snorkel.labeling.LFAnalysis¶

class snorkel.labeling.LFAnalysis(L, lfs=None)[source]¶

Bases: object

Run analyses on LFs using label matrix.

Parameters

L (ndarray) – Label matrix where L_{i,j} is the label given by the jth LF to the ith candidate (using -1 for abstain)
lfs (Optional[List[LabelingFunction]]) – Labeling functions used to generate L

Raises

ValueError – If number of LFs and number of LF matrix columns differ

L[source]¶: See above.

__init__(L, lfs=None)[source]¶

Initialize self. See help(type(self)) for accurate signature.

Return type: None

Methods

`__init__`(L[, lfs])	Initialize self.
`label_conflict`()	Compute the fraction of data points with conflicting (non-abstain) labels.
`label_coverage`()	Compute the fraction of data points with at least one label.
`label_overlap`()	Compute the fraction of data points with at least two (non-abstain) labels.
`lf_conflicts`([normalize_by_overlaps])	Compute frac.
`lf_coverages`()	Compute frac.
`lf_empirical_accuracies`(Y)	Compute empirical accuracy against a set of labels Y for each LF.
`lf_empirical_probs`(Y, k)	Estimate conditional probability tables for each LF.
`lf_overlaps`([normalize_by_coverage])	Compute frac.
`lf_polarities`()	Infer the polarities of each LF based on evidence in a label matrix.
`lf_summary`([Y, est_weights])	Create a pandas DataFrame with the various per-LF statistics.

label_conflict()[source]¶

Compute the fraction of data points with conflicting (non-abstain) labels.

Returns: Fraction of data points with conflicting labels
Return type: float

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).label_conflict()
0.2

label_coverage()[source]¶

Compute the fraction of data points with at least one label.

Returns: Fraction of data points with labels
Return type: float

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).label_coverage()
0.8

label_overlap()[source]¶

Compute the fraction of data points with at least two (non-abstain) labels.

Returns: Fraction of data points with overlapping labels
Return type: float

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).label_overlap()
0.6

lf_conflicts(normalize_by_overlaps=False)[source]¶

Compute frac. of examples each LF labels and labeled differently by another LF.

A conflicting example is one that at least one other LF returns a different (non-abstain) label for.

Note that the maximum possible conflict fraction for an LF is the LF’s overlaps fraction, unless normalize_by_overlaps=True, in which case it is 1.

Parameters: normalize_by_overlaps (bool) – Normalize by overlaps of the LF, so that it returns the percent of LF overlaps that have conflicts.
Returns: Fraction of conflicting examples for each LF
Return type: numpy.ndarray

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).lf_conflicts()
array([0.2, 0.2, 0. ])
>>> LFAnalysis(L).lf_conflicts(normalize_by_overlaps=True)
array([0.5       , 0.33333333, 0.        ])

lf_coverages()[source]¶

Compute frac. of examples each LF labels.

Returns: Fraction of labeled examples for each LF
Return type: numpy.ndarray

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).lf_coverages()
array([0.4, 0.8, 0.4])

lf_empirical_accuracies(Y)[source]¶

Compute empirical accuracy against a set of labels Y for each LF.

Usually, Y represents development set labels.

Parameters: Y (ndarray) – [n] or [n, 1] np.ndarray of gold labels
Returns: Empirical accuracies for each LF
Return type: numpy.ndarray

lf_empirical_probs(Y, k)[source]¶

Estimate conditional probability tables for each LF.

Computes conditional probability tables, P(L | Y), for each LF using the provided true labels Y.

Parameters

Y (ndarray) – The n-dim array of true labels in {1,…,k}
k (int) – The cardinality i.e. number of classes

Returns

An m x (k+1) x k np.ndarray representing the m (k+1) x k conditional probability tables P_i, where P_i[l,y] represents P(LF_i = l | Y = y) empirically calculated

Return type

np.ndarray

lf_overlaps(normalize_by_coverage=False)[source]¶

Compute frac. of examples each LF labels that are labeled by another LF.

An overlapping example is one that at least one other LF returns a (non-abstain) label for.

Note that the maximum possible overlap fraction for an LF is the LF’s coverage, unless normalize_by_coverage=True, in which case it is 1.

Parameters: normalize_by_coverage (bool) – Normalize by coverage of the LF, so that it returns the percent of LF labels that have overlaps.
Returns: Fraction of overlapping examples for each LF
Return type: numpy.ndarray

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).lf_overlaps()
array([0.4, 0.6, 0.4])
>>> LFAnalysis(L).lf_overlaps(normalize_by_coverage=True)
array([1.  , 0.75, 1.  ])

lf_polarities()[source]¶

Infer the polarities of each LF based on evidence in a label matrix.

Returns: Unique output labels for each LF
Return type: List[List[int]]

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).lf_polarities()
[[0, 1], [0], [0]]

lf_summary(Y=None, est_weights=None)[source]¶

Create a pandas DataFrame with the various per-LF statistics.

Parameters

Y (Optional[ndarray]) – [n] or [n, 1] np.ndarray of gold labels. If provided, the empirical weight for each LF will be calculated.
est_weights (Optional[ndarray]) – Learned weights for each LF

Returns

Summary statistics for each LF

Return type

pandas.DataFrame