snorkel.labeling.LFAnalysis

class snorkel.labeling.LFAnalysis(L, lfs=None)[source]

Bases: object

Run analyses on LFs using label matrix.

Parameters
  • L (ndarray) – Label matrix where L_{i,j} is the label given by the jth LF to the ith candidate (using -1 for abstain)

  • lfs (Optional[List[LabelingFunction]]) – Labeling functions used to generate L

Raises

ValueError – If number of LFs and number of LF matrix columns differ

L[source]

See above.

__init__(L, lfs=None)[source]

Initialize self. See help(type(self)) for accurate signature.

Return type

None

Methods

__init__(L[, lfs])

Initialize self.

label_conflict()

Compute the fraction of data points with conflicting (non-abstain) labels.

label_coverage()

Compute the fraction of data points with at least one label.

label_overlap()

Compute the fraction of data points with at least two (non-abstain) labels.

lf_conflicts([normalize_by_overlaps])

Compute frac.

lf_coverages()

Compute frac.

lf_empirical_accuracies(Y)

Compute empirical accuracy against a set of labels Y for each LF.

lf_empirical_probs(Y, k)

Estimate conditional probability tables for each LF.

lf_overlaps([normalize_by_coverage])

Compute frac.

lf_polarities()

Infer the polarities of each LF based on evidence in a label matrix.

lf_summary([Y, est_weights])

Create a pandas DataFrame with the various per-LF statistics.

label_conflict()[source]

Compute the fraction of data points with conflicting (non-abstain) labels.

Returns

Fraction of data points with conflicting labels

Return type

float

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).label_conflict()
0.2
label_coverage()[source]

Compute the fraction of data points with at least one label.

Returns

Fraction of data points with labels

Return type

float

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).label_coverage()
0.8
label_overlap()[source]

Compute the fraction of data points with at least two (non-abstain) labels.

Returns

Fraction of data points with overlapping labels

Return type

float

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).label_overlap()
0.6
lf_conflicts(normalize_by_overlaps=False)[source]

Compute frac. of examples each LF labels and labeled differently by another LF.

A conflicting example is one that at least one other LF returns a different (non-abstain) label for.

Note that the maximum possible conflict fraction for an LF is the LF’s overlaps fraction, unless normalize_by_overlaps=True, in which case it is 1.

Parameters

normalize_by_overlaps (bool) – Normalize by overlaps of the LF, so that it returns the percent of LF overlaps that have conflicts.

Returns

Fraction of conflicting examples for each LF

Return type

numpy.ndarray

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).lf_conflicts()
array([0.2, 0.2, 0. ])
>>> LFAnalysis(L).lf_conflicts(normalize_by_overlaps=True)
array([0.5       , 0.33333333, 0.        ])
lf_coverages()[source]

Compute frac. of examples each LF labels.

Returns

Fraction of labeled examples for each LF

Return type

numpy.ndarray

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).lf_coverages()
array([0.4, 0.8, 0.4])
lf_empirical_accuracies(Y)[source]

Compute empirical accuracy against a set of labels Y for each LF.

Usually, Y represents development set labels.

Parameters

Y (ndarray) – [n] or [n, 1] np.ndarray of gold labels

Returns

Empirical accuracies for each LF

Return type

numpy.ndarray

lf_empirical_probs(Y, k)[source]

Estimate conditional probability tables for each LF.

Computes conditional probability tables, P(L | Y), for each LF using the provided true labels Y.

Parameters
  • Y (ndarray) – The n-dim array of true labels in {1,…,k}

  • k (int) – The cardinality i.e. number of classes

Returns

An m x (k+1) x k np.ndarray representing the m (k+1) x k conditional probability tables P_i, where P_i[l,y] represents P(LF_i = l | Y = y) empirically calculated

Return type

np.ndarray

lf_overlaps(normalize_by_coverage=False)[source]

Compute frac. of examples each LF labels that are labeled by another LF.

An overlapping example is one that at least one other LF returns a (non-abstain) label for.

Note that the maximum possible overlap fraction for an LF is the LF’s coverage, unless normalize_by_coverage=True, in which case it is 1.

Parameters

normalize_by_coverage (bool) – Normalize by coverage of the LF, so that it returns the percent of LF labels that have overlaps.

Returns

Fraction of overlapping examples for each LF

Return type

numpy.ndarray

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).lf_overlaps()
array([0.4, 0.6, 0.4])
>>> LFAnalysis(L).lf_overlaps(normalize_by_coverage=True)
array([1.  , 0.75, 1.  ])
lf_polarities()[source]

Infer the polarities of each LF based on evidence in a label matrix.

Returns

Unique output labels for each LF

Return type

List[List[int]]

Example

>>> L = np.array([
...     [-1, 0, 0],
...     [-1, -1, -1],
...     [1, 0, -1],
...     [-1, 0, -1],
...     [0, 0, 0],
... ])
>>> LFAnalysis(L).lf_polarities()
[[0, 1], [0], [0]]
lf_summary(Y=None, est_weights=None)[source]

Create a pandas DataFrame with the various per-LF statistics.

Parameters
  • Y (Optional[ndarray]) – [n] or [n, 1] np.ndarray of gold labels. If provided, the empirical weight for each LF will be calculated.

  • est_weights (Optional[ndarray]) – Learned weights for each LF

Returns

Summary statistics for each LF

Return type

pandas.DataFrame