Candidates

In order to apply machine learning–i.e., in this case, a classifier–to information extraction problems, we need to have a base set of objects that are being classified. In Snorkel, these are the Candidate subclasses, which are defined over Context arguments, and represent potential mentions to extract. We use Matcher operators to extract a set of Candidate objects from the input data.

Core Data Models

class snorkel.models.candidate.Candidate(**kwargs)[source]

An abstract candidate relation.

New relation types should be defined by calling candidate_subclass(), not subclassing this class directly.

class snorkel.models.candidate.CandidateSet(**kwargs)[source]

A set of Candidates, uniquely identified by a name.

CandidateSets have many-to-many relationships with Candidates, so users can create new subsets, supersets, etc.

stats(gold_set=None)[source]

Print diagnostic stats about CandidateSet.

snorkel.models.candidate.candidate_subclass(class_name, args, table_name=None)[source]

Creates and returns a Candidate subclass with provided argument names, which are Context type. Creates the table in DB if does not exist yet.

Import using:

from snorkel.models import candidate_subclass
Parameters:
  • class_name – The name of the class, should be “camel case” e.g. NewCandidateClass
  • args – A list of names of consituent arguments, which refer to the Contexts–representing mentions–that comprise the candidate
  • table_name – The name of the corresponding table in DB; if not provided, is converted from camel case by default, e.g. new_candidate_class

Core Objects for Candidate Extraction

class snorkel.candidates.CandidateExtractor(candidate_class, cspaces, matchers, self_relations=False, nested_relations=False, symmetric_relations=True)[source]

An operator to extract Candidate objects from Context objects.

Parameters:
  • candidate_class – The type of relation to extract, defined using snorkel.models.candidate_subclass
  • cspaces – one or list of CandidateSpace objects, one for each relation argument. Defines space of Contexts to consider
  • matchers – one or list of snorkel.matchers.Matcher objects, one for each relation argument. Only tuples of Contexts for which each element is accepted by the corresponding Matcher will be returned as Candidates
  • self_relations – Boolean indicating whether to extract Candidates that relate the same context. Only applies to binary relations. Default is False.
  • nested_relations – Boolean indicating whether to extract Candidates that relate one Context with another that contains it. Only applies to binary relations. Default is False.
  • symmetric_relations – Boolean indicating whether to extract symmetric Candidates, i.e., rel(A,B) and rel(B,A), where A and B are Contexts. Only applies to binary relations. Default is True.
class snorkel.candidates.CandidateSpace[source]

Defines the space of candidate objects Calling _apply(x)_ given an object _x_ returns a generator over candidates in _x_.

class snorkel.candidates.Ngrams(n_max=5, split_tokens=['-', '/'])[source]

Defines the space of candidates as all n-grams (n <= n_max) in a Sentence _x_, indexing by character offset.

snorkel.candidates.gold_stats(candidates, gold)[source]

Return precision and recall relative to a “gold” CandidateSet

class snorkel.matchers.Concat(*children, **opts)[source]

Selects candidates which are the concatenation of adjacent matches from child operators NOTE: Currently slices on word index and considers concatenation along these divisions only

class snorkel.matchers.DateMatcher(**kwargs)[source]

Matches Spans that are dates, as identified by CoreNLP.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a date.

class snorkel.matchers.DictionaryMatch(*children, **opts)[source]

Selects candidate Ngrams that match against a given list d

class snorkel.matchers.LambdaFunctionMatch(*children, **opts)[source]

Selects candidate Ngrams that match against a given list d

class snorkel.matchers.LocationMatcher(**kwargs)[source]

Matches Spans that are the names of locations, as identified by CoreNLP.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a location.

class snorkel.matchers.Matcher(*children, **opts)[source]

Applies a function f : c -> {True,False} to a generator of candidates, returning only candidates _c_ s.t. _f(c) == True_, where f can be compositionally defined.

apply(candidates)[source]

Apply the Matcher to a generator of candidates Optionally only takes the longest match (NOTE: assumes this is the first match)

f(c)[source]

The recursively composed version of filter function f By default, returns logical conjunction of operator and single child operator

class snorkel.matchers.MiscMatcher(**kwargs)[source]

Matches Spans that are miscellaneous named entities, as identified by CoreNLP.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as miscellaneous.

class snorkel.matchers.NgramMatcher(*children, **opts)[source]

Matcher base class for Ngram objects

class snorkel.matchers.NumberMatcher(**kwargs)[source]

Matches Spans that are numbers, as identified by CoreNLP.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a number.

class snorkel.matchers.OrganizationMatcher(**kwargs)[source]

Matches Spans that are the names of organizations, as identified by CoreNLP.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as an organization.

class snorkel.matchers.PersonMatcher(**kwargs)[source]

Matches Spans that are the names of people, as identified by CoreNLP.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a person.

class snorkel.matchers.RegexMatch(*children, **opts)[source]

Base regex class- does not specify specific semantics of what is being matched yet

class snorkel.matchers.RegexMatchEach(*children, **opts)[source]

Matches regex pattern on each token

class snorkel.matchers.RegexMatchSpan(*children, **opts)[source]

Matches regex pattern on full concatenated span

class snorkel.matchers.SlotFillMatch(*children, **opts)[source]

Matches a slot fill pattern of matchers _at the character level_

class snorkel.matchers.Union(*children, **opts)[source]

Takes the union of candidate sets returned by child operators