Candidates¶

In order to apply machine learning–i.e., in this case, a classifier–to information extraction problems, we need to have a base set of objects that are being classified. In Snorkel, these are the Candidate subclasses, which are defined over Context arguments, and represent potential mentions to extract. We use Matcher operators to extract a set of Candidate objects from the input data.

Core Data Models¶

class snorkel.models.candidate.Candidate(**kwargs)[source]

An abstract candidate relation.

New relation types should be defined by calling candidate_subclass(), not subclassing this class directly.

get_cids()[source]

Get a tuple of the canonical IDs (CIDs) of the contexts making up this candidate

get_contexts()[source]

Get a tuple of the consituent contexts making up this candidate

class snorkel.models.candidate.Marginal(**kwargs)[source]

A marginal probability corresponding to a (Candidate, value) pair.

Represents:

P(candidate = value) = probability

@training: If True, this is a training marginal; otherwise is end prediction

snorkel.models.candidate.candidate_subclass(class_name, args, table_name=None, cardinality=None, values=None)[source]

Creates and returns a Candidate subclass with provided argument names, which are Context type. Creates the table in DB if does not exist yet.

Import using:

from snorkel.models import candidate_subclass

Parameters: class_name – The name of the class, should be “camel case” e.g. NewCandidate args – A list of names of consituent arguments, which refer to the Contexts–representing mentions–that comprise the candidate table_name – The name of the corresponding table in DB; if not provided, is converted from camel case by default, e.g. new_candidate cardinality – The cardinality of the variable corresponding to the Candidate. By default is 2 i.e. is a binary value, e.g. is or is not a true mention.

Core Objects for Candidate Extraction¶

class snorkel.candidates.CandidateExtractor(candidate_class, cspaces, matchers, self_relations=False, nested_relations=False, symmetric_relations=False)[source]

An operator to extract Candidate objects from a Context.

Parameters: candidate_class – The type of relation to extract, defined using snorkel.models.candidate_subclass cspaces – one or list of CandidateSpace objects, one for each relation argument. Defines space of Contexts to consider matchers – one or list of snorkel.matchers.Matcher objects, one for each relation argument. Only tuples of Contexts for which each element is accepted by the corresponding Matcher will be returned as Candidates self_relations – Boolean indicating whether to extract Candidates that relate the same context. Only applies to binary relations. Default is False. nested_relations – Boolean indicating whether to extract Candidates that relate one Context with another that contains it. Only applies to binary relations. Default is False. symmetric_relations – Boolean indicating whether to extract symmetric Candidates, i.e., rel(A,B) and rel(B,A), where A and B are Contexts. Only applies to binary relations. Default is False.
class snorkel.candidates.CandidateSpace[source]

Defines the space of candidate objects Calling _apply(x)_ given an object _x_ returns a generator over candidates in _x_.

class snorkel.candidates.Ngrams(n_max=5, split_tokens=('-', '/'))[source]

Defines the space of candidates as all n-grams (n <= n_max) in a Sentence _x_, indexing by character offset.

class snorkel.candidates.PretaggedCandidateExtractor(candidate_class, entity_types, self_relations=False, nested_relations=False, symmetric_relations=True, entity_sep='~@~')[source]

UDFRunner for PretaggedCandidateExtractorUDF

class snorkel.candidates.PretaggedCandidateExtractorUDF(candidate_class, entity_types, self_relations=False, nested_relations=False, symmetric_relations=False, entity_sep='~@~', **kwargs)[source]

An extractor for Sentences with entities pre-tagged, and stored in the entity_types and entity_cids fields.

apply(context, clear, split, check_for_existing=True, **kwargs)[source]

Extract Candidates from a Context

class snorkel.matchers.Concat(*children, **opts)[source]

Selects candidates which are the concatenation of adjacent matches from child operators NOTE: Currently slices on word index and considers concatenation along these divisions only

class snorkel.matchers.DateMatcher(*children, **kwargs)[source]

Matches Spans that are dates, as identified by CoreNLP.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a date.

class snorkel.matchers.DictionaryMatch(*children, **opts)[source]

Selects candidate Ngrams that match against a given list d

class snorkel.matchers.LambdaFunctionMatcher(*children, **opts)[source]

Selects candidate Ngrams that return True when fed to a function f.

class snorkel.matchers.LocationMatcher(*children, **kwargs)[source]

Matches Spans that are the names of locations, as identified by CoreNLP.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a location.

class snorkel.matchers.Matcher(*children, **opts)[source]

Applies a function f : c -> {True,False} to a generator of candidates, returning only candidates _c_ s.t. _f(c) == True_, where f can be compositionally defined.

apply(candidates)[source]

Apply the Matcher to a generator of candidates Optionally only takes the longest match (NOTE: assumes this is the first match)

f(c)[source]

The recursively composed version of filter function f By default, returns logical conjunction of operator and single child operator

class snorkel.matchers.MiscMatcher(*children, **kwargs)[source]

Matches Spans that are miscellaneous named entities, as identified by CoreNLP.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as miscellaneous.

class snorkel.matchers.NgramMatcher(*children, **opts)[source]

Matcher base class for Ngram objects

class snorkel.matchers.NumberMatcher(*children, **kwargs)[source]

Matches Spans that are numbers, as identified by CoreNLP.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a number.

class snorkel.matchers.OrganizationMatcher(*children, **kwargs)[source]

Matches Spans that are the names of organizations, as identified by CoreNLP.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as an organization.

class snorkel.matchers.PersonMatcher(*children, **kwargs)[source]

Matches Spans that are the names of people, as identified by CoreNLP.

A convenience class for setting up a RegexMatchEach to match spans for which each token was tagged as a person.

class snorkel.matchers.RegexMatch(*children, **opts)[source]

Base regex class- does not specify specific semantics of what is being matched yet

class snorkel.matchers.RegexMatchEach(*children, **opts)[source]

Matches regex pattern on each token

class snorkel.matchers.RegexMatchSpan(*children, **opts)[source]

Matches regex pattern on full concatenated span

class snorkel.matchers.SlotFillMatch(*children, **opts)[source]

Matches a slot fill pattern of matchers _at the character level_

class snorkel.matchers.Union(*children, **opts)[source]

Takes the union of candidate sets returned by child operators