snorkel.preprocess.nlp.SpacyPreprocessor

class snorkel.preprocess.nlp.SpacyPreprocessor(text_field, doc_field, language='en_core_web_sm', disable=None, pre=None, memoize=False, memoize_key=None, gpu=False)[source]

Bases: snorkel.preprocess.core.Preprocessor

Preprocessor that parses input text via a SpaCy model.

A common approach to writing LFs over text is to first use a natural language parser to decompose the text into tokens, part-of-speech tags, etc. SpaCy (https://spacy.io/) is a popular tool for doing this. This preprocessor adds a SpaCy Doc object to the data point. A Doc object is a sequence of Token objects, which contain information on lemmatization, parts-of-speech, etc. Doc objects also contain fields like Doc.ents, a list of named entities, and Doc.noun_chunks, a list of noun phrases. For details of SpaCy Doc objects and a full attribute listing, see https://spacy.io/api/doc.

Parameters
  • text_field (str) – Name of data point text field to input

  • doc_field (str) – Name of data point field to output parsed document to

  • language (str) – SpaCy model to load See https://spacy.io/usage/models#usage

  • disable (Optional[List[str]]) – List of pipeline components to disable See https://spacy.io/usage/processing-pipelines#disabling

  • pre (Optional[List[BaseMapper]]) – Preprocessors to run before this preprocessor is executed

  • memoize (bool) – Memoize preprocessor outputs?

  • memoize_key (Optional[Callable[[Any], Hashable]]) – Hashing function to handle the memoization (default to snorkel.map.core.get_hashable)

  • gpu (bool) – Prefer Spacy GPU processing?

__init__(text_field, doc_field, language='en_core_web_sm', disable=None, pre=None, memoize=False, memoize_key=None, gpu=False)[source]

Initialize self. See help(type(self)) for accurate signature.

Return type

None

Methods

__init__(text_field, doc_field[, language, …])

Initialize self.

reset_cache()

Reset the memoization cache.

run(text)

Run the SpaCy model on input text.

__call__(x)[source]

Run mapping function on input data point.

Deep copies the data point first so as not to make accidental in-place changes. If memoize is set to True, an internal cache is checked for results. If no cached results are found, the computed results are added to the cache.

Parameters

x (Any) – Data point to run mapping function on

Returns

Mapped data point of same format but possibly different fields

Return type

DataPoint

reset_cache()[source]

Reset the memoization cache.

Return type

None

run(text)[source]

Run the SpaCy model on input text.

Parameters

text (str) – Text of document to parse

Returns

Dictionary with a single key ("doc"), mapping to the parsed SpaCy Doc object

Return type

FieldMap