snorkel.preprocess.nlp.SpacyPreprocessor¶
-
class
snorkel.preprocess.nlp.
SpacyPreprocessor
(text_field, doc_field, language='en_core_web_sm', disable=None, pre=None, memoize=False, memoize_key=None, gpu=False)[source]¶ Bases:
snorkel.preprocess.core.Preprocessor
Preprocessor that parses input text via a SpaCy model.
A common approach to writing LFs over text is to first use a natural language parser to decompose the text into tokens, part-of-speech tags, etc. SpaCy (https://spacy.io/) is a popular tool for doing this. This preprocessor adds a SpaCy
Doc
object to the data point. ADoc
object is a sequence ofToken
objects, which contain information on lemmatization, parts-of-speech, etc.Doc
objects also contain fields likeDoc.ents
, a list of named entities, andDoc.noun_chunks
, a list of noun phrases. For details of SpaCyDoc
objects and a full attribute listing, see https://spacy.io/api/doc.- Parameters
text_field (
str
) – Name of data point text field to inputdoc_field (
str
) – Name of data point field to output parsed document tolanguage (
str
) – SpaCy model to load See https://spacy.io/usage/models#usagedisable (
Optional
[List
[str
]]) – List of pipeline components to disable See https://spacy.io/usage/processing-pipelines#disablingpre (
Optional
[List
[BaseMapper
]]) – Preprocessors to run before this preprocessor is executedmemoize (
bool
) – Memoize preprocessor outputs?memoize_key (
Optional
[Callable
[[Any
],Hashable
]]) – Hashing function to handle the memoization (default to snorkel.map.core.get_hashable)gpu (
bool
) – Prefer Spacy GPU processing?
-
__init__
(text_field, doc_field, language='en_core_web_sm', disable=None, pre=None, memoize=False, memoize_key=None, gpu=False)[source]¶ Initialize self. See help(type(self)) for accurate signature.
- Return type
None
Methods
__init__
(text_field, doc_field[, language, …])Initialize self.
Reset the memoization cache.
run
(text)Run the SpaCy model on input text.
-
__call__
(x)[source]¶ Run mapping function on input data point.
Deep copies the data point first so as not to make accidental in-place changes. If
memoize
is set toTrue
, an internal cache is checked for results. If no cached results are found, the computed results are added to the cache.- Parameters
x (
Any
) – Data point to run mapping function on- Returns
Mapped data point of same format but possibly different fields
- Return type
DataPoint