snorkel.map.Mapper¶
-
class
snorkel.map.
Mapper
(name, field_names=None, mapped_field_names=None, pre=None, memoize=False)[source]¶ Bases:
snorkel.map.core.BaseMapper
Base class for any data point to data point mapping in the pipeline.
Map data points to new data points by transforming, adding additional information, or decomposing into primitives. This module provides base classes for other operators like
TransformationFunction
andPreprocessor
. We don’t expect people to constructMapper
objects directly.A Mapper maps an data point to a new data point, possibly with a different schema. Subclasses of Mapper need to implement the
run
method, which takes fields of the data point as input and outputs new fields for the mapped data point as a dictionary. Therun
method should only be called internally by theMapper
object, not directly by a user.Mapper derivatives work for data points that have mutable attributes, like
SimpleNamespace
,pd.Series
, ordask.Series
. An example of a data point type without mutable fields ispyspark.sql.Row
. Usesnorkel.map.spark.make_spark_mapper
for PySpark compatibility.- For an example of a Mapper, see
snorkel.preprocess.nlp.SpacyPreprocessor
- Parameters
name (
str
) – Name of mapperfield_names (
Optional
[Mapping
[str
,str
]]) – A map from attribute names of the incoming data points to the input argument names of therun
method. If None, the parameter names in the function signature are used.mapped_field_names (
Optional
[Mapping
[str
,str
]]) – A map from output keys of therun
method to attribute names of the output data points. If None, the original output keys are used.pre (
Optional
[List
[BaseMapper
]]) – Mappers to run before this mapper is executedmemoize (
bool
) – Memoize mapper outputs?
- Raises
NotImplementedError – Subclasses must implement the
run
method
-
__init__
(name, field_names=None, mapped_field_names=None, pre=None, memoize=False)[source]¶ Initialize self. See help(type(self)) for accurate signature.
- Return type
None
Methods
__init__
(name[, field_names, …])Initialize self.
Reset the memoization cache.
run
(**kwargs)Run the mapping operation using the input fields.
-
__call__
(x)[source]¶ Run mapping function on input data point.
Deep copies the data point first so as not to make accidental in-place changes. If
memoize
is set toTrue
, an internal cache is checked for results. If no cached results are found, the computed results are added to the cache.- Parameters
x (
Any
) – Data point to run mapping function on- Returns
Mapped data point of same format but possibly different fields
- Return type
DataPoint
-
run
(**kwargs)[source]¶ Run the mapping operation using the input fields.
The inputs to this function are fed by extracting the fields of the input data point using the keys of
field_names
. The output field names are converted usingmapped_field_names
and added to the data point.- Returns
A mapping from canonical output field names to their values.
- Return type
Optional[FieldMap]
- Raises
NotImplementedError – Subclasses must implement this method