snorkel.map.Mapper

class snorkel.map.Mapper(name, field_names=None, mapped_field_names=None, pre=None, memoize=False, memoize_key=None)[source]

Bases: snorkel.map.core.BaseMapper

Base class for any data point to data point mapping in the pipeline.

Map data points to new data points by transforming, adding additional information, or decomposing into primitives. This module provides base classes for other operators like TransformationFunction and Preprocessor. We don’t expect people to construct Mapper objects directly.

A Mapper maps an data point to a new data point, possibly with a different schema. Subclasses of Mapper need to implement the run method, which takes fields of the data point as input and outputs new fields for the mapped data point as a dictionary. The run method should only be called internally by the Mapper object, not directly by a user.

Mapper derivatives work for data points that have mutable attributes, like SimpleNamespace, pd.Series, or dask.Series. An example of a data point type without mutable fields is pyspark.sql.Row. Use snorkel.map.spark.make_spark_mapper for PySpark compatibility.

For an example of a Mapper, see

snorkel.preprocess.nlp.SpacyPreprocessor

Parameters
  • name (str) – Name of mapper

  • field_names (Optional[Mapping[str, str]]) – A map from attribute names of the incoming data points to the input argument names of the run method. If None, the parameter names in the function signature are used.

  • mapped_field_names (Optional[Mapping[str, str]]) – A map from output keys of the run method to attribute names of the output data points. If None, the original output keys are used.

  • pre (Optional[List[BaseMapper]]) – Mappers to run before this mapper is executed

  • memoize (bool) – Memoize mapper outputs?

  • memoize_key (Optional[Callable[[Any], Hashable]]) – Hashing function to handle the memoization (default to snorkel.map.core.get_hashable)

Raises

NotImplementedError – Subclasses must implement the run method

field_names[source]

See above

mapped_field_names[source]

See above

memoize[source]

Memoize mapper outputs?

__init__(name, field_names=None, mapped_field_names=None, pre=None, memoize=False, memoize_key=None)[source]

Initialize self. See help(type(self)) for accurate signature.

Return type

None

Methods

__init__(name[, field_names, …])

Initialize self.

reset_cache()

Reset the memoization cache.

run(**kwargs)

Run the mapping operation using the input fields.

__call__(x)[source]

Run mapping function on input data point.

Deep copies the data point first so as not to make accidental in-place changes. If memoize is set to True, an internal cache is checked for results. If no cached results are found, the computed results are added to the cache.

Parameters

x (Any) – Data point to run mapping function on

Returns

Mapped data point of same format but possibly different fields

Return type

DataPoint

reset_cache()[source]

Reset the memoization cache.

Return type

None

run(**kwargs)[source]

Run the mapping operation using the input fields.

The inputs to this function are fed by extracting the fields of the input data point using the keys of field_names. The output field names are converted using mapped_field_names and added to the data point.

Returns

A mapping from canonical output field names to their values.

Return type

Optional[FieldMap]

Raises

NotImplementedError – Subclasses must implement this method