snorkel.labeling.apply.spark.SparkLFApplier

class snorkel.labeling.apply.spark.SparkLFApplier(lfs)[source]

Bases: snorkel.labeling.apply.core.BaseLFApplier

LF applier for a Spark RDD.

Data points are stored as Rows in an RDD, and a Spark map job is submitted to execute the LFs. A common way to obtain an RDD is via a PySpark DataFrame. For an example usage with AWS EMR instructions, see test/labeling/apply/lf_applier_spark_test_script.py.

__init__(lfs)[source]

Initialize self. See help(type(self)) for accurate signature.

Return type

None

Methods

__init__(lfs)

Initialize self.

apply(data_points)

Label PySpark RDD of data points with LFs.

apply(data_points)[source]

Label PySpark RDD of data points with LFs.

Parameters

data_points (pyspark.RDD) – PySpark RDD containing data points to be labeled by LFs

Returns

Matrix of labels emitted by LFs

Return type

np.ndarray