Preprocessed input data is represented in Snorkel as a hierarchy of Context subclass objects. For example, as is currently default for text: Corpus -> Document -> Sentence -> Span.

Core Data Models

class snorkel.models.context.Context(**kwargs)[source]

A piece of content from which Candidates are composed.

class snorkel.models.context.Document(**kwargs)[source]

A root Context.

class snorkel.models.context.Sentence(**kwargs)[source]

A sentence Context in a Document.

class snorkel.models.context.Span(**kwargs)[source]

A span of characters, identified by Context id and character-index start, end (inclusive).

char_offsets are relative to the Context start

class snorkel.models.context.TemporaryContext[source]

A context which does not incur the overhead of a proper ORM-based Context object. The TemporaryContext class is specifically for the candidate extraction process, during which a CandidateSpace object will generate many TemporaryContexts, which will then be filtered by Matchers prior to materialization of Candidates and constituent Context objects.

Every Context object has a corresponding TemporaryContext object from which it inherits.

A TemporaryContext must have specified equality / set membership semantics, a stable_id for checking uniqueness against the database, and a promote() method which returns a corresponding Context object.

class snorkel.models.context.TemporarySpan(sentence, char_start, char_end, meta=None)[source]

The TemporaryContext version of Span


Given a character-level index (offset), return the index of the word this char is in

get_attrib_span(a, sep=' ')[source]

Get the span of sentence attribute _a_ over the range defined by word_offset, n


Get the tokens of sentence attribute _a_ over the range defined by word_offset, n


Given a word-level index, return the character-level index (offset) of the word’s start

snorkel.models.context.construct_stable_id(parent_context, polymorphic_type, relative_char_offset_start, relative_char_offset_end)[source]

Contruct a stable ID for a Context given its parent and its character offsets relative to the parent

Split stable id, returning:
  • Document (root) stable ID
  • Context polymorphic type
  • Character offset start, end relative to document start

Returns tuple of four values.

Core Objects for Preprocessing and Loading

class snorkel.parser.doc_preprocessors.CSVPathsPreprocessor(path, parser_factory=<class 'snorkel.parser.doc_preprocessors.TextDocPreprocessor'>, column=None, delim=', ', *args, **kwargs)[source]

This DocumentPreprocessor treats inputs file as index of paths to actual documents; each line in the input file contains a path to a document.

Defaults and Customization:

  • The input file is treated as a simple text file having one path per file. However, if the input is a CSV file, a pair of column and delim parameters may be used to retrieve the desired value as reference path.
  • The referenced documents are treated as text document and hence parsed using TextDocPreprocessor. However, if the referenced files are complex, an advanced parser may be used by specifying parser_factory parameter to constructor.
class snorkel.parser.doc_preprocessors.DocPreprocessor(path, encoding='utf-8', max_docs=inf)[source]

Processes a file or directory of files into a set of Document objects.

  • encoding – file encoding to use, default=’utf-8’
  • path – filesystem path to file or directory to parse
  • max_docs – the maximum number of Documents to produce, default=float(‘inf’)

Parses a file or directory of files into a set of Document objects.

class snorkel.parser.doc_preprocessors.HTMLDocPreprocessor(path, encoding='utf-8', max_docs=inf)[source]

Simple parsing of raw HTML files, assuming one document per file

class snorkel.parser.doc_preprocessors.TSVDocPreprocessor(path, encoding='utf-8', max_docs=inf)[source]

Simple parsing of TSV file with one (doc_name <tab> doc_text) per line

class snorkel.parser.doc_preprocessors.TextDocPreprocessor(path, encoding='utf-8', max_docs=inf)[source]

Simple parsing of raw text files, assuming one document per file

class snorkel.parser.doc_preprocessors.TikaPreprocessor(path, encoding='utf-8', max_docs=inf)[source]

This preprocessor use Apache Tika parser to retrieve text content from complex file types such as DOCX, HTML and PDFs.

Documentation for customizing Tika is here


!find pdf_dir -name *.pdf > input.csv # list of files
from snorkel.parser import (
    TikaPreprocessor, CSVPathsPreprocessor, CorpusParser
    CSVPathsPreprocessor('input.csv', parser_factory=TikaPreprocessor)
class snorkel.parser.doc_preprocessors.XMLMultiDocPreprocessor(path, doc='.//document', text='./text/text()', id='./id/text()', keep_xml_tree=False, *args, **kwargs)[source]

Parse an XML file _which contains multiple documents_ into a set of Document objects.

Use XPath queries to specify a _document_ object, and then for each document, a set of _text_ sections and an _id_.

Note: Include the full document XML etree in the attribs dict with keep_xml_tree=True

class snorkel.parser.parser.ParserConnection(parser)[source]

Default connection object assumes local parser object

class snorkel.parser.parser.URLParserConnection(parser, retries=5)[source]

URL parser connection

parse(document, text)[source]

Return parse generator :param document: :param text: :return:

post(url, data, allow_redirects=True)[source]
  • url
  • data
  • allow_redirects
  • timeout