Preprocessed input data is represented in Snorkel as a hierarchy of Context subclass objects. For example, as is currently default for text: Corpus -> Document -> Sentence -> Span.

Core Data Models

class snorkel.models.context.Context(**kwargs)[source]

A piece of content from which Candidates are composed.

class snorkel.models.context.Corpus(**kwargs)[source]

A set of Documents, uniquely identified by a name.

Corpora have many-to-many relationships with Documents, so users can create new subsets, supersets, etc.


Given a parent context class, gets all the child context classes, and returns histograms of the number of children per parent.


Print summary / diagnostic stats about the corpus

class snorkel.models.context.Document(**kwargs)[source]

A root Context.

class snorkel.models.context.Sentence(**kwargs)[source]

A sentence Context in a Document.

class snorkel.models.context.Span(**kwargs)[source]

A span of characters, identified by Context id and character-index start, end (inclusive).

char_offsets are relative to the Context start

class snorkel.models.context.TemporaryContext[source]

A context which does not incur the overhead of a proper ORM-based Context object. The TemporaryContext class is specifically for the candidate extraction process, during which a CandidateSpace object will generate many TemporaryContexts, which will then be filtered by Matchers prior to materialization of Candidates and constituent Context objects.

Every Context object has a corresponding TemporaryContext object from which it inherits.

A TemporaryContext must have specified equality / set membership semantics, a stable_id for checking uniqueness against the database, and a promote() method which returns a corresponding Context object.

class snorkel.models.context.TemporarySpan(parent, char_start, char_end, meta=None)[source]

The TemporaryContext version of Span


Given a character-level index (offset), return the index of the word this char is in

get_attrib_span(a, sep=' ')[source]

Get the span of sentence attribute _a_ over the range defined by word_offset, n


Get the tokens of sentence attribute _a_ over the range defined by word_offset, n


Given a word-level index, return the character-level index (offset) of the word’s start

snorkel.models.context.construct_stable_id(parent_context, polymorphic_type, relative_char_offset_start, relative_char_offset_end)[source]

Contruct a stable ID for a Context given its parent and its character offsets relative to the parent

Split stable id, returning:
  • Document (root) stable ID
  • Context polymorphic type
  • Character offset start, end relative to document start

Returns tuple of four values.

Core Objects for Preprocessing and Loading

class snorkel.parser.CorpusParser(doc_parser, sent_parser, max_docs=None)[source]

Invokes a DocParser and runs the output through a SentenceParser to produce a Corpus.

class snorkel.parser.DocParser(path, encoding='utf-8')[source]

Parse a file or directory of files into a set of Document objects.


Parse a file or directory of files into a set of Document objects.

  • Input: A file or directory path.

  • Output: A set of Document objects, which at least have a _text_ attribute,

    and possibly a dictionary of other attributes.

class snorkel.parser.HTMLDocParser(path, encoding='utf-8')[source]

Simple parsing of raw HTML files, assuming one document per file

class snorkel.parser.TSVDocParser(path, encoding='utf-8')[source]

Simple parsing of TSV file with one (doc_name <tab> doc_text) per line

class snorkel.parser.TextDocParser(path, encoding='utf-8')[source]

Simple parsing of raw text files, assuming one document per file

class snorkel.parser.XMLMultiDocParser(path, doc='.//document', text='./text/text()', id='./id/text()', keep_xml_tree=False)[source]

Parse an XML file _which contains multiple documents_ into a set of Document objects.

Use XPath queries to specify a _document_ object, and then for each document, a set of _text_ sections and an _id_.

Note: Include the full document XML etree in the attribs dict with keep_xml_tree=True