Corpora API

Corpora module is very simple so far and consists only of one class.

class corpora.Corpus(path)

Corpus class is responsible for creating new corpus and also represents a corpus as an object

exception ExceptionDuplicate

Exception raised when appending document with duplicate id

exception Corpus.ExceptionTooBig

Exception raised when document is to big to fit chunk file.

Corpus.add(text, ident, **headers)

Appending new document to a corpus.

static Corpus.create(path, **properties)

Static method for creating new corpus in the given path. Additional properties can be given as named arguments.

Corpus.get(ident)

Get random document from a corpus.

Corpus.get_by_idx(idx)

Get document pointed by idx structure which is offset information in chunk file.

Corpus.get_chunk(number=None)

Getter for chunk. Default chunk is current_chunk. Method caches opened chunk files.

Corpus.get_idx(index)

Return tuple (chunk, offset, header len, text len) for given index

Corpus.get_ridx(key)

Return index of idx for given key

Corpus.make_new_chunk()

Creates new chunk with next sequential chunk number.

Corpus.save_config()

Saving properties of corpora to config file.

Corpus.save_indexes()

Saving all indexes to apropriate files.

Corpus.test_chunk_size(new_size)

Tests if new_size data will fit into current chunk.

The only argument path should be pointed on directory containing corpus.

Creating new corpus

To create new corpus use static method:

static Corpus.create(path, **properties)

Static method for creating new corpus in the given path. Additional properties can be given as named arguments.

Example:

Corpus.create('/tmp/test_corpus', name="My test corpus", chunk_size=1024*1024*10)

This will create an empty corpus with 10MB chunk size and name My test corpus in directory /tmp/test_corpus.

Appending document to corpus

Corpus.add(text, ident, **headers)

Appending new document to a corpus.

text
a document raw text formed as an Unicode; this will be encoded with the encode property of corpus before saving to a chunk.
ident
a unique identifier of an element, this can be a number or any string (ex. hash value); needed for random access.

Warning

you should assume that ident will be converted to string, so 1 and ‘1’ are the same ident and are not unique.

**headers
you can add any additional headers as key-value pairs; values can be any serializable by yaml objects; the key “id” is restricted for storing the document ident.

Example:

c = Corpus('/tmp/test_corpus')
c.add(u'This is some text', 1, fetched_from_url="http://some.test/place", related_documents=[1,2,3])
c.add(u'Other text', 2, is_interesting=False)

Note

documents are saved with order of appending them, this means that if you add 3 documents with id like 2, 1, 3 there will be served in the same order while accessed sequentially.

Warning

as you can see you can add any header to document. There is no pre-configuration what can be set as document header. This is very flexible, but in the same time can lead to problem with consistency of headers among all documents collections. Be sure that you append this same headers to every document in corpus or write your code in a way that will deal with KeyError from missing headers.

After adding new documents to a corpus you need to sync indexes to a filesystem.

Corpus.save_indexes()

Saving all indexes to apropriate files.

c.save_indexes()

Sequential access to corpus

Typical use of a corpus is to sequentially access all documents one-by-one. Corpora supports operation with generators.

Corpus.__iter__()

Example:

c = Corpus('/tmp/test_corpus')
for (headers, text) in c:
    ... some processing

This will read a file chunks sequentially what should be as fast as possible.

Random access to corpus

There is also a possibility to access a given document pointed by it’s id.

Corpus.__getitem__(key)

Interface for get method

Corpus.get(ident)

Get random document from a corpus.

Examples:

c = Corpus('/tmp/test_corpus')
print c[1]
print c.get(1)

Both lines will print the same document tuple (if exists).

Size of corpus

Standard python len is used.

Corpus.__len__()

Returns size of document collection

Example:

c = Corpus('/tmp/test_corpus')
print len(c)

Exceptions

exception corpora.Corpus.ExceptionTooBig

Exception raised when document is to big to fit chunk file.

exception corpora.Corpus.ExceptionDuplicate

Exception raised when appending document with duplicate id