gensim logo

gensim
gensim tagline
corpora.hashdictionary – Construct word<->id mappings

corpora.hashdictionary – Construct word<->id mappings

This module implements the concept of HashDictionary – a mapping between words and their integer ids. The ids are computed as hash(word) % id_range, the idea being that new words can be represented immediately, without an extra pass through the corpus to collect all the ids first. See http://en.wikipedia.org/wiki/Hashing-Trick .

This means that, unline plain Dictionary, several words may map to the same id (~hash collisions). The word<->id mapping is no longer a bijection.

class gensim.corpora.hashdictionary.HashDictionary(documents=None, id_range=32000, myhash=<built-in function adler32>, debug=True)

HashDictionary encapsulates the mapping between normalized words and their integer ids.

Unlike Dictionary, building a HashDictionary before using it is not a necessary step. The documents can be computed immediately, from an uninitialized HashDictionary, without seeing the rest of the corpus first.

The main function is doc2bow, which converts a collection of words to its bag-of-words representation: a list of (word_id, word_frequency) 2-tuples.

By default, keep track of debug statistics and mappings. If you find yourself running out of memory (or are sure you don’t need the debug info), set debug=False.

add_documents(documents)

Build dictionary from a collection of documents. Each document is a list of tokens = tokenized and normalized utf-8 encoded strings.

This is only a convenience wrapper for calling doc2bow on each document with allow_update=True.

doc2bow(document, allow_update=False, return_missing=False)

Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized utf-8 encoded string. No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

If allow_update or self.allow_update is set, then also update dictionary in the process: update overall corpus statistics and document frequencies. For each id appearing in this document, increase its document frequency (self.dfs) by one.

filter_extremes(no_below=5, no_above=0.5, keep_n=100000)

Remove document frequency statistics for tokens that appear in

  1. less than no_below documents (absolute number) or
  2. more than no_above documents (fraction of total corpus size, not absolute number).
  3. after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

Note: since HashDictionary’s id range is fixed and doesn’t depend on the number of tokens seen, this doesn’t really “remove” anything. It only clears some supplementary statistics, for easier debugging and a smaller RAM footprint.

keys()

Return a list of all token ids.

classmethod load(fname)

Load a previously saved object from file (also see save).

restricted_hash(token)

Calculate id of the given token. Also keep track of what words were mapped to what ids, for debugging reasons.

save(fname)

Save the object to file via pickling (also see load).

save_as_text(fname)

Save this HashDictionary to a text file, for easier debugging.

The format is: id[TAB]document frequency of this id[TAB]tab-separated set of words in UTF8 that map to this id[NEWLINE].

Note: use save/load to store in binary format instead (pickle).