gensim logo

gensim
gensim tagline
models.tfidfmodel – TF-IDF model

models.tfidfmodel – TF-IDF model

class gensim.models.tfidfmodel.TfidfModel(corpus=None, id2word=None, dictionary=None, wlocal=<function identity at 0x105d52aa0>, wglobal=<function df2idf at 0x1082e75f0>, normalize=True)

Objects of this class realize the transformation between word-document co-occurrence matrix (integers) into a locally/globally weighted TF_IDF matrix (positive floats).

The main methods are:

  1. constructor, which calculates inverse document counts for all terms in the training corpus.
  2. the [] method, which transforms a simple count representation into the TfIdf space.
>>> tfidf = TfidfModel(corpus)
>>> print = tfidf[some_doc]
>>> tfidf.save('/tmp/foo.tfidf_model')

Model persistency is achieved via its load/save methods.

Compute tf-idf by multiplying a local component (term frequency) with a global component (inverse document frequency), and normalizing the resulting documents to unit length. Formula for unnormalized weight of term i in document j in a corpus of D documents:

weight_{i,j} = frequency_{i,j} * log_2(D / document_freq_{i})

or, more generally:

weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document_freq_{i}, D)

so you can plug in your own custom wlocal and wglobal functions.

Default for wlocal is identity (other options: math.sqrt, math.log1p, ...) and default for wglobal is log_2(total_docs / doc_freq), giving the formula above.

normalize dictates how the final transformed vectors will be normalized. normalize=True means set to unit length (default); False means don’t normalize. You can also set normalize to your own function that accepts and returns a sparse vector.

If dictionary is specified, it must be a corpora.Dictionary object and it will be used to directly construct the inverse document frequency mapping (then corpus, if specified, is ignored).

initialize(corpus)

Compute inverse document weights, which will be used to modify term frequencies for documents.

classmethod load(fname)

Load a previously saved object from file (also see save).

save(fname)

Save the object to file via pickling (also see load).

gensim.models.tfidfmodel.df2idf(docfreq, totaldocs, log_base=2.0, add=0.0)

Compute default inverse-document-frequency for a term with document frequency doc_freq:

idf = add + log(totaldocs / doc_freq)
gensim.models.tfidfmodel.precompute_idfs(wglobal, dfs, total_docs)

Precompute the inverse document frequency mapping for all terms.