tlda.TLDA

class tlda.TLDA(n_topic, alpha_0, n_iter_train, n_iter_test, learning_rate, pca_batch_size=10000, third_order_cumulant_batch=1000, gamma_shape=1.0, smoothing=1e-06, theta=1, ortho_loss_criterion=1000, n_eigenvec=None, random_seed=None)[source]

Class to learn topic-word distribution from a corpus of documents

Attributes:
unwhitened_factors

Unwhitened learned factors of shape (n_topic, vocabulary_size)

Methods

fit(X[, order])

Compute the word-topic distribution for the entire dataset at once.

partial_fit(X_batch, batch_index[, save_folder])

Update the word-topic distribution using a batch of documents.

partial_fit_online(X_batch)

Update the word-topic distribution using a batch of documents in a fully online version.

transform([X, predict])

Transform the document-word matrix of a set of documents into a word-topic distribution and topic-distribution when predict=True.

fit(X, order=None)[source]

Compute the word-topic distribution for the entire dataset at once. Assumes that the whole dataset and the tensors required to compute its word-topic distribution fit in memory.

Parameters:
X: tensor of size (self.n_documents , self.vocab) all documents used to fit the word-topic distribution
partial_fit(X_batch, batch_index, save_folder=None)[source]

Update the word-topic distribution using a batch of documents. For a given batch, the first and second order cumulants need to be fit once, but the third order cumulant should be fit many times.

Parameters:
X_batchtensor of shape (batch_size, self.vocab)
batch_indexint

index of the current batch. This is used to know whether to update the first and second moment or just whiten

save_folderstr, default is None

Folder in which to store the whitened batches. If None, the whitened batches will be recomputed at each iteration instead of being catched.

partial_fit_online(X_batch)[source]

Update the word-topic distribution using a batch of documents in a fully online version. Meant for very large datasets, since we only do one gradient update for each batch in the third order cumulant calculation.

Parameters:
X_batchtensor of shape (batch_size, self.vocab)
property unwhitened_factors

Unwhitened learned factors of shape (n_topic, vocabulary_size)

On the first call, this will compute and store the unwhitened factors. Subsequent calls will simply return the stored value.

transform(X=None, predict=False)[source]

Transform the document-word matrix of a set of documents into a word-topic distribution and topic-distribution when predict=True.

Parameters:
Xtensor of shape (n_documents , self.vocab)

set of documetns to predict topic distribution

predictindicate whether to return topic-document distribution and word-topic distribution or just word-topic distribution.