src.models .User2Vec

class src.models. User2Vec ( vector_size , min_count , epochs , ** kwargs ) [source]

Generates vectors for each user in the dataset.

__init__ ( vector_size , min_count , epochs , ** kwargs ) [source]
Parameters
documents iterable of list of TaggedDocument , optional

Input corpus, can be simply a list of elements, but for larger corpora,consider an iterable that streams the documents directly from disk/network. If you don’t supply documents (or corpus_file ), the model is left uninitialized – use if you plan to initialize it in some other way.

corpus_file str, optional

Path to a corpus file in LineSentence format. You may use this argument instead of documents to get performance boost. Only one of documents or corpus_file arguments need to be passed (or none of them, in that case, the model is left uninitialized). Documents’ tags are assigned automatically and are equal to line number, as in TaggedLineDocument .

dm {1,0}, optional

Defines the training algorithm. If dm=1 , ‘distributed memory’ (PV-DM) is used. Otherwise, distributed bag of words (PV-DBOW) is employed.

vector_size int, optional

Dimensionality of the feature vectors.

window int, optional

The maximum distance between the current and predicted word within a sentence.

alpha float, optional

The initial learning rate.

min_alpha float, optional

Learning rate will linearly drop to min_alpha as training progresses.

seed int, optional

Seed for the random number generator. Initial vectors for each word are seeded with a hash of the concatenation of word + str(seed) . Note that for a fully deterministically-reproducible run, you must also limit the model to a single worker thread ( workers=1 ), to eliminate ordering jitter from OS thread scheduling. In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization.

min_count int, optional

Ignores all words with total frequency lower than this.

max_vocab_size int, optional

Limits the RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit.

sample float, optional

The threshold for configuring which higher-frequency words are randomly downsampled, useful range is (0, 1e-5).

workers int, optional

Use these many worker threads to train the model (=faster training with multicore machines).

epochs int, optional

Number of iterations (epochs) over the corpus.

hs {1,0}, optional

If 1, hierarchical softmax will be used for model training. If set to 0, and negative is non-zero, negative sampling will be used.

negative int, optional

If > 0, negative sampling will be used, the int for negative specifies how many “noise words” should be drawn (usually between 5-20). If set to 0, no negative sampling is used.

ns_exponent float, optional

The exponent used to shape the negative sampling distribution. A value of 1.0 samples exactly in proportion to the frequencies, 0.0 samples all words equally, while a negative value samples low-frequency words more than high-frequency words. The popular default value of 0.75 was chosen by the original Word2Vec paper. More recently, in https://arxiv.org/abs/1804.04212 , Caselles-Dupré, Lesaint, & Royo-Letelier suggest that other values may perform better for recommendation applications.

dm_mean {1,0}, optional

If 0 , use the sum of the context word vectors. If 1, use the mean. Only applies when dm is used in non-concatenative mode.

dm_concat {1,0}, optional

If 1, use concatenation of context vectors rather than sum/average; Note concatenation results in a much-larger model, as the input is no longer the size of one (sampled or arithmetically combined) word vector, but the size of the tag(s) and all words in the context strung together.

dm_tag_count int, optional

Expected constant number of document tags per document, when using dm_concat mode.

dbow_words {1,0}, optional

If set to 1 trains word-vectors (in skip-gram fashion) simultaneous with DBOW doc-vector training; If 0, only trains doc-vectors (faster).

trim_rule function, optional

Vocabulary trimming rule, specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used, look to keep_vocab_item() ), or a callable that accepts parameters (word, count, min_count) and returns either gensim.utils.RULE_DISCARD , gensim.utils.RULE_KEEP or gensim.utils.RULE_DEFAULT . The rule, if given, is only used to prune vocabulary during current method call and is not stored as part of the model.

The input parameters are of the following types:
  • word (str) - the word we are examining

  • count (int) - the word’s frequency count in the corpus

  • min_count (int) - the minimum count threshold.

callbacks :obj: list of :obj: ~gensim.models.callbacks.CallbackAny2Vec , optional

List of callbacks that need to be executed/run at specific stages during training.