Word Embeddings

This section groups papers that talks about word embeddings in the NLP context.

Efficient Estimation of Word Representations in Vector Space (Skip-gram)

Paper Link: Efficient Estimation of Word Representations in Vector Space (Skip-gram)
Citation:Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).

Reading notes

The model design uses Hierarchical Softmax to reduce the number of computations and the vocabulary is represented as a Huffman Binary tree.
The Huffman tree provides a non-ambiguous binary encoding for the vocabulary in a corpora. The length of the encoding for word w depends on its frequency in the corpora. High frequency words have smaller encodings.
Previous models:
- Feed-Forward NN (Input,Projection,Hidden,Output)
- Recurrent NN (Input,Hidden,Output)
Proposed Model:
- Continuous bag of words:
- Skip-Gram:
- Both are trained in a parallel setting (DistBelief: Distributed computing framework)

Summary

In the paper Efficient Estimation of Word Representations in Vector Space, the authors propose two novel models for generating continuous word vector representations: continuous bag of words and skip-gram.
The design goal of these models was the following: generate high dimensional vectors that can be trained in a huge amount of data in a short time. That is, minimize computational complexity while attending high accuracy levels in benchmark tests.
The author points out the flaws in previous models to reach these goals:
- Feed-Forward NN
- Recurrent NN

The dominant cost of these models come from the Hidden layer (which is responsable for the non-linearity of the model. Therefore, the proposed models are less rich in the sense of relations it can infer).

CBOW: It is kind of an autoencoder. The model goal is to optimize the weights of the projection and reprojection matrix such that the error in word vector representation is minimized.
Input: Some continuous word vector representation for each word in the vocabulary.
Output: Projection matrix W (DxV) and reprojection matrix U (VxD).
Skip-Gram: Similar as above, but this time the computation is done to optimize the reprojection of an input word in the vector representation of its context word representations.
Both models assume an initial word embedding for the vocabulary. One that can be constructed by frequency matrices, for example.
The analogy task. The model is reported to perform well in this task.
Paris is to France as Berlin is to ???
The authors reported that this is encoded in the form of a simple summation of the trained vector representations.

Distributed representations of words and phrases and their compositionality

Paper link: Distributed representations of words and phrases and their compositionality
Citation: Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems 26 (2013).

Reading notes

Follow-up paper from the one above.
Negative sampling algorithm
word2vec open-source code
skip-gram is trained on over 30 billion words and take a fraction of training time required for other models.
vector compositionality:
w(French) + w(Actress) = [Juliette Binoche, Vanessa Paradis]
w(Czech) + w(currency) = [koruna,check crown]
The vector embeddings are trained to represent the context distribution of the word. How to compare with GloVe?
The analogy task: it is the task proposed by Mikolov to train the skip-gram model. It could be categorized in semantic analogy (Germany:Berlin::France:?) or syntactic analogy (quick:quickly::slow:?). The analogy task can be solved by finding the closest vector to w(Berlin) - w(Germany) + w(France).
To train the Skip-gram model does not involve dense matrix multiplications. This is a great advantage because a single machine can train 100 billion words in a single day.
Phrase vectors. Word representations are limited by their inability to represent idiomatic phrases of the individual words. When the words do not represent their usual meaning; or they are in an idiomatic phrase: "Boston Globe" (the name of a newspaper). To get that, train the model identifying phrases as single tokens.

Summary

This paper is a follow-up from a previous work from the authors in which the Skip-Gram NNLM to generate word vector representation was proposed. This paper presents yet other techniques to decrease the training time in the Skip-Gram model while increasing its accuracy.
Subsampling of frequent words: The idea is that frequent words such as the contributes much less to the generation of unique vector representations than more rare words such as window. Therefore, these words could be ignores during training.
Alternative method to Hierarchical Soft Max: Negative sampling.

GloVe: Global Vectors for Word Representation

Paper link: Global Vectors for Word Representation
Citation: Pennington, Jeffrey, Richard Socher, and Christopher D. Manning. "Glove: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.

The name of Global Vector is to highlight the property that global statistics of the corpus are captured directed by the model. In particular, the vector components are defined to reflect the co-occurrence probability between pair of words and context.

Scalability regarding corpus size

The fact that this basic SVD model does not scale well to large corpora lends further evidence to the necessity of the type of weighting scheme proposed in our model.

Training/Optimization algorithm

The author uses AdaGrad (Duchi, et al 2011)

Tasks in which GloVe beated state-of-art models of the time

Word analogy:
Word similarity
Names entity recognition

Comparison against word2vec

The word2vec framework has 2 models: COBW and skip-gram. Both of them are beated by the GloVe model.

For the same corpus, vocabulary, window size, and training time, GloVe consistently outperforms word2vec. It achieves better results faster, and also obtains the best results irrespective of speed.

Torch implementation

The implementation we can find in the torch library.

Paper: Mittens: an Extension of GloVe for Learning Domain-Specialized Representations.

Learned in Translation: Contextualized Word Vectors

Paper link: Learned in Translation: Contextualized Word Vectors
Citation: McCann, Bryan, et al. "Learned in translation: Contextualized word vectors." Advances in neural information processing systems 30 (2017).

The idea is to use an intermediate step in a machine translation model (encoder) to transfer learning to other tasks.

The Machine Translation model used is an attentional encoder decoder. To train the model, sequence of GloVe vectores in the source and target languages are given and the encoder-decoder model task is to encode the source word and decode it in the target language. The output of the encoder is used as input in other NLP tasks (transfer learning).

The intuition behind is that the machine translation task needs to take context into account in order to produce results of good quality. The hypothesis here is that the context information is encoded in the encoder output.

This work appear at the same time as BERT.
Transfer Learning:
Machine Translation Datasets:
- Multi30k (Flickr captions descriptions)
- Cettolo et al. 2015 (TED videos transcriptions)
- WMT 2017 (news and commentary corpus; EU parlement proceedings)