Lecture 4

Retrofitting

The approach we have been studing consists in compute a distribution matrix of words and then generate a vector space from this distribution that will allow us to solve tasks such as word similarity by computing the distance between vector points.

While this approach is very convenient because it is relatively easy to compile such distributional matrices, there may be some complex semantic distinction between words that are not captured by matrix distributions but an alternative type of resource which can encode complex relationships, such as graphs. Examples of such rich representations are:

WordNet
FrameNet
Paraphrase database

The idea of retrofitting is to start from a word vector space and then modify this vector space considering the information contained in these rich relational graph of lexicons.

To solve that, a minimization problem is solved in which the optimization function asks that the elements in the new vector space are not that distant from its original counterpart but also that vectors that are related according to the lexicon relational graph get closer to each other.

\[ \sum_{i \in V}{\alpha_i||q_i - \hat{q}_i||^2} + \sum_{i,j \in E}{\beta_{ij}||q_i - q_j||^2} \]

Here are some extra references:

Mrkˇsi ́c et al. (2016)
Lengerich et al. (2018)

Static Representation from Contextual Models

In a static representation, a corpus is analysed and a vector is derived for each word. In a contextual model, we have word-token vectors. In other words, the vector for a word will be context-dependent. In a contextual model, the vectors for the word good in the two sentences below will be different.

I had good grades this year.
This restaurant serves very good food.

Professor Potts presents the work in a paper by Bommasani (2020) in which contextual vector representations generated by models such as BERT are transformed in static representations.

Daniel's note: I understand why someone would like to have a static vector representation; but using a contextual representation for doing that is killing the very basic idea of having contextual models at first place. Contextual representations are supposed to capture the multimeaning property that some word have, which a static representation has difficulties to do so. Anyway, I should read the work of Bommasani in more depth. By taking a quick glance, it seems that the resulting static representation performs better than word2vec and GloVe.

To understand what is being done here, I need to read the following papers:

Attention is all you need (Vaswani et al)
The annotation transformer guide: http://nlp.seas.harvard.edu/2018/04/03/attention.html
BERT paper: https://aclanthology.org/N19-1423.pdf
Bommasani paper: https://aclanthology.org/2020.acl-main.431.pdf

Material

Notebook: Retrofitting

It is impressive how much information we can extract and infer from count matrices. But not every relationship can be extracted this way. That's where richer representations, as the one given by WordNet, could come in hand.

Model parameters

We just played with the retrofitting model parameters to get an intuition on the model behaviour by analysing extreme cases such as \(\alpha=0\) in which the model collapses to a single point in the all connected example. The model expected a digraph.

WordNet

There is a reference for WordNet schemas in other languages than English.
There is a reference to a tutorial on NLTK, a python package to handle WordNet data.

A lemma is a defined by a unique pair (word,meaning). The word crane, for example, has 6 lemmas in WordNet:

United States writer (1871-1900)|
United States poet (1899-1932)
a small constellation in the southern hemisphere near Phoenix
lifts and moves heavy objects; lifting tackle is suspended from a pivoted boom that rotates around a vertical axis
large long-necked wading bird of marshes and plains in many parts of the world
stretch (the neck) so as to see better

Each of these meanings are part of a synset. A synset is a group of relations that are meaningful with respect to a lemma. For example, considering the 4th lemma for crane

(lemma,lifts and moves heavy objects; lifting tackle is suspended from a pivoted boom that rotates around a vertical axis)

its synset in WordNet displays the relations

direct hyponym
direct hypernym

In the rest of Notebook, we executed a retrofitting on a GloVe matrix and WordNet graph.

The notebook also makes some connections with graph embeddings. There is a list of papers at the end. In particular, the Lengerich 2017 paper it is said to open a number of new opportunities regarding retrofitting.

Notebook: Static Representation From Contextual Models

In order to better understand the terminology and references made, I had quickly read some sections of the material in http://ufldl.stanford.edu/tutorial/unsupervised/Autoencoders/, namely:

Softmax regression
Multi-Layer Neural Networks
Feature extraction using convolution
Autoencoders

I believe I have sufficient vocabulary to read the 4 articles listed in section Static Representation from Contextual Models above.

The idea here is to recover static representation from contextual ones. The examples in the notebook uses the contextual vectors generated by the BERT model. As the name suggests, the vector representation of a word varies according to the context you are given to the model.

BERT model

Accepts a sequence of words and it outputs a sequence of vectors, one for each word.

\[ \left[ \begin{array}{c} [w_{1,c_1},w_{2,c_1},w_{3,c_1},\cdots]\\ [w_{1,c_2},w_{2,c_2},w_{3,c_2},\cdots]\\ \cdots \end{array}\right] \rightarrow \left[ \begin{array}{c} [v_{1,1},v_{2,1},v_{3,1},\cdots]\\ [v_{1,2},v_{2,2},v_{3,2},\cdots]\\ \cdots \end{array}\right] \]

Each row in the left side represents a context. In the work of Boumassami, two approaches are considered: the decontextualized one and the aggregated.

In the decontextualized one, each word is given a very unnatural context... itself. That is, the full context is composed of a single word. Notice that this does not mean that a pooling function would not be applied. BERT (and plenty of other models as well) consider subword representation. That means that a single word may be split in several tokens. In this approach, the pooling is applied to the set of vector representations for each token of the word.

In the aggregated one, several contexts (full sentences) are selected for each word. The word, of course, should be present in the context (eventually more than once). Once again the vector representations of each token for each word in each context are computed; eventual multi-token words are pooled (usually the mean function is used). However, context-wise pooling is done at the end of this computation. We aggregate all the pooled representations for the word of interest in all the given functions by another pooling function (usually the mean function).

This is much better explained in page 2, column 2 of Bommasami paper.