Lecture 2

Date: 30 March 2022

Video Lecture

High-level goals and guiding hypothesis

Example of classification problem: Given a list of words, classify these words as being positive or negative.
A point was made about distribution analysis of words. For the problem above, a pretty good model can be divised by only checking how many times the word of interest appears together with the word 'excellent' in comparison with how many times it appears with the word 'terrible'.
This was already stated by linguists in the past. Some of them says that you can tell the meaning of a word by looking at its companions; it seems that it is also true that words with similar word distribution are similar in meaning.
Big picture of the modelling process: You need to make several design choices according to your problem and then likely apply a normalization and dimension reduction step before starting making your comparisons. For comparisons, you have several metrics to chose from.

Matrix designs

Word x Word design: Number of times the row word co-appeared with the column word. It is a dense matrix; fixed dimensions, as long you decide the vocabulary size.
Word x Document design: Number of times the row word appeared in the column document. It is a sparse matrix.
The two matrices aboves are the most common in the literature, but there is no reason to be constrained to them. In fact, you should do whatever your problem requests.
Word x Discurse context: Different from the others, in this one we need someone to annotate the documents and classify them with a fixed group of labels. In a sense, it is a grouped version of the Word x Document design.
Examples of other designs:
- Adjective x Modified noun
- Word x Syntactic context
- Word x Search query
- Person x Product
- Word x Person
- Word x Word x Pattern
- Verb x Subject x Object
Window and Scalling: We need to set a definition of co-occurence. We can generalize this process by dividing in two steps:
- Window: How distant from the word of interest a word should be to be considered as a co-occurence?
- Scaling: Given the distance of a word from the word of interesting, what it should be the value of its co-occurence? Should closer words have higher weight?
Examples of scalling: Flat and 1/n.
Larger windows tend to gather more semantic information; smaller windows tend to gather more syntactic information.

Vector Comparison

Word about distances: Euclidean, L2-normalization, Cosine, Jaccard, Dice, KL (for probabilities).
Only Euclidean and L2-normalization respects the triangle inequality. One can make an alternative version for the cosine distance to also respect the triangle inequality.
The cosine distance between A and B roughly computes the angle between the lines OA and OB.
The distance to be used really depends of what you want to measure.
- Euclidean favors raw frequency. A high value in a component may contribute to much to the distance value.
- L2-normalization. The contribution of each component is normalized, so the effect if more controlled.
- Cosine. We are really measuring similarity here (Look for Pearson coefficient and this kind of stuff in statistics).

Basic Reweighting

The goal is to deemphasize the mundaine and amplify the important, the trustworthy, the unusual.
Simple rescaling it not what we want.
Ask the question: What distribution this weighting gives to me?
PMI (Pointwise Mutual Information): If observed value is greater than expected, the PMI is positive; if smaller than expected, the PMI is negative. The PMI distribution looks like a normal (distance from the mean).

\[ \begin{align*} pmi(X,i,j) := log_e(\frac{X_{ij}}{expected(X,i,j)}) \\[1em] expected(X,i,j) := \frac{ rowsum(X,i) \times colsum(X,j) }{ sum(X) } \end{align*} \]

\(rowsum(X,i)\) and \(colsum(X,j)\) count all the appearances of words \(w_i\) and \(w_j\). Thinking in probabilistic terms, the PMI computation is answering the question: supposing that word \(w_i\) and \(w_j\) are independent, how uncommon it is to find the two of them in the same context?

\[ \begin{align*} P(X_{i,*}) &= \frac{rowsum(X,i)}{sum(X)} \\[1em] P(X_{*,j}) &= \frac{colsum(X,j)}{sum(X)} \\[1em] expected(X,i,j) &= P(X_{i,*})\times P(X_{*,j}) \times sum(X) \\[1em] &= \frac{ rowsum(X,i) \times colsum(X,j) }{ sum(X) } \end{align*} \]

It is worthwhile to recall some statistics concepts!

Other weighting schemas: t-test; TF-IDF
TF-IDF: Term Frequency by Inverse document frequency. Useful for sparse matrices. Not that useful for dense ones.
This weighting is tailored for word by document design. It is defined in the following way:

\[ \begin{align*} TF(X,i,j) &= \frac{X_{ij}}{colsum(X,i,j)}\\[1em] IDF(X,i,j) &= \log\left(\frac{n}{|\{k : X_{ik} > 0 \}|}\right)\\[1em] TF-IDF(X,i,j) &= TF(X,i,j) \times IDF(X,i,j) \end{align*} \]

TF: The frequency of a word in a document.  
IDF: How unfrequent is the word in the document collection. If a word
appears in every document, this value equals zero.

Material

Notebook: Designs, distance and reweighting

Matrix designs
Vector comparison
- Euclidean
- Length normalization
- Cosine distance
- Cosine distance that's really a distance metric
- Matching-based methods
- Summary
Distributional neighbors
Matrix reweighting
- Normalization
- Observed/Expected
- Pointwise Mutual Information
- TF-IDF
Subword information
Visualization

Notebook notes

Intuition behind the word by word matrix frequency. Suppose that our vocabulary is the following:

V = ("superb", "outstanding", "magnificent", "wonderful", "terrible", "w_u","w_v", "awful", "horrible", "disgusting")

Then, consider the two vectors below for different replacements of the word "w_u" and "w_v".

\[ \begin{align*} u = (0.25,0.25,0.25,0.25,1,0,0,0,0,0)\\[1em] v = (0,0,0,0,0,1,0.25,0.25,0.25,0.25) \end{align*} \]

They are orthogonal, that is, with the highest possible distance according to the cosine distance.

Subword information. It consists of representing words by its n-grams. In the notebook there are two references that are listed.
The 4-ngrams for the word abandon are the following:
- aban
- band
- ando
- ndon

One may use subword repreentation to reduce sparsity or to make projections for words that are not even present in the original vocabulary. The latter is based on the fact that words in many languages are composed by concatenating prefixes and suffixes together.

Paper reading

Assigment: Word Relatedness

We are given a data set of word pairs that are human-annotated with a relatedness score. Higher score, higher relatdness. Here are some examples,

Word 1	Word 2	Score
lake	water	0.9
liquor	scotch	0.955
decoration	interior	0.72
match	start	0.51
dead	pigeon	0.3
canyon	piano	0.1
love	object	0.01
sandwich	submarine	0.01

obs: These entries are in the data set given. In particular, sandwich and submarine are kind of related if we think about sub sandwiches.

Spearman correlation coefficient

To illustrate the Spearman coefficient, let us start with an example. Consider the table below of word relatedeness for some word pairs with human annotated scores.

word 1	word 2	score
ball	player	0.8
water	fish	0.9
drink	chess	0.12
costume	beach	0.08
vehicle	bike	0.5

We order the elements of the table by ascending order of the score value. This ordering induces a ranking r.

word 1	word 2	score	rank
costume	beach	0.08	1
drink	chess	0.12	2
vehicle	bike	0.5	3
ball	player	0.8	4
water	fish	0.9	5

The rank uniquely represents a word pair. We use the rank to define the rank function, which is monotone.

Now, let us say that we have a model that scores pairs of words accordingly to their relatedeness. Let us say that our model give the predictions below.

word 1	word 2	score	rank
drink	chess	0.10	1
costume	beach	0.15	2
vehicle	bike	0.5	3
water	fish	0.75	4
ball	player	0.82	5

The Spearman correlation coefficient will tell us how much the ranks of the pairs are coherent with each other. That is, we are going to obtain a Spearman coefficient of 1 if both tables agree with the rank given by the pairs. On the other hand, we are going to obtain a Spearman coefficient of -1 if the ranks in the tables are the exaclty opposite.

In other words, the Spearman coefficient computes the Pearson coefficient between the rank values of the word pairs.

\[ r = \rho(R(X),R(Y)) = \frac{cov(R(X),R(Y))}{\sigma(R(X))\sigma(R(Y))} \]

Repeated pairs

The test set has repeated pairs. For example, the pair (bank,money) were scored four times. If we have two instances, we take the mean. If we have more than two we take the median.

VSMs and the evaluation function

The evaluation function expects a VSM (vector space model). It is not clear which design is expected. I assume that it will only work with word by word design.

A VSM, in our context, it is a vector representation of a collection of words. The values in the vector components could be anything, it depend of our goals.

After writing the above, I guess that the question in the first paragraph is YES! it will accept any kind of VSM.

Baseline system and random score

A perfectly reasonable baseline system is the random one. Your system must at least return, on average, better results than random guesses.

Error analysis

We are interested into check how the human-annotated scores relate with our prediction score. Since these two measures could be arbitrarily different, a technique that works in general is to compare the ranks induced by the human-annotaded score with the rank induced by the prediction score.

In general terms, we are going to compare how the rank of word W in the human-annotaded score compares with the rank of this same word in the rank induced by the prediction score.

The error is defined as the different in the rank values. Higher the difference, higher is the error.

PPMI baseline

Since the PPMI is such a standard reweighting technique that is reported to improve prediction scores in several word relatedness problems, it is worth to evaluate our system with a PPMI reweighting.

Model	Spearman Correlation with human-annotated scores
random	-0.0006
giga20_cosine-dist	0.2776
giga20_PPMI_cosine-dist	0.586

Gigaword with LSA at different dimensions

We simply compute LSA (SVD-based dimensionality reduction) on the Gigaword data and we use the resulting VSM to compute the relatedeness scores for the vocabulary in the dev set and next compute the spearman coefficient of these scores with the ones human annotated in the dev set.

The score here was of: 0.545

t-test reweighting

This is another reweighting scheme. It does not seem to be related with the t-student test, despite the name. Here is the definition:

\[ ttest(X,i,j) = \frac{P(X,i,j)-P(X,i,*)P(X,*,j)}{\sqrt{(P(X,i,*)P(X,*,j))}} \]

Pooled BERT representation

Here we implement the Bommasani approach of converting contextual VSM into static VSM. We simply implemented the decontextualized strategy, in which the word context is formed by the word itself and nothing else.

The score here was of: 0.4

Learned distance functions

We are not constrained to use standard distance functions as Euclidean or cosine. Indee, any mapping between a pair of vectors and a double is suitable for a distance function in the context of our work.

In this exercise, we trained a k-nearest neighbors model to use as a distance function. To train the model, we used a pre-computed VSM (giga20) and a split of our dev_set (pair of words with human-annotated scores for word-relatedeness) to train and test the model.

For each pair of words in the training set, we concatenate their VSM representations to derive one single vector. We then use the k-nearest neighbors on this higher dimensional space.

It makes sense when we think that pairs of related words should be closer from each other; the same is true for non-relatedness words. The problem is that our concatenation strategy duplicates the dimension of the space, and I believe that most of the points tend to be far apart from each other. I think it could work better if we applied a dimensionality reduction first.

Indeed, the score improved from -0.6 to -0.3. Not very remarkable.

Questions to answer

How does it compare with GloVe?
What are the main characteristics of the model?
Which objective (loss) function is used?
Which machine learning model (concept) is used?
How much time to train?
How much time to get an answer?
How the model behaves with more or less data training?
How the model compares with competitors?
How could we apply this model in a SH context?
What are the model limitations?