Skip to content

Lecture 3

Dimensionality Reduction

It seems that by applying dimensionality reduction techniques one can identify high-order co-occurence among words.

By a high-order co-occurrence, consider the example of the words: gnarly and wicked. These words are slangs. One used in the east and the other in the west coast of the United States. Let us say that we have a word x document design and because of the geographic distance, it is natural to say that gnarly and wicked would not appear in the same document, that is, they won't co-occur. However, they have the same meaning, and they co-occur with awesome.

Dimensionality reduction techniques are supposed to capture this high-order interaction in which an intermediate word, awesome, makes the connection between two words with the same meaning but that do not co-occur.

Latent Semantic Analysis (LSA)

  • Method from a 90s paper by Deerwester.
  • Standard baseline, often very though to beat.
  • Based on Singular Value Decomposition
  • For any real matrix there exists a factorization in matrices T,S,D such that
\[ A_{m\times n} = T_{m\times m}S_{m\times m}D^T_{m \times n} \]

\(T\) are the orthogonal column vectors that the method outputs for representing the data; \(S\) is the weight of each of these column vectors (ordered from most important to lessen); and \(D\) is the coefficients one should used to recover the original vectors in \(A\) by using the vector space encoded by \(T\) and \(S\).

  • How to choose \(k\)? Usually, the best \(k\) is the one that gives you the more representative with the smallest dimensions. One can observe the singular values in matrix \(S\) and chose the first \(k\) such that in \(k+1\) we observe a drastic reduction in the singular value. However, in the matrices we are handling in this course, the pattern of singular values do not present such a distinguishable reduction. In practice, the \(k\) is considered as a hyper-parameter to be optimized.

  • Other dimensionality reduction techniques:

    • Principal Component Analysis (PSA)
    • Non-Negative Matrix Factorization (NMF)
    • Probabilistic LSA (PLSA)
    • Latent Dirichlet Allocation (LDA)
    • t-SNE

Take a look at sklearn.decomposition and sklearn.manifold.

Autoencoders

  • Flexible class of deep learning architectures for learning reduced dimensional representations.
  • Chapter 14 of Goodfellow book (2016) contains very deep discussion and plenty of examples.
  • In our problem, it is usually better to give an already dimension reduced matrix as an input (LSA)

GloVe

  • It stands for Global Vectors.
  • It first appeared in Pennington (2014).
  • Implementations:
    • torch_glove.py
    • reference implementation in vsm.py
    • super duper fast c implementation by the GloVe team.
  • There is a relation between GloVe and PMI
  • The objective of GloVe is to learn vectors for words such that their dot product is proportional to their log probability of co-occurence.
\[ \begin{align*} w_i^Tw_k &= log(P_{ik}) = log(X_{ik}) - log(X_{i*} \cdot X_{*k})\\[1em] &= log\left(\frac{X_{ij}}{expected(X,i,j)}\right) = log\left(\frac{P(X_{ij})}{P(X_{i*}) \cdot P(X_{*j})}\right) \end{align*} \]
  • Weighted GloVe (\(\alpha\) is usually set to \(0.75\))
\[ \begin{align*} \sum_{i,j=1}^{|V|}{f(X_{ij})\left(w_i^Tw_k - log(X_{ij})\right)^2}\\[1em] f(x) &= (x/x_{max})^{\alpha}, \text{if $x<x_{max}$}\\ &= 1, \text{otherwise} \end{align*} \]
  • The \(X_{ij}\) values are given by the corpus that is being analyzed. The GloVe objective function will found the vector representation for each of the words in the corpus vocabulary having the property that their dot products equals their ratio of co-occurence probabilities (PMI).

Visualization

A bunch of techniques to plot our high-dimensional vector space in a 2,3 dimensions. The package sklearn.manifold contains some of these methods. The slides show examples for t-sne.

Material

Notebook: Dimensionality Reduction and representation learning

The gnarly and wicked example.

In the notebook we have this toy example for illustrating how LSA can help us to identify hidden relation between words. Consider this word by document matrix

term d1 d2 d3 d4 d5 d6
gnarly 1 0 1 0 0 0
awesome 1 1 1 1 0 0
wicked 0 1 0 1 0 0
lame 0 0 0 0 1 1
terrible 0 0 0 0 0 1

The LSA truncated in two dimensions gives the following basis vector. These are the directions of higher variability in the original matrix above.

term k1 k2
gnarly 0.41 0.00
awesome 0.82 0.00
wicked 0.41 0.00
lame 0.00 0.85
terrible 0.00 0.53

We noticed that LSA and reduction dimensionality techniques in general are kind of grouping some terms together. An interpretation for the example above could be: If I have to synthetize the six documents in only two documents, how they would like?

Another point of view is to think about overfitting. In the original matrix, the documents are so fine grained subcategorized that we miss some generalizations like the one for gnarly and wicked.

GloVe

The goal of GloVe is to define a word by word matrix such that the dot products of their rows and column vectors equal the log probability of co-occurrence (the PMI weightinh).

Notice that is not exaclty the same as reweighting the matrix using PMI. The latter uses the original count matrix to reweight the matrix, while the other have this information encoded in the final matrix itself.

I wonder if it is possible to compute the GloVe matrix by effectuating a PMI reweighting several times. That is,

\[ \begin{align*} A^{(0)} &= A\\ A^{(t)} &= PMI(A^{(t-1)})\\ G &=? \lim_{t \rightarrow \infty}A^{(t)} \end{align*} \]

Regarding to that, I am not sure how the objective function of GloVe is usually optimized, although I believe it is some sort of gradient descent method.

Autoencoders

Autoencoders is a kind of generalization of linear algebra dimension reduction techniques to non-linear functions. I really need to read the Chapter 14 of Goodfellow book to understand it better.

Other methods

Several worth-reading references are listed here. It is probably a good idea to get back to them in a second pass through this material.

Paper reading