Language Modelling

This section groups my reading notes of papers in the subject of Language Modelling.

The mathematics of statistical machine translation

Paper reference: The mathematics of statistical machine translation

Summary

This paper presents 5 models trained to perform translation between French and English sentences. All models are derived from a general distribution of translation pairs proposed by the authors. The models are explained in order of difficulty and they work together, i.e., we use model 1 to estimate parameters in model 2 that are used to estimate parameters in model 3 and so on...

The paper is written in the form of a tutorial and it is very well written. It is quite dense at some parts, but the article itself is self-contained.

Among the techniques used we have

Joint distribution and maximum likelihood computation
Expectation-Maximization algorithm
Viterbi algorithm
Lagrange multipliers

Can we use this model to construct word embeddings?

Nowadays translation data abounds. Assuming that we have a good model for do translation, can we use this model to extrapolate other tasks? Can we do transfer learning to tackle our conditioning problem?

This translation model is more powerful than static word embeddings. This translation model is dependented on context.

What tasks we do during conditioning? - Extrapolate abbreviations (e.g. FBI -> Federal Bureau of Investigation)
- Reduce names to abbreviations (e.g. Federal Bureau of Investigation -> FBI )
- Make some words optional (e.g. Target [Supermarket])
- Numbers written in numeric and textual form (e.g. one -> one)
- Word number variation (e.g. place places)
- Add optional words (e.g. )
-

How this model compares with recent deep approaches?

I would like to compare this model with the state-of-art deep model for translation and list their advantages and disadvantages. Is the deep model a generalization of this approach?

Let us say that that we want to translate a french sentence $f$ to the correspondent english sentence $e$. This can be modeled by a maximum likelihood problem. We want to find the english sentence $e$ with the highest probability to happen given the french sentence $f$.

The fundamental equation of machine translation is $(1)$

\[ \begin{align} argmax_e P(e|f) = argmax_e P(e)P(f|e) \end{align} \]

The right side of the equation can be interpreted as follows:

$P(e)$ is the language model probability, that is, is the probability of a sentence $e$ to belong to the class of well formed english sentences.
$P(f|e)$ is the translation model probability.

The two factors cooperate. The translation model probability is large for English strings, whether well- or ill-formed, that have the necessary words in them in roughly the right places to explain the French. The language model probability is large for well-formed English strings regardless of their connection to the French. Together, they produce a large probability for well-formed English strings that account well for the French. We cannot achieve this simply by reversing our translation models.

Why can we estimate $P(f|e)$ adequately but not $P(e|f)$?

Let us say that we want to translate the following french sentence to english

Je vais à l'école

I argue that for the purposes of our translation problem it is not important to have a probability point for the exact sentence above. We could aggregate the probabilities of all french permutations of the above sentence

Je vais à l'école
Je vais l'école à
Je l'école vais à
Je l'école à vais
Je à l'école vais
Je à vais l'école
vais Je à l'école ...
l'école vais Je à

and do the same for english sentences in order to estimate $P(f|e)$. On the other hand, $P(e)$ should give larger values for well-formed english sentences.

It is much easier to get a good estimate of $P(f|e)$ when we aggregate sentences (ill-formed and well-formed) than when we consider them separately.

Notation: Alignments and cepts

$\text{The}_1 \text{ poor}_2 \text{ don't}_3 \text{ have}_4 \text{ any}_5 \text{ money}_6$
$\text{Les}_1 \text{ pauvres}_2 \text{ sont}_3 \text{ demunis}_4$

Alignment: (f|e) 2. Les pauvres sont demunis | The(1) poor(2) don't(3,4) have(3,4) money(3,4)

Cepts:

( (Les),(The) )
( (pauvres), (poor) )
( (sont, demunis), (don't, have, money) )

An alignment is a set of connections between two sentences. Given a french sentence with $l$ words and an english sentence with $m$ words, there are $lm$ possible connections. An alignment is any subset of these connections, therefore we have $2^{lm}$ possible alignments.

Generalized functions

In page 9 of the article, when describing the model 1, the derivative of the likelihood function is computed. The likelihood function is described in terms of a generalized probability density function. That is, we use delta functions to transform a "discrete pdf" (mdf) into a continuous pdf.

The problem is that I am not able to understand how the derivative is computed. Here are some references that I found useful.

Generalized PDF

About model 1 and derivation of the optimal condition

The model 1 is the simplest of the models exposed in the articles. Several independence assumptions are made and the result is a likelihood expression that is simple enough to be optimally optimized.

The marginal probability is the following $(2)$

\[ \begin{align} Pr(\mathbf{f}, \mathbf{a} \mid \mathbf{e}) &= \frac{\epsilon}{(l+1)^m}\prod_{j=1}^{m}t(f_j \mid e_{a_j}) \end{align} \]

Recalling the notation, we have

$\mathbf{f}$ is the french string one wants to translate
$\mathbf{e}$ is the english string which we make the assumption that is the translation of $\mathbf{f}$
The french string has length $m$, that is: $\mathbf{f} = f_1,f_2\cdots,f_m$
The english string has length $l$, that is: $\mathbf{e} = e_1,e_2\cdots,e_l$
An alignment $a$ is a sequence of $m$ positive numbers indicating the mapping of french words in $\mathbf{f}$ with english words in $e$

In model 1, we are interested to compute the joint distribution of the probability above for all the possible alignments. That gives us the expression $(3)$

\[ \begin{align} Pr(\mathbf{f} \mid \mathbf{e}) &= \frac{\epsilon}{(l+1)^m}\sum_{a_1=0}^{l}\dots\sum_{a_m=0}^{l}\prod_{j=1}^{m}t(f_j \mid e_{a_j}) \end{align} \]

The next step is to maximize the expression above with respdect to the translation probabilities $t(f\mid e)$. The first thing to notice is that the optimization should be constrained to the restriction below:

\[ \begin{align} \sum_e{t(f \mid e)} = 1 \end{align} \]

The Lagrangian of the expression to be optimized is then equal to $(4)$

\[ \begin{align} h(t,\lambda) &= \frac{\epsilon}{(l+1)^m}\sum_{a_1=0}^{l}\dots\sum_{a_m=0}^{l}\prod_{j=1}^{m}{t(f_j \mid e_{a_j})} - \sum{\lambda_e\big(\sum_{e}t(f \mid e) -1 \big) } \end{align} \]

The maximum will happen for the values of $t$ and $\lambda_e$ such that the expression above equals to zero. We express $t(f \mid e)$ as a vector. The component $j$ of this vector holds the value of $t(f_j,e_{a_j})$. Therefore, we need to compute the partial derivative of the expression above with respect to each component of $t(f,e)$. That gives us $(5)$:

\[ \begin{align} \frac{\partial h}{\partial t(f \mid e)} &= \frac{\epsilon}{(l+1)^m}\sum_{a_1=0}^{l}\dots\sum_{a_m=0}^{l}\sum_{j=1}^{m}\delta(f,f_j)\delta(e,e_{a_j})t(f \mid e)^{-1}\prod_{k=1}^{k}{t(f_k \mid e_{a_k})} - \lambda_e \end{align} \]

Notice that

\[ \begin{align} \frac{\partial h}{\partial t(f_j,e_{a_j})} = \delta(f,f_j)\delta(e,e_{a_j})t(f \mid e)^{-1}\prod_{k=1}^{k}{t(f_k \mid e_{a_k})} \end{align} \]

Equalling the $(5)$ to zero we obtain $(6)$

\[ \begin{align} t(f \mid e) &= \lambda_e^{-1}\frac{\epsilon}{(l+1)^m}\sum_{a_1=0}^{l}\dots\sum_{a_m=0}^{l}\sum_{j=1}^{m}\delta(f,f_j)\delta(e,e_{a_j})\prod_{k=1}^{k}{t(f_k \mid e_{a_k})} \end{align} \]

Notice that we have $t(f \mid e)$ from both sides. This indicates that we may have a fixed point method to obtain $t(f \mid e)$. Indeed, by starting with any non-zero initial guess for $t(f \mid e)$ we can iteratively evaluate the expression above until it converges to the real value of $t(f \mid e)$. In particular, the expression above respects the convergence conditions of the Expected Maximization method.

Nonetheless, we cannot enterprise this plan with the expression above as it is. We have something in the order of $(l+1)^m$ operations to compute. That is infeasible. Fortunately, we can do some manipulation.

First, let us define the concept of count of $f$. With the help of equation $(3)$ we rewrite expression $(6)$ to obtain expression $(7)$:

\[ \begin{align} t(f \mid e) &= \lambda_e^{-1}Pr( \mathbf{f} \mid \mathbf{e})\sum_{j=1}^{m}\delta(f,f_j)\delta(e,e_{a_j}) \end{align} \]

Here we define the concept of count of $f$. The count of $f$ is the expected number of times that $e$ connects to $f$ in the translation $(\mathbf{f}\mid\mathbf{e})$. By definition we have $(8)$,

\[ \begin{align} c(f\mid e; \mathbf{f},\mathbf{e}) &= \sum_{\mathbf{a}}Pr(\mathbf{a}\mid \mathbf{e},\mathbf{f})\sum_{j=1}^{m}\delta(f,f_j)\delta(e,e_{a_j}) \end{align} \]

Notice that $Pr(\mathbf{f}\mid \mathbf{e}) = \frac{Pr(\mathbf{f,a}\mid \mathbf{e})}{Pr(\mathbf{f},\mathbf{e})}$. For the sake of obtaining a simpler expression, we assimilate the value of $Pr(\mathbf{f},\mathbf{e})$ in $\lambda_e$. That is (with some abuse of notation), $\lambda_e$ from now on is really representing $\lambda_e Pr(\mathbf{f},\mathbf{e})$. Notice that this does not cause any trouble to us. With these considerations, we can rewrite $(7)$ to obtain $(9)

\[ \begin{align} t(f \mid e) &= \lambda_e^{-1}c(f \mid e;\mathbf{f,e}) \end{align} \]

The real problem here at this point is the computation of $(8)$. We are going to recast expression $(8)$ such that its computation becomes feasible.

First of all, notice that we can simplify $(3)$ and rewrite it as $(10)$

\[ \begin{align} Pr(\mathbf{f} \mid \mathbf{e}) &= \frac{\epsilon}{(l+1)^m}\sum_{a_1=0}^{l}\dots\sum_{a_m=0}^{l}\prod_{j=1}^{m}t(f_j \mid e_{a_j})\\[1em] &= \frac{\epsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}t(f_j\mid e_i) \end{align} \]

To proceed with our derivation, we substitute $(10)$ in the optimization expression $(4)$ and apply the necessary optimization condition once again.

\[ \begin{align} h(t,\lambda) &= \frac{\epsilon}{(l+1)^m}\prod_{j=1}^{m}\sum_{i=0}^{l}t(f_j\mid e_i) - \sum_e\lambda_e\big(\sum_{f}t(f\mid e) -1\big) \end{align} \]

\[ \begin{align} \frac{\partial h}{\partial t(f\mid e)} &= \frac{\epsilon}{(l+1)^m}\sum_{j=1}^{m}\delta(f,f_j)\sum_{k=0}^{l}\delta(e,e_k)\Big(\sum_{i=0}^{l}t(f_j\mid e_i)\Big)^{-1}\prod_{j'=1}^{m}\sum_{i'=0}^{l}t(f_{j'}\mid e_{i'}) - \lambda_e \end{align} \]

Equalling the last expression above to zero and multiplying both sides by $t(f \mid e)$ we obtain:

\[ \begin{align} t(f\mid e) &= \frac{\epsilon}{(l+1)^m}\prod_{j'=1}^{m}\sum_{i'=0}^{l}t(f_{j'}\mid e_{i'})\frac{t(f \mid e)}{\sum_{i=0}^{l}t(f \mid e_i)}\sum_{j=1}^{m}\delta(f,f_j)\sum_{k=0}^{l}\delta(e,e_k) \\[1em] &= Pr(\mathbf{f} \mid \mathbf{e})\frac{t(f \mid e)}{\sum_{i=0}^{l}t(f \mid e_i)}\sum_{j=1}^{m}\delta(f,f_j)\sum_{k=0}^{l}\delta(e,e_k) \end{align} \]

Comparing the latter with expressions $(7)$ and $(8)$, we obtain $(11)$

\[ \begin{align} c(f \mid e;\mathbf{f,e}) &= \frac{t(f \mid e)}{t(f\mid e_0) + \cdots + t(f \mid e_l)}\sum_{j=1}^{m}\delta(f,f_j)\sum_{k=0}^{l}\delta(e,e_k) \end{align} \]

We now have an way to compute the count of $f$ by doing $(l+1)$ operations instead of $(l+1)^m$!

In practice, we are going to estimate $t(f \mid e)$ by a process called training. The equation we are going to iteratively execute until convergence is $(12)$

\[ \begin{align} t(f \mid e) &= \lambda_e^{-1}\sum_{s=1}^{S}c(f \mid e; \mathbf{f^{(S)},e^{(S)}}) \end{align} \]

Notice that $\lambda_e^{-1}$ plays the role of the probability normalizer. In other words,

\[ \begin{align} \lambda_e &= \sum_f\sum_{s=1}^{S}c(f \mid e; \mathbf{f^{(S)},e^{(S)}}) \end{align} \]

Here it is how the training process goes:

Set a non-zero initial guess for $t(f \mid e)$.
For each pair in the training set $(\mathbf{f^{(S)}\mid e^{(S)}})$ compute the count of $f$.
For each $e$ that appears in at least one of the $\mathbf{e^{(S)}}$.
- Compute $\lambda_e$
- For each $f$ that appears in at least one $\mathbf{f^{(S)}}$, use equation $(12)$ to obtain the new value for $t(f \mid e)$.
Repeat steps 2 and 3 until convergence.

Model 2

Viterbi algorithm. It is used to find the maximum likelihood of joint distributions. It is not that difficult to understand. The wikipedia article is quite clear. A nice application example is to find the most likely sequence of hidden states in a Hidden Markov Field. The Viterbi algorithm is a dynamic programming algorithm.

TBD:

List of tricks:

The interpretation of $\lambda_e$ as a probability normalizer is fundamental to the algorithm computation.
The use of delta functions to express the derivatives. In practice, the delta functions is a mathematical notation to filter values to be computed.
Interpreting the probability function $t(f \mid e)$ as a discrete function and representing it by a vector which each component correspond to a value $t(f_j \mid e_k)$.