NLP Reflexions

This sectiong groups papers that tackle general topics in NLP such as the current state-of-art, general public content or the ethics of NLP and AI.

On our best behaviour

Paper link: On Our Best Behaviour
Citation: Levesque, Hector J. "On our best behaviour." Artificial Intelligence 212 (2014): 27-35.

The author writes its thoughts about what inteligence really is and what makes a machine to be declared as inteligent. It contains a slight critics to the hype that we currently have with machine learning techniques and big data.

Winograd schemas

Inteligent is too complex to have an analytical definition. The idea behind the Turing Test is to demonstrate inteligence without a need of a definition. The author argues though, that the Turing Test is vulnerable to tricks. A common strategy used by bots candidates to pass the Turing Test is to be the most evasive possible in the answers. These chat bots rely heavily on wordplay, jokes, quotations and emotional outbursts.

A more adequate test would be something along the lines of captchas but with words instead of images. Winograd schemas consist of a short sentence with a multiple choice follow up question. For example,

Joan made sure to thank Susan for the help she had given. Who had given the help? 1. Joan. 2. Susan

In Winograd questions, there is always a way to slightly modify the schema such that the correct answer changes. In the example above, changing given by received changes the correct answer from Susan to Joan.

The troph would not fit in the brown suitcase because it was so small. What it was so small? 1. trophy 2. suitcase (replacement=big)

In the exa,ple above, we can simply paste another piece of information in the sentence to change its answer.

The troph would not fit in the brown suitcase despite the fact that it was so small. What it was so small? 1. trophy 2. suitcase (replacement=big)

The lure of statistics

Can we engineer a system to produce a desired behaviour with no more errors than people would produce (with confidence level z)? Looking at behaviour this way can allow some of the more challenging examples that arise be ignored when they are not statistically significant. Unfortunately, this can lead us to systems with very impressive performance that are nonetheless idiot-savants. We might produce prodigies at chess, face-recognition, Jeopardy, and so on, that are completely hopeless outside their area of expertise.

The lesson

The ultimate question of AI cannot be answered by expert systems trained by tons of examples solely; neither by computationaly expensive symbolic system, but more likely to something that integrates both. We need the first to have a knowledge base and the second to make reasoning from it.

Computation Linguistics and Deep Learning

Paper link: Computations Linguistics and Deep Learning
Citation: Manning, Christopher D. "Computational linguistics and deep learning." Computational Linguistics 41.4 (2015): 701-707.

The author criticizes the no-brainer use of machine learning methods without support of linguistic knowledge. He refers to this as the Kaggle game. The paper is from 2015 and the author seems not very enthusiast about deep learning methods applied to natural language processing. Nonetheless, a recent visit of his profile points him as a leader in deep learning techniques in the area.

Recently, there have been many, many papers showing how systems can be improved by using distributed word represen- tations from “deep learning” approaches, such as word2vec (Mikolov et al. 2013) or GloVe (Pennington, Socher, and Manning 2014). However, this is not actually building Deep Learning models, and I hope in the future that more people focus on the strongly linguistic question of whether we can build meaning composition functions in Deep Learning systems.

He is one of the authors of GloVe.

Initiatives of the author

Universal dependencies
Abstract Meaning Representation

Ambiguity

[ That kind [of knife]] isn't used much
We are [kind of] hungry
[A [kind [of dense rock]]]
[A [[kind of]] dense] rock]

Notice how the sequence kind of can take different meaning in these phrasings. It could be specific about the kind, type, category of something; but it could also have the meaning of sort of.

Linguistic variation and change

Distributional word representation can illustrate the change in linguistic variation of dog and hound. These words have their meaning swapped. Earlier, hound was used to identify any kind of canine, while dog was used to identify a particular one.

Other examples are the word gay and the word cell, that nowadays is more associated with the meaning of phone and cordless.

Overall, I think we should feel excited and glad to live in a time when Natural Language Processing is seen as so central to both the further development of machine learning and industry application problems. The future is bright. However, I would encourage everyone to think about problems, architectures, cognitive science, and the details of human language, how it is learned, processed, and how it changes, rather than just chasing state-of-the-art numbers on a benchmark task.

Contextual Word Representations: A Contextual Introduction

Paper link: Contextual Word Representations: A Contextual Introduction
Citation: Huang, Eric H., et al. "Improving word representations via global context and multiple word prototypes." Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2012.

The article is supposed to give an introduction to NLP to a great audience. There is no much technical details and a list of recent advancements and open problems are given. It is also a collection of seminal papers in the area.

Discrete words.

Not long ago, it was a common practice to represent words by integers. Instead of comparing two sequence of characters, one would compare two integers and that is fast. Integer-based representations of word types are referred to as discrete representations.

Word as vectors

The author discurses on how each dimension of a vector could hold a class of information. One dimension could indicate the grammatical class; another its syntactic function; another could indicate a mondaneous property such as weight or class, such as food; and etc.

Those word dimensions could be specified by an human or by an automated system. Regarding the latter, we are really referring to features.

WordNet

It is a expert-crafted data structure.
Synset: An unordered set which elements are word representations. Word representations with multiple meanings have one element for each meaning.
WordNet has about 117K synset elements. They are linked to each other via plentiful of relations. In the examples below, the relations are encodes as a R b for synset elements a and b:
- Hyponym: phytoplankton R plant
- Hypernym: plant R organism
- Part Meronym: shelf R bookcase
- Domain Category: plant R botany
- Member holonym: plant R plant kingdown
The goal of WordNet is to explicitly state the context of every word in the vocabulary by stablishing connections among its elements.

Word as distributional vectors: context as meaning

The idea is to look at the surroundings of a word; make some statistics on it; and then infer relations between those statistics with the assumption that these inferred relation will also reflect some sort of similarity between the two elements in the relation.

The Deerwester 90s paper is the seminal reference for this approach. It basically goes to construct a count word by word matrix apply some weighting and normalizing schemes and then executing a dimensionality reduction technique. A disadvantadge of dimensionality reduction is that is not clear how to transform back the reduced vector in the original representation, which usually carries some type of intuition and structure that is easy to reason about.

It has been noticed that space vectors can also be useful to identify word analogies. For example, Turney and Pantel 2010 pointed out that the word analogy "man is to woman as king is to queen" could be encoded as

\[ v(man) - v(woman) = v(king) - v(queen) \]

word2vec: Mikolov 2013
Use WordNet to adjust a previously trained vector embedding (retrofitting, Faruqui 2015).
Use a bilingual dictionary to create a unique vector space from two language specific vector space. Example, create a single vector space in which the words cucumber and concombre are close to each other.

Contextual Word Vectors

The distributional vector approach takes a corpora, do some counting and then outputs a single vector for each word. Contextual word vectors are supposed to output multiple vectors for the same words depending of the context. Therefore, the word good will have different vector representations for the two phrasings below:

I got good grades this year.
That is a very good restaurant.

The author calls the distributional vectors as word-type vectors and the contextual vectors as word-token vectors.

The idea of word-token vectors first appeared in Peters 2018 and it can be summarized as:

Compute the word-type vector using a corpora;
For each word-token, use the word-type vectors of this context as input of a neural network that will output a unique word-token vector for that word-type inserted in that specific context.

There is a list of papers that seems nice to read. I should create a Rabbit Hole entry for this paper.

NLP Book references

Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural Lan- guage Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, third edition, forth- coming. URL https://web.stanford.edu/ ̃jurafsky/slp3/
Jacob Eisenstein. Introduction to Natural Language Processing. MIT Press, 2019.