What are Word Embeddings?

Keet Malin Sugathadasa
May 5, 2017
5 min read

We always use different mechanisms to represent items in the real world. To represent a scenario, we use articles or movies. To represent a flower, we might use an image. To represent love, we might use music. All of these are different ways in representing certain objects that we use everyday. Similarly, word embedding is a representation of a word. This representation is done using a vector space where each word is represented as a vector. When we consider vectors, they are meaningless alone, but they give semantic to each of them when comparing them within a given vector space. For example, similar words have similar vectors, and vector similarities are more easy to compute. The mapping from word to vectors is simply a mapping of each word to a number (real number). Word embeddings are one of the strongest areas in natural language processing, where vectors whose relative similarities correlate with semantic similarities.

This blog explains the concept of word embeddings, why we use them and where we might use them, with examples explained step by step, for any beginner to get a clear understanding on the concept.

1) Why Word Embeddings?

As mentioned above, word embedding is a mapping between words and vectors. In order to understand why we need word embeddings, lets try to identify the main reasons to use vectors instead of the characters in a word for semantic tasks. main two reasons are as follows.

1) Semantic equivalence using characters only is impractical, where vectors do a much better job

2) Computers can handle numbers much better than strings.

When we consider a simple piece of text, a word is considered to be the simplest unit that we can use in identifying semantics of the text. In some simple tasks, we try to use the word and its characters to do the operations. But the word itself, as a sequence of characters is meaningless in terms of semantics. Some words like dog, dogs or eat, eaten or type, typed show a similarity in meaning and representation, and such similarities can be easily understood using the Levenshtein distance. But this seems to be impractical in most cases when trying to identify the similarity between words in a given text. (Eg: lawyer and court does not have a similarity in terms of their characters, even though they have a semantic equivalence).

Also, asking a computer to perform operations on strings is very costly and numbers do a better job which makes the computations faster and more efficient.

Words as Vectors

There are many different ways to represent words as vectors.

1) A vector can correspond to documents in which the word occurs

eg: "banana"

2) A vector can correspond to neighbouring word context

Eg:

"Yellow banana grows on trees in Africa"

-1 0 +1 +2 +3 +4 +5

3) A vector can correspond to a character trigrams in the word

There are many other ways that words can be represented using vectors. But the problem with these methods is that, when comparing two vectors to see how similar they are, the "notion of relatedness" depends on what vector representation you have chosen for the words. And they do not accurately present the similarity in semantics of the words. The vector representations discussed so far, are very high dimensional (thousands and millions) and sparse.

This is where word embeddings come into play. Now lets have a look at what word embeddings are.

2) What is a Word Embedding?

Word embedding is a term given for a set of language modelling and feature learning techniques in Natural Language Processing where words or phrases from the vocabulary are mapped to vectors of real numbers. In simple terms, this is a mapping of a word, to a d-dimensional vector space. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with much lower dimension. We say lower because, unlike other vector representations, word embedding uses a technique to learn lower dimensional (typically between 50 and 1000) dense vectors for words, using the same intuition.

Instead of counting occurrences of neighboring words, the vectors are now predicted (learned) in word embeddings. The algorithms use here follow these simple steps.

1) Assign a random vector fr each word in the vocabulary

2) Now, go over the large text corpus, and at every step, observe a target word and its context (neighbors within a window)

3) The target word's vector and the context words' vectors are then updated to bring them close together in the vector space (and therefore increase their similarity score)

4) Other vectors (all of them, or a sample of them) are updated to become less close to the target word

5) After a significant number of observations from a text corpus, the vectors become meaningful, yielding similar vectors to similar words.

Some of the popular approaches used for word embeddings are word2vec (by Google) and GloVe (by Standford).

Advantages of Word Embeddings

1) The possibility of storing the semantic similarities with a very low dimensional vector space. Even though this is also possible with other high dimensional vector spaces, it is very inefficient in computing similarities between vectors. But in word embeddings, it is efficient and fast.

2) Can find the most similar words, using the most similar vectors.

The following image is taken from this paper , which shows a two dimensional vector space from word2vec, of countries and their capitals. The lines between the countries and their capitals are almost parallel to the horizontal axis, showing a semantic similarity between the country and their capital.

3) Learning Dense Embeddings

There are two methods that can be used to learn word embeddings as vectors. One way is Matrix Factorization and the other is Neural Networks. Let's briefly look into both these aspects.

Matrix Factorization

Factorize word contexts into a matrix. The words on each row and the contexts (neighbors within a window) as columns.

Examples:

LDA ( Word to Document mapping)

GloVe (Word to Neighbouring word mapping)

Neural Networks

A neural network with a bottleneck layer, where the input is the word and context where the output is given respectively.

Example:

word2vec (Word to Neighbouring word mapping)

Usefulness of Word Embeddings

The results given out by word embeddings are very impressive. If we look at a few analogies for this, we can identify similarities and relationships among the following examples.

Example 1:

If we ask the question,

If MAN is to WOMAN, then KING is to?

The answer is QUEEN.

This approach comes to this answer by selecting a vector which is similar to WOMAN and KING, but not to MAN. (We want a a vector related to woman and king both, but not related to man)

KING + WOMAN - MAN = QUEEN

There are number are number of applications that benefit from word embeddings. When it comes to natural language processing and semantics, many applications have re-implemented and started to use word embeddings as a part of their systems, due to its accuracy and performance boost. In future blog posts, let's have a look at how the most popular word embedding technique works, which is word2vect by Google.