An Introduction on Word2vec

Keet Malin Sugathadasa
May 7, 2017
6 min read

There are many ways that can be used to represent words in a given piece of text, so that it can be further used for various applications. As discussed in this blog, words can be represented by a sequence of characters, by real number, by images and many other ways. Word2vec is a tool that efficiently analyzes a given word and represents it using a vector in a selected vector space. These representations can be then used for many applications in Natural Language Processing, Semantics and other potential research areas. Word2vec was created by a team of researchers led by Tomas Mikolov at Google. The algorithm has been further optimized by other researchers where this gives a competitive advangtage over other word embedding techniques available in the industry today.

Introduction

Word2vec is a group of related models used to produce word embeddings. It is a two layer neural network, that will process text and train to reconstruct linguistic contexts of words. The word2vec tool tekes a text corpus as an input and produces a vector space, typically of 100 - 1000 dimensions, with each unique word in the corpus being added to a vocabulary and assigned a vector in the space. given below is a very high level look of the word2vec process.

Word2vec is not a deep neural network, but the output given by it in numerical form, can be used to process within deep neural networks. The positioning of these vectors (word embeddings) in the vector space occurs in a way such that, the words that share a common context are located in close proximity to one another in the space. The main advantage here is that, this does not need a human intervention at all.

Word2vec is not just a text parsing system. It's applications and extensions are being widely used many fields like, medicine, law, music, movies and many more. If we look deep into its applications, even though word2vec means simply about converting words to vectors, the same process can be applied to other data types like genes, likes, codes, playlists or symbols as well. Words are normal discrete states like the data that was mentioned, and the transitional probabilities between those states (likelihood that they will co-occur) will be searched for. Therefore, similar to word2vec, gene2vec, like2vec and all are also possible.

How Does it Work?

As mentioned above in the introduction, word2vec takes a text corpus as an input and produces word vectors (word embeddings) as an output. If will first construct a vocabulary of unique words from the text corpus which was used as the training data set, and then learns vector representation of words. The resukting word vector files, can be used in many Natural Language and Machine Learning applications.

When enough data and contexts are being given, it gives a very accurate output about the semantics of words, based on the training data set (text corpus). Those semantics can be used to establish word associations between words or cluster documents and classify them by topic. Those clusters can form the basis of search, sentiment analysis and recommendations in such diverse fields as scientific research, legal discovery, e-commerce and customer relationship management.

Word2vec examines the cosine distance between words and expresses the similarity. Now let's have a look at how cosine similarities work.

Cosine Similarity

The differences between words vectors can be seen as the arms of a clock that swing around the origin, where the differences can be taken in degrees. Similar to ancient navigators gauging the stars by a sextant, we will measure the angular distance between words using something called cosine similarity.

In simple terms:

0 degree angle difference = total similarity
90 degree angle difference = no similarity

Have a look at the figure given below. The points which are with very less angle differences, are clustered together in constellations of meaning.

The cosine distance between two vectors can be easily calculated using the Dot Product of the vectors.

If a and b can be considered as two different vectors with an angle of theta, the dot product of a and b is as shown.

The dot product of two vectors a and b is equal to the product of those vectors' norms, (the absolute values of their respective lengths) multiplied by the cosine of the angle that separates them, theta.

The cosine of the angle can re-written as given below.

So the word similarity in terms of cosine distance can be simply expressed as follows.

If cosine distance = 1 --> angle = 0 --> two vectors are equal
If cosine distance = 0 --> angle = 90 --> two vectors are perpendicular (no relation to each other)

Example:

The countries France and Spain are very close to each other.

If we enter the word France, the cosine distances to other words are as given in this table.

As you can see, Spain has the highest cosine distance, meaning it has the lowest angle difference between the vectors France and Spain. So their similarity is very close. It is also depicted in the map given above. The more we train, the more accurate it gets.

Algorithms being used

A word represented by a number is known as a neural word embedding, which is also the simple transformation being done by word2vec. It vectorizes the words as numbers, so that computers can process natural language with ease.

The image given below is an autoencoder. An autoencoder is an artificial neural network used for unsupervised learning of efficient codings. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of dimensionality reduction. But rather than training against the input words through reconstruction, word2vec trains words against other words that neighbor them in the input corpus.

There are two main learning algorithms in word2vec.

Continuous Bag-Of-Words (CBOW)
continuous skip-gram

In Continuous Bag Of Words (CBOW) architecture, the model predicts the current word from a window of surrounding context words (neighbouring words). According to the Bag-Of-Words assumption, the order of context words, does not influence the prediction.

In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. This weighs the nearby context words (neighbouring words) more heavily than distant context words.

According to the author, CBOW is faster while skip gram is slower, but skip gram does a better job for infrequent words. For models with large corpora and high number of dimensions, the skip-gram model yields the highest overall accuracy, and consistently produces the highest accuracy on semantic relationships, as well as yielding the highest syntactic accuracy in most cases. However, the CBOW is less computationally expensive and yields similar accuracy results.

Vector Space of Word Vectors

The vector space shown in the above examples are mostly of 2 or 3 dimensional vector spaces. Most outputs in word2vec are from d-dimensional vector spaces where d is around 100 to 1000. A well trained set of word vectors will place similar words close to each other in that space.

Eg:

Vector operations which are given below are capable of outputting accurately, if the model is being trained with sufficient data. Word2vec supports the possibility to train models on huge data sets (up to hundreds of billions of words).

vector('Paris') - vector('France') + vector('Italy') = vector('Rome')

vector('king') - vector('man') + vector('woman') = vector('queen')

Parameters in word2vec Training

The results given by word2vec training can be fine tuned or customized, based on the following parameters.

1) Training Algorithm

Hierarchical Softmax Algorithm : works better for infrequent words
Negative Sampling Algorithm: works better for frequent words and better with low dimensional vectors

2) Sub-sampling

High frequency words often provide little information. Words with frequency above a certain threshold may be subsampled to increase training speed.

3) Dimensionality

Higher dimensionalities will increase the quality of the word embeddings. When increasing dimensionality, after a certain point, the marginal gain will diminish. The typical dimansionality of vectors in word2vec, is set to be 100 - 1000.

4) Context Window

The size of the context window determines how many words before and after a given word would be included as context words of the given word.