Word Embedding

This article explains the word embedding functionalities in IndicLP

Getting Started

Word Embedding has become one of the most crucial aspect of NLP tasks, as it helps in representing the meaning of words in a way computer can understand. Let us first consider the definition of Word Embedding, provided by Stanford’s RCpedia:

Word Embeddings are a method to translate a string of text into an N-dimensional vector of real numbers. Many computational methods are not capable of accepting text as input. This is one method of transforming text into a number space that can be used in various computational methods.

In this article we’ll be trying to achieve exactly that, embedding words as a n-dimensional vector, using indicLP library.

Using Word Embedding

To get us started, we first need to import the Embedding class from IndicLP, preprocessing submodule, as show below.

from indicLP.preprocessing import Embedding

Once we have imported the class, we must define an instance of the class with the required language. IndicLP’s embedder is made using Gensim library’s word2vec model, a standard model used in NLP tasks.

embedder = Embedding(lang="ta")

Now we can use this embedder object to get the associated numpy vector for various words, as required in the task. Here we are going to use the get_vector method on a hardcoded string to showcase its usecase.

word = "உயிர்"

print(embedder.get_vector(word).shape)

The above code will give the output as (300,) which is the shape of the numpy array returned when using to builtin word embedding models.

Getting the top 10 closest words

Besides providing the vector for processing, indicLP also provides a function to print the top n words (stems to be more accurate) closely associated with the word provided. get_closest function of Embedding class achieves this. To use this function you need to provide two input argument, the word and the number of words needed (default value is 10), as shown below.

word = "உயிர்"
print(embedder.get_closest(word))

This will give us a list of tuples, containing the stem of the found word and score associated with it.

[('உடலை', 0.8038842082023621), ('மனிதர்', 0.7657704949378967), ('வாழ', 0.7598057985305786), ('மனம்', 0.7563427686691284), ('துன்பம்', 0.7538397312164307), ('தீய', 0.7489129900932312), ('உள்ளம்', 0.7443488240242004), ('தாம்', 0.7373127937316895), ('மனிதன்', 0.7366601824760437), ('நம்', 0.7319730520248413)]