Embedding

Embedding provides the necessary functions to perform word embedding.

Getting Started

Embedding is a necessary functionality in modern NLP as most models and algorithms require real number data and not string such as words and sentences. The Embedding library is built on top of Gensim Library.

Example
# coding:utf-8
from indicLP.preprocessing import Embedding

language = "ta"

word = "புத்தகம்"
embedder = Embedding(lang=language)

print(embedder.get_vector(word).shape)
#Output is (300,)

print(embedder.get_closest(word))
# Output is [('நாவல்', 0.8771955370903015), ('கட்டுர்', 0.8479164242744446), ('கட்டுரை', 0.8334720134735107), ('பதிப்பி', 0.8099671006202698), ('பத்திரிக்கை', 0.793248176574707), ('தலைப்பு', 0.7890440821647644), ('எழுதி', 0.7842222452163696), ('கவிதை', 0.782835841178894), ('நூலை', 0.780312716960907), ('தியாயம்', 0.7802311778068542)]

Methods Involved

Following are the methods contained in TextNormalizer Class:

  • Get Vector: A method to receive the vector associated with the word.
  • Get Closest: A method to get the n closest words to the provided word.

Reference Materials

Following are some reference materials for Embedding module

  • Word Embedding: A reference blogpost discussing the usage of Embedding Class in hardcoded string.
  • Stanford RCpedia: RCpedia NLP slides on word embeddings provided by Stanford.