NLP: Indic Language Processing
In the last week’s post we discussed the need for NLP technologies that account for languages besides English. Continuing with that theme today we will be discussing various existing toolkits and libraries that enable us to perform language processing on Indic Languages like Hindi, Tamil, Telugu etc.
Various initiatives have been taken on this domain, developing corpora, datasets and toolkits to make Indic NLP tasks easy for the developers and researchers alike. AI4Bharat is one such interesting github open-source community which has made great strides of improvement in this particular domain.
Indic NLP Library
Indic NLP Library is one such interesting initiative that helps AI Engineers develop models and applications using Indic Languages. The library provides various functionalities such as Text Normalization, Transliteration, Translation etc. which form a crucial part in NLP development cycle. Let’s now go through some examples of it to understand the application better.
Firstly we must make sur that our python environment has the library installed in it, if not we can install into through pip as shown below
pip install indic-nlp-library
Let us start with an example of Transliteration from Hindi to Tamil. In the below code we import the Transliteration module from the Indic NLP Library. We then provide an input text in Hindi and use the module to convert it to Tamil. The output of the below code is நமஸ்தே.
from indicnlp.transliterate.unicode_transliterate import *
hin_text = "नमस्ते"
print(UnicodeIndicTransliterator.transliterate(hin_text,"hi","ta"))
Let us now consider a common preprocessing step in NLP development cycle, which is tokenization of the sentence. Here, we use the Tokenization module from the library to achieve our goal. The function does a fine job on a wide of languages and few of the outputs have been listen below.
- ‘क्या कर रहे हो?' gives the output [‘क्या’, ‘कर’, ‘रहे’, ‘हो’, ‘?']
- ‘எனக்கு என் குழந்தைப் பருவம் நினைவிருக்கிறது’ gives the output [‘எனக்கு’, ‘என்’, ‘குழந்தைப்’, ‘பருவம்’, ‘நினைவிருக்கிறது’]
from indicnlp.tokenize import indic_tokenize
text = 'क्या कर रहे हो?'
indic_tokenize.trivial_tokenize(text)
iNLTK Library
iNLTK is an Indic Language equivalent of NLTK library, which is the standard library when it comes to text preprocessing. iNLTK similarly has a wide variety of functionalities when it comes to processing of Natural Language, along with a wide variety of dataset and corpora to train models on.
The iNLTK library has a wide range of language support Hindi, Bengali etc. with detailed analysis of various datasets such as perplexity, classification accuracy etc. Further it also accounts code mixed languages such as Hinglish and Tanglish, giving it an additional edge when dealing with modern data and lingo.