Transliteration in indicLP

This article explains the transliteration functionalities of indicLP

Getting Started

Transliteration is a topic that’s not often brought up in NLP tasks as more often than not we are dealing with corpus that contain latin text, thus making it standard. However this is not the case when deal with Indic Languages as more often than not we can have text from various scripts in our corpus. These could english translation of some words, quotes from other languages etc.

Transliteration is the process of converting text from one script to another. Unlike translation, we do not change the words and phrases to suit to new language, instead the pronunciation of words remain the same, only represented in the new language.

Transliteration in indicLP

indicLP provides transliteration functionality between all supported languages and latin script (English). This has been done using the library indic-transliteration and an example for which has been explained in the article.

Firstly we need to import the transliteration class from the transliterate submodule of indicLP.

from indicLP.transliterate import Transliterate

Let us now define an instance of the Transliteration class to convert Hindi (Devanagari) to Tamil. Once we have defined the class we’ll use it to transliterate the word “आदमी” in Tamil.

hi2ta = Transliterate("hi","ta")
print(hi2ta.convert("आदमी"))

# Output is ஆதமீ

Let us now try to revert back to Devanagari script now. For this we’ll be using the method revert of Transliteration class. We can notice that reverse it doesn’t give us the original string as often there are multiple character that are pheonetically similar to each other.

text = hi2ta.convert("आदमी")
print(hi2ta.revert(text))

# Output is आधमी