Tokenizing Sentences

This article explains how to use indicLP library to perform the task of tokenization.

Getting Started with indicLP

IndicLP (Indic Language Processing) Library has been developed to act as a complete toolkit for programmers and researchers who are working on NLP projects in Indic Languages. Therefore, one of the most necessary functionalities that it supports is tokenization. Let us first consider what tokenization is, stanford NLP defines tokenization as:

Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation.

indicLP provides sentencepiece tokenizer for all the supported languages (currently hindi and tamil). In this post we’ll discuss this functionality and how to use it in your projects.

Tokenization using indicLP

Preprocessing of texts is a crucial aspect of NLP, as it helps the model development process easier by focussing on the necessary aspects of the data, instead of the unnecessary details. In indicLP library, this is done using the TextNormalizer class in preprocessing submodule.

Let us first import the necessary libraries for our program.

from indicLP.preprocessing import TextNormalizer
import re

We’ll be using the TextNormalizer to perform tokenization while re module is used to split the sentences. Next up let’s get the string we are going to tokenize. In this example we are consider a Tamil string, containing multiple sentences.

text = "வணக்கம்! எப்படி இருக்கின்றீர்கள்? நன்றாக இருக்கின்றேன். உங்கள் சொந்த ஊர் எது?"
text = re.split('[।;?\.!]',text)
text = list(filter(None, text))

Now that we have our string, let’s tokenize the text into list of tokens. For this we’ll be first defining an instance of TextNormalizer class with language as Tamil. We can then use the tokenizer method. Furthermore we are going to not stem the words first to understand it better.

normalizerTamil = TextNormalizer("ta")

out = normalizerTamil.tokenizer(text, stem = True)
print(out)

One can also use the tokenizer with stemming option, in which case the words are stemmed based on the language set. Snowball stemmer has been used to achieve this.

Stemmed: [['▁வணக்கம்'], ['▁எ', '▁இர்'], ['▁நல்', '▁இர்'], ['▁உங்', '▁சொ', '▁ஊர்', '▁எது']]

Not Stemmed: [['▁வணக்கம்'], ['▁எப்படி', '▁இருக்கின்றீர்கள்'], ['▁நன்றாக', '▁இருக்கின்றேன்'], ['▁உங்கள்', '▁சொந்த', '▁ஊர்', '▁எது']]

Removing Stopwords

Once we have tokenized the sentences and stemmed the words, it is necessary that we remove words that do not contribute towards the meaning of the text. These words are called stopwords and examples in English are “a”, “the”, “is” etc. indicLP provides the method remove_stop_words to remove the stopwords from given list.

Continuing with the tokenized sentences, we will now remove the stop words present in it. It must be noted here that the program assumed that stemming has been done, as often in Indic languages the words can have multiple forms.

clean_out = []
for i in out:
    clean_out.append(normalizerTamil.remove_stop_words(i))
print(clean_out)

This will remove the stopwords from the text thus giving us only the meaningful data required for processing.

Original: [['▁வணக்கம்'], ['▁எ', '▁இர்'], ['▁நல்', '▁இர்'], ['▁உங்', '▁சொ', '▁ஊர்', '▁எது']]
Without Stopwords: [['▁வணக்கம்'], [], ['▁நல்'], ['▁உங்', '▁சொ', '▁ஊர்', '▁எது']]