TextNormalizer

TextNormalizer provides the necessary functions needed to process the corpus.

Getting Started

TextNormalizer class provides functions like tokenizer, stop word removal and stemmer, to normalize the corpora before training the machine learning models.

Example

# coding:utf-8
from indicLP.preprocessing import TextNormalizer, Embedding
import re

language = "hi"

normalizerHindi = TextNormalizer(language)

text = " नमस्ते! आप कैसे हैं? मुझे आपकी बहुत याद आयी।"
text = re.split('[।;?\.!]',text)
text = list(filter(None, text))

print(text)
# Output is [' नमस्ते', ' आप कैसे हैं', ' मुझे आपकी बहुत याद आयी']

out = normalizerHindi.tokenizer(text, stem = True)
print(out)
# Output is [['▁नमस्त'], ['▁आप', '▁कैस', '▁हैं'], ['▁मुझ', '▁आपक', '▁बहुत', '▁याद', '▁आय']]

clean_out = []
for i in out:
    clean_out.append(normalizerHindi.remove_stop_words(i))
print(clean_out)
#Output is  [['▁नमस्त'], [], ['▁मुझ', '▁याद', '▁आय']]

Methods Involved

Following are the methods contained in TextNormalizer Class:

Tokenizer: A method to perform tokenization on given list of sentences.
Stem: A method to perform stemming on given string.
Remove Stop Words: A method to perform stop word removal on given list of words.

Reference Materials

Following are some reference materials for Preprocessing module

Tokenizing Sentences: A reference blogpost discussing the usage of TextNormalizer Class in hardcoded string.
Text Normalization: Jurafsky NLP slides on text processing provided by Stanford.