Getting Started
TextNormalizer class provides functions like tokenizer, stop word removal and stemmer, to normalize the corpora before training the machine learning models.
Example
# coding:utf-8 from indicLP.preprocessing import TextNormalizer, Embedding import re language = "hi" normalizerHindi = TextNormalizer(language) text = " नमस्ते! आप कैसे हैं? मुझे आपकी बहुत याद आयी।" text = re.split('[।;?\.!]',text) text = list(filter(None, text)) print(text) # Output is [' नमस्ते', ' आप कैसे हैं', ' मुझे आपकी बहुत याद आयी'] out = normalizerHindi.tokenizer(text, stem = True) print(out) # Output is [['▁नमस्त'], ['▁आप', '▁कैस', '▁हैं'], ['▁मुझ', '▁आपक', '▁बहुत', '▁याद', '▁आय']] clean_out = [] for i in out: clean_out.append(normalizerHindi.remove_stop_words(i)) print(clean_out) #Output is [['▁नमस्त'], [], ['▁मुझ', '▁याद', '▁आय']]
Methods Involved
Following are the methods contained in TextNormalizer Class:
- Tokenizer: A method to perform tokenization on given list of sentences.
- Stem: A method to perform stemming on given string.
- Remove Stop Words: A method to perform stop word removal on given list of words.
Reference Materials
Following are some reference materials for Preprocessing module
- Tokenizing Sentences: A reference blogpost discussing the usage of TextNormalizer Class in hardcoded string.
- Text Normalization: Jurafsky NLP slides on text processing provided by Stanford.