Getting Started
Tokenizer function is one of the most necessary method when it comes to preprocessing in NLP.
Example
# coding:utf-8 from indicLP.preprocessing import TextNormalizer import re language = "hi" normalizerHindi = TextNormalizer(language) text = " नमस्ते! आप कैसे हैं? मुझे आपकी बहुत याद आयी।" text = re.split('[।;?\.!]',text) text = list(filter(None, text)) print(text) # Output is [' नमस्ते', ' आप कैसे हैं', ' मुझे आपकी बहुत याद आयी'] out = normalizerHindi.tokenizer(text, stem = True) print(out) # Output is [['▁नमस्त'], ['▁आप', '▁कैस', '▁हैं'], ['▁मुझ', '▁आपक', '▁बहुत', '▁याद', '▁आय']] out = normalizerHindi.tokenizer(text, stem = False) print(out) # Output is [['▁नमस्ते'], ['▁आप', '▁कैसे', '▁हैं'], ['▁मुझे', '▁आपकी', '▁बहुत', '▁याद', '▁आयी']]
Input Arguments
Following are the input arguments to be provided while using tokenizer method:
- inp_list: A list of strings, containing the sentences to be tokenized.
- stem: A boolean value to determine is words must be stemmed or not. Default is True.
Reference Materials
Following are some reference materials for Tokenizer Method
- Tokenizing Sentences: A reference blogpost discussing the usage of TextNormalizer Class in hardcoded string.
- Text Normalization: Jurafsky NLP slides on text processing provided by Stanford.