Tokenizer - Method

A method of TextNormalizer class to perform tokenization on sentences.

Getting Started

Tokenizer function is one of the most necessary method when it comes to preprocessing in NLP.

Example

# coding:utf-8
from indicLP.preprocessing import TextNormalizer
import re

language = "hi"

normalizerHindi = TextNormalizer(language)

text = " नमस्ते! आप कैसे हैं? मुझे आपकी बहुत याद आयी।"
text = re.split('[।;?\.!]',text)
text = list(filter(None, text))

print(text)
# Output is [' नमस्ते', ' आप कैसे हैं', ' मुझे आपकी बहुत याद आयी']

out = normalizerHindi.tokenizer(text, stem = True)
print(out)
# Output is [['▁नमस्त'], ['▁आप', '▁कैस', '▁हैं'], ['▁मुझ', '▁आपक', '▁बहुत', '▁याद', '▁आय']]

out = normalizerHindi.tokenizer(text, stem = False)
print(out)
# Output is [['▁नमस्ते'], ['▁आप', '▁कैसे', '▁हैं'], ['▁मुझे', '▁आपकी', '▁बहुत', '▁याद', '▁आयी']]

Input Arguments

Following are the input arguments to be provided while using tokenizer method:

inp_list: A list of strings, containing the sentences to be tokenized.
stem: A boolean value to determine is words must be stemmed or not. Default is True.

Reference Materials

Following are some reference materials for Tokenizer Method

Tokenizing Sentences: A reference blogpost discussing the usage of TextNormalizer Class in hardcoded string.
Text Normalization: Jurafsky NLP slides on text processing provided by Stanford.