Getting Started
remove_stop_words method in Embedding class can be used to perform stopword removal in input list.
Example
# coding:utf-8 from indicLP.preprocessing import TextNormalizer import re language = "hi" normalizerHindi = TextNormalizer(language) text = " नमस्ते! आप कैसे हैं? मुझे आपकी बहुत याद आयी।" text = re.split('[।;?\.!]',text) text = list(filter(None, text)) print(text) # Output is [' नमस्ते', ' आप कैसे हैं', ' मुझे आपकी बहुत याद आयी'] out = normalizerHindi.tokenizer(text, stem = True) print(out) # Output is [['▁नमस्त'], ['▁आप', '▁कैस', '▁हैं'], ['▁मुझ', '▁आपक', '▁बहुत', '▁याद', '▁आय']] clean_out = [] for i in out: clean_out.append(normalizerHindi.remove_stop_words(i)) print(clean_out) #Output is [['▁नमस्त'], [], ['▁मुझ', '▁याद', '▁आय']]
Input Arguments
Following are the input arguments to be provided while using Remove Stopwords Method:
- wordList: A list of string representing the tokenized sentence.
Reference Materials
Following are some reference materials for Remove Stopwords Method
- Tokenizing Sentences: A reference blogpost discussing the usage of TextNormalizer Class in hardcoded string.
- Dropping common terms: Stop Words: NLP notes on stopwords removal provided by Stanford.