Preprocessing

indicLP's Preprocessing module provides developers with a wide variety of tools and functions to make their NLP journey easier.

Getting Started

Preprocessing Library contains all the necessary tools to clean the input text before passing it to a model.

Example
from indicLP.preprocessing import TextNormalizer, Embedding

language = "ta"

normalizerTamil = TextNormalizer(language)
embedder = Embedding(lang=language)

Classes Contained

Following are the classes contained in preprocessing module:

  • TextNormalizer: Class to perform tasks such as tokenizing, stopword removal and stemming.
  • Embedding: Class for performing word embedding and finding closely associated words.

Reference Materials

Following are some reference materials for Preprocessing module

  • Tokenizing Sentences: A reference blogpost discussing the usage of TextNormalizer Class in hardcoded string.
  • Word Embedding: A reference blogpost discussing the usage of Embedding Class in hardcoded string.