NLP: Stemming and Lemmatization
In last week’s post we discussed about the ambiguities that can be present in a sentence. Today, we are going to be discussing about steps that can be taken while processing a sentence for getting better results. To get started let us consider the following sentences as example:
The words “eats” and “eating” have been used in the above sentences. Often while processing a sentence, knowing the root word is enough to understand the meaning of the sentence. For “eats” and “eating” the root word is “eat”. Today we are going to be discussing two common methods through which this is achieved in NLP.
NLTK Library
Let us discuss the library we are going to use for this blog’s activities. NLTK, which stands for Natural Language Toolkit, is a python library that helps us process and work with natural language (human language). The NLTK library can perform a wide range of operations such as tokenizing, stemming, classification, parsing, tagging, and semantic reasoning.
Furthermore, NLTK Library also provides us with an user friendly interface to interact with over 50 corpora and lexical resources like Gutenberg Corpora, WordNet etc. You can install NLTK if it isn’t available in your system using pip command as shown below:
pip install nltk
Stemming
Let us first understand the process of stemming. Stemming is the process in which we reduce a word to its root form. For instance, the words running, run, runs all are variations of the same root word “run”. However, the stem word that is generated need not be an actual meaningful word. For instance, the words “dance” and “dancing” will have the root word as “danc” which is not a meaningful word.
In the NLTK Library, there are various inbuilt stemmers like Porter and Snowball stemmer. We will be using Porter Stemmer in the examples. It uses various rules associated with suffix removal to identify the stem/root word associated.
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
words = ["eating","eat","ate","eats"]
stems = [stemmer.stem(w) for w in words]
for i in range(len(words)):
print(words[i]+"\t"+stems[i])
On executing the above code you will notice that, while the words “eating”, “eat” and “eats” give the stem word as “eat”, “ate” does not as mere suffix removal will not be sufficient.
Lemmatization
Lemmatization on the other hand, identifies a meaningful root word known as lemma. One of the commonly used lemmatizer in NLTK library is the WordNet Lemmatizer. WordNet is a large, freely available lexical database, that establishes structured semantic relationships between various words in English language.
The WordNet Lemmatizer requires two parameters, the first one being the word itself and second being it’s context, which is represented by the POS Tag (verb, noun etc.) in which it appears in our sentence. Sometimes the lemma may change based on the context in which our word appears in our sentence.
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
words = ["eating","eat","ate","eats"]
lemmas = [wordnet_lemmatizer.lemmatize(w,pos="v") for w in words]
for i in range(len(words)):
print(words[i]+"\t"+lemmas[i])
While lemmatizing, we can notice that all the four words are being converted to “eat”. This is because unlike stemming, here we are using WordNet to identify the relationship between “ate” and “eat” when they appear as a verb.