Text Processing

Tutorial on Lemmatization using NLTK

Lemmatization is one of the foundational steps in Natural Language Processing (NLP). It involves converting a word into its base or root form, which enables you to simplify text analysis. The Natural Language Toolkit (NLTK) offers a straightforward way to perform lemmatization. In this tutorial, you’ll learn how to use lemmatization using NLTK

Table of Contents

Prerequisites

  • Python installed on your system
  • Basic Python programming knowledge
  • Familiarity with NLP concepts is beneficial but not essential

What is Lemmatization?

Lemmatization is the process of converting a word into its most basic form or root word. For example, the lemma of “running” is “run”, and the lemma of “geese” is “goose”.

Installing NLTK

If you haven’t installed NLTK yet, you can do so using pip: Installing NLTK on Windows, Mac and Linux

Performing Lemmatization

To perform lemmatization in NLTK, you’ll need to import the WordNetLemmatizer class from the nltk.stem module.

Python
from nltk.stem import WordNetLemmatizer

Basic Example

Here’s a simple example to get you started:

Python
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("running"))  # Output: 'running'
print(lemmatizer.lemmatize("geese"))  # Output: 'goose'

Lemmatization with POS tags

Part-of-Speech (POS) tags can also be used to improve lemmatization.

Python
print(lemmatizer.lemmatize("running", pos="v"))  # Output: 'run'

Examples

Python
# Without POS tag
print(lemmatizer.lemmatize("increases"))  # Output: 'increase'

# With POS tag
print(lemmatizer.lemmatize("increases", pos="v"))  # Output: 'increase'

FAQs

Q. Is lemmatization better than stemming?

A. Both methods have their own use-cases. Lemmatization is more accurate but computationally more intensive. Stemming is faster but less precise.

Q. Can I use lemmatization for languages other than English?

A. NLTK primarily focuses on English. However, there are tools and libraries for lemmatization in other languages as well.

Q. Do I always need to provide POS tags for lemmatization?

A. No, but providing POS tags can improve the accuracy of the lemmatization process.

Q. What are POS tags?

A. POS tags are labels that indicate the part-of-speech (such as noun, verb, adjective) of each word in a text.

Q. How do I find out the POS tag of a word?

A. You can use NLTK’s pos_tag function to tag a sentence with part-of-speech labels.