Natural Language Toolkit (NLTK) is a powerful library in Python for working with human language data, a.k.a text. One of the basic tasks in natural language processing is Part-of-Speech (POS) tagging. POS tagging identifies the grammatical parts of a sentence, like noun, verb, adverb, adjective, etc. In this article we will provide steps for POS Tagging using NLTK
Table of Contents
- Prerequisites for POS Tagging using NLTK
- Installing NLTK
- Importing NLTK and Downloading Resources
- Tokenization
- POS Tagging
- Understanding POS Tags
- Example Code
- NLTK POS Taggers
- FAQs for POS Tagging Using NLTK
Prerequisites for POS Tagging using NLTK
- Basic understanding of Python
- Familiarity with NLP concepts is helpful but not mandatory
Installing NLTK
First, you need to install the NLTK library. You can install it using pip: Installing NLTK on Windows, Mac and Linux
Importing NLTK and Downloading Resources
Import NLTK and download the necessary resources like the tokenizers and the POS taggers.
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
Tokenization
The first step before POS tagging is to tokenize the text into words and sentences.
from nltk.tokenize import word_tokenize, sent_tokenize
text = "NLTK is a leading platform for building Python programs to work with human language data."
sentences = sent_tokenize(text)
# Tokenizing the first sentence into words
words = word_tokenize(sentences[0])
print(words)
# Output
'''['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.']'''
POS Tagging
After tokenization, you can proceed to tagging.
# POS tagging
tagged_words = nltk.pos_tag(words)
print(tagged_words)
# Output
'''[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'VBG'), ('platform', 'NN'), ('for', 'IN'), ('building', 'VBG'), ('Python', 'NNP'), ('programs', 'NNS'), ('to', 'TO'), ('work', 'VB'), ('with', 'IN'), ('human', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('.', '.')]'''
Understanding POS Tags
The output will be a list of tuples, where the first element is the word, and the second is the POS tag.
Example:
[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'VBG'), ('platform', 'NN'), ...]
Example Code
Here’s an example that puts it all together:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
# Sample text
text = "NLTK is a leading platform for building Python programs to work with human language data."
# Tokenizing sentences
sentences = sent_tokenize(text)
# Tokenizing words and POS tagging
for sentence in sentences:
words = word_tokenize(sentence)
tagged_words = nltk.pos_tag(words)
print(tagged_words)
NLTK POS Taggers
In Natural Language Processing (NLP), Part-of-Speech (POS) tagging is the task of identifying the grammatical categories to which the words in a sentence belong (e.g., nouns, verbs, adjectives, etc.). The NLTK library provides various taggers that can be employed for this purpose.
Default POS Tagger (Maxent Tagger)
The Maxent Tagger, or the averaged perceptron tagger, is the default POS tagger in NLTK. Here’s how you can use it:
import nltk
from nltk.tokenize import word_tokenize
# Sample sentence
sentence = "The cat is chasing the mouse"
# Tokenize the sentence into words
words = word_tokenize(sentence)
# Perform POS tagging
tagged_words = nltk.pos_tag(words)
print(tagged_words)
Rule-Based POS Taggers
RegexpTagger
You can use regular expressions to build a rule-based POS tagger.
from nltk.tag import RegexpTagger
# Define the patterns
patterns = [(r'.*ing$', 'VBG'), (r'.*ed$', 'VBD'), (r'.*es$', 'VBZ'), (r'^-?[0-9]+$', 'CD'), (r'.*', 'NN')]
# Initialize the tagger
regexp_tagger = RegexpTagger(patterns)
# Tag the sentence
tagged_words = regexp_tagger.tag(word_tokenize("He is running and eating."))
print(tagged_words)
N-Gram Taggers
UnigramTagger
The Unigram Tagger tags words based on statistics about each individual word.
from nltk.tag import UnigramTagger
from nltk.corpus import treebank
# Load some training data
train_sents = treebank.tagged_sents()[:3000]
# Initialize the tagger
unigram_tagger = UnigramTagger(train_sents)
# Tag the sentence
tagged_words = unigram_tagger.tag(word_tokenize("This is a simple sentence."))
print(tagged_words)
BigramTagger and TrigramTagger
These taggers work similarly to the UnigramTagger, but they also consider the context in which a word appears.
from nltk.tag import BigramTagger, TrigramTagger
# Initialize the taggers
bigram_tagger = BigramTagger(train_sents)
trigram_tagger = TrigramTagger(train_sents)
# Tag the sentence
tagged_words_bigram = bigram_tagger.tag(word_tokenize("This is another sentence."))
tagged_words_trigram = trigram_tagger.tag(word_tokenize("This is another sentence."))
print("Bigram:", tagged_words_bigram)
print("Trigram:", tagged_words_trigram)
Combining Taggers
You can also combine different taggers to improve accuracy.
from nltk.tag import DefaultTagger
# Create a sequence of taggers with backoff
t0 = DefaultTagger('NN')
t1 = UnigramTagger(train_sents, backoff=t0)
t2 = BigramTagger(train_sents, backoff=t1)
# Use the combined tagger
tagged_words = t2.tag(word_tokenize("Yet another example sentence."))
print(tagged_words)
Each of these taggers has its strengths and weaknesses, and the best choice depends on your specific application and requirements.
FAQs for POS Tagging Using NLTK
Q. What are the most common POS tags?
A. The most common POS tags are NNP (Proper noun), VB (Verb), NN (Noun), and JJ (Adjective).
Q. Can NLTK tag POS in languages other than English?
A. Yes, but the support for languages other than English is limited.
Q. What do I do if I encounter an error during installation?
A. Ensure that you have a stable internet connection and you have pip properly installed.
Q. What are the prerequisites for using NLTK for POS tagging?
A. Basic Python programming knowledge is enough, although a familiarity with natural language processing can be helpful.
Q. Can I use POS tagging for sentiment analysis?
A. While POS tagging itself won’t give you the sentiment, it can be a crucial preprocessing step for more advanced sentiment analysis algorithms.