Text Processing

NLTK POS (Part-of-Speech) Taggers : POS Tagging Using NLTK

Natural Language Toolkit (NLTK) is a powerful library in Python for working with human language data, a.k.a text. One of the basic tasks in natural language processing is Part-of-Speech (POS) tagging. POS tagging identifies the grammatical parts of a sentence, like noun, verb, adverb, adjective, etc. In this article we will provide steps for POS Tagging using NLTK

Table of Contents

Prerequisites for POS Tagging using NLTK

  • Basic understanding of Python
  • Familiarity with NLP concepts is helpful but not mandatory

Installing NLTK

First, you need to install the NLTK library. You can install it using pip: Installing NLTK on Windows, Mac and Linux

Importing NLTK and Downloading Resources

Import NLTK and download the necessary resources like the tokenizers and the POS taggers.

Python
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

Tokenization

The first step before POS tagging is to tokenize the text into words and sentences.

Python
from nltk.tokenize import word_tokenize, sent_tokenize

text = "NLTK is a leading platform for building Python programs to work with human language data."
sentences = sent_tokenize(text)

# Tokenizing the first sentence into words
words = word_tokenize(sentences[0])

print(words)

# Output
'''['NLTK', 'is', 'a', 'leading', 'platform', 'for', 'building', 'Python', 'programs', 'to', 'work', 'with', 'human', 'language', 'data', '.']'''

POS Tagging

After tokenization, you can proceed to tagging.

Python
# POS tagging
tagged_words = nltk.pos_tag(words)

print(tagged_words)
# Output
'''[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'VBG'), ('platform', 'NN'), ('for', 'IN'), ('building', 'VBG'), ('Python', 'NNP'), ('programs', 'NNS'), ('to', 'TO'), ('work', 'VB'), ('with', 'IN'), ('human', 'JJ'), ('language', 'NN'), ('data', 'NNS'), ('.', '.')]'''

Understanding POS Tags

The output will be a list of tuples, where the first element is the word, and the second is the POS tag.

Example:

Python
[('NLTK', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('leading', 'VBG'), ('platform', 'NN'), ...]

Example Code

Here’s an example that puts it all together:

Python
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample text
text = "NLTK is a leading platform for building Python programs to work with human language data."

# Tokenizing sentences
sentences = sent_tokenize(text)

# Tokenizing words and POS tagging
for sentence in sentences:
    words = word_tokenize(sentence)
    tagged_words = nltk.pos_tag(words)
    print(tagged_words)

NLTK POS Taggers

In Natural Language Processing (NLP), Part-of-Speech (POS) tagging is the task of identifying the grammatical categories to which the words in a sentence belong (e.g., nouns, verbs, adjectives, etc.). The NLTK library provides various taggers that can be employed for this purpose.

Default POS Tagger (Maxent Tagger)

The Maxent Tagger, or the averaged perceptron tagger, is the default POS tagger in NLTK. Here’s how you can use it:

Python
import nltk
from nltk.tokenize import word_tokenize

# Sample sentence
sentence = "The cat is chasing the mouse"

# Tokenize the sentence into words
words = word_tokenize(sentence)

# Perform POS tagging
tagged_words = nltk.pos_tag(words)

print(tagged_words)

Rule-Based POS Taggers

RegexpTagger

You can use regular expressions to build a rule-based POS tagger.

Python
from nltk.tag import RegexpTagger

# Define the patterns
patterns = [(r'.*ing$', 'VBG'), (r'.*ed$', 'VBD'), (r'.*es$', 'VBZ'), (r'^-?[0-9]+$', 'CD'), (r'.*', 'NN')]

# Initialize the tagger
regexp_tagger = RegexpTagger(patterns)

# Tag the sentence
tagged_words = regexp_tagger.tag(word_tokenize("He is running and eating."))

print(tagged_words)

N-Gram Taggers

UnigramTagger

The Unigram Tagger tags words based on statistics about each individual word.

Python
from nltk.tag import UnigramTagger
from nltk.corpus import treebank

# Load some training data
train_sents = treebank.tagged_sents()[:3000]

# Initialize the tagger
unigram_tagger = UnigramTagger(train_sents)

# Tag the sentence
tagged_words = unigram_tagger.tag(word_tokenize("This is a simple sentence."))

print(tagged_words)

BigramTagger and TrigramTagger

These taggers work similarly to the UnigramTagger, but they also consider the context in which a word appears.

Python
from nltk.tag import BigramTagger, TrigramTagger

# Initialize the taggers
bigram_tagger = BigramTagger(train_sents)
trigram_tagger = TrigramTagger(train_sents)

# Tag the sentence
tagged_words_bigram = bigram_tagger.tag(word_tokenize("This is another sentence."))
tagged_words_trigram = trigram_tagger.tag(word_tokenize("This is another sentence."))

print("Bigram:", tagged_words_bigram)
print("Trigram:", tagged_words_trigram)

Combining Taggers

You can also combine different taggers to improve accuracy.

Python
from nltk.tag import DefaultTagger

# Create a sequence of taggers with backoff
t0 = DefaultTagger('NN')
t1 = UnigramTagger(train_sents, backoff=t0)
t2 = BigramTagger(train_sents, backoff=t1)

# Use the combined tagger
tagged_words = t2.tag(word_tokenize("Yet another example sentence."))

print(tagged_words)

Each of these taggers has its strengths and weaknesses, and the best choice depends on your specific application and requirements.

FAQs for POS Tagging Using NLTK

Q. What are the most common POS tags?

A. The most common POS tags are NNP (Proper noun), VB (Verb), NN (Noun), and JJ (Adjective).

Q. Can NLTK tag POS in languages other than English?

A. Yes, but the support for languages other than English is limited.

Q. What do I do if I encounter an error during installation?

A. Ensure that you have a stable internet connection and you have pip properly installed.

Q. What are the prerequisites for using NLTK for POS tagging?

A. Basic Python programming knowledge is enough, although a familiarity with natural language processing can be helpful.

Q. Can I use POS tagging for sentiment analysis?

A. While POS tagging itself won’t give you the sentiment, it can be a crucial preprocessing step for more advanced sentiment analysis algorithms.