NLTK

Text Classification using NLTK and Support Vector Machines

Text Classification using NLTK and Support Vector Machines: Text classification is a common task in natural language processing (NLP), where the goal is to categorize text into predefined classes. In this tutorial, we will walk through the process of using Support Vector Machines (SVM) for text classification using Python’s Natural Language Toolkit (NLTK).

Table of Contents


Introduction to Support Vector Machines

Support Vector Machines (SVMs) are supervised learning models used for classification tasks. They work well with high-dimensional data and are particularly suited for text classification problems. SVMs aim to find a hyperplane that best separates the data into different classes.


Setting up the Environment

First, let’s install the necessary packages.

Python
pip install nltk
pip install scikit-learn

Now import the required libraries:

Python
import nltk
import random
from nltk.corpus import movie_reviews
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm

Data Preparation

For this example, we’ll use the movie reviews dataset available in NLTK. This dataset contains 2,000 movie reviews categorized as positive or negative.

Python
# Load the dataset
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Shuffle the data
random.shuffle(documents)

Feature Extraction

We will convert the text data into numerical features using the Bag-of-Words model.

Python
# Prepare the data
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]  # Use the top 2000 frequent words as features

# Function to create feature vector
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

# Create feature sets
featuresets = [(document_features(d), c) for (d, c) in documents]

Building the SVM Model

Let’s split the data into training and testing sets and build the SVM model.

Python
# Split the data
train_set, test_set = featuresets[100:], featuresets[:100]

# Prepare the data for sklearn
train_texts = [" ".join(d) for (d, c) in train_set]
train_labels = [c for (d, c) in train_set]

vectorizer = CountVectorizer()
train_vectors = vectorizer.fit_transform(train_texts)

# Create the SVM model
classifier = svm.SVC()
classifier.fit(train_vectors, train_labels)

Evaluation

Let’s evaluate the model on the test set.

Python
# Prepare the test data
test_texts = [" ".join(d) for (d, c) in test_set]
test_labels = [c for (d, c) in test_set]
test_vectors = vectorizer.transform(test_texts)

# Evaluate the model
accuracy = classifier.score(test_vectors, test_labels)
print(f"Accuracy: {accuracy * 100:.2f}%")

FAQs for Text Classification using NLTK and Support Vector Machines

Q. What are Support Vector Machines?
A. Support Vector Machines (SVMs) are supervised learning algorithms used mainly for classification tasks. They work by finding a hyperplane that best separates the data into different classes.

Q. Why use SVM for text classification?
A. SVMs work well with high-dimensional data, making them well-suited for text classification problems where the feature space is large.

Q. How do I choose the features for my text data?
A. Common methods include using Bag-of-Words, TF-IDF, or word embeddings. In this tutorial, we used the Bag-of-Words model.

Q. How do I evaluate my model?
A. Common metrics for evaluation include accuracy, precision, recall, and F1 score. In this tutorial, we used accuracy as our evaluation metric.

Q. Can I use other kernels in the SVM?
A. Yes, SVMs support various kernel functions like linear, polynomial, and radial basis function (RBF). You can specify the kernel type when initializing the SVC class.