Text Classification

Word2Vec Tutorial : A Text classification model

Word2Vec is a group of related models that are used to produce word embeddings, which are dense vector representations of words. These models are shallow, two-layer neural networks trained to reconstruct linguistic contexts of words. Word2Vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space.

Word2Vec Model

Key features and concepts of Word2Vec include:

  1. Continuous Bag-of-Words (CBOW) Model: This model predicts a target word from its surrounding context words (the ‘bag of words’). It averages the context words’ vectors and uses this as a prediction for the target word in the training process.
  2. Skip-Gram Model: Contrary to CBOW, Skip-Gram predicts context words from a target word. It uses the target word’s vector to predict the surrounding context words. This model tends to work better with small datasets and is good at capturing more distant word relationships.
  3. Semantic Information: Word2Vec models capture the semantic meaning of words. Words that are used in similar contexts tend to have vectors that are close to each other in the vector space.
  4. Vector Operations: The vector representations can be manipulated in meaningful ways. For example, vector calculations like “vector(‘king’) – vector(‘man’) + vector(‘woman’)” can result in a vector very close to “queen”.
  5. Training and Efficiency: Word2Vec uses a sliding window approach to train the model, looking at a window of words at a time. The training process is efficient because it uses techniques such as hierarchical softmax or negative sampling.
  6. Applications: Word2Vec embeddings are used in many natural language processing applications such as sentiment analysis, machine translation, and named entity recognition.

The success of Word2Vec has been a major driving force behind the popularity of word embeddings in natural language processing, providing a way to capture semantic relationships between words and use them in various machine learning models.

Table of Contents

Text classification model using Word2Vec involves several steps, from data preprocessing to model training and evaluation. Here’s a step-by-step tutorial, complete with detailed code examples

Step 1: Install Necessary Libraries

You’ll need libraries like gensim for Word2Vec, scikit-learn for the classifier and metrics, and nltk for text preprocessing.

for NLTK read more: Introduction to NLTK: Your Gateway to Natural Language Processing

Python
!pip install gensim scikit-learn nltk

Step 2: Import Libraries

Python
import nltk
nltk.download('punkt')
nltk.download('stopwords')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
import numpy as np

Step 3: Data Preparation

For this tutorial, let’s assume you have a dataset data with two columns: text (the documents) and label (the class labels). Replace this with your actual dataset.

We will use “IMBD Movie review dataset“. The dataset can be downloaded from the link: IMDB_Dataset.csv. We assume that the dataset is saved as the local file. The dataset has few thousands reviews and each review items is saved in following two columns:

review: review text

sentiment: positive/negative

Python
import csv
data = []
with open('IMDB Dataset.csv', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        data.append(row)    
        
data.pop(0) # remove header output: ['review', 'sentiment']
    
data[0] # check the first element of the data   ['.......', 'positive']
Python
# separate review and sentiments
texts = [item[0] for item in data]
labels = [item[1] for item in data]

Step 4: Text Preprocessing

Tokenize the text and remove stopwords.

Python
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    filtered_tokens = [w for w in tokens if not w in stop_words]
    return filtered_tokens

processed_texts = [preprocess_text(text) for text in texts]

Step 5: Create Word2Vec Model

Train a Word2Vec model on your dataset.

Python
word2vec_model = Word2Vec(processed_texts, vector_size=100, window=5, min_count=1, workers=4)

Step 6: Feature Engineering

Convert text data into numerical data using Word2Vec vectors.

Python
def document_vector(doc):
    # Remove out-of-vocabulary words
    doc = [word for word in doc if word in word2vec_model.wv.index_to_key]
    return np.mean(word2vec_model.wv[doc], axis=0)

X = np.array([document_vector(text) for text in processed_texts])

Step 7: Prepare Training and Test Sets

Split the data into training and testing sets.

Python
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.3, random_state=42)

Step 8: Train a Classifier

Train a classifier, like a RandomForest, on the training data.

Python
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

Step 9: Evaluate the Model

Evaluate the model on the test set.

Python
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

Conclusion

This tutorial provides a basic approach to text classification using Word2Vec for feature extraction. You can improve this model by experimenting with different preprocessing techniques, Word2Vec parameters, and classification algorithms. Remember to handle edge cases, like documents with no words after preprocessing, in a real-world scenario.