Topic modeling is a method for uncovering the abstract “topics” that occur in a collection of documents. It is a form of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. NLTK, or Natural Language Toolkit, is a powerful Python library that can be used for working with human language data (text). However, it’s important to note that NLTK by itself doesn’t provide a direct way to do topic modeling. Instead, it assists in preprocessing the text data to make it suitable for topic modeling, which can then be performed using another library like gensim
. For topic modeling using NLTK, In this tutorial, we will go through the preprocessing steps using NLTK and then demonstrate how to perform topic modeling using gensim
.
Step 1: Install NLTK and Gensim
Before we start, you need to install NLTK. Please refer Installing NLTK on Windows, Mac and Linux
Then install gensim if you haven’t already. You can install it using pip:
pip install gensim
Step 2: Import Libraries
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import gensim
from gensim import corpora
Step 3: Prepare Your Documents
For this tutorial, let’s assume we have a set of documents about different types of animals.
doc1 = "Cats are natural hunters."
doc2 = "Dogs are very loyal companions."
doc3 = "Goldfish are common pets."
# Combine documents into a list
docs = [doc1, doc2, doc3]
Step 4: Preprocessing
Here, we will tokenize the documents, remove stop words and lemmatize the words.
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
stop = set(stopwords.words('english'))
lemma = WordNetLemmatizer()
def preprocess(text):
# Tokenize
tokens = nltk.word_tokenize(text.lower())
# Remove stopwords and non-alphabetic words
tokens = [lemma.lemmatize(word) for word in tokens if word.isalpha() and word not in stop]
return tokens
# Preprocess the documents
doc_clean = [preprocess(doc) for doc in docs]
Step 5: Prepare Document-Term Matrix
Next, we create a document-term matrix using the gensim
library.
# Creating the term dictionary of our corpus, where every unique term is assigned an index
dictionary = corpora.Dictionary(doc_clean)
# Converting list of documents (corpus) into Document Term Matrix using the dictionary
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]
Step 6: Create the LDA Model
Now we can create an LDA model using gensim
. LDA, or Latent Dirichlet Allocation, is a popular algorithm for topic modeling.
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel
# Running and Training LDA model on the document term matrix
ldamodel = Lda(doc_term_matrix, num_topics=3, id2word = dictionary, passes=50)
Step 7: Results
Finally, we can view the topics in our documents.
print(ldamodel.print_topics(num_topics=3, num_words=3))
Each line corresponds to a topic with individual topic terms and weights. Higher weight means the term is more important for the topic.
Detailed Example and Analysis
Imagine you have executed the above steps with a larger set of documents related to animals. The LDA might output topics like:
[(0, '0.167*"cat" + 0.083*"hunter" + 0.083*"natural"'),
(1, '0.125*"dog" + 0.125*"loyal" + 0.125*"companion"'),
(2, '0.154*"goldfish" + 0.154*"pet" + 0.154*"common"')]
Here we see that the model has found three topics. In a more detailed analysis, we could infer:
- Topic 0 might be about cats and hunting.
- Topic 1 seems to relate to dogs and companionship.
- Topic 2 appears to discuss goldfish as pets.
These topics make sense given our original documents.
FAQs
Q. What is topic modeling?
A. Topic modeling is a type of statistical modeling for discovering the abstract topics that occur in a collection of documents. It’s used in natural language processing to classify and organize large volumes of textual information.
Q. Why do we need to preprocess text data?
A. Preprocessing helps in cleaning and preparing text data for modeling by removing noise and irrelevant words