Text summarization is an application of Natural Language Processing (NLP) that involves reducing a text document with the intent of creating a summary without losing key information. In this tutorial, we will walk through the process of a basic text summarization using NLTK : Python’s Natural Language Toolkit (NLTK).
Table of Contents
Getting Started with NLTK
Before we start, you need to install NLTK. Please refer Installing NLTK on Windows, Mac and Linux
Once installed, you may need to download certain NLTK resources:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
These resources will help in sentence tokenization and removing stopwords from our text.
Text Summarization Steps
1. Import Required Libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
2. Prepare the Text
For this example, let’s use a simple paragraph to summarize:
text = """Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."""
3. Tokenize the Text
Split the text into sentences and words:
sentences = sent_tokenize(text) # Tokenize sentences
4. Frequency Table of Words
Create a frequency table for the most frequent words, excluding the stopwords:
stopWords = set(stopwords.words("english"))
words = word_tokenize(text)
freqTable = dict()
for word in words:
word = word.lower()
if word not in stopWords:
if word in freqTable:
freqTable[word] += 1
else:
freqTable[word] = 1
5. Scoring Sentences
Score each sentence based on the number of frequent words it contains:
sentenceValue = dict()
for sentence in sentences:
for word, freq in freqTable.items():
if word in sentence.lower():
if sentence in sentenceValue:
sentenceValue[sentence] += freq
else:
sentenceValue[sentence] = freq
6. Average Score Calculation
Calculate the average score for the sentences:
sumValues = 0
for sentence in sentenceValue:
sumValues += sentenceValue[sentence]
# Average value of a sentence from the original text
average = int(sumValues / len(sentenceValue))
7. Generating the Summary
Create the summary by adding sentences with a score above the average:
summary = ''
for sentence in sentences:
if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
summary += " " + sentence
8. Displaying the Summary
Let’s print out our summary:
print(summary)
For our example text, the output might look something like this:
It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them.
FAQs : Text Summarization using NLTK
Q. What is NLTK?
A. NLTK, or Natural Language Toolkit, is a library in Python for working with human language data (texts). It provides easy-to-use interfaces to over 50 corpora and lexical resources and a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.
Q. Why do we remove stopwords in text summarization?
A. Stopwords are words which are filtered out before processing because they are considered to be irrelevant for the meaning of the text. They include words like ‘is’, ‘and’, ‘the’, etc. Removing them helps in focusing on the important information.
Q. How do you determine the importance of a sentence?
A. The importance of a sentence is often determined by the occurrence of high-frequency words. If a sentence contains more high-frequency words, it is considered more important, which is a basic approach. More sophisticated methods may involve machine learning and understanding of the context.
Q. Can NLTK be used for complex text summarization?
A. NLTK provides the basic tools for text processing and can be used to build a simple text summarizer. However, for more complex tasks involving understanding the context and semantics of the text, additional tools and advanced algorithms like deep learning are typically used.
Q. Is there a difference between extractive and abstractive summarization?
A. Yes, extractive summarization involves selecting phrases and sentences from the source document to make up the new summary. In contrast, abstractive summarization involves understanding the main ideas and expressing them in new ways