NLTK

Text Summarization using NLTK : A Tutorial

Text summarization is an application of Natural Language Processing (NLP) that involves reducing a text document with the intent of creating a summary without losing key information. In this tutorial, we will walk through the process of a basic text summarization using NLTK : Python’s Natural Language Toolkit (NLTK).

Table of Contents

Getting Started with NLTK

Before we start, you need to install NLTK. Please refer Installing NLTK on Windows, Mac and Linux

Once installed, you may need to download certain NLTK resources:

Python
import nltk
nltk.download('punkt')
nltk.download('stopwords')

These resources will help in sentence tokenization and removing stopwords from our text.

Text Summarization Steps

1. Import Required Libraries

Python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

2. Prepare the Text

For this example, let’s use a simple paragraph to summarize:

Python
text = """Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate speech. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves."""

3. Tokenize the Text

Split the text into sentences and words:

Python
sentences = sent_tokenize(text)  # Tokenize sentences

4. Frequency Table of Words

Create a frequency table for the most frequent words, excluding the stopwords:

Python
stopWords = set(stopwords.words("english"))
words = word_tokenize(text)

freqTable = dict()
for word in words:
    word = word.lower()
    if word not in stopWords:
        if word in freqTable:
            freqTable[word] += 1
        else:
            freqTable[word] = 1

5. Scoring Sentences

Score each sentence based on the number of frequent words it contains:

Python
sentenceValue = dict()

for sentence in sentences:
    for word, freq in freqTable.items():
        if word in sentence.lower():
            if sentence in sentenceValue:
                sentenceValue[sentence] += freq
            else:
                sentenceValue[sentence] = freq

6. Average Score Calculation

Calculate the average score for the sentences:

Python
sumValues = 0
for sentence in sentenceValue:
    sumValues += sentenceValue[sentence]

# Average value of a sentence from the original text
average = int(sumValues / len(sentenceValue))

7. Generating the Summary

Create the summary by adding sentences with a score above the average:

Python
summary = ''
for sentence in sentences:
    if (sentence in sentenceValue) and (sentenceValue[sentence] > (1.2 * average)):
        summary += " " + sentence

8. Displaying the Summary

Let’s print out our summary:

print(summary)

For our example text, the output might look something like this:

Python
It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them.

FAQs : Text Summarization using NLTK

Q. What is NLTK?
A. NLTK, or Natural Language Toolkit, is a library in Python for working with human language data (texts). It provides easy-to-use interfaces to over 50 corpora and lexical resources and a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.

Q. Why do we remove stopwords in text summarization?
A. Stopwords are words which are filtered out before processing because they are considered to be irrelevant for the meaning of the text. They include words like ‘is’, ‘and’, ‘the’, etc. Removing them helps in focusing on the important information.

Q. How do you determine the importance of a sentence?
A. The importance of a sentence is often determined by the occurrence of high-frequency words. If a sentence contains more high-frequency words, it is considered more important, which is a basic approach. More sophisticated methods may involve machine learning and understanding of the context.

Q. Can NLTK be used for complex text summarization?
A. NLTK provides the basic tools for text processing and can be used to build a simple text summarizer. However, for more complex tasks involving understanding the context and semantics of the text, additional tools and advanced algorithms like deep learning are typically used.

Q. Is there a difference between extractive and abstractive summarization?
A. Yes, extractive summarization involves selecting phrases and sentences from the source document to make up the new summary. In contrast, abstractive summarization involves understanding the main ideas and expressing them in new ways