FastText : A Guide for Text Representation Models for NLP

Introduction

In the ever-evolving domain of natural language processing (NLP), efficient and robust text representation is crucial. Enter FastText, an extension of the Word2Vec model, developed by researchers at Facebook AI Research (FAIR). FastText not only learns vector representations for words but also for sub-word units, making it exceptional in handling rare words, morphologically rich languages, and even misspellings.

Table of Contents

Technical Overview

Sub-word Information

Unlike Word2Vec, FastText represents each word as a bag of character n-grams in addition to the word itself. For example, the word “apple” with n=3 would be represented as <ap, app, ppl, ple, le>, including the boundary symbols < and > to denote the beginning and end of the word. This method allows the model to capture the internal structure of words.

Model Architecture

FastText can use both the Continuous Bag of Words (CBOW) and Skip-Gram models, similar to Word2Vec. The key difference is that FastText treats each word as a bag of character n-grams, so the word embeddings are composed of the sum of these character n-gram embeddings.

Training

FastText uses hierarchical softmax or negative sampling, like Word2Vec, for training efficiency. The inclusion of sub-word information leads to better performance, especially for languages with rich morphology.

Features and Advantages

  1. Handling Rare Words: FastText can compute representations for words not seen during training (out-of-vocabulary words) by summing the vectors of their sub-parts.
  2. Morphological Richness: It’s exceptionally useful for agglutinative languages (like Turkish or Finnish), where the meaning of words can be altered with affixes.
  3. Robustness to Misspellings: Due to its sub-word approach, it can handle misspellings or variations in words.
  4. Multilingual Capabilities: FastText provides pre-trained word vectors for 157 languages, trained on Wikipedia using CBOW with position-weights.
  5. Efficiency: Despite its detailed approach, FastText is efficient in both training and inference.

Applications

FastText’s applications span a wide range of NLP tasks:

  1. Text Classification: Its ability to understand sub-word information makes it suitable for tasks like sentiment analysis and topic categorization.
  2. Word Representations: Useful in tasks requiring semantic understanding, like machine translation and named entity recognition.
  3. Language Identification: FastText’s robustness to misspellings and morphological richness makes it ideal for identifying the language of a given text.

Libraries and Tools

FastText is open-source and its official implementation is in C++. However, Python wrappers are also available for ease of use. PyPI hosts the fasttext Python package.

Code Example

Package installation

Install python package for FastText:

Bash
pip install fasttext

Getting and preparing the data

To start the tutorial, we need labeled data to train our supervised classifier. In this tutorial, we are interested in building a classifier to automatically recognize the topic of a stackexchange question about cooking. Let’s download examples of questions from the cooking section of Stackexchange, and their associated tags:

Bash
wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz && tar xvzf cooking.stackexchange.tar.gz
#view some data
head cooking.stackexchange.txt
Bash
# output
__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there

In the text file ‘cooking.stackexchange.txt’ , each line contains a list of labels, followed by the corresponding text . All the labels start by the __label__ prefix, this is the format using which fastText recognize what is a label or what is a word. The model is then trained to predict the labels given the word in the document.

To prepare the data for the training for the classifier, we need to split the data into train and validation. We will use the validation set to evaluate how good the learned classifier is on new data.

The dataset contains 15404 examples. Let’s split it into a training set of 12404 examples and a validation set of 3000 examples:

Bash
head -n 12404 'cooking.stackexchange.txt' > 'cooking.train'
tail -n 3000 'cooking.stackexchange.txt' > 'cooking.valid'

Training the Classifier Model

Now we shall train the classifier using python package:

Python
import fasttext
model = fasttext.train_supervised(input="cooking.train")
Python
# save the model
model.save_model("model_cooking.bin")
# predict with the model
model.predict("Which dish is best to eat with fruit salad ?", k=3)

Output

Python
(('__label__food-safety', '__label__baking', '__label__equipment'),
 array([0.02835034, 0.0257651 , 0.0180293 ]))

Train the classifier with preprocessed text

Bash
cat './fasttext/cooking.stackexchange.txt' | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > './fasttext/cooking.preprocessed.txt'
head -n 12404 cooking.preprocessed.txt > cooking.train
tail -n 3000 cooking.preprocessed.txt > cooking.valid

Now train the model again using preprocessed data with python package

Python
# Train the model with preprocessed text
model = fasttext.train_supervised(input="./fasttext/cooking.train")
# get the prediction
model.predict("Why not put knives in the dishwasher?")

Output

Python
(('__label__food-safety',), array([0.10605929]))

Conclusion

FastText offers an innovative approach to word representation, addressing some of the limitations of previous models like Word2Vec. Its ability to understand sub-word elements and handle a variety of linguistic challenges makes it a powerful tool in the NLP toolkit.

References

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching Word Vectors with Subword Information. arXiv preprint arXiv:1607.04606.
  • FastText. (n.d.). Retrieved from https://fasttext.cc/

FAQs for FastText Text Representation

Here are some frequently asked questions about FastText text representation:

Q. What is FastText text representation?

FastText text representation is a method for converting words and text into numerical vectors. It is an extension of the Word2Vec model that not only learns vector representations for words but also for sub-word units, such as character n-grams.

Q. How does FastText represent words?

FastText represents words as a bag of character n-grams in addition to the word itself. For example, the word “apple” with n=3 would be represented as <ap, app, ppl, ple, le>, where < and > represent the beginning and end of the word. This approach allows the model to capture the internal structure of words.

Q. What is the advantage of using sub-word information in FastText?

The inclusion of sub-word information in FastText has several advantages. It helps in handling rare words by computing representations for out-of-vocabulary words based on the vectors of their sub-parts. It is particularly useful for languages with rich morphology, where words can be altered with affixes. Additionally, FastText can handle misspellings or variations in words due to its sub-word approach.

Q. How is FastText trained?

FastText training utilizes hierarchical softmax or negative sampling, similar to Word2Vec. It can use both the Continuous Bag of Words (CBOW) and Skip-Gram models. The main difference is that FastText treats each word as a bag of character n-grams, and the word embeddings are composed of the sum of these character n-gram embeddings.

Q. What are the applications of FastText text representation?

FastText has a wide range of applications in natural language processing (NLP). Some of the common applications include:

  • Text classification: FastText is suitable for tasks like sentiment analysis and topic categorization due to its ability to understand sub-word information.
  • Word representations: It is useful in tasks requiring semantic understanding, such as machine translation and named entity recognition.
  • Language identification: FastText’s robustness to misspellings and morphological richness makes it ideal for identifying the language of a given text.

Q. Are there pre-trained models available for FastText?

Yes, FastText provides pre-trained word vectors for 157 languages, which have been trained on Wikipedia using CBOW with position-weights. These pre-trained models can be used as a starting point in various NLP tasks.

Q. What are the programming languages and tools available for FastText?

FastText is an open-source project with an official implementation in C++. However, Python wrappers are also available for ease of use. The fasttext Python package can be installed via PyPI.

This blog post provides a comprehensive overview of FastText models, covering their technical aspects, features, and practical applications. FastText’s unique approach to incorporating sub-word information into word representations makes it a versatile and powerful tool in natural language processing.