Transformers

Document classification using RoBERTa Model

RoBERTa, which stands for “Robustly Optimized BERT Pretraining Approach,” is a natural language processing (NLP) model developed by Facebook AI. It is an enhanced version of BERT (Bidirectional Encoder Representations from Transformers), a groundbreaking model introduced by Google. RoBERTa was designed to improve upon BERT in several key areas. Following is stepwise tutorial for Document classification using RoBERTa:

Step 1: Choose a Dataset

We’ll stick with the “20 Newsgroups” dataset, as mentioned earlier.

Step 2: Environment Setup

Install the necessary Python libraries:

  • TensorFlow
  • Transformers (by Hugging Face)
  • Scikit-learn

You can install these using pip:

Terminal
pip install tensorflow transformers sklearn

Step 3: Data Preparation

  1. Load the Dataset:
    Use scikit-learn to load the 20 Newsgroups dataset as before.
  2. Preprocess the Data:
    Basic text cleaning to remove unnecessary characters.
Python
from sklearn.datasets import fetch_20newsgroups

categories = ['comp.graphics', 'sci.med']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
  1. Tokenize and Encode:
    Use RoBERTa tokenizer from Hugging Face:
Python
   from transformers import RobertaTokenizer

   tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
   train_encodings = tokenizer(newsgroups_train.data, truncation=True, padding=True, return_tensors='tf')
   test_encodings = tokenizer(newsgroups_test.data, truncation=True, padding=True, return_tensors='tf')

Step 4: Prepare the Dataset

In TensorFlow, you can use the tf.data.Dataset to handle the data:

Python
import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    newsgroups_train.target
))
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    newsgroups_test.target
))
Document classification using RoBERTa Model

Step 5: Load the Model

Load a TensorFlow-compatible RoBERTa model:

Python
from transformers import TFRobertaForSequenceClassification

model = TFRobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=len(categories))

Step 6: Training the Model

Set up the training with TensorFlow:

  1. Define the optimizer, loss function, and the metrics.
  2. Compile the model.
  3. Fit the model on the training data.
  4. Save the model after training.
Python
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
Python
batch_size = 16
epochs = 3

train_dataset = train_dataset.shuffle(200).batch(batch_size)
test_dataset = test_dataset.batch(batch_size)

history = model.fit(train_dataset, epochs=epochs, validation_data=test_dataset)

Step 7: Evaluation

  1. Model Performance: Evaluate the model on the test dataset using the evaluate method.
  2. Analysis: Analyze the performance and identify areas of improvement.
Python
evaluation_results = model.evaluate(test_dataset)
print(f"Test Loss: {evaluation_results[0]}, Test Accuracy: {evaluation_results[1]}")

Step 8: Inference

Use the trained model to predict classes of new documents.

Python
sample_text = [["this computer has 16GB RAM"], ["The lungs and respiratory system allow us to breathe" ],
              ['interconnected computing devices that can exchange data and share resources with each other.']]

for sample in sample_text:
    sample_encoding = tokenizer(sample, truncation=True, padding=True, return_tensors='tf')
    sample_prediction = model(sample_encoding)
    predicted_class = tf.argmax(sample_prediction.logits, axis=1).numpy()[0]
    print(f"Predicted class: {predicted_class}")
    
# Output categories = ['comp.graphics', 'sci.med']
"""
Predicted class: 0
Predicted class: 1
Predicted class: 0
"""

Step 9: Saving and Loading the Model

Use TensorFlow’s built-in methods to save and load the model.

Python
model.save('MyModelSequenceClassification', save_format='tf')

Conclusion

This is a simplified example of how to create a document classification model with RoBERTa using TensorFlow. Real-world scenarios might require more advanced data preprocessing, model tuning, and evaluation strategies. Also, consider exploring TensorFlow’s more advanced features for better performance and efficiency.