RoBERTa, which stands for “Robustly Optimized BERT Pretraining Approach,” is a natural language processing (NLP) model developed by Facebook AI. It is an enhanced version of BERT (Bidirectional Encoder Representations from Transformers), a groundbreaking model introduced by Google. RoBERTa was designed to improve upon BERT in several key areas. Following is stepwise tutorial for Document classification using RoBERTa:
Step 1: Choose a Dataset
We’ll stick with the “20 Newsgroups” dataset, as mentioned earlier.
Step 2: Environment Setup
Install the necessary Python libraries:
- TensorFlow
- Transformers (by Hugging Face)
- Scikit-learn
You can install these using pip:
pip install tensorflow transformers sklearn
Step 3: Data Preparation
- Load the Dataset:
Use scikit-learn to load the 20 Newsgroups dataset as before. - Preprocess the Data:
Basic text cleaning to remove unnecessary characters.
from sklearn.datasets import fetch_20newsgroups
categories = ['comp.graphics', 'sci.med']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
- Tokenize and Encode:
Use RoBERTa tokenizer from Hugging Face:
from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
train_encodings = tokenizer(newsgroups_train.data, truncation=True, padding=True, return_tensors='tf')
test_encodings = tokenizer(newsgroups_test.data, truncation=True, padding=True, return_tensors='tf')
Step 4: Prepare the Dataset
In TensorFlow, you can use the tf.data.Dataset
to handle the data:
import tensorflow as tf
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(train_encodings),
newsgroups_train.target
))
test_dataset = tf.data.Dataset.from_tensor_slices((
dict(test_encodings),
newsgroups_test.target
))
![image-2 Document classification using RoBERTa Model](https://codeblockhub.com/wp-content/uploads/2023/12/image-2-1024x585.png)
Step 5: Load the Model
Load a TensorFlow-compatible RoBERTa model:
from transformers import TFRobertaForSequenceClassification
model = TFRobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=len(categories))
Step 6: Training the Model
Set up the training with TensorFlow:
- Define the optimizer, loss function, and the metrics.
- Compile the model.
- Fit the model on the training data.
- Save the model after training.
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
batch_size = 16
epochs = 3
train_dataset = train_dataset.shuffle(200).batch(batch_size)
test_dataset = test_dataset.batch(batch_size)
history = model.fit(train_dataset, epochs=epochs, validation_data=test_dataset)
Step 7: Evaluation
- Model Performance: Evaluate the model on the test dataset using the
evaluate
method. - Analysis: Analyze the performance and identify areas of improvement.
evaluation_results = model.evaluate(test_dataset)
print(f"Test Loss: {evaluation_results[0]}, Test Accuracy: {evaluation_results[1]}")
Step 8: Inference
Use the trained model to predict classes of new documents.
sample_text = [["this computer has 16GB RAM"], ["The lungs and respiratory system allow us to breathe" ],
['interconnected computing devices that can exchange data and share resources with each other.']]
for sample in sample_text:
sample_encoding = tokenizer(sample, truncation=True, padding=True, return_tensors='tf')
sample_prediction = model(sample_encoding)
predicted_class = tf.argmax(sample_prediction.logits, axis=1).numpy()[0]
print(f"Predicted class: {predicted_class}")
# Output categories = ['comp.graphics', 'sci.med']
"""
Predicted class: 0
Predicted class: 1
Predicted class: 0
"""
Step 9: Saving and Loading the Model
Use TensorFlow’s built-in methods to save and load the model.
model.save('MyModelSequenceClassification', save_format='tf')
Conclusion
This is a simplified example of how to create a document classification model with RoBERTa using TensorFlow. Real-world scenarios might require more advanced data preprocessing, model tuning, and evaluation strategies. Also, consider exploring TensorFlow’s more advanced features for better performance and efficiency.