In the evolving landscape of natural language processing (NLP), the T5 (Text-To-Text Transfer Transformer) model has emerged as a versatile model. Fine-tuning this model for specific tasks can unleash its full potential, making it a crucial skill for AI enthusiasts and professionals. This article delves into fine tuning T5 Transformer model, specifically for the task of generating tags based on Stack Overflow questions.
Using a combination of question headings and content, we’ll explore how to tailor the T5 model to excel in this task. Our focus on “Fine Tuning T5” aims to offer insights and practical guidance for those looking to enhance their NLP applications.
The trained T5 Transformer model produces impressive results even with just a few epochs of training. Scroll down to the results section to take a quick look.
Why Do We Need Automated Tag Generation?
- Efficiency in Information Retrieval: With the sheer volume of questions and data on platforms like Stack Overflow, manually tagging each post is impractical. Automated tag generation speeds up the process, ensuring that questions are quickly and accurately categorized.
- Enhanced Search and Filtering: Proper tagging improves the searchability of questions. Users can easily find relevant information based on specific tags, making knowledge acquisition more streamlined and efficient.
- Consistency in Tagging: Manual tagging can lead to inconsistencies due to subjective interpretations. An automated system, trained on a vast dataset, can provide uniformity in tagging, reducing ambiguity and improving the overall quality of data categorization.
- Learning and Adaptation: AI models like T5 can learn from new data and adapt over time. This continuous learning process means that the tagging system evolves, capturing the latest trends and terminologies in the tech world.
The T5 (Text-to-Text Transformer) Model
T5, or Text-To-Text Transfer Transformer, was developed by Google. It adopts a unique approach where every NLP task is framed as a text-to-text problem. This means inputs and outputs are always treated as text strings, irrespective of their nature. This universal framework allows T5 to handle a wide range of tasks without needing task-specific architectures.
The Architecture of T5
The architecture of the T5 model is based on the original Transformer model, which uses an encoder-decoder structure. However, T5 introduces several key modifications:
- Unified Text-to-Text Framework: T5 processes all tasks, whether translation, summarization, or question answering, in the same manner – by converting them into a text-to-text format.
- Pre-training Tasks: T5 is pre-trained on a diverse text corpus using unsupervised learning, which involves tasks like masking and shuffling words to help the model understand context and language structure better.
- Size Variants: T5 comes in various sizes – from small to large, offering flexibility in terms of computational resources and application needs.
Tasks Handled by T5
T5 is capable of handling a broad spectrum of NLP tasks, such as:
- Text Summarization: Condensing a long piece of text into a concise summary.
- Question Answering: Providing answers to questions based on context given.
- Language Translation: Translating text from one language to another.
- Text Classification: Categorizing text into predefined classes.
One of the most remarkable aspects of T5 is its ability to adapt to virtually any text-based task, making it a powerful tool in the hands of NLP practitioners.
Fine Tuning T5 For Stack Overflow Tag Generation
Fine tuning T5 using Hugging Face Transformers includes several steps:
- Firstly, we need to prepare the tag generation dataset which we will discuss further.
- Secondly, we need to tokenize the dataset using the T5 tokenizer.
- Thirdly, the T5 Transformer model needs to be initialized.
- Next, we need to define the training arguments.
- Then we can start the process of fine tuning T5.
Now, let’s start with the most interesting part – the code for fine tuning T5 model from Hugging Face Transformers for tag generation.
A Brief About the Training Dataset
We will be using the 60k Stack Overflow Questions with Quality Rating dataset from Kaggle. It contains several attributes for over 60000 Stack Overflow questions. Some of them are the question title, question body, and quality of the question – high or not. Additionally, each question contains a Tags column indicating the tags the question has been assigned.
For example, the following is a question about Javascript, Jquery, and React.
It has 5 tags associated with it.
Similarly, we have tags for 60000 such questions. Since there are over 9000 tags, it cannot be considered a multi-label classification problem. Hence, we will take the Generative AI approach instead of using a BERT-like model for classification.
Installing the Necessary Libraries
Let’s start with installing all the necessary libraries.
!pip install -U transformers
!pip install -U datasets
!pip install tensorboard
!pip install sentencepiece
!pip install accelerate
Here we install:
transformers
: This will give us access to all the Transformer based Hugging Face models.sentencepiece
: This library is used for text tokenization.accelerate
: Theaccelerate
library from Hugging Face automates the training of Transformers across different hardware types and also handles multi-GPU training.
Download and Extract the Dataset
The following code downloads and extracts the 60k Stack Overflow Question dataset.
!wget "https://www.dropbox.com/scl/fi/525gv6tmdi3n32mipo6mr/input.zip?rlkey=5jdsxahphk2ped5wxbxnv0n4y&dl=1" -O input.zip
!unzip input.zip
The dataset will be extracted into a folder named input
.
Import Statements
The following code handles the import of all the necessary libraries.
import torch
from transformers import (
T5Tokenizer,
T5ForConditionalGeneration,
TrainingArguments,
Trainer
)
from datasets import load_dataset
Most of the above imports have been explained in the previous article where we fine-tuned the BERT model. Our new import statements are explained as follows:
- T5Tokenizer: This class is part of the Transformers library. The T5Tokenizer is responsible for converting text into a format the T5 model can understand, typically involving converting words into numerical tokens.
- T5ForConditionalGeneration: This is a specific class from the Transformers library that provides the T5 model architecture specifically configured for tasks like translation, summarization, or question answering – where the model generates text based on a given input. However, for us, it will be generating tags based on specific questions.
Training and Dataset Configurations
Choosing the right configuration and hyperparameters is one of the most important aspects of any deep learning fine-tuning process. The following are the configurations we will use along the way.
MODEL = 't5-small'
BATCH_SIZE = 48
NUM_PROCS = 16
EPOCHS = 10
OUT_DIR = 'results_t5small'
MAX_LENGTH = 256 # Maximum context length to consider while preparing dataset.
Hugging Face provides several versions of the T5 model based on size. For this problem, the T5-Small model is a good starting point. It contains around 60 million parameters which is just perfect for experimenting
The fine-tuning process shown here was carried out on a RTX 3090 GPU with a 32 core Ryzen processor. To facilitate faster training, we use a batch size of 48 and 16 parallel processes for tokenization. You can change it according to the hardware used. Furthermore, we will train the T5 model for 10 epochs on the tag generation dataset.
The MAX_LENGTH
is an important hyperparameter here. This is the maximum context length to use and is set to 256. It defines the maximum number of tokens to consider from each input text during dataset preparation. Although larger context length offers better results, it also demands more GPU memory and longer training times.
Preparing the Tag Generation Dataset
The first part of preparing the tag generation dataset for fine tuning T5 involves loading the model. As the dataset is present locally on disk, a few things need to be taken into account.
dataset_train = load_dataset(
'csv',
data_files='input/train.csv',
split='train'
)
dataset_valid = load_dataset(
'csv',
data_files='input/valid.csv',
split='train'
)
As the dataset is in CSV format, the first argument to the load_dataset
function is the format. The second argument is the path which is the training and validation CSV files respectively.
One important point you may note is the split
argument which is set to train
in both cases. When loading local datasets, it is important to provide this argument. Although one is the training dataset and another is the validation dataset, we can safely set both of these to train
. Further into the training process, both of them will be used according to the usage. That is, one will be used to create the training data loader, and another for the validation data loader.
Printing both datasets gives us the following information.
As expected, the training and validation sets contain 45000 and 15000 samples respectively. Our primary interest is in the Title, Body, and Tag columns.
Tokenizing the Dataset
The next step is tokenizing the dataset, where the text will be split according to a tokenization algorithm and the split text will be converted to numbers.
Let’s load the tokenizer for the T5 model first.
tokenizer = T5Tokenizer.from_pretrained(MODEL)
Next, we have a preprocessing function.
# Function to convert text data into model inputs and targets
def preprocess_function(examples):
inputs = [f"assign tag: {title} {body}" for (title, body) in zip(examples['Title'], examples['Body'])]
model_inputs = tokenizer(
inputs,
max_length=MAX_LENGTH,
truncation=True,
padding='max_length'
)
# Set up the tokenizer for targets
cleaned_tag = [' '.join(''.join(tag.split('<')).split('>')[:-1]) for tag in examples['Tags']]
with tokenizer.as_target_tokenizer():
labels = tokenizer(
cleaned_tag,
max_length=MAX_LENGTH,
truncation=True,
padding='max_length'
)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
The preprocess_function
is defined to transform the raw text data into a structured format suitable for model input. This function takes titles and bodies of Stack Overflow posts, and formats them into a standard input structure for the T5 model. It also processes the tags to be used as the target output during training.
Then, we apply the tokenization to the training and validation datasets using the map
method.
# Apply the function to the whole dataset
tokenized_train = dataset_train.map(
preprocess_function,
batched=True,
num_proc=NUM_PROCS
)
tokenized_valid = dataset_valid.map(
preprocess_function,
batched=True,
num_proc=NUM_PROCS
)
By the end of this step, the data is fully tokenized and in the right format for training the T5 model.
Preparing the T5 Mode for Fine-Tuning
The code below focuses on preparing the T5 model for the fine-tuning process.
model = T5ForConditionalGeneration.from_pretrained(MODEL)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# Total parameters and trainable parameters.
total_params = sum(p.numel() for p in model.parameters())
print(f"{total_params:,} total parameters.")
total_trainable_params = sum(
p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_trainable_params:,} training parameters.")
The T5ForConditionalGeneration
class is designed for tasks that involve generating text based on input. This is exactly what’s needed for generating tags from Stack Overflow questions.
Along with the model initialization, we will also transfer it to the GPU.
Fine Tuning T5 Model
To start fine tuning T5, we need to initialize the training arguments and trainer.
training_args = TrainingArguments(
output_dir=OUT_DIR,
num_train_epochs=EPOCHS,
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
warmup_steps=500,
weight_decay=0.01,
logging_dir=OUT_DIR,
logging_steps=10,
evaluation_strategy='steps',
save_steps=500,
eval_steps=500,
load_best_model_at_end=True,
save_total_limit=5,
report_to='tensorboard',
learning_rate=0.0001,
fp16=True,
dataloader_num_workers=4
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_valid,
)
While fine tuning BERT, we saw what each training argument and Trainer
class does. Let’s clarify a few points, here:
- We train the model in mixed precision model with
fp16=True
which leads to less GPU usage and a slightly faster training time. - The number of workers for the data loaders has been set to 4 for faster and parallel processing of the data.
The Trainer
object is then initialized with the model, training arguments, and datasets. The Trainer handles the training and evaluation loops, effectively managing the process of fine-tuning the model on the training dataset and evaluating its performance on the validation dataset.
Finally, we can begin the fine tuning T5 for tag generation.
history = trainer.train()
After every 500 steps, a model checkpoint will be saved and evaluation will be performed.
This is the result that we get from training the T5 model.
The validation loss at the end of 9000 steps is 0.046514, which is the least loss until now.
Here are the loss graphs from the Tensorboard logs.
Clearly, the loss has been decreasing until the end of training, hence we can use the last saved model for inference.
Before we move to the inference step, let’s also save the tokenizer that we can load.
tokenizer.save_pretrained(OUT_DIR)
Inference using the Trained T5 Model
After fine tuning T5 Transformer model, we are all set to carry out the inference.
In the following code block, we download and unzip a few Stack Overflow questions for inference.
!wget "https://www.dropbox.com/scl/fi/9brsjizymq5zvqi7hff09/inference_data.zip?rlkey=ukmdy5egmdld80r5hhmsja78v&dl=1" -O inference_data.zip
!unzip inference_data.zip
To make the inference section standalone, we import the necessary packages.
from transformers import T5ForConditionalGeneration, T5Tokenizer
import os
Before inference, the trained model and tokenizer need to be loaded from disk.
model_path = 'results_t5small/checkpoint-9000/' # the path where you saved your model
model = T5ForConditionalGeneration.from_pretrained(model_path)
tokenizer = T5Tokenizer.from_pretrained('results_t5small')
we write a simple helper function for inference.
def do_correction(text, model, tokenizer):
input_text = f"assign tag: {text}"
inputs = tokenizer.encode(
input_text,
return_tensors='pt',
max_length=256,
padding='max_length',
truncation=True
)
# Get correct sentence ids.
corrected_ids = model.generate(
inputs,
max_length=256,
num_beams=5, # `num_beams=1` indicated temperature sampling.
early_stopping=True
)
# Decode.
corrected_sentence = tokenizer.decode(
corrected_ids[0],
skip_special_tokens=True
)
return corrected_sentence
This function appends the assign tag token before each of the Stack Overflow titles and questions. This token is the same as the one used during the training dataset preparation. It tells the model that it needs to assign tokens based on the text we provide.
In the final step, we loop over all files in the inference_data directory and carry out inference.
for file in os.listdir('inference_data/'):
f = open(f"inference_data/{file}", 'r')
sentence = f.read()
corrected_sentence = do_correction(sentence, model, tokenizer)
print(f"QUERY: {sentence}\nTAGS: {corrected_sentence}")
print('-'*100)
Below are the truncated text, predicted tags, and ground truth tags.
Do you find the result compelling? If so, go through the detailed write-up above to know how we achieved this.
Our verdict
The model performs incredibly well in almost all the cases. Apart from the second example, it detects a part of the correct tags from the original ones. This shows the power of the T5 model for such a real-life use case.
Conclusion
In this article, we fine-tuned the T5 Transformer model for Stack Overflow tag generation. It is one of the real-life use cases where we can use Transformer based language models for automating a task.
We started with the dataset and model preparation, and moved on to the detailed procedure of training. In the end, we also ran inference to check its performance on real-world samples. Although we trained the T5-Small model, it has performed exceptionally well.. Of course, training a larger model will improve its performance even more.
Please share in the comments if you’ve trained either the T5-Base or T5-Large model and noticed improved performance.