Fine Tuning Demo | Open Source Model with Conversational Reddit Data
Step 1: Accessing and Setting Up a Hugging Face Model in Google Colab
Find our python Google Colab demo script here: Python Reddit Comment Generation Demo
Welcome to the first step in fine-tuning an open-source large language model (LLM) using Google Colab! Below, you'll find detailed instructions on how to get started with a model from Hugging Face and set it up in your Colab environment.
1.1 Accessing Hugging Face Models
-
Visit Hugging Face: Head over to the Hugging Face Models page, where you can explore a wide range of open-source models. For this guide, let's use two examples:
- GPT-2: A small yet powerful language model suitable for various text generation tasks.
- DistilBERT: A distilled version of BERT that is faster and lighter, making it ideal for text classification and other NLP tasks.
-
Choose a Model: Click on the model you want to fine-tune. For example, if you choose GPT-2, you'll be taken to the model's page, where you can find more details about its architecture, usage, and available versions.
1.2 Setting Up Google Colab
-
Open Google Colab: If you don’t already have Google Colab open, you can start by going to Google Colab. Sign in with your Google account.
-
Create a New Notebook: In Colab, click on “File” and then “New notebook” to create a new Python notebook.
-
Install the Hugging Face Transformers Library:
-
In the first cell of your notebook, enter the following code to install the necessary libraries:
!pip install transformers
!pip install torch
-
Run the cell by pressing
Shift + Enter
.
-
-
Import the Necessary Libraries:
-
After the installation is complete, import the libraries you'll need:
from transformers import AutoTokenizer, AutoModelForCausalLM
-
If you're using GPT-2, you'll specifically need these:
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
-
For DistilBERT, use the following:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertForSequenceClassification.from_pretrained(model_name)
-
-
Connect to Google Drive:
-
You’ll need to connect your Google Drive to load the dataset for fine-tuning later. Run the following code to mount your Google Drive:
from google.colab import drive
drive.mount('/content/drive')
-
After running this cell, a link will appear, prompting you to authorize Google Colab to access your Google Drive. Follow the instructions to grant permission.
-
For models that require authentication, you'll need to provide an API token when loading the model in your Google Colab environment. Here's how you can do it:
-
Create or Log in to Your Hugging Face Account:
- Visit Hugging Face and either create a new account or log in to your existing account.
-
Generate an API Token:
- After logging in, go to your account settings by clicking on your profile icon in the top right corner, then select "Settings."
- In the "Access Tokens" section, click on "New token" to generate a new API token. You can name it anything you like and select the scope (usually "read").
-
Authenticate in Google Colab:
-
In your Google Colab notebook, before you load the model, run the following code to set up authentication:
from huggingface_hub import login
login(token="your_hugging_face_api_token")
-
Replace
"your_hugging_face_api_token"
with the token you generated in the previous step.
-
-
Load the Model:
- After logging in, you can load the model as usual. The authentication will allow you to access models that require it.
For models that don't require authentication, like GPT-2 or DistilBERT, you can load them directly without this step. However, if your customers are working with restricted models, you’ll need to guide them through the authentication process as described above.
Step 2: Preparing Your Dataset and Fine-Tuning the Model
Now that you have your model set up in Google Colab, it's time to prepare your dataset and fine-tune the model. In this example, we'll be using a dataset with Reddit comment trees, focusing on the AggregatedText
and CommentText
columns, and using the top 5% of comments by like count for fine-tuning.
2.1 Loading Your Dataset into Google Colab
-
Upload Your Dataset to Google Drive:
- Ensure your dataset is uploaded to your Google Drive. You should know the exact path to your dataset (e.g.,
/content/drive/MyDrive/RedditData.csv
).
- Ensure your dataset is uploaded to your Google Drive. You should know the exact path to your dataset (e.g.,
-
Load the Dataset in Colab:
-
Use the following code to load your dataset into a Pandas DataFrame:
import pandas as pd
data_path = '/content/drive/MyDrive/RedditData.csv'
df = pd.read_csv(data_path)
-
Replace
'/content/drive/MyDrive/RedditData.csv'
with the actual path to your dataset in Google Drive.
-
2.2 Filtering the Top Comments
-
Calculate the Top 5% by Like Count:
-
First, calculate the threshold to determine the top 5% of comments by like count:
like_threshold = df['like_count'].quantile(0.95)
-
This will give you the like count that separates the top 5% from the rest.
-
-
Filter the Dataset:
-
Next, filter the dataset to include only the comments that are in the top 5%:
top_comments_df = df[df['like_count'] >= like_threshold]
-
-
Check the Result:
-
Verify that your DataFrame now contains only the top 5% of comments:
print(top_comments_df.head())
-
2.3 Preparing the Data for Fine-Tuning
-
Prepare Inputs for the Model:
-
For fine-tuning, you'll want to use the
AggregatedText
as input and theCommentText
as the target for prediction:inputs = top_comments_df['AggregatedText'].tolist()
targets = top_comments_df['CommentText'].tolist()
-
-
Tokenize the Data:
-
Use your tokenizer to convert the text into tokens that the model can understand:
input_tokens = tokenizer(inputs, return_tensors="pt", padding=True, truncation=True)
target_tokens = tokenizer(targets, return_tensors="pt", padding=True, truncation=True)
-
2.4 Fine-Tuning the Model
-
Set Up the Trainer:
-
Use the Hugging Face
Trainer
class to handle the fine-tuning process. Here's how you can set it up:from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # number of training epochs
per_device_train_batch_size=8, # batch size for training
save_steps=10_000, # Save checkpoint every 10,000 steps
save_total_limit=2, # Only keep the last 2 checkpoints
)
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=input_tokens['input_ids'], # training dataset
)
-
-
Start Training:
-
Begin the fine-tuning process by running:
trainer.train()
-
This process might take some time depending on your dataset and model size.
-
2.5 Evaluating the Model
-
Generate Predictions:
-
After training, you can generate predictions using the fine-tuned model to see how well it predicts popular comments:
predictions = model.generate(input_tokens['input_ids'], max_length=50)
for i, prediction in enumerate(predictions):
print(f"Original: {targets[i]}")
print(f"Predicted: {tokenizer.decode(prediction, skip_special_tokens=True)}")
-
-
Assess Performance:
- Compare the predicted comments to the original comments to assess the model’s performance.
Step 3: Testing the Fine-Tuned Model on Real-Time Reddit Data
Now that you've fine-tuned your model to predict popular comments, it's time to put it to the test with real-time Reddit data. In this step, we'll walk through how to use your fine-tuned model to generate a prediction for a real-time comment string.
3.1 Setting Up the Real-Time Prediction Environment
-
Import Necessary Libraries:
-
First, make sure you've imported all the libraries you need:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
-
-
Load Your Fine-Tuned Model:
-
If your model is saved in a directory after fine-tuning, load it like this:
model_name_or_path = './results'
(Replace./results
with the actual path where your model is saved.)tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path)
-
3.2 Preparing the Real-Time Comment for Prediction
-
Input the Real-Time Comment:
-
You’ll need a string input, which represents the comment for which you want to generate a prediction.
-
- Create the Input for the Model
- Tokenize the real-time comment to prepare it for the model:
3.3 Generating the Prediction
-
Generate the Prediction:
-
Use the fine-tuned model to generate a prediction based on the real-time comment input:
predicted_tokens = model.generate(input_tokens['input_ids'], max_length=50)
predicted_comment = tokenizer.decode(predicted_tokens[0], skip_special_tokens=True)
-
-
Output the Predicted Comment:
-
Print the predicted comment, which represents what your model thinks would be a viral or popular response:
print(f"Original Comment: {real_time_comment}")
print(f"Predicted Popular Comment: {predicted_comment}")
-
3.4 Setting Up a Simple Interface (Optional)
-
Creating a Function for Repeated Use:
-
To streamline the process, you can create a function that takes any input string and returns the predicted popular comment:
def generate_viral_comment(input_string):
input_tokens = tokenizer(input_string, return_tensors="pt")
predicted_tokens = model.generate(input_tokens['input_ids'], max_length=50)
return tokenizer.decode(predicted_tokens[0], skip_special_tokens=True) # Example usage
User 1: Faux Patriotism. User 2: The perfect metaphor for Fox News? User 3: Maybe for the Republican Party in general as well."real_time_comment = "
predicted_comment = generate_viral_comment(real_time_comment)
print(f"Original Comment: {real_time_comment}") print(f"Predicted
Popular Comment: {predicted_comment}")
-
-
Creating a Simple User Interface:
-
If you’d like to create a simple text interface for others to use, you could add a text input and a button in a Jupyter Notebook or Colab:
from IPython.display import display
import ipywidgets as widgets
def on_button_click(b):input_string = text_box.value
predicted_comment = generate_viral_comment(input_string)
print(f"Original Comment: {input_string}")
print(f"Predicted Popular Comment: {predicted_comment}")
text_box = widgets.Text(
value='Type your comment here...',
description='Comment:',
)
generate_button = widgets.Button(
description='Generate Prediction',
)
generate_button.on_click(on_button_click)
display(text_box, generate_button)
-
This code will create a text box where users can input their comment and a button that, when clicked, will display the predicted popular comment.
-
3.5 Testing and Iteration
-
Test with Different Comments:
- Try testing with various real-time Reddit comments to see how well your model predicts viral responses.
-
Refine Based on Feedback:
- Use the results to further fine-tune the model if necessary or adjust your data preprocessing steps to improve prediction accuracy.