Fine Tuning Demo | Open Source Model with Conversational Reddit Data

Step 1: Accessing and Setting Up a Hugging Face Model in Google Colab

Find our python Google Colab demo script here: Python Reddit Comment Generation Demo

Welcome to the first step in fine-tuning an open-source large language model (LLM) using Google Colab! Below, you'll find detailed instructions on how to get started with a model from Hugging Face and set it up in your Colab environment.

1.1 Accessing Hugging Face Models

  1. Visit Hugging Face: Head over to the Hugging Face Models page, where you can explore a wide range of open-source models. For this guide, let's use two examples:

    • GPT-2: A small yet powerful language model suitable for various text generation tasks.
    • DistilBERT: A distilled version of BERT that is faster and lighter, making it ideal for text classification and other NLP tasks.
  2. Choose a Model: Click on the model you want to fine-tune. For example, if you choose GPT-2, you'll be taken to the model's page, where you can find more details about its architecture, usage, and available versions.

1.2 Setting Up Google Colab

  1. Open Google Colab: If you don’t already have Google Colab open, you can start by going to Google Colab. Sign in with your Google account.

  2. Create a New Notebook: In Colab, click on “File” and then “New notebook” to create a new Python notebook.

  3. Install the Hugging Face Transformers Library:

    • In the first cell of your notebook, enter the following code to install the necessary libraries:

      !pip install transformers

      !pip install torch

    • Run the cell by pressing Shift + Enter.

  4. Import the Necessary Libraries:

    • After the installation is complete, import the libraries you'll need:

      from transformers import AutoTokenizer, AutoModelForCausalLM

    • If you're using GPT-2, you'll specifically need these:

      model_name = "gpt2"

      tokenizer = AutoTokenizer.from_pretrained(model_name)

      model = AutoModelForCausalLM.from_pretrained(model_name)

    • For DistilBERT, use the following:

      from transformers import DistilBertTokenizer, DistilBertForSequenceClassification

      model_name = "distilbert-base-uncased"

      tokenizer = DistilBertTokenizer.from_pretrained(model_name)

      model = DistilBertForSequenceClassification.from_pretrained(model_name)

  5. Connect to Google Drive:

    • You’ll need to connect your Google Drive to load the dataset for fine-tuning later. Run the following code to mount your Google Drive:

      from google.colab import drive

      drive.mount('/content/drive')

    • After running this cell, a link will appear, prompting you to authorize Google Colab to access your Google Drive. Follow the instructions to grant permission.

 

For models that require authentication, you'll need to provide an API token when loading the model in your Google Colab environment. Here's how you can do it:

  1. Create or Log in to Your Hugging Face Account:

    • Visit Hugging Face and either create a new account or log in to your existing account.
  2. Generate an API Token:

    • After logging in, go to your account settings by clicking on your profile icon in the top right corner, then select "Settings."
    • In the "Access Tokens" section, click on "New token" to generate a new API token. You can name it anything you like and select the scope (usually "read").
  3. Authenticate in Google Colab:

    • In your Google Colab notebook, before you load the model, run the following code to set up authentication:

      from huggingface_hub import login
      login(token="your_hugging_face_api_token")
    • Replace "your_hugging_face_api_token" with the token you generated in the previous step.

  4. Load the Model:

    • After logging in, you can load the model as usual. The authentication will allow you to access models that require it.

For models that don't require authentication, like GPT-2 or DistilBERT, you can load them directly without this step. However, if your customers are working with restricted models, you’ll need to guide them through the authentication process as described above.

Step 2: Preparing Your Dataset and Fine-Tuning the Model

Now that you have your model set up in Google Colab, it's time to prepare your dataset and fine-tune the model. In this example, we'll be using a dataset with Reddit comment trees, focusing on the AggregatedText and CommentText columns, and using the top 5% of comments by like count for fine-tuning.

2.1 Loading Your Dataset into Google Colab

  1. Upload Your Dataset to Google Drive:

    • Ensure your dataset is uploaded to your Google Drive. You should know the exact path to your dataset (e.g., /content/drive/MyDrive/RedditData.csv).
  1. Load the Dataset in Colab:

    • Use the following code to load your dataset into a Pandas DataFrame:

      import pandas as pd

      data_path = '/content/drive/MyDrive/RedditData.csv'

      df = pd.read_csv(data_path)

    • Replace '/content/drive/MyDrive/RedditData.csv' with the actual path to your dataset in Google Drive.

2.2 Filtering the Top Comments

  1. Calculate the Top 5% by Like Count:

    • First, calculate the threshold to determine the top 5% of comments by like count:

      like_threshold = df['like_count'].quantile(0.95)

    • This will give you the like count that separates the top 5% from the rest.

  2. Filter the Dataset:

    • Next, filter the dataset to include only the comments that are in the top 5%:

      top_comments_df = df[df['like_count'] >= like_threshold]

  3. Check the Result:

    • Verify that your DataFrame now contains only the top 5% of comments:

      print(top_comments_df.head())

2.3 Preparing the Data for Fine-Tuning

  1. Prepare Inputs for the Model:

    • For fine-tuning, you'll want to use the AggregatedText as input and the CommentText as the target for prediction:

      inputs = top_comments_df['AggregatedText'].tolist()

      targets = top_comments_df['CommentText'].tolist()

  2. Tokenize the Data:

    • Use your tokenizer to convert the text into tokens that the model can understand:

      input_tokens = tokenizer(inputs, return_tensors="pt", padding=True, truncation=True)

      target_tokens = tokenizer(targets, return_tensors="pt", padding=True, truncation=True)

2.4 Fine-Tuning the Model

  1. Set Up the Trainer:

    • Use the Hugging Face Trainer class to handle the fine-tuning process. Here's how you can set it up:

      from transformers import Trainer, TrainingArguments
      training_args = TrainingArguments(
      output_dir='./results', # output directory
      num_train_epochs=3, # number of training epochs
      per_device_train_batch_size=8, # batch size for training
      save_steps=10_000, # Save checkpoint every 10,000 steps
      save_total_limit=2, # Only keep the last 2 checkpoints
      )
      trainer = Trainer(
      model=model, # the instantiated 🤗 Transformers model to be trained
      args=training_args, # training arguments, defined above
      train_dataset=input_tokens['input_ids'], # training dataset
      )
  2. Start Training:

    • Begin the fine-tuning process by running:

      trainer.train()

    • This process might take some time depending on your dataset and model size.

2.5 Evaluating the Model

  1. Generate Predictions:

    • After training, you can generate predictions using the fine-tuned model to see how well it predicts popular comments:

      predictions = model.generate(input_tokens['input_ids'], max_length=50)

      for i, prediction in enumerate(predictions):

      print(f"Original: {targets[i]}")

      print(f"Predicted: {tokenizer.decode(prediction, skip_special_tokens=True)}")

  2. Assess Performance:

    • Compare the predicted comments to the original comments to assess the model’s performance.

Step 3: Testing the Fine-Tuned Model on Real-Time Reddit Data

Now that you've fine-tuned your model to predict popular comments, it's time to put it to the test with real-time Reddit data. In this step, we'll walk through how to use your fine-tuned model to generate a prediction for a real-time comment string.

3.1 Setting Up the Real-Time Prediction Environment

  1. Import Necessary Libraries:

    • First, make sure you've imported all the libraries you need:

      import torch

      from transformers import AutoTokenizer, AutoModelForCausalLM

  2. Load Your Fine-Tuned Model:

    • If your model is saved in a directory after fine-tuning, load it like this:

      model_name_or_path = './results' (Replace ./results with the actual path where your model is saved.)

      tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

      model = AutoModelForCausalLM.from_pretrained(model_name_or_path)

3.2 Preparing the Real-Time Comment for Prediction

  1. Input the Real-Time Comment:

    • You’ll need a string input, which represents the comment for which you want to generate a prediction.

  2. Create the Input for the Model
    1. Tokenize the real-time comment to prepare it for the model:
input_tokens = tokenizer(real_time_comment, return_tensors="pt")

3.3 Generating the Prediction

  1. Generate the Prediction:

    • Use the fine-tuned model to generate a prediction based on the real-time comment input:

      predicted_tokens = model.generate(input_tokens['input_ids'], max_length=50)

      predicted_comment = tokenizer.decode(predicted_tokens[0], skip_special_tokens=True)

  2. Output the Predicted Comment:

    • Print the predicted comment, which represents what your model thinks would be a viral or popular response:

      print(f"Original Comment: {real_time_comment}")

      print(f"Predicted Popular Comment: {predicted_comment}")

3.4 Setting Up a Simple Interface (Optional)

  1. Creating a Function for Repeated Use:

    • To streamline the process, you can create a function that takes any input string and returns the predicted popular comment:

      def generate_viral_comment(input_string):
      input_tokens = tokenizer(input_string, return_tensors="pt")
      predicted_tokens = model.generate(input_tokens['input_ids'], max_length=50)
      return tokenizer.decode(predicted_tokens[0], skip_special_tokens=True) # Example usage
      real_time_comment = "User 1: Faux Patriotism. User 2: The perfect metaphor for Fox News? User 3: Maybe for the Republican Party in general as well."
      predicted_comment = generate_viral_comment(real_time_comment)
      print(f"Original Comment: {real_time_comment}") print(f"Predicted
      Popular Comment: {predicted_comment}")
  2. Creating a Simple User Interface:

    • If you’d like to create a simple text interface for others to use, you could add a text input and a button in a Jupyter Notebook or Colab:

      from IPython.display import display
      import ipywidgets as widgets

      def on_button_click(b):
      input_string = text_box.value
      predicted_comment = generate_viral_comment(input_string)
      print(f"Original Comment: {input_string}")
      print(f"Predicted Popular Comment: {predicted_comment}")
      text_box = widgets.Text(
      value='Type your comment here...',
      description='Comment:',
      )
      generate_button = widgets.Button(
      description='Generate Prediction',
      )
      generate_button.on_click(on_button_click)
      display(text_box, generate_button)
    • This code will create a text box where users can input their comment and a button that, when clicked, will display the predicted popular comment.

3.5 Testing and Iteration

  1. Test with Different Comments:

    • Try testing with various real-time Reddit comments to see how well your model predicts viral responses.
  2. Refine Based on Feedback:

    • Use the results to further fine-tune the model if necessary or adjust your data preprocessing steps to improve prediction accuracy.