# Lightweight Fine-Tuning Project

TODO: In this cell, describe your choices for each of the following

* PEFT technique:
* Model:
* Evaluation approach:
* Fine-tuning dataset:

## Loading and Evaluating a Foundation Model

TODO: In the cells below, load your chosen pre-trained Hugging Face model and evaluate its performance prior to fine-tuning. This step includes loading an appropriate tokenizer and dataset.

# 1. Installation

Below we install the necessary packages for this notebook.

In [1]:
# You will need to choose "Kernel > Restart Kernel" from the menu after executing this cell

!pip install evaluate scikit-learn "datasets==3.2.0" bitsandbytes

Defaulting to user installation because normal site-packages is not writeable
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting scikit-learn
  Downloading scikit_learn-1.6.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.5/13.5 MB[0m [31m80.3 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hCollecting datasets==3.2.0
  Downloading datasets-3.2.0-py3-none-any.whl (480 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m44.6 MB/s[0m eta [36m0:00:00[0m
Collecting requests>=2.32.2
  Downloading requests-2.32.3-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.9/64.9 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub>=0.23.0
 

# 2. Imports

In this section, we import the libraries and modules we will need.

In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from datasets import load_dataset
import pandas as pd
import torch
import numpy as np

from peft import LoraConfig, TaskType, get_peft_model, AutoPeftModelForSequenceClassification
from transformers import DataCollatorWithPadding, TrainingArguments, Trainer, BitsAndBytesConfig
from huggingface_hub import login

# 3. Tokenizer Initialization

We load the GPT-2 tokenizer here, specifying a maximum length for the tokens.

In [3]:
MODEL_NAME = "gpt2"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, max_length=1024)
tokenizer.pad_token = tokenizer.eos_token  # Set pad token to the EOS token



# 4. Load and Preview Dataset

We load the **sms_spam** dataset and split it into training and test sets. We then display the raw dataset for inspection.

In [4]:
dataset_name = "sms_spam"

raw_datasets = load_dataset(dataset_name, split="train").train_test_split(
    test_size=0.2, shuffle=True, seed=23
)

# Display basic dataset info
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['sms', 'label'],
        num_rows: 4459
    })
    test: Dataset({
        features: ['sms', 'label'],
        num_rows: 1115
    })
})

# 5. Tokenize the Dataset

Here, we define a `tokenize_func` function and apply it to the dataset. We remove the original 'sms' column to keep the dataset clean.

In [5]:
def tokenize_func(element):
    return tokenizer(
        element['sms'],
        padding="max_length",
        truncation=True
    )

splits = ["train", "test"]
tokenized_datasets = {}

for split in splits:
    tokenized_datasets[split] = raw_datasets[split].map(
        tokenize_func,
        batched=True,
        remove_columns=["sms"]
    )

# Inspect the resulting tokenized dataset
tokenized_datasets

Map:   0%|          | 0/1115 [00:00<?, ? examples/s]

{'train': Dataset({
     features: ['label', 'input_ids', 'attention_mask'],
     num_rows: 4459
 }),
 'test': Dataset({
     features: ['label', 'input_ids', 'attention_mask'],
     num_rows: 1115
 })}

# 6. Quick Look at Tokenized Data

Convert the first few examples to a DataFrame to see how they look after tokenization.

In [6]:
df = pd.DataFrame(tokenized_datasets["train"][:5])
df

Unnamed: 0,label,input_ids,attention_mask
0,1,"[25383, 534, 5175, 838, 285, 9998, 30, 10133, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,0,"[7594, 220, 1222, 2528, 26, 2, 5, 13655, 26, 8...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, ..."
2,0,"[19926, 314, 423, 6497, 510, 257, 14507, 393, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, ..."
3,0,"[18565, 306, 8508, 319, 428, 1323, 290, 340, 1...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, ..."
4,0,"[18690, 986, 198, 50256, 50256, 50256, 50256, ...","[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


# 7. Load Pretrained Model

We load a GPT-2 based model configured for sequence classification with 2 labels (spam / not spam).

We will also use normalized float 4 adn the BitsAndBytes library to quantize our model for better performance.

In [7]:
quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    quantization_config=quantization_config,
    torch_dtype="auto",
    num_labels=2,
    id2label={0: "not spam", 1: "spam"},
    label2id={"not spam": 0, "spam": 1},
    max_position_embeddings=1024,
    use_safetensors=True
)

# GPT-2 was trained without a pad token, so we align the model config with the tokenizer's pad token.
model.config.pad_token_id = tokenizer.eos_token_id
print(f"memory footprint {model.get_memory_footprint()}")
print(model)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


memory footprint 255544368
GPT2ForSequenceClassification(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Linear8bitLt(in_features=768, out_features=2304, bias=True)
          (c_proj): Linear8bitLt(in_features=768, out_features=768, bias=True)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Linear8bitLt(in_features=768, out_features=3072, bias=True)
          (c_proj): Linear8bitLt(in_features=3072, out_features=768, bias=True)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f

# 8. Testing Inference Before Fine-Tuning

Let's create a helper function `run_inference` and test it on a spam-like text.

In [8]:
def run_inference(text, model):
    # Tokenize
    encoded_prompt = tokenizer(
        text,
        truncation=True,
        return_tensors="pt"
    )
    input_ids = encoded_prompt["input_ids"]
    attention_mask = encoded_prompt["attention_mask"]

    # Run inference
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits

    # Predict and print
    predicted_class_idx = torch.argmax(logits, dim=-1).item()
    predicted_label = model.config.id2label[predicted_class_idx]

    print(f"Text: {text}")
    print(f"Predicted label: {predicted_label}")

text = "Two muffins are sitting in an oven. One muffin turns to the other and asks “Is it just me, or is it hot in here?” The other muffin says OH MY GOD A TALKING MUFFIN!!!"
run_inference(text, model)
print("\n")
text = "Amazon is sending you a refunding of $32.64. Please reply with your bank account and routing number to receive your refund."
run_inference(text, model)



Text: Two muffins are sitting in an oven. One muffin turns to the other and asks “Is it just me, or is it hot in here?” The other muffin says OH MY GOD A TALKING MUFFIN!!!
Predicted label: not spam


Text: Amazon is sending you a refunding of $32.64. Please reply with your bank account and routing number to receive your refund.
Predicted label: not spam


# 9. Configure LoRA for GPT-2

We define a LoRA configuration, specifying how to adapt GPT-2 using LoRA (Low-Rank Adapters).

**rank** and **alpha** are empirically chosen. We’ll start with _r = 8_ and _alpha = 32_ because our model is relatively small and we have a limited training set, making these values a reasonable balance for our needs.

In [9]:
lora_config = LoraConfig(
    r=8,  # Rank number
    lora_alpha=32,  # Scaling factor
    task_type=TaskType.SEQ_CLS,  # Sequence Classification Task
    fan_in_fan_out=True  # GPT-2 requires this. Info: Source: https://stackoverflow.com/questions/78122986/struggling-with-hugging-face-peft
)

# 10. Integrate LoRA with the Base Model

We apply the LoRA configuration to our loaded GPT-2 model.

In [10]:
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()

trainable params: 297,984 || all params: 124,737,792 || trainable%: 0.23888830740245906


# 11. Create Data Collator

We create a data collator to pad the tokenized examples to a fixed length for uniformity.

In [11]:
data_collator = DataCollatorWithPadding(
    tokenizer=tokenizer,
    padding="max_length"
)

# 12. Set Up Training Arguments

Here we define the training parameters, including batch size, learning rate, and number of epochs.

See the [docs](https://huggingface.co/docs/transformers/en/training#training-hyperparameters) for more info.

In [12]:
training_args = TrainingArguments(
    output_dir="peft_model",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    evaluation_strategy="steps",
    logging_steps=100,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    weight_decay=0.01,
    warmup_steps=50,
    lr_scheduler_type="cosine",
    learning_rate=5e-4,
    save_steps=100,
    fp16=True,  # Only to be used with GPU
    push_to_hub=False,
    report_to="none"
)

# 13. Define Metrics

We define a simple accuracy metric for evaluating our model on the spam classification task.

In [13]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

# 14. Initialize Trainer

Hugging Face's `Trainer` simplifies the training loop. Here we specify:
- Our LoRA-adapted model
- Training arguments
- Tokenizer
- Training and evaluation datasets
- Data collator
- Metrics for evaluation

In [14]:
trainer = Trainer(
    model=peft_model,
    args=training_args,
    tokenizer=tokenizer,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# 15. Train and Evaluate the Model

In [15]:
trainer.train()
trainer.evaluate()

You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Accuracy
100,0.5555,0.093548,0.982063
200,0.0929,0.077017,0.983857
300,0.0605,0.083342,0.98565
400,0.062,0.070049,0.98565
500,0.0474,0.070088,0.986547


Checkpoint destination directory peft_model/checkpoint-100 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory peft_model/checkpoint-200 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory peft_model/checkpoint-300 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory peft_model/checkpoint-400 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory peft_model/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.


{'eval_loss': 0.06869951635599136,
 'eval_accuracy': 0.9865470852017937,
 'eval_runtime': 88.8715,
 'eval_samples_per_second': 12.546,
 'eval_steps_per_second': 6.279,
 'epoch': 1.0}

# 16. Save the Model

We save our LoRA-adapted GPT-2 model locally.

In [16]:
peft_model.save_pretrained("peft_model")

# 17. Load the Fine-Tuned Model

We can reload the fine-tuned model from the saved weights to confirm everything works.

In [17]:
from peft import AutoPeftModelForSequenceClassification
peft_model = AutoPeftModelForSequenceClassification.from_pretrained("peft_model")
print(peft_model)

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at gpt2 and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


PeftModelForSequenceClassification(
  (base_model): LoraModel(
    (model): GPT2ForSequenceClassification(
      (transformer): GPT2Model(
        (wte): Embedding(50257, 768)
        (wpe): Embedding(1024, 768)
        (drop): Dropout(p=0.1, inplace=False)
        (h): ModuleList(
          (0-11): 12 x GPT2Block(
            (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
            (attn): GPT2Attention(
              (c_attn): Linear(
                in_features=768, out_features=2304, bias=True
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=768, out_features=8, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=8, out_features=2304, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()

# 18. Test the Final Model on Example Texts

We test our saved and reloaded model on both a "spam" and "non-spam" message.

In [18]:
text_spam = "Amazon is sending you a refunding of $32.64. Please reply with your bank account and routing number to receive your refund."
text_no_spam = "Two muffins are sitting in an oven. One muffin turns to the other and asks “Is it just me, or is it hot in here?” The other muffin says OH MY GOD A TALKING MUFFIN!!!"

run_inference(text_spam, peft_model)
print("\n")
run_inference(text_no_spam, peft_model)

Text: Amazon is sending you a refunding of $32.64. Please reply with your bank account and routing number to receive your refund.
Predicted label: LABEL_1


Text: Two muffins are sitting in an oven. One muffin turns to the other and asks “Is it just me, or is it hot in here?” The other muffin says OH MY GOD A TALKING MUFFIN!!!
Predicted label: LABEL_0


# 19. (Optional) Push Model to Hugging Face Hub

I think making the model publicly available is good practice.

In [19]:
login()
peft_model.push_to_hub("jonathanbenavides/peft-model-gpt2-sms-spam-bitsandbytes-8bits")

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/repos/create (Request ID: Root=1-677a9b22-5b17384404e2d12a33c9ac8a;42e1e3ac-32aa-4bd0-8da4-eaaf58a1cbdb)

Invalid username or password.