Iris / README.md
jpruzcuen
link in read me
d2152c2

A newer version of the Gradio SDK is available: 6.3.0

Upgrade
metadata
title: Gh Action
emoji: 🐨
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 6.0.0
app_file: app.py
pinned: false
license: mit

Overview

Link to Application

This project demonstrates the fine tuning of a LLaMA 3.2 3B model using the QLoRA strategy. In this method, the model's original weights are frozen and small "adapters" matrices are trained. This type of fine tuning drastrically reduces the number of parameters to train, which is significant when starting with LLMs with billions of parameters. QLoRA is an extension of this, where weights are quantized in a compress format, speeding up the entire process.

In this Lab, the LLaMA 3.2 3B model is fine tuned using the FineTome-100k dataset, which is a dataset containing instructions (or questions) and answers. The fine tuned model is serviced in a Gradio app like a chatbot. The chatbot was repurposed to be an AI teacher assistant, helping students answer questions or explain concepts about machine learning.

To expand the usage of the chatbot we built a quiz that helps students test their understanding of machine learning. The quiz is generated by the fine tune model. It creates 10 multiple choice questions and the answer. The user can select an option and they receive feedback about the answer being correct or not.

Evaluation

We evaluated the fine-tuned model on the ARC Challenge, a benchmark designed to test reasoning and logic rather than memorization. For an AI TA, the ability to reason and provide guided, instructive answers is more important than pure accuracy, which is why we started with this dataset. We used 500 questions from the dataset for practical runtime considerations.

We also evaluated both models on the TruthfulQA dataset, which aligns more closely with the FineTome data by emphasizing factual and instruction-following responses.

Model ARC Accuracy TruthfulQA
Fine-tuned 51% 78%
Baseline LLaMa 67% 53%

The drop in accuracy is expected for the ARC dataset: fine-tuning encourages the model to follow instructions, which can slightly compromise raw reasoning performance. However, on simpler question and answer format, the fine-tuned model performed much better than the baseline one. So it all depends on what you want to prioritize.

Challenges

  1. Memory constraints: Running different models with unsloth frequently ran out of RAM when saving to GGUF format. We experimented with various model sizes until one could be successfully saved.

  2. Quiz generation and parsing: Crafting a prompt that reliably produced questions, options, and answers in a consistent format was challenging. Correct parsing was essential to ensure the quiz behaved as expected.

  3. Model output variability: The model’s output varied across prompts. While we accounted for most observed formats, some questions may have been correctly answered but incorrectly parsed, potentially underestimating model performance.

Improvements

  1. The chatbot keeps little context of previous conversations and we do not inject back the output from previous answers because the inference was too slow. This is fitting for a question-answer interaction with the LLM, since context from previous answers is not as important. We could build a smart way of detection if the user is asking a follow up question, and only in those cases add the previous answer as context to the prompt sent to the LLM. This can be done dynamically to toggle on and off the context addition and make the experience better for the user.

  2. The quiz is generated by an LLM that is far from being 100% accurate in reasoning and answering questions. For this to be of more value we should give feedback to the LLM when the answer it selects as correct is in fact false (or other answers are correct). We tried looking for an ML-themed dataset for further fine tuning the LLM but couldn't find any. But this would greatly improve the quality of the generated quizes.