Spaces:

ebbalg
/

Iris

Sleeping

App Files Files Community

Iris / README.md

jpruzcuen

link in read me

d2152c2 about 2 months ago

preview code

raw

history blame contribute delete

4.1 kB

A newer version of the Gradio SDK is available: 6.3.0

Upgrade

metadata

title: Gh Action
emoji: 🐨
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 6.0.0
app_file: app.py
pinned: false
license: mit

Overview

Link to Application

This project demonstrates the fine tuning of a LLaMA 3.2 3B model using the QLoRA strategy. In this method, the model's original weights are frozen and small "adapters" matrices are trained. This type of fine tuning drastrically reduces the number of parameters to train, which is significant when starting with LLMs with billions of parameters. QLoRA is an extension of this, where weights are quantized in a compress format, speeding up the entire process.

In this Lab, the LLaMA 3.2 3B model is fine tuned using the FineTome-100k dataset, which is a dataset containing instructions (or questions) and answers. The fine tuned model is serviced in a Gradio app like a chatbot. The chatbot was repurposed to be an AI teacher assistant, helping students answer questions or explain concepts about machine learning.

To expand the usage of the chatbot we built a quiz that helps students test their understanding of machine learning. The quiz is generated by the fine tune model. It creates 10 multiple choice questions and the answer. The user can select an option and they receive feedback about the answer being correct or not.

Evaluation

We evaluated the fine-tuned model on the ARC Challenge, a benchmark designed to test reasoning and logic rather than memorization. For an AI TA, the ability to reason and provide guided, instructive answers is more important than pure accuracy, which is why we started with this dataset. We used 500 questions from the dataset for practical runtime considerations.

We also evaluated both models on the TruthfulQA dataset, which aligns more closely with the FineTome data by emphasizing factual and instruction-following responses.

Model	ARC Accuracy	TruthfulQA
Fine-tuned	51%	78%
Baseline LLaMa	67%	53%

The drop in accuracy is expected for the ARC dataset: fine-tuning encourages the model to follow instructions, which can slightly compromise raw reasoning performance. However, on simpler question and answer format, the fine-tuned model performed much better than the baseline one. So it all depends on what you want to prioritize.

Challenges

Memory constraints: Running different models with unsloth frequently ran out of RAM when saving to GGUF format. We experimented with various model sizes until one could be successfully saved.
Quiz generation and parsing: Crafting a prompt that reliably produced questions, options, and answers in a consistent format was challenging. Correct parsing was essential to ensure the quiz behaved as expected.
Model output variability: The model’s output varied across prompts. While we accounted for most observed formats, some questions may have been correctly answered but incorrectly parsed, potentially underestimating model performance.

Improvements

The chatbot keeps little context of previous conversations and we do not inject back the output from previous answers because the inference was too slow. This is fitting for a question-answer interaction with the LLM, since context from previous answers is not as important. We could build a smart way of detection if the user is asking a follow up question, and only in those cases add the previous answer as context to the prompt sent to the LLM. This can be done dynamically to toggle on and off the context addition and make the experience better for the user.
The quiz is generated by an LLM that is far from being 100% accurate in reasoning and answering questions. For this to be of more value we should give feedback to the LLM when the answer it selects as correct is in fact false (or other answers are correct). We tried looking for an ML-themed dataset for further fine tuning the LLM but couldn't find any. But this would greatly improve the quality of the generated quizes.