Update README.md
Browse files
README.md
CHANGED
|
@@ -32,17 +32,17 @@ The finetuning method used was LoRA.
|
|
| 32 |
LoRA is a moderate intervention model editor that seems perfect for my task.
|
| 33 |
It is computationally efficient, preserves the knowledge of the base model well, and has smaller file sizes which means the latency of the model is minimally impacted.
|
| 34 |
This is perfect for an AI tutor since students these days need answers immediately or they go onto to other things.
|
| 35 |
-
They also tend to go down rabbit holes, so while
|
| 36 |
|
| 37 |
## Evaluation
|
| 38 |
The metrics used to evaluate this model are the mmlu_high_school_statistics, minerva_math, and race benchmarks. The BERT benchmarks are also reported.
|
| 39 |
-
The mmlu_high_school_statistics benchmarks is the main statistics knowledge benchmark, minerva_math serves as a general math benchmark to ensure by gaining stat knowledge general math knowledge was not lost, and the race benchmark served to establish that the model does not suffer from
|
| 40 |
As can be seen this new model does not improve on the scores of the base model on any of the evaluation metrics.
|
| 41 |
Not improving on the minvera_math and race benchmarks should be expected, and the fact they did not decrease is good sign.
|
| 42 |
Not improving on the mmlu_high_school_statistics is a bit disheartening.
|
| 43 |
However, the benchmark score is one thing, how the model acutally performs is another.
|
| 44 |
In testing the base model, the model frequently made math errors and though it "answered" the question, left off key pieces required for the AP Exam.
|
| 45 |
-
The finetuned model fixes these errors and gives answer in a format much more suitable for an AP Exam with fewer mistakes (see Expected Output Format).
|
| 46 |
This may be due to the specific nature of the object of the model itself (inference problems for the AP test) verses the broad nature of the benchmark (all of high school statistics in a non AP setting).
|
| 47 |
So, while the benchmark scores do not indicate success, the model does perform better in real world scenarios indicating the finetuning was a success.
|
| 48 |
The model was compared to Llama-3.2-3B-Instruct and Mistral7B-Instruct-v0.2 and show superior metrics on the mmlu_high_school_statistics and minerva_math while having a comparable race metric.
|
|
|
|
| 32 |
LoRA is a moderate intervention model editor that seems perfect for my task.
|
| 33 |
It is computationally efficient, preserves the knowledge of the base model well, and has smaller file sizes which means the latency of the model is minimally impacted.
|
| 34 |
This is perfect for an AI tutor since students these days need answers immediately or they go onto to other things.
|
| 35 |
+
They also tend to go down rabbit holes, so while this model is specifically trained for a statistics tutor, keeping the base model knowledge when the explore the rabbit holes can be important.
|
| 36 |
|
| 37 |
## Evaluation
|
| 38 |
The metrics used to evaluate this model are the mmlu_high_school_statistics, minerva_math, and race benchmarks. The BERT benchmarks are also reported.
|
| 39 |
+
The mmlu_high_school_statistics benchmarks is the main statistics knowledge benchmark, minerva_math serves as a general math benchmark to ensure by gaining stat knowledge, general math knowledge was not lost, and the race benchmark served to establish that the model does not suffer from catastrophic forgetting.
|
| 40 |
As can be seen this new model does not improve on the scores of the base model on any of the evaluation metrics.
|
| 41 |
Not improving on the minvera_math and race benchmarks should be expected, and the fact they did not decrease is good sign.
|
| 42 |
Not improving on the mmlu_high_school_statistics is a bit disheartening.
|
| 43 |
However, the benchmark score is one thing, how the model acutally performs is another.
|
| 44 |
In testing the base model, the model frequently made math errors and though it "answered" the question, left off key pieces required for the AP Exam.
|
| 45 |
+
The finetuned model fixes these errors and gives answer in a format much more suitable for an AP Exam with fewer mistakes (see the Expected Output Format section).
|
| 46 |
This may be due to the specific nature of the object of the model itself (inference problems for the AP test) verses the broad nature of the benchmark (all of high school statistics in a non AP setting).
|
| 47 |
So, while the benchmark scores do not indicate success, the model does perform better in real world scenarios indicating the finetuning was a success.
|
| 48 |
The model was compared to Llama-3.2-3B-Instruct and Mistral7B-Instruct-v0.2 and show superior metrics on the mmlu_high_school_statistics and minerva_math while having a comparable race metric.
|