bnolton
/

AP_Stat_Inference_Helper

Transformers

Safetensors

English

Model card Files Files and versions

xet

Community

bnolton commited on Dec 2, 2025

Commit

81649ea

verified ·

1 Parent(s): bacb539

Update README.md

Browse files

Files changed (1) hide show

README.md +33 -10

README.md CHANGED Viewed

@@ -11,19 +11,38 @@ base_model:
 ## Data
 ## Methodology
 ## Evaluation
-| Model                    | mmlu_hs_stats | miverva_math | race | bert_prec | bert_recall | bert_f1 |
-|--------------------------|---------------|--------------|------|-----------|-------------|---------|
-| AP_Stat_Inference_Helper | 0.72          | 0.45         | 0.32 | 0.75      | 0.85        | 0.80    |
-| Qwen                     | 0.72          | 0.45         | 0.32 | 0.75      | 0.85        | 0.80    |
-| Llama                    | 0.30          | 0.29         | 0.38 | x.xx      | x.xx        | x.xx    |
-| Mistral                  | 0.46          | 0.09         | 0.38 | x.xx      | x.xx        | x.xx    |
 ## Usage and Intended Use
@@ -68,7 +87,11 @@ Since the p-value is less than the significance level of 0.05, we reject the nul
 There is sufficient evidence at the 0.05 significance level to conclude that the response rate in this situation is greater than the previous rate of 40%. The investigator's theory that people would respond at a higher rate when the distributor was stigmatized appears to be supported by the data.
 ## Limitations
-The dataset for this model is solely focus on the inference procedures for the AP Statistics class. This model did not improve on the metrics, however it did improve the format of the answers to the questions asked.
 # Model Card for Model ID

 ## Data
+The dataset used in training consists of 1014 question-answer pairs on the topic of inference for the AP Statistics exam created by the owner of this model.
+Specifically the problems involve 1 and 2 sample means and proportions confidence intervals and significance tests (no chi-sqaure or inference for slope).
+These problems were taken from three textbooks (two paid for copies, 1 open source).
+There are 928 free response questions and 86 multiple choice questions.
+This dataset was split 80/20 between a training set a validation set using a random seed of 42.
 ## Methodology
+The finetuning method used was LoRA.
+LoRA is a moderate intervention model editor that seems perfect for my task.
+It is computationally efficient, preserves the knowledge of the base model well, and has smaller file sizes which means the latency of the model is minimally impacted.
+This is perfect for an AI tutor since students these days need answers immediately or they go onto to other things.
+They also tend to go down rabbit holes, so while I'm specifically training this for a statistics tutor, keeping the base model knowledge when the explore the rabbit holes can be important.
 ## Evaluation
+The metrics used to evaluate this model are the mmlu_high_school_statistics, minerva_math, and race benchmarks. The BERT benchmarks are also reported.
+The mmlu_high_school_statistics benchmarks is the main statistics knowledge benchmark, minerva_math serves as a general math benchmark to ensure by gaining stat knowledge general math knowledge was not lost, and the race benchmark served to establish that the model does not suffer from atastrophic forgetting.
+As can be seen this new model does not improve on the scores of the base model on any of the evaluation metrics.
+Not improving on the minvera_math and race benchmarks should be expected, and the fact they did not decrease is good sign.
+Not improving on the mmlu_high_school_statistics is a bit disheartening.
+However, the benchmark score is one thing, how the model acutally performs is another.
+In testing the base model, the model frequently made math errors and though it "answered" the question, left off key pieces required for the AP Exam.
+The finetuned model fixes these errors and gives answer in a format much more suitable for an AP Exam with fewer mistakes (see Expected Output Format).
+This may be due to the specific nature of the object of the model itself (inference problems for the AP test) verses the broad nature of the benchmark (all of high school statistics in a non AP setting).
+So, while the benchmark scores do not indicate success, the model does perform better in real world scenarios indicating the finetuning was a success.
+The model was compared to Llama-3.2-3B-Instruct and Mistral7B-Instruct-v0.2 and show superior metrics on the mmlu_high_school_statistics and minerva_math while having a comparable race metric.
+| Model                    | mmlu_high_school_statistics | minerva_math | race | bert_prec | bert_recall | bert_f1 |
+|--------------------------|-----------------------------|--------------|------|-----------|-------------|---------|
+| AP_Stat_Inference_Helper | 0.72                        | 0.45         | 0.32 | 0.75      | 0.85        | 0.80    |
+| Qwen3-4B-Instruct-2507   | 0.72                        | 0.45         | 0.32 | 0.75      | 0.85        | 0.80    |
+| Llama-3.2-3B-Instruct    | 0.30                        | 0.29         | 0.38 | x.xx      | x.xx        | x.xx    |
+| Mistral7B-Instruct-v0.2  | 0.46                        | 0.09         | 0.38 | x.xx      | x.xx        | x.xx    |
 ## Usage and Intended Use
 There is sufficient evidence at the 0.05 significance level to conclude that the response rate in this situation is greater than the previous rate of 40%. The investigator's theory that people would respond at a higher rate when the distributor was stigmatized appears to be supported by the data.
 ## Limitations
+The dataset for this model is solely focus on the inference procedures for the AP Statistics class.
+The specific inference procedures are 1 and 2 sample means and proportions confidence intervals and significance tests (no chi-sqaure or inference for slope).
+While the AP test for some of these problems would require the drawing of a curve, this model is text only.
+The model may use some terms that are being phased out due to the source of the problems in the dataset being published before the AP Statistics rework (for example: indepencence instead of 10% check and normality instead of large counts condition).
 # Model Card for Model ID