pfost-bit
/

SurfMine

Text Generation

Model card Files Files and versions

pfost-bit commited on Dec 7, 2025

Commit

29d7722

·

verified ·

1 Parent(s): 26afa61

Update README.md

Files changed (1) hide show

README.md +6 -0

README.md CHANGED Viewed

@@ -75,6 +75,7 @@ For this models evalutation I used three metrics that are common in natural lang
 * BERT
 * ROUGE
 * BLEU
 The primary evaluation is BERT score, this is a way to calcuate the similarity between two text inputs. BERT aims to to assess semantic similarity, it measure the difference between the actual forecast, and the generated forecast to see if they have similar semantic meanings, a higher BERT score is better.
@@ -82,6 +83,8 @@ ROUGE (Recall-Oriented Understudy for Gisting Evaluation) This is used to see if
 BLEU (Bilingual Evaluation Understudy) measures how many words appear in the reference generated human text. This should show if the model is picking up on the "surfer lingo" a higher BLEU score is better.
 I chose two models of a similar size from large AI researchers as benchmarks.
 |       | QWEN-4B-Instruct-2507 | Llama-3.2-3B-Instruct | google/gemma-2-2b-it | SurfMine     |
@@ -89,9 +92,12 @@ I chose two models of a similar size from large AI researchers as benchmarks.
 | BERT  |       **.8215**       |         .8141         |         .8201        |   **.8717**  |
 | ROUGE |       **.1097**       |         .1053         |         .1075        |   **.2074**  |
 | BLEU  |         .0051         |         .0032         |       **.0059**      |   **.0702**  |
 SurfMine does better in all metrics when compared to the base model as well as the chosen benchmark models.
 ## Usage and Intended Uses

 * BERT
 * ROUGE
 * BLEU
+* HellaSwag
 The primary evaluation is BERT score, this is a way to calcuate the similarity between two text inputs. BERT aims to to assess semantic similarity, it measure the difference between the actual forecast, and the generated forecast to see if they have similar semantic meanings, a higher BERT score is better.
 BLEU (Bilingual Evaluation Understudy) measures how many words appear in the reference generated human text. This should show if the model is picking up on the "surfer lingo" a higher BLEU score is better.
+In order to address catastrophic forgetting, I also ran a benchmark test with HellaSwag to see how the model would perform before and after training for a custom task.
 I chose two models of a similar size from large AI researchers as benchmarks.
 |       | QWEN-4B-Instruct-2507 | Llama-3.2-3B-Instruct | google/gemma-2-2b-it | SurfMine     |
 | BERT  |       **.8215**       |         .8141         |         .8201        |   **.8717**  |
 | ROUGE |       **.1097**       |         .1053         |         .1075        |   **.2074**  |
 | BLEU  |         .0051         |         .0032         |       **.0059**      |   **.0702**  |
+| HellaSwag | .4 | .4 | .4 | .35 |
 SurfMine does better in all metrics when compared to the base model as well as the chosen benchmark models.
+However, we do see some degradation in the HellaSwag `acc_none` benchmark, indicating reduced generality of the model. This is to be suspected; we might be able to make further changes to the training pipeline to combat this in the future. This model is not intended to be used as general model, so the loss of generality is not the end of the world.
 ## Usage and Intended Uses