pfost-bit commited on
Commit
29d7722
·
verified ·
1 Parent(s): 26afa61

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -0
README.md CHANGED
@@ -75,6 +75,7 @@ For this models evalutation I used three metrics that are common in natural lang
75
  * BERT
76
  * ROUGE
77
  * BLEU
 
78
 
79
  The primary evaluation is BERT score, this is a way to calcuate the similarity between two text inputs. BERT aims to to assess semantic similarity, it measure the difference between the actual forecast, and the generated forecast to see if they have similar semantic meanings, a higher BERT score is better.
80
 
@@ -82,6 +83,8 @@ ROUGE (Recall-Oriented Understudy for Gisting Evaluation) This is used to see if
82
 
83
  BLEU (Bilingual Evaluation Understudy) measures how many words appear in the reference generated human text. This should show if the model is picking up on the "surfer lingo" a higher BLEU score is better.
84
 
 
 
85
  I chose two models of a similar size from large AI researchers as benchmarks.
86
 
87
  | | QWEN-4B-Instruct-2507 | Llama-3.2-3B-Instruct | google/gemma-2-2b-it | SurfMine |
@@ -89,9 +92,12 @@ I chose two models of a similar size from large AI researchers as benchmarks.
89
  | BERT | **.8215** | .8141 | .8201 | **.8717** |
90
  | ROUGE | **.1097** | .1053 | .1075 | **.2074** |
91
  | BLEU | .0051 | .0032 | **.0059** | **.0702** |
 
92
 
93
  SurfMine does better in all metrics when compared to the base model as well as the chosen benchmark models.
94
 
 
 
95
 
96
  ## Usage and Intended Uses
97
 
 
75
  * BERT
76
  * ROUGE
77
  * BLEU
78
+ * HellaSwag
79
 
80
  The primary evaluation is BERT score, this is a way to calcuate the similarity between two text inputs. BERT aims to to assess semantic similarity, it measure the difference between the actual forecast, and the generated forecast to see if they have similar semantic meanings, a higher BERT score is better.
81
 
 
83
 
84
  BLEU (Bilingual Evaluation Understudy) measures how many words appear in the reference generated human text. This should show if the model is picking up on the "surfer lingo" a higher BLEU score is better.
85
 
86
+ In order to address catastrophic forgetting, I also ran a benchmark test with HellaSwag to see how the model would perform before and after training for a custom task.
87
+
88
  I chose two models of a similar size from large AI researchers as benchmarks.
89
 
90
  | | QWEN-4B-Instruct-2507 | Llama-3.2-3B-Instruct | google/gemma-2-2b-it | SurfMine |
 
92
  | BERT | **.8215** | .8141 | .8201 | **.8717** |
93
  | ROUGE | **.1097** | .1053 | .1075 | **.2074** |
94
  | BLEU | .0051 | .0032 | **.0059** | **.0702** |
95
+ | HellaSwag | .4 | .4 | .4 | .35 |
96
 
97
  SurfMine does better in all metrics when compared to the base model as well as the chosen benchmark models.
98
 
99
+ However, we do see some degradation in the HellaSwag `acc_none` benchmark, indicating reduced generality of the model. This is to be suspected; we might be able to make further changes to the training pipeline to combat this in the future. This model is not intended to be used as general model, so the loss of generality is not the end of the world.
100
+
101
 
102
  ## Usage and Intended Uses
103