rrrr66254 commited on
Commit
1827d56
·
verified ·
1 Parent(s): 47f5f99

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -6
README.md CHANGED
@@ -4,6 +4,8 @@ language:
4
  - en
5
  metrics:
6
  - bertscore
 
 
7
  base_model:
8
  - facebook/bart-base
9
  ---
@@ -120,16 +122,25 @@ This model does not explicitly disaggregate results by demographic group, signer
120
 
121
  #### Metrics
122
 
123
- - **Primary metric**: BERTScore (F1)
124
  - **Model selection**: Best checkpoint based on highest validation BERTScore-F1
125
- - BERTScore is preferred for this task due to its alignment with semantic quality over token-level exactness (e.g., BLEU or ROUGE)
126
 
127
  ### Results
128
 
129
- After 2 epochs of training, the model achieved:
130
 
131
- - **BERTScore-F1**: 0.83 on held-out evaluation set of 500 gloss-reference pairs
132
- - Qualitative inspection confirms that most outputs are fluent and contextually aligned, though some suffer from missing function words or incorrect verb tenses.
 
 
 
 
 
 
 
 
 
133
 
134
 
135
  #### Summary
@@ -142,4 +153,4 @@ This model demonstrates strong potential for gloss-to-English translation, with
142
 
143
  ## Model Card Contact
144
 
145
- - rrrr66254@gmail.com
 
4
  - en
5
  metrics:
6
  - bertscore
7
+ - bleu
8
+ - rouge
9
  base_model:
10
  - facebook/bart-base
11
  ---
 
122
 
123
  #### Metrics
124
 
125
+ - **Primary metric**: BERTScore (F1), BLEU, and ROUGE
126
  - **Model selection**: Best checkpoint based on highest validation BERTScore-F1
127
+ - BERTScore is used to evaluate semantic alignment, while BLEU and ROUGE provide additional insight into surface-level n-gram overlap. All metrics were evaluated using the same held-out set of 500 gloss-reference pairs.
128
 
129
  ### Results
130
 
131
+ After 2 epochs of training, the model achieved the following on the 500-pair evaluation set:
132
 
133
+ - **BERTScore-F1**: 0.83
134
+ - **BLEU Scores**:
135
+ - BLEU-1: 0.7063
136
+ - BLEU-2: 0.6175
137
+ - BLEU-3: 0.5479
138
+ - BLEU-4: 0.4821
139
+ - **ROUGE Scores**:
140
+ - ROUGE-1: 0.7587
141
+ - ROUGE-2: 0.5874
142
+ - ROUGE-L: 0.7312
143
+ - Qualitative inspection shows that most model outputs are fluent and contextually accurate. Common errors include omission of function words and minor verb tense mismatches.
144
 
145
 
146
  #### Summary
 
153
 
154
  ## Model Card Contact
155
 
156
+ - rrrr66254@gmail.com