update accuracy: normalized answer comparison result
Browse files
README.md
CHANGED
|
@@ -27,19 +27,20 @@ This model is a fine-tuned version of [microsoft/MiniLM-L12-H384-uncased](https:
|
|
| 27 |
|
| 28 |
It achieves the following results on the evaluation set:
|
| 29 |
- Loss: 1.4653
|
| 30 |
-
- Exact Match Accuracy:
|
| 31 |
|
| 32 |
## Evaluation Notes
|
| 33 |
|
| 34 |
#### Issues with Exact Match Evaluation
|
| 35 |
Several correct predictions were incorrectly marked as false negatives due to strict exact-match criteria being sensitive to minor differences in tokenization, formatting, or span boundaries:
|
| 36 |
|
| 37 |
-
- Predicted: `
|
| 38 |
-
- Predicted: `
|
| 39 |
-
- Predicted: `
|
|
|
|
| 40 |
|
| 41 |
#### Overall Performance
|
| 42 |
-
- Exact-match accuracy: **>
|
| 43 |
- The model frequently generates high-quality and semantically correct answer spans even when exact-match evaluation penalizes them.
|
| 44 |
- Primary limitation: performance drops on questions requiring deep domain-specific knowledge, largely attributable to the model's relatively small size and limited parameter capacity.
|
| 45 |
|
|
|
|
| 27 |
|
| 28 |
It achieves the following results on the evaluation set:
|
| 29 |
- Loss: 1.4653
|
| 30 |
+
- Exact Match Accuracy: 62.95%
|
| 31 |
|
| 32 |
## Evaluation Notes
|
| 33 |
|
| 34 |
#### Issues with Exact Match Evaluation
|
| 35 |
Several correct predictions were incorrectly marked as false negatives due to strict exact-match criteria being sensitive to minor differences in tokenization, formatting, or span boundaries:
|
| 36 |
|
| 37 |
+
- Predicted: `schrodinger equation` → Rejected (expected: `schrödinger equation`)
|
| 38 |
+
- Predicted: `feynman diagrams` → Rejected (expected: `feynman`)
|
| 39 |
+
- Predicted: `electromagnetic force` → Rejected (expected: `electromagnetic`)
|
| 40 |
+
- Predicted: `45 000 pounds` → Rejected (expected: `45000 pounds`)
|
| 41 |
|
| 42 |
#### Overall Performance
|
| 43 |
+
- Exact-match accuracy: **>63%**
|
| 44 |
- The model frequently generates high-quality and semantically correct answer spans even when exact-match evaluation penalizes them.
|
| 45 |
- Primary limitation: performance drops on questions requiring deep domain-specific knowledge, largely attributable to the model's relatively small size and limited parameter capacity.
|
| 46 |
|