Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -46,10 +46,13 @@ Both datasets were converted to a unified `messages` format compatible with Qwen
|
|
| 46 |
|
| 47 |
| Metric | Method | Few-shot | Score | Std Error |
|
| 48 |
|--------|--------|----------|-------|-----------|
|
| 49 |
-
| exact_match | flexible-extract | 5 |
|
| 50 |
-
| exact_match | strict-match | 5 |
|
| 51 |
|
| 52 |
- **Baseline** (Qwen2.5-0.5B-Instruct): 34.42% (flexible-extract), 31.69% (strict-match)
|
|
|
|
|
|
|
|
|
|
| 53 |
- **Note**: This model was fine-tuned on a curated dataset mixture of 47,473 samples to improve mathematical reasoning capabilities
|
| 54 |
|
| 55 |
### Evaluation Details
|
|
|
|
| 46 |
|
| 47 |
| Metric | Method | Few-shot | Score | Std Error |
|
| 48 |
|--------|--------|----------|-------|-----------|
|
| 49 |
+
| exact_match | flexible-extract | 5 | **34.12%** | ±1.31% |
|
| 50 |
+
| exact_match | strict-match | 5 | **33.59%** | ±1.30% |
|
| 51 |
|
| 52 |
- **Baseline** (Qwen2.5-0.5B-Instruct): 34.42% (flexible-extract), 31.69% (strict-match)
|
| 53 |
+
- **Improvement**:
|
| 54 |
+
- Flexible-extract: Comparable performance (34.12% vs 34.42%)
|
| 55 |
+
- Strict-match: **+1.90% improvement** (33.59% vs 31.69%)
|
| 56 |
- **Note**: This model was fine-tuned on a curated dataset mixture of 47,473 samples to improve mathematical reasoning capabilities
|
| 57 |
|
| 58 |
### Evaluation Details
|