Update README.md
Browse filesadded evaluation metric.
README.md
CHANGED
|
@@ -26,6 +26,26 @@ This model is a LoRA (Low-Rank Adaptation) fine-tuned version of **Qwen2.5-1.5B-
|
|
| 26 |
|
| 27 |
---
|
| 28 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
## How to Use
|
| 30 |
|
| 31 |
### Example Python Script
|
|
|
|
| 26 |
|
| 27 |
---
|
| 28 |
|
| 29 |
+
## Evaluation on MATH-500 Benchmark
|
| 30 |
+
|
| 31 |
+
After following the sampling-based Pass@1 methodology inspired by [DeepSeek R1](https://arxiv.org/abs/2501.12948), we have
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
| Parameter | Value |
|
| 35 |
+
|------------------|---------|
|
| 36 |
+
| **Dataset** | `uggingFaceH4/MATH-500` |
|
| 37 |
+
| **Temperature** | `0.6` |
|
| 38 |
+
| **Top_p** | `0.95` |
|
| 39 |
+
| **Num_samples** | `16` per question |
|
| 40 |
+
|
| 41 |
+
### Results
|
| 42 |
+
|
| 43 |
+
- **At-least-one-correct Rate:** **54.60%** (273 out of 500 questions)
|
| 44 |
+
|
| 45 |
+
*This metric represents the percentage of questions with at least one correct solution among multiple generated attempts.*
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
## How to Use
|
| 50 |
|
| 51 |
### Example Python Script
|