Qwen2.5-7B-ReasonMed-cot-123k
Full fine-tune of Qwen/Qwen2.5-7B on the CoTMed variant of ReasonMed — chain-of-thought reasoning followed by the answer, without <think> tags. 123K samples (1/3 of full dataset) × 3 epochs.
Training code: https://github.com/Chen-Jie7/NLP_project
Training data
lingshu-medical-mllm/ReasonMed — CoTMed.json variant. Outputs are free-form CoT reasoning concluding with the answer.
Three-way format comparison
All three models trained with identical hyperparameters on 123K samples. Evaluated via loglikelihood MCQ scoring.
| Variant | Output format | Total acc |
|---|---|---|
| reason | <think>CoT</think>Response |
65.8 |
| cot (this model) | CoT without tags | 65.0 |
| response | Direct answer only | 63.8 |
Evaluation
| Benchmark | Ours (123K) | Paper (370K) |
|---|---|---|
| MedQA | 61.0 | 66.9 |
| MedMCQA (val) | 60.4 | 65.1 |
| PubMedQA | 75.9 | 82.0 |
| MMLU-Anatomy | 74.8 | 75.6 |
| MMLU-Clinical-Knowledge | 78.1 | 79.3 |
| MMLU-College-Biology | 81.9 | 79.2 |
| MMLU-College-Medicine | 68.8 | 73.4 |
| MMLU-Medical-Genetics | 84.0 | 85.0 |
| MMLU-Professional-Medicine | 79.0 | 80.9 |
| Total | 65.0 | 69.6 |
Training hyperparameters
- learning_rate: 1e-05
- effective batch size: 128 (8 GPU × 4 per-device × 4 accum)
- num_epochs: 3.0
- lr_scheduler: cosine, warmup_ratio 0.1
- precision: bf16
- deepspeed: ZeRO stage 2
- hardware: 8× H200, ~6.5h
Framework versions
- Transformers 4.57.6
- Pytorch 2.10.0+cu128
- LlamaFactory 0.9.5
- Downloads last month
- 4
Model tree for Chens7/Qwen2.5-7B-ReasonMed-cot-123k
Base model
Qwen/Qwen2.5-7B