|
|
--- |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-3B |
|
|
datasets: |
|
|
- MegaScience/MegaScience |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
metrics: |
|
|
- accuracy |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# [MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning](https://huggingface.co/papers/2507.16812) |
|
|
|
|
|
This repository contains the `Qwen2.5-3B-MegaScience` model, one of the models trained as part of the MegaScience project. |
|
|
|
|
|
For the official code, data processing pipeline, and evaluation system, please refer to the [MegaScience GitHub repository](https://github.com/GAIR-NLP/lm-open-science-evaluation). |
|
|
|
|
|
## Qwen2.5-3B-MegaScience |
|
|
|
|
|
### Usage |
|
|
|
|
|
You can use this model with the Hugging Face `transformers` library: |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_name = "MegaScience/Qwen2.5-3B-MegaScience" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_name) |
|
|
|
|
|
# Example text generation |
|
|
prompt = "The capital of France is" |
|
|
messages = [ |
|
|
{"role": "user", "content": prompt} |
|
|
] |
|
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
model_inputs = tokenizer([text], return_tensors="pt") |
|
|
|
|
|
generated_ids = model.generate(model_inputs.input_ids, max_new_tokens=20) |
|
|
print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]) |
|
|
``` |
|
|
|
|
|
### Training Recipe |
|
|
|
|
|
- **LR**: 5e-6 |
|
|
- **LR Schedule**: Cosine |
|
|
- **Batch Size**: 512 |
|
|
- **Max Length**: 4,096 |
|
|
- **Warm Up Ratio**: 0.05 |
|
|
- **Epochs**: 3 |
|
|
|
|
|
### Evaluation Results |
|
|
|
|
|
<div style="display: flex; justify-content: left; gap: 20px;"> |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/616bfc2b40e2f69baa1c7add/abIVZ2XB9D-o-TCyvOkDE.png" alt="Data Pipeline" style="width:80%;"> |
|
|
</div> |
|
|
|
|
|
<div style="display: flex; justify-content: left; gap: 20px;"> |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/616bfc2b40e2f69baa1c7add/xFTJ7nevc3S4UYJxUS7ue.png" alt="Data Pipeline" style="width:80%;"> |
|
|
</div> |
|
|
|
|
|
### More about MegaScience |
|
|
|
|
|
<div style="display: flex; justify-content: left; gap: 20px;"> |
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/616bfc2b40e2f69baa1c7add/VogIpBbjfNxXFP9DfVMms.png" alt="Data Pipeline" style="width:100%;"> |
|
|
</div> |
|
|
|
|
|
## Citation |
|
|
|
|
|
Check out our [paper](https://arxiv.org/abs/2507.16812) for more details. If you use our dataset or find our work useful, please cite |
|
|
|
|
|
``` |
|
|
@article{fan2025megascience, |
|
|
title={MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning}, |
|
|
author={Fan, Run-Ze and Wang, Zengzhi and Liu, Pengfei}, |
|
|
year={2025}, |
|
|
journal={arXiv preprint arXiv:2507.16812}, |
|
|
url={https://arxiv.org/abs/2507.16812} |
|
|
} |
|
|
``` |