|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
base_model: Qwen/Qwen3-4B |
|
|
pipeline_tag: text-generation |
|
|
language: |
|
|
- bn |
|
|
- en |
|
|
tags: |
|
|
- math |
|
|
- bengali |
|
|
- reasoning |
|
|
- grpo |
|
|
- curriculum-learning |
|
|
datasets: |
|
|
- dipta007/Ganit |
|
|
--- |
|
|
|
|
|
# GanitLLM-4B_CGRPO |
|
|
|
|
|
[](https://arxiv.org/) |
|
|
[](https://huggingface.co/datasets/dipta007/Ganit) |
|
|
[](https://huggingface.co/collections/dipta007/ganitllm) |
|
|
|
|
|
## Highlights |
|
|
|
|
|
**GanitLLM-4B_CGRPO** is a Bengali mathematical reasoning model trained with Curriculum-GRPO directly on the base model (without SFT). This variant achieves the highest raw accuracy but reasons primarily in English. Key results: |
|
|
|
|
|
- **+13.2 accuracy** on Bn-MGSM benchmark (69.2 → 82.4) |
|
|
- **+8.0 accuracy** on Bn-MSVAMP benchmark (70.5 → 78.5) |
|
|
- **14.94% Bengali reasoning** (similar to base model) |
|
|
- **10.5% fewer tokens** in generated solutions (943 → 844 words) |
|
|
|
|
|
> **Note**: This model achieves high accuracy but does not reason in Bengali. For Bengali reasoning, use [GanitLLM-4B_SFT_CGRPO](https://huggingface.co/dipta007/GanitLLM-4B_SFT_CGRPO) instead. |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
| Property | Value | |
|
|
|----------|-------| |
|
|
| **Model Type** | Causal Language Model | |
|
|
| **Base Model** | Qwen/Qwen3-4B | |
|
|
| **Parameters** | 4B | |
|
|
| **Training** | Curriculum-GRPO (no SFT) | |
|
|
| **Context Length** | 4,096 tokens | |
|
|
| **Language** | Bengali, English | |
|
|
|
|
|
## Training Details |
|
|
|
|
|
This model was trained with a single-stage pipeline: |
|
|
|
|
|
1. **Curriculum-GRPO**: Reinforcement learning with difficulty-aware sampling directly on the base model using GANIT-RLVR (~7.3k examples) |
|
|
|
|
|
### Reward Functions |
|
|
- **Format Reward**: Validates `<think>` and `<answer>` tag structure |
|
|
- **Correctness Reward**: +2.0 for Bengali answer match, +1.0 for English match |
|
|
- **Bengali Reasoning Reward**: Ensures >80% Bengali text in reasoning |
|
|
|
|
|
## Quickstart |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_name = "dipta007/GanitLLM-4B_CGRPO" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype="auto", |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
problem = "একটি দোকানে ১২টি আপেল আছে। যদি ৫টি আপেল বিক্রি হয়, তাহলে কতটি আপেল বাকি থাকবে?" |
|
|
|
|
|
prompt = f"""A conversation takes place between the user and the assistant. The user asks a question, and the assistant solves the problem. Please reason step by step in Bengali, and put your final answer in the <answer> </answer> tags. |
|
|
|
|
|
Question: {problem}""" |
|
|
|
|
|
messages = [{"role": "user", "content": prompt}] |
|
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
model_inputs = tokenizer([text], return_tensors="pt").to(model.device) |
|
|
|
|
|
generated_ids = model.generate(**model_inputs, max_new_tokens=2048, temperature=0.7) |
|
|
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() |
|
|
response = tokenizer.decode(output_ids, skip_special_tokens=True) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### Using vLLM |
|
|
|
|
|
```bash |
|
|
vllm serve dipta007/GanitLLM-4B_CGRPO --max-model-len 4096 |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Model | Bn-MGSM | Bn-MSVAMP | Avg. Words | Bengali % | |
|
|
|-------|---------|-----------|------------|-----------| |
|
|
| Qwen3-4B (base) | 69.20 | 70.50 | 943 | 14.79% | |
|
|
| **GanitLLM-4B_CGRPO** | **82.40** | **78.50** | **844** | **14.94%** | |
|
|
|
|
|
## Related Models |
|
|
|
|
|
| Model | Parameters | Training | Link | |
|
|
|-------|------------|----------|------| |
|
|
| GanitLLM-4B_SFT_CGRPO | 4B | SFT + CGRPO | [Link](https://huggingface.co/dipta007/GanitLLM-4B_SFT_CGRPO) | |
|
|
| GanitLLM-4B_SFT_GRPO | 4B | SFT + GRPO | [Link](https://huggingface.co/dipta007/GanitLLM-4B_SFT_GRPO) | |
|
|
| **GanitLLM-4B_CGRPO** | 4B | CGRPO | [Link](https://huggingface.co/dipta007/GanitLLM-4B_CGRPO) | |
|
|
| GanitLLM-1.7B_CGRPO | 1.7B | CGRPO | [Link](https://huggingface.co/dipta007/GanitLLM-1.7B_CGRPO) | |
|
|
| GanitLLM-0.6B_CGRPO | 0.6B | CGRPO | [Link](https://huggingface.co/dipta007/GanitLLM-0.6B_CGRPO) | |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
will be updated |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Apache 2.0 License. |
|
|
|