GanitLLM-4B_CGRPO / README.md

Update README.md

c2b09dc verified 9 days ago

4.3 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: Qwen/Qwen3-4B
	pipeline_tag: text-generation
	language:
	- bn
	- en
	tags:
	- math
	- bengali
	- reasoning
	- grpo
	- curriculum-learning
	datasets:
	- dipta007/Ganit
	---

	# GanitLLM-4B_CGRPO

	[![Paper](https://img.shields.io/badge/arXiv-Paper-red)](https://arxiv.org/)
	[![Dataset](https://img.shields.io/badge/HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/dipta007/Ganit)
	[![Models](https://img.shields.io/badge/HuggingFace-Models-orange)](https://huggingface.co/collections/dipta007/ganitllm)

	## Highlights

	GanitLLM-4B_CGRPO is a Bengali mathematical reasoning model trained with Curriculum-GRPO directly on the base model (without SFT). This variant achieves the highest raw accuracy but reasons primarily in English. Key results:

	- +13.2 accuracy on Bn-MGSM benchmark (69.2 → 82.4)
	- +8.0 accuracy on Bn-MSVAMP benchmark (70.5 → 78.5)
	- 14.94% Bengali reasoning (similar to base model)
	- 10.5% fewer tokens in generated solutions (943 → 844 words)

	> Note: This model achieves high accuracy but does not reason in Bengali. For Bengali reasoning, use [GanitLLM-4B_SFT_CGRPO](https://huggingface.co/dipta007/GanitLLM-4B_SFT_CGRPO) instead.

	## Model Overview

	\| Property \| Value \|
	\|----------\|-------\|
	\| Model Type \| Causal Language Model \|
	\| Base Model \| Qwen/Qwen3-4B \|
	\| Parameters \| 4B \|
	\| Training \| Curriculum-GRPO (no SFT) \|
	\| Context Length \| 4,096 tokens \|
	\| Language \| Bengali, English \|

	## Training Details

	This model was trained with a single-stage pipeline:

	1. Curriculum-GRPO: Reinforcement learning with difficulty-aware sampling directly on the base model using GANIT-RLVR (~7.3k examples)

	### Reward Functions
	- Format Reward: Validates `<think>` and `<answer>` tag structure
	- Correctness Reward: +2.0 for Bengali answer match, +1.0 for English match
	- Bengali Reasoning Reward: Ensures >80% Bengali text in reasoning

	## Quickstart

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "dipta007/GanitLLM-4B_CGRPO"

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)

	problem = "একটি দোকানে ১২টি আপেল আছে। যদি ৫টি আপেল বিক্রি হয়, তাহলে কতটি আপেল বাকি থাকবে?"

	prompt = f"""A conversation takes place between the user and the assistant. The user asks a question, and the assistant solves the problem. Please reason step by step in Bengali, and put your final answer in the <answer> </answer> tags.

	Question: {problem}"""

	messages = [{"role": "user", "content": prompt}]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	generated_ids = model.generate(**model_inputs, max_new_tokens=2048, temperature=0.7)
	output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
	response = tokenizer.decode(output_ids, skip_special_tokens=True)
	print(response)
	```

	### Using vLLM

	```bash
	vllm serve dipta007/GanitLLM-4B_CGRPO --max-model-len 4096
	```

	## Performance

	\| Model \| Bn-MGSM \| Bn-MSVAMP \| Avg. Words \| Bengali % \|
	\|-------\|---------\|-----------\|------------\|-----------\|
	\| Qwen3-4B (base) \| 69.20 \| 70.50 \| 943 \| 14.79% \|
	\| GanitLLM-4B_CGRPO \| 82.40 \| 78.50 \| 844 \| 14.94% \|

	## Related Models

	\| Model \| Parameters \| Training \| Link \|
	\|-------\|------------\|----------\|------\|
	\| GanitLLM-4B_SFT_CGRPO \| 4B \| SFT + CGRPO \| [Link](https://huggingface.co/dipta007/GanitLLM-4B_SFT_CGRPO) \|
	\| GanitLLM-4B_SFT_GRPO \| 4B \| SFT + GRPO \| [Link](https://huggingface.co/dipta007/GanitLLM-4B_SFT_GRPO) \|
	\| GanitLLM-4B_CGRPO \| 4B \| CGRPO \| [Link](https://huggingface.co/dipta007/GanitLLM-4B_CGRPO) \|
	\| GanitLLM-1.7B_CGRPO \| 1.7B \| CGRPO \| [Link](https://huggingface.co/dipta007/GanitLLM-1.7B_CGRPO) \|
	\| GanitLLM-0.6B_CGRPO \| 0.6B \| CGRPO \| [Link](https://huggingface.co/dipta007/GanitLLM-0.6B_CGRPO) \|

	## Citation

	```bibtex
	will be updated
	```

	## License

	This model is released under the Apache 2.0 License.