dipta007 commited on
Commit
cfe04b0
·
verified ·
1 Parent(s): 746b6c2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +137 -0
README.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: gemma
4
+ license_link: https://ai.google.dev/gemma/terms
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - math
8
+ - reasoning
9
+ - computational-graph
10
+ - bangla
11
+ - low-resource
12
+ - distractor-aware
13
+ - grpo
14
+ - reinforcement-learning
15
+ - small-model
16
+ base_model:
17
+ - google/gemma-3-4b-it
18
+ language:
19
+ - bn
20
+ - en
21
+ datasets:
22
+ - dipta007/dagger
23
+ - dipta007/DistractMath-Bn
24
+ ---
25
+
26
+ # DAGGER-4B-GRPO
27
+
28
+ <a href="https://arxiv.org/abs/XXXX.XXXXX" target="_blank">
29
+ <img alt="arXiv" src="https://img.shields.io/badge/arXiv-XXXX.XXXXX-b31b1b" style="display: inline-block; vertical-align: middle;"/>
30
+ </a>
31
+ <a href="https://github.com/dipta007/dagger" target="_blank">
32
+ <img alt="GitHub" src="https://img.shields.io/badge/GitHub-Code-black" style="display: inline-block; vertical-align: middle;"/>
33
+ </a>
34
+
35
+ ## Model Description
36
+
37
+ **DAGGER-4B-GRPO** is trained with GRPO directly from the base Gemma-3-4B model without SFT initialization. This ablation model demonstrates the critical importance of SFT initialization for smaller models.
38
+
39
+ ## Model Overview
40
+
41
+ | Attribute | Value |
42
+ |-----------|-------|
43
+ | Base Model | Gemma-3-4B-Instruct |
44
+ | Training | GRPO (from base) |
45
+ | Parameters | 4B |
46
+ | LoRA Rank | 64 |
47
+
48
+ ## Performance
49
+
50
+ | Dataset | Original | +Distractor |
51
+ |---------|----------|-------------|
52
+ | MGSM | 29.2 | 13.1 |
53
+ | MSVAMP | 57.1 | 29.3 |
54
+
55
+ ### Critical Finding: SFT Initialization Effect
56
+
57
+ | Initialization | MGSM | MGSM (+D) | MSVAMP (+D) |
58
+ |---------------|------|-----------|-------------|
59
+ | Base → GRPO | 29.2 | 13.1 | 29.3 |
60
+ | **SFT → GRPO** | **54.8** | **31.4** | **42.9** |
61
+
62
+ **Key Insight**: For 4B models, GRPO without SFT struggles to learn reliable graph generation. SFT provides essential scaffolding:
63
+ - **+25.6 points** on MGSM
64
+ - **+18.3 points** on MGSM (+Distractor)
65
+ - **+13.6 points** on MSVAMP (+Distractor)
66
+
67
+ This effect is more pronounced in smaller models than in 12B variants.
68
+
69
+ ## Quickstart
70
+
71
+ ```python
72
+ from transformers import AutoModelForCausalLM, AutoTokenizer
73
+
74
+ model_name = "dipta007/dagger-4B_GRPO"
75
+
76
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
77
+ model = AutoModelForCausalLM.from_pretrained(
78
+ model_name,
79
+ torch_dtype="auto",
80
+ device_map="auto"
81
+ )
82
+
83
+ question = "মিনার কাছে ১০০টি কলম আছে। প্রতিটি কলমের দাম ৫ টাকা।"
84
+
85
+ messages = [
86
+ {"role": "system", "content": "You are an expert Bangla Math Reasoner. Solve by constructing a Computational Graph."},
87
+ {"role": "user", "content": question}
88
+ ]
89
+
90
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
91
+ inputs = tokenizer([text], return_tensors="pt").to(model.device)
92
+
93
+ outputs = model.generate(**inputs, max_new_tokens=1024)
94
+ response = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)
95
+ print(response)
96
+ ```
97
+
98
+ ## Training Configuration
99
+
100
+ | Parameter | Value |
101
+ |-----------|-------|
102
+ | Base Model | Gemma-3-4B-Instruct (no SFT) |
103
+ | LoRA Rank / Alpha | 64 / 128 |
104
+ | Global Batch Size | 32 |
105
+ | Generations per Prompt | 8 |
106
+ | Loss Type | BNPO |
107
+
108
+ ## When to Use This Model
109
+
110
+ - **Ablation studies**: Understanding SFT contribution for smaller models
111
+ - **Research**: Studying capacity requirements for GRPO-only training
112
+ - **NOT recommended for production**: Use dagger-4B_SFT_GRPO instead
113
+
114
+ ## Limitations
115
+
116
+ - **Low accuracy**: Struggles to generate valid computational graphs
117
+ - **High failure rate**: Often produces malformed JSON or incorrect structures
118
+ - **Poor distractor handling**: Collapses to 13.1% on augmented MGSM
119
+
120
+ ## Recommendation
121
+
122
+ For 4B models, always use SFT initialization before GRPO:
123
+ - [dagger-4B_SFT_GRPO](https://huggingface.co/dipta007/dagger-4B_SFT_GRPO) provides +18 points improvement
124
+
125
+ ## Related Models
126
+
127
+ | Model | Training | MGSM (+D) |
128
+ |-------|----------|-----------|
129
+ | **dagger-4B_GRPO** | Base → GRPO | 13.1 |
130
+ | [dagger-4B_SFT](https://huggingface.co/dipta007/dagger-4B_SFT) | SFT | 25.1 |
131
+ | [dagger-4B_SFT_GRPO](https://huggingface.co/dipta007/dagger-4B_SFT_GRPO) | SFT → GRPO | **31.4** |
132
+
133
+ ## Citation
134
+
135
+ ```bibtex
136
+ will be updated
137
+ ```