File size: 7,990 Bytes
80f9004
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
---
language:
- en
license: apache-2.0
library_name: peft
base_model: google/gemma-4-e4b-it
tags:
- gemma4
- unsloth
- lora
- qlora
- fine-tuning
- hackathon
- gemma-4-good-hackathon
- kaggle
datasets:
- mlabonne/FineTome-100k
pipeline_tag: text-generation
---

# Gemma 4 E4B Fine-Tuned with Unsloth QLoRA

**Competition:** [The Gemma 4 Good Hackathon](https://www.kaggle.com/competitions/gemma-4-good-hackathon) on Kaggle  
**Tracks:** Unsloth ($10K prize) + Impact Tracks  
**Framework:** [Unsloth](https://unsloth.ai) — 2x faster fine-tuning  
**Base Model:** [google/gemma-4-e4b-it](https://huggingface.co/google/gemma-4-e4b-it) (4B params, instruction-tuned)

## Highlights

- **99.6% training loss reduction** — from 2.916 (baseline) to **0.0115** (final)
- **5 epochs** of QLoRA fine-tuning on 10,000 high-quality samples
- **Only 2.29% of parameters trained** (146.8M / 6.4B) via rank-stabilized LoRA
- **12 hours total training** on a single NVIDIA L4 GPU (24GB)

## How to Use

### With Unsloth (Recommended)
```python
from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    "bradduy/Any2AnyModels",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastModel.for_inference(model)

messages = [
    {"role": "user", "content": "Explain how renewable energy helps developing communities"}
]

inputs = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
).to("cuda")

outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True,
)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
```

### With Transformers + PEFT
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-e4b-it",
    device_map="auto",
    load_in_4bit=True,
)
model = PeftModel.from_pretrained(base_model, "bradduy/Any2AnyModels")
tokenizer = AutoTokenizer.from_pretrained("bradduy/Any2AnyModels")
```

## Training Details

### Method

We used **Unsloth's QLoRA** implementation with **rank-stabilized LoRA (RSLoRA)** for parameter-efficient fine-tuning. The key innovation was discovering that **multi-epoch training dramatically reduces loss** with each additional pass over the data.

### Configuration

| Parameter | Value |
|-----------|-------|
| Base Model | `google/gemma-4-e4b-it` (4B params) |
| Quantization | 4-bit QLoRA via bitsandbytes |
| LoRA Rank | 64 |
| LoRA Alpha | 64 |
| RSLoRA | Enabled (rank-stabilized scaling) |
| Learning Rate | 7e-5 |
| LR Scheduler | Cosine |
| Epochs | 5 |
| Dataset Size | 10,000 samples |
| Effective Batch Size | 8 (1 × 8 grad accumulation) |
| Weight Decay | 0.01 |
| Warmup Steps | 50 |
| Total Steps | 6,250 |
| Max Seq Length | 2048 |
| Optimizer | AdamW 8-bit |
| Seed | 3407 |
| Response Masking | `train_on_responses_only` enabled |

### Dataset

- **Source:** [mlabonne/FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k)
- **Samples Used:** 10,000 (first 10k)
- **Format:** Multi-turn chat conversations
- **Chat Template:** Gemma 4 native (`role: "model"`, not `"assistant"`)
- **Masking:** Only model responses contribute to loss (instruction tokens masked)

### Hardware

- **GPU:** NVIDIA L4 (24GB VRAM)
- **RAM:** 32GB
- **Training Time:** ~12 hours (with checkpoint resume)
- **GPU Memory Used:** ~14.8GB during training

## Experiment Journey

We ran **8 systematic experiments** to find the optimal configuration:

| Exp | LoRA r | Epochs | Samples | LR | Train Loss | Key Finding |
|-----|--------|--------|---------|-----|-----------|-------------|
| 01 | 16 | 0.13 | 3k | 2e-4 | 2.916 | Baseline |
| 02 | 32 | 0.24 | 5k | 2e-4 | 1.725 | Higher rank helps (+41%) |
| 03 | 64+RSLoRA | 0.20 | 10k | 2e-4 | 1.460 | RSLoRA + more data (+50%) |
| 04 | 64+RSLoRA | 0.40 | 20k | 1e-4 | ~1.05 | Lower LR improves convergence |
| 05 | 128+RSLoRA | 0.40 | 20k | 5e-5 | 1.134 | r=128 slower than r=64 |
| 06 | 64+RSLoRA | 3 | 10k | 1e-4 | ~0.30 | **Multi-epoch is transformative** |
| 07 | 128+RSLoRA | 3 | 10k | 1e-4 | ~0.59 | r=64 > r=128 for multi-epoch |
| **08** | **64+RSLoRA** | **5** | **10k** | **7e-5** | **0.0115** | **5 epochs = 99.6% reduction** |

### The Multi-Epoch Discovery

The single most impactful finding: **each additional epoch delivers a dramatic, consistent loss reduction:**

```
Epoch 1: loss ~0.90  (learning the patterns)
Epoch 2: loss ~0.60  (reinforcing knowledge)
Epoch 3: loss ~0.30  (deep memorization)
Epoch 4: loss ~0.10  (fine polishing)
Epoch 5: loss ~0.01  (near-perfect fitting)
```

This pattern was consistent across experiments 06, 07, and 08. The loss drops happen at each epoch boundary as the model sees the training data again.

### Other Key Insights

1. **r=64 with RSLoRA is the sweet spot** — r=128 converges slower and provides no benefit in multi-epoch settings
2. **Lower LR (7e-5) stabilizes long training** — higher LR (2e-4) causes instability after epoch 2
3. **`train_on_responses_only` is essential** — masks user/system tokens so the model only learns from responses
4. **Checkpoint saving every 250 steps** — long CUDA runs crash from memory fragmentation; resume from checkpoints solved this
5. **10k high-quality samples > 20k samples** for multi-epoch — quality over quantity when doing multiple passes

## Training Pipeline

Built entirely with [Unsloth](https://unsloth.ai):

```python
from unsloth import FastModel
from trl import SFTTrainer, SFTConfig
from unsloth.chat_templates import get_chat_template, train_on_responses_only

# 1. Load 4-bit quantized model
model, tokenizer = FastModel.from_pretrained(
    "unsloth/gemma-4-E4B-it-unsloth-bnb-4bit",
    max_seq_length=2048, load_in_4bit=True,
)

# 2. Apply LoRA adapters (r=64, RSLoRA)
model = FastModel.get_peft_model(model,
    finetune_vision_layers=False, finetune_language_layers=True,
    finetune_attention_modules=True, finetune_mlp_modules=True,
    r=64, lora_alpha=64, lora_dropout=0, bias="none",
    random_state=3407, use_rslora=True,
)

# 3. Setup Gemma 4 chat template
tokenizer = get_chat_template(tokenizer, chat_template="gemma-4")

# 4. Train with response-only masking
trainer = SFTTrainer(model=model, tokenizer=tokenizer, train_dataset=dataset,
    args=SFTConfig(
        per_device_train_batch_size=1, gradient_accumulation_steps=8,
        learning_rate=7e-5, num_train_epochs=5, lr_scheduler_type="cosine",
        warmup_steps=50, weight_decay=0.01, optim="adamw_8bit",
        save_strategy="steps", save_steps=250, save_total_limit=3,
    ),
)
trainer = train_on_responses_only(trainer,
    instruction_part="<|turn>user\n", response_part="<|turn>model\n",
)
trainer.train()
```

## Reproduce Training

```bash
git clone https://github.com/bradduy/Any2AnyModels
cd Any2AnyModels
pip install unsloth

python scripts/train.py \
  --model unsloth/gemma-4-E4B-it-unsloth-bnb-4bit \
  --load-4bit --lora-rank 64 --use-rslora \
  --dataset mlabonne/FineTome-100k --max-samples 10000 \
  --num-epochs 5 --learning-rate 7e-5 --grad-accum 8 \
  --weight-decay 0.01 --warmup-steps 50 --scheduler cosine \
  --save-steps 250 --save-total-limit 3
```

## Limitations

- Fine-tuned on English-only data (FineTome-100k)
- Optimized for instruction following, not domain-specific tasks
- 4B parameter model — larger models (26B, 31B) would perform better but require more VRAM
- Training loss ≠ downstream task performance; the model should be evaluated on specific benchmarks

## Acknowledgments

- **Google DeepMind** for the [Gemma 4](https://blog.google/technology/developers/gemma-4/) model family
- **[Unsloth](https://unsloth.ai)** for making QLoRA fine-tuning 2x faster and memory efficient
- **[Kaggle](https://www.kaggle.com)** for hosting the Gemma 4 Good Hackathon
- **[mlabonne](https://huggingface.co/mlabonne)** for the FineTome-100k dataset

## License

Apache 2.0 (same as Gemma 4)