Upload README.md with huggingface_hub

80f9004 verified about 1 month ago

7.99 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: peft
	base_model: google/gemma-4-e4b-it
	tags:
	- gemma4
	- unsloth
	- lora
	- qlora
	- fine-tuning
	- hackathon
	- gemma-4-good-hackathon
	- kaggle
	datasets:
	- mlabonne/FineTome-100k
	pipeline_tag: text-generation
	---

	# Gemma 4 E4B Fine-Tuned with Unsloth QLoRA

	Competition: [The Gemma 4 Good Hackathon](https://www.kaggle.com/competitions/gemma-4-good-hackathon) on Kaggle
	Tracks: Unsloth ($10K prize) + Impact Tracks
	Framework: [Unsloth](https://unsloth.ai) — 2x faster fine-tuning
	Base Model: [google/gemma-4-e4b-it](https://huggingface.co/google/gemma-4-e4b-it) (4B params, instruction-tuned)

	## Highlights

	- 99.6% training loss reduction — from 2.916 (baseline) to 0.0115 (final)
	- 5 epochs of QLoRA fine-tuning on 10,000 high-quality samples
	- Only 2.29% of parameters trained (146.8M / 6.4B) via rank-stabilized LoRA
	- 12 hours total training on a single NVIDIA L4 GPU (24GB)

	## How to Use

	### With Unsloth (Recommended)
	```python
	from unsloth import FastModel

	model, tokenizer = FastModel.from_pretrained(
	"bradduy/Any2AnyModels",
	max_seq_length=2048,
	load_in_4bit=True,
	)
	FastModel.for_inference(model)

	messages = [
	{"role": "user", "content": "Explain how renewable energy helps developing communities"}
	]

	inputs = tokenizer.apply_chat_template(
	messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
	).to("cuda")

	outputs = model.generate(
	input_ids=inputs,
	max_new_tokens=512,
	temperature=0.7,
	do_sample=True,
	)
	print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
	```

	### With Transformers + PEFT
	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from peft import PeftModel

	base_model = AutoModelForCausalLM.from_pretrained(
	"google/gemma-4-e4b-it",
	device_map="auto",
	load_in_4bit=True,
	)
	model = PeftModel.from_pretrained(base_model, "bradduy/Any2AnyModels")
	tokenizer = AutoTokenizer.from_pretrained("bradduy/Any2AnyModels")
	```

	## Training Details

	### Method

	We used Unsloth's QLoRA implementation with rank-stabilized LoRA (RSLoRA) for parameter-efficient fine-tuning. The key innovation was discovering that multi-epoch training dramatically reduces loss with each additional pass over the data.

	### Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Base Model \| `google/gemma-4-e4b-it` (4B params) \|
	\| Quantization \| 4-bit QLoRA via bitsandbytes \|
	\| LoRA Rank \| 64 \|
	\| LoRA Alpha \| 64 \|
	\| RSLoRA \| Enabled (rank-stabilized scaling) \|
	\| Learning Rate \| 7e-5 \|
	\| LR Scheduler \| Cosine \|
	\| Epochs \| 5 \|
	\| Dataset Size \| 10,000 samples \|
	\| Effective Batch Size \| 8 (1 × 8 grad accumulation) \|
	\| Weight Decay \| 0.01 \|
	\| Warmup Steps \| 50 \|
	\| Total Steps \| 6,250 \|
	\| Max Seq Length \| 2048 \|
	\| Optimizer \| AdamW 8-bit \|
	\| Seed \| 3407 \|
	\| Response Masking \| `train_on_responses_only` enabled \|

	### Dataset

	- Source: [mlabonne/FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k)
	- Samples Used: 10,000 (first 10k)
	- Format: Multi-turn chat conversations
	- Chat Template: Gemma 4 native (`role: "model"`, not `"assistant"`)
	- Masking: Only model responses contribute to loss (instruction tokens masked)

	### Hardware

	- GPU: NVIDIA L4 (24GB VRAM)
	- RAM: 32GB
	- Training Time: ~12 hours (with checkpoint resume)
	- GPU Memory Used: ~14.8GB during training

	## Experiment Journey

	We ran 8 systematic experiments to find the optimal configuration:

	\| Exp \| LoRA r \| Epochs \| Samples \| LR \| Train Loss \| Key Finding \|
	\|-----\|--------\|--------\|---------\|-----\|-----------\|-------------\|
	\| 01 \| 16 \| 0.13 \| 3k \| 2e-4 \| 2.916 \| Baseline \|
	\| 02 \| 32 \| 0.24 \| 5k \| 2e-4 \| 1.725 \| Higher rank helps (+41%) \|
	\| 03 \| 64+RSLoRA \| 0.20 \| 10k \| 2e-4 \| 1.460 \| RSLoRA + more data (+50%) \|
	\| 04 \| 64+RSLoRA \| 0.40 \| 20k \| 1e-4 \| ~1.05 \| Lower LR improves convergence \|
	\| 05 \| 128+RSLoRA \| 0.40 \| 20k \| 5e-5 \| 1.134 \| r=128 slower than r=64 \|
	\| 06 \| 64+RSLoRA \| 3 \| 10k \| 1e-4 \| ~0.30 \| Multi-epoch is transformative \|
	\| 07 \| 128+RSLoRA \| 3 \| 10k \| 1e-4 \| ~0.59 \| r=64 > r=128 for multi-epoch \|
	\| 08 \| 64+RSLoRA \| 5 \| 10k \| 7e-5 \| 0.0115 \| 5 epochs = 99.6% reduction \|

	### The Multi-Epoch Discovery

	The single most impactful finding: each additional epoch delivers a dramatic, consistent loss reduction:

	```
	Epoch 1: loss ~0.90 (learning the patterns)
	Epoch 2: loss ~0.60 (reinforcing knowledge)
	Epoch 3: loss ~0.30 (deep memorization)
	Epoch 4: loss ~0.10 (fine polishing)
	Epoch 5: loss ~0.01 (near-perfect fitting)
	```

	This pattern was consistent across experiments 06, 07, and 08. The loss drops happen at each epoch boundary as the model sees the training data again.

	### Other Key Insights

	1. r=64 with RSLoRA is the sweet spot — r=128 converges slower and provides no benefit in multi-epoch settings
	2. Lower LR (7e-5) stabilizes long training — higher LR (2e-4) causes instability after epoch 2
	3. `train_on_responses_only` is essential — masks user/system tokens so the model only learns from responses
	4. Checkpoint saving every 250 steps — long CUDA runs crash from memory fragmentation; resume from checkpoints solved this
	5. 10k high-quality samples > 20k samples for multi-epoch — quality over quantity when doing multiple passes

	## Training Pipeline

	Built entirely with [Unsloth](https://unsloth.ai):

	```python
	from unsloth import FastModel
	from trl import SFTTrainer, SFTConfig
	from unsloth.chat_templates import get_chat_template, train_on_responses_only

	# 1. Load 4-bit quantized model
	model, tokenizer = FastModel.from_pretrained(
	"unsloth/gemma-4-E4B-it-unsloth-bnb-4bit",
	max_seq_length=2048, load_in_4bit=True,
	)

	# 2. Apply LoRA adapters (r=64, RSLoRA)
	model = FastModel.get_peft_model(model,
	finetune_vision_layers=False, finetune_language_layers=True,
	finetune_attention_modules=True, finetune_mlp_modules=True,
	r=64, lora_alpha=64, lora_dropout=0, bias="none",
	random_state=3407, use_rslora=True,
	)

	# 3. Setup Gemma 4 chat template
	tokenizer = get_chat_template(tokenizer, chat_template="gemma-4")

	# 4. Train with response-only masking
	trainer = SFTTrainer(model=model, tokenizer=tokenizer, train_dataset=dataset,
	args=SFTConfig(
	per_device_train_batch_size=1, gradient_accumulation_steps=8,
	learning_rate=7e-5, num_train_epochs=5, lr_scheduler_type="cosine",
	warmup_steps=50, weight_decay=0.01, optim="adamw_8bit",
	save_strategy="steps", save_steps=250, save_total_limit=3,
	),
	)
	trainer = train_on_responses_only(trainer,
	instruction_part="<\|turn>user\n", response_part="<\|turn>model\n",
	)
	trainer.train()
	```

	## Reproduce Training

	```bash
	git clone https://github.com/bradduy/Any2AnyModels
	cd Any2AnyModels
	pip install unsloth

	python scripts/train.py \
	--model unsloth/gemma-4-E4B-it-unsloth-bnb-4bit \
	--load-4bit --lora-rank 64 --use-rslora \
	--dataset mlabonne/FineTome-100k --max-samples 10000 \
	--num-epochs 5 --learning-rate 7e-5 --grad-accum 8 \
	--weight-decay 0.01 --warmup-steps 50 --scheduler cosine \
	--save-steps 250 --save-total-limit 3
	```

	## Limitations

	- Fine-tuned on English-only data (FineTome-100k)
	- Optimized for instruction following, not domain-specific tasks
	- 4B parameter model — larger models (26B, 31B) would perform better but require more VRAM
	- Training loss ≠ downstream task performance; the model should be evaluated on specific benchmarks

	## Acknowledgments

	- Google DeepMind for the [Gemma 4](https://blog.google/technology/developers/gemma-4/) model family
	- [Unsloth](https://unsloth.ai) for making QLoRA fine-tuning 2x faster and memory efficient
	- [Kaggle](https://www.kaggle.com) for hosting the Gemma 4 Good Hackathon
	- [mlabonne](https://huggingface.co/mlabonne) for the FineTome-100k dataset

	## License

	Apache 2.0 (same as Gemma 4)