Add comprehensive README with Colab instructions

c7771e0 verified about 2 months ago

5.12 kB

	# 🧠 Mini Coding Agent - Fine-tuned Gemma-3-1B-IT

	A small coding assistant (~1B parameters) built by fine-tuning Gemma-3-1B-IT on coding instruction datasets. Think of it as a tiny Claude Code you can run on a free Google Colab T4 GPU.

	## Model Details

	\| Property \| Value \|
	\|---\|---\|
	\| Base Model \| `google/gemma-3-1b-it` \|
	\| Parameters \| ~1B (actual ~1.3B) \|
	\| Training Method \| LoRA (Low-Rank Adaptation) + 4-bit Quantization \|
	\| Trainable Parameters \| ~1.5% of total \|
	\| Dataset \| `ise-uiuc/Magicoder-OSS-Instruct-75K` or `nvidia/OpenCodeInstruct` \|
	\| VRAM Usage \| ~6-10GB peak (fits on Colab T4) \|
	\| Training Time \| ~30-60 min for 50K samples, 2 epochs \|

	## Why These Choices?

	- Gemma-3-1B-IT: The smallest official Gemma model. Already instruction-tuned, so it understands chat format.
	- LoRA: Only trains adapter layers (~20M params), keeping VRAM low while still learning coding patterns.
	- 4-bit (NF4) Quantization: Cuts memory by ~4x with minimal quality loss.
	- Magicoder Dataset: Proven recipe (arxiv:2312.02120) using real open-source code snippets as seeds — better than raw code pairs.
	- OpenCodeInstruct: Higher quality synthetic data with unit tests (arxiv:2504.04030). Use a subset for Colab.

	## Quick Start in Google Colab

	### Step 1: Setup

	```python
	!pip install -q transformers trl peft datasets accelerate bitsandbytes huggingface_hub
	```

	### Step 2: Authenticate

	```python
	from huggingface_hub import notebook_login
	notebook_login()
	```

	> IMPORTANT: Visit https://huggingface.co/google/gemma-3-1b-it and ACCEPT the license before training!

	### Step 3: Change Runtime to GPU

	Go to Runtime > Change runtime type > T4 GPU

	### Step 4: Run Training

	Download and run [`train_colab.py`](./train_colab.py):

	```python
	# In a Colab cell:
	!wget https://huggingface.co/Abhay557/gemma-mini-code-agent/raw/main/train_colab.py
	!python train_colab.py
	```

	Or copy-paste the contents of `train_colab.py` directly into a Colab cell.

	### Step 5: Chat with your Agent

	After training, use the built-in `chat_with_agent()` function from the script, or download [`inference.py`](./inference.py):

	```python
	!wget https://huggingface.co/Abhay557/gemma-mini-code-agent/raw/main/inference.py
	!python inference.py
	```

	## Configurable Parameters

	Edit these in `train_colab.py` before running:

	\| Param \| Default \| Description \|
	\|---\|---\|---\|
	\| `MAX_SAMPLES` \| 50000 \| Dataset subset size (reduce for faster runs) \|
	\| `NUM_EPOCHS` \| 2 \| Training epochs \|
	\| `LEARNING_RATE` \| 5e-5 \| LoRA learning rate \|
	\| `LORA_R` \| 16 \| LoRA rank \|
	\| `LORA_ALPHA` \| 32 \| LoRA scaling \|
	\| `MAX_SEQ_LENGTH` \| 1024 \| Max tokens per sequence \|
	\| `GRAD_ACCUM` \| 16 \| Gradient accumulation steps \|

	## Datasets

	\| Dataset \| Size \| Best For \| Paper \|
	\|---\|---\|---\|---\|
	\| [`ise-uiuc/Magicoder-OSS-Instruct-75K`](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K) \| 75K \| Quick experiments, proven recipe \| [arxiv:2312.02120](https://arxiv.org/abs/2312.02120) \|
	\| [`nvidia/OpenCodeInstruct`](https://huggingface.co/datasets/nvidia/OpenCodeInstruct) \| 5M \| Best quality, use subset for Colab \| [arxiv:2504.04030](https://arxiv.org/abs/2504.04030) \|

	To switch datasets, change the `DATASET_NAME` variable in the script.

	## Expected Results

	This won't match Claude Code (that's ~100B+ params), but it can:
	- ✅ Write small Python functions
	- ✅ Explain algorithms
	- ✅ Debug simple code
	- ✅ Answer basic coding interview questions

	Benchmarks on similar 1B models fine-tuned with these datasets:
	- HumanEval: ~30-40% pass@1 (base model: ~10-15%)
	- MBPP: ~35-45% pass@1

	## Pushing to Hugging Face Hub

	After training, uncomment these lines in the script:

	```python
	# merged_model.push_to_hub("YOUR_USERNAME/gemma-3-1b-code-agent")
	# tokenizer.push_to_hub("YOUR_USERNAME/gemma-3-1b-code-agent")
	```

	## Troubleshooting

	\| Issue \| Fix \|
	\|---\|---\|
	\| OOM error \| Reduce `MAX_SEQ_LENGTH` to 512 or `MAX_SAMPLES` to 10000 \|
	\| Training too slow \| Reduce `MAX_SAMPLES` to 10000, reduce `NUM_EPOCHS` to 1 \|
	\| Gemma license error \| Visit the model page and click "Accept" \|
	\| `prepare_model_for_kbit_training` import error \| Make sure `peft` is up to date: `!pip install -U peft` \|

	## Architecture

	```
	Base: google/gemma-3-1b-it (Gemma3ForCausalLM)
	├── 26 layers
	├── 1152 hidden size
	├── 4 attention heads
	└── 262k vocab

	+ LoRA adapters (r=16, alpha=32)
	├── q_proj, k_proj, v_proj, o_proj
	├── gate_proj, up_proj, down_proj
	└── ~20M trainable params

	+ 4-bit NF4 quantization
	└── ~3.5GB model footprint
	```

	## License

	- Base model: [Gemma License](https://ai.google.dev/gemma/terms)
	- This fine-tune: MIT
	- Datasets: Check respective dataset pages

	## Citation

	If you use this, cite the base papers:

	```bibtex
	@article{gemma3_2025,
	title={Gemma 3 Technical Report},
	author={Google DeepMind},
	year={2025}
	}

	@article{magicoder_2024,
	title={Magicoder: Source Code is All You Need},
	author={Wei, Yuxiang and others},
	journal={arXiv:2312.02120},
	year={2024}
	}
	```