Abhay557's picture
Add comprehensive README with Colab instructions
c7771e0 verified
|
Raw
History Blame Contribute Delete
5.12 kB
# 🧠 Mini Coding Agent - Fine-tuned Gemma-3-1B-IT
A small coding assistant (~1B parameters) built by fine-tuning **Gemma-3-1B-IT** on coding instruction datasets. Think of it as a tiny Claude Code you can run on a free Google Colab T4 GPU.
## Model Details
| Property | Value |
|---|---|
| **Base Model** | `google/gemma-3-1b-it` |
| **Parameters** | ~1B (actual ~1.3B) |
| **Training Method** | LoRA (Low-Rank Adaptation) + 4-bit Quantization |
| **Trainable Parameters** | ~1.5% of total |
| **Dataset** | `ise-uiuc/Magicoder-OSS-Instruct-75K` or `nvidia/OpenCodeInstruct` |
| **VRAM Usage** | ~6-10GB peak (fits on Colab T4) |
| **Training Time** | ~30-60 min for 50K samples, 2 epochs |
## Why These Choices?
- **Gemma-3-1B-IT**: The smallest official Gemma model. Already instruction-tuned, so it understands chat format.
- **LoRA**: Only trains adapter layers (~20M params), keeping VRAM low while still learning coding patterns.
- **4-bit (NF4) Quantization**: Cuts memory by ~4x with minimal quality loss.
- **Magicoder Dataset**: Proven recipe (arxiv:2312.02120) using real open-source code snippets as seeds β€” better than raw code pairs.
- **OpenCodeInstruct**: Higher quality synthetic data with unit tests (arxiv:2504.04030). Use a subset for Colab.
## Quick Start in Google Colab
### Step 1: Setup
```python
!pip install -q transformers trl peft datasets accelerate bitsandbytes huggingface_hub
```
### Step 2: Authenticate
```python
from huggingface_hub import notebook_login
notebook_login()
```
> **IMPORTANT**: Visit https://huggingface.co/google/gemma-3-1b-it and **ACCEPT the license** before training!
### Step 3: Change Runtime to GPU
Go to **Runtime > Change runtime type > T4 GPU**
### Step 4: Run Training
Download and run [`train_colab.py`](./train_colab.py):
```python
# In a Colab cell:
!wget https://huggingface.co/Abhay557/gemma-mini-code-agent/raw/main/train_colab.py
!python train_colab.py
```
Or copy-paste the contents of `train_colab.py` directly into a Colab cell.
### Step 5: Chat with your Agent
After training, use the built-in `chat_with_agent()` function from the script, or download [`inference.py`](./inference.py):
```python
!wget https://huggingface.co/Abhay557/gemma-mini-code-agent/raw/main/inference.py
!python inference.py
```
## Configurable Parameters
Edit these in `train_colab.py` before running:
| Param | Default | Description |
|---|---|---|
| `MAX_SAMPLES` | 50000 | Dataset subset size (reduce for faster runs) |
| `NUM_EPOCHS` | 2 | Training epochs |
| `LEARNING_RATE` | 5e-5 | LoRA learning rate |
| `LORA_R` | 16 | LoRA rank |
| `LORA_ALPHA` | 32 | LoRA scaling |
| `MAX_SEQ_LENGTH` | 1024 | Max tokens per sequence |
| `GRAD_ACCUM` | 16 | Gradient accumulation steps |
## Datasets
| Dataset | Size | Best For | Paper |
|---|---|---|---|
| [`ise-uiuc/Magicoder-OSS-Instruct-75K`](https://huggingface.co/datasets/ise-uiuc/Magicoder-OSS-Instruct-75K) | 75K | Quick experiments, proven recipe | [arxiv:2312.02120](https://arxiv.org/abs/2312.02120) |
| [`nvidia/OpenCodeInstruct`](https://huggingface.co/datasets/nvidia/OpenCodeInstruct) | 5M | Best quality, use subset for Colab | [arxiv:2504.04030](https://arxiv.org/abs/2504.04030) |
To switch datasets, change the `DATASET_NAME` variable in the script.
## Expected Results
This won't match Claude Code (that's ~100B+ params), but it can:
- βœ… Write small Python functions
- βœ… Explain algorithms
- βœ… Debug simple code
- βœ… Answer basic coding interview questions
Benchmarks on similar 1B models fine-tuned with these datasets:
- **HumanEval**: ~30-40% pass@1 (base model: ~10-15%)
- **MBPP**: ~35-45% pass@1
## Pushing to Hugging Face Hub
After training, uncomment these lines in the script:
```python
# merged_model.push_to_hub("YOUR_USERNAME/gemma-3-1b-code-agent")
# tokenizer.push_to_hub("YOUR_USERNAME/gemma-3-1b-code-agent")
```
## Troubleshooting
| Issue | Fix |
|---|---|
| OOM error | Reduce `MAX_SEQ_LENGTH` to 512 or `MAX_SAMPLES` to 10000 |
| Training too slow | Reduce `MAX_SAMPLES` to 10000, reduce `NUM_EPOCHS` to 1 |
| Gemma license error | Visit the model page and click "Accept" |
| `prepare_model_for_kbit_training` import error | Make sure `peft` is up to date: `!pip install -U peft` |
## Architecture
```
Base: google/gemma-3-1b-it (Gemma3ForCausalLM)
β”œβ”€β”€ 26 layers
β”œβ”€β”€ 1152 hidden size
β”œβ”€β”€ 4 attention heads
└── 262k vocab
+ LoRA adapters (r=16, alpha=32)
β”œβ”€β”€ q_proj, k_proj, v_proj, o_proj
β”œβ”€β”€ gate_proj, up_proj, down_proj
└── ~20M trainable params
+ 4-bit NF4 quantization
└── ~3.5GB model footprint
```
## License
- Base model: [Gemma License](https://ai.google.dev/gemma/terms)
- This fine-tune: MIT
- Datasets: Check respective dataset pages
## Citation
If you use this, cite the base papers:
```bibtex
@article{gemma3_2025,
title={Gemma 3 Technical Report},
author={Google DeepMind},
year={2025}
}
@article{magicoder_2024,
title={Magicoder: Source Code is All You Need},
author={Wei, Yuxiang and others},
journal={arXiv:2312.02120},
year={2024}
}
```