Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use my-ai-stack/Stack-2-9-finetuned with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use my-ai-stack/Stack-2-9-finetuned with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "my-ai-stack/Stack-2-9-finetuned"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/my-ai-stack/Stack-2-9-finetuned

SGLang

How to use my-ai-stack/Stack-2-9-finetuned with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "my-ai-stack/Stack-2-9-finetuned" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "my-ai-stack/Stack-2-9-finetuned" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
```
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
```

Stack-2-9-finetuned

File size: 8,360 Bytes

49ffe54

# Google Colab Training Guide for Stack 2.9

This guide walks through training Stack 2.9 Pattern Memory LoRA adapters using **free Google Colab** T4 GPUs.

---

## ⚡ Quick Start (3-5 hours)

1. **Open Colab**: https://colab.research.google.com/
2. **Upload** `colab_train_stack29.ipynb`
3. **Runtime → Change runtime type → GPU (T4)**
4. **Run all cells sequentially**

That's it! The notebook handles everything.

---

## 📋 Prerequisites

- Google account (for Colab)
- Basic understanding of notebook execution
- (Optional) Google Drive for persistent storage

---

## 🎯 What This Covers

1. **Setting up the environment** on Colab
2. **Mounting Google Drive** to keep your data between sessions
3. **Installing dependencies** (PyTorch, Transformers, PEFT, etc.)
4. **Preparing training data** (either full or mini dataset)
5. **Training LoRA adapter** on Qwen2.5-Coder-7B (or 32B if you have A100)
6. **Merging adapter** with base model
7. **Testing inference** with the trained model
8. **Exporting to Hugging Face Hub** (optional)

---

## ⏱️ Estimated Timings (T4 GPU)

| Step | Duration |
|------|----------|
| Environment setup | 5-10 min |
| Data preparation | 2-5 min (using mini dataset) / 30-60 min (full dataset) |
| Training (2 epochs, 7B) | 3-5 hours |
| Adapter merging | 2-3 min |
| Inference testing | 1-2 min |
| **Total** | **~4-6 hours** |

**Note:** Colab free tier has ~12 hour runtime limit. Training fits within this.

---

## 💾 Storage Strategy

### Option A: Google Drive (Recommended for persistence)

```python
from google.colab import drive
drive.mount('/content/drive')
# Data stored in /content/drive/MyDrive/stack-2.9/
```

**Pros:** Data persists after runtime disconnect, no re-upload needed.

### Option B: Local Colab storage (ephemeral)

```bash
# Data stored in /content/stack-2.9/
# Lost when runtime disconnects (~12 hours max)
```

**Use for:** Quick experiments, one-off training runs.

---

## 🧠 Memory Optimization for T4 (15GB VRAM)

The provided `train_config_colab.yaml` is tuned specifically for T4:

- **Base model**: `Qwen/Qwen2.5-Coder-7B` (4-bit ≈ 4.5GB)
- **Context length**: 8192 (instead of 131072)
- **Batch size**: 1 (with gradient accumulation 16)
- **LoRA rank**: 16 (instead of 64)
- **4-bit quantization**: `load_in_4bit=True`
- **8-bit optimizer**: `paged_adamw_8bit`
- **Gradient checkpointing**: Enabled
- **BF16 precision**: Enabled

**Total expected VRAM usage**: ~10-12GB (leaves headroom)

---

## 🛠️ Step-by-Step Instructions

### 1. Notebook Setup

Open `colab_train_stack29.ipynb` in Colab. It contains pre-filled cells with:

- Dependency installation
- Drive mounting (optional)
- Clone repo / upload data
- Copy training config
- Run training
- Merge adapter
- Test inference

### 2. Install Dependencies

The notebook installs:

```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers==4.40.0 peft==0.10.0 accelerate bitsandbytes==0.43.0 datasets pyyaml
```

Takes ~5 minutes.

### 3. Prepare Training Data

**For quick prototyping** (recommended first run):

```bash
python scripts/create_mini_dataset.py --size 5000 --output data_mini/train_mini.jsonl
```

This creates a 5K stratified sample in ~30 seconds.

**For full training:**

Download your existing `training-data/final/train.jsonl` to Colab (upload to Drive or local).

### 4. Prepare Configuration

Copy the Colab-optimized config:

```bash
cp stack_2_9_training/train_config_colab.yaml stack_2_9_training/train_config.yaml
```

Or edit `train_config.yaml` directly to match the Colab settings.

### 5. Run Training

```bash
cd stack-2.9-training
python -m stack_2_9_training.train_lora --config train_config.yaml
```

**Monitor progress:**

- Watch `nvidia-smi` in a separate cell: `!nvidia-smi --loop=5`
- Training logs show loss per step
- Checkpoints saved every 500 steps to `./adapters/`

**Expected output:**
```
Train loss: 1.234
Step 100/2000 - loss 1.234
...
Training completed. Model saved to ./adapters/
```

### 6. Merge Adapter

After training finishes:

```bash
python -m stack_2_9_training.merge_adapter --base-model Qwen/Qwen2.5-Coder-7B
```

Output: `./model_final/` with full model + tokenizer.

Takes 2-3 minutes.

### 7. Test Inference

Quick test:

```python
from stack_2_9_eval.model_client import create_model_client

# Point to your merged model
client = create_model_client(
    provider="ollama",  # or use direct HF pipeline
    model="./model_final"
)

result = client.generate("Write a Python function to reverse a string")
print(result.text)
```

For production use, serve via vLLM or Hugging Face TGI.

---

## 🚨 Troubleshooting OOM (Out of Memory)

If you get CUDA OOM errors, try these fixes **in order**:

### 1. Reduce sequence length
Edit `train_config_colab.yaml`:
```yaml
training:
  max_seq_length: 4096  # instead of 8192
```

### 2. Reduce batch size further
```yaml
training:
  per_device_train_batch_size: 1  # already 1
  gradient_accumulation_steps: 32  # increase to 32 (slower but less memory)
```

### 3. Disable gradient checkpointing (memory vs speed trade-off)
```yaml
training:
  gradient_checkpointing: false  # uses more memory but faster
```

### 4. Lower LoRA rank
```yaml
peft:
  r: 8  # or even 4
  lora_alpha: 16
```

### 5. Switch to CPU (last resort)
Very slow (days), but works:
```yaml
model:
  load_in_4bit: false  # CPU cannot handle 4-bit quantization well
```

---

## 📊 Expected Performance

On **Colab T4 (free)** with 7B model:

| Metric | Value |
|--------|-------|
| Training time (2 epochs, 5K examples) | ~3-4 hours |
| Training time (2 epochs, 50K examples) | ~12-18 hours |
| VRAM usage | 10-12 GB |
| Disk space needed | 5-10 GB (model + checkpoints) |
| Inference throughput | ~15-25 tokens/sec |

---

## ☁️ Upgrading to A100 (Colab Pro)

If you have **Colab Pro** with A100 (40GB):

1. Change model in config:
   ```yaml
   model:
     name: "Qwen/Qwen2.5-Coder-32B"
   ```

2. Increase context:
   ```yaml
   tokenizer:
     model_max_length: 32768
   ```

3. Increase batch size:
   ```yaml
   training:
     per_device_train_batch_size: 4
     gradient_accumulation_steps: 4
   ```

4. Training time for 50K examples: ~6-8 hours

---

## 📤 Exporting to Hugging Face Hub

After merging, push to HF:

```python
from huggingface_hub import HfApi

api = HfApi(token="your-hf-token")
api.upload_folder(
    folder_path="./model_final",
    repo_id="your-org/stack-2.9-7b-lora",
    repo_type="model"
)
```

Then update `TOGETHER_AI.md` with your model ID.

---

## 🔄 Resuming Interrupted Training

Colab can disconnect unexpectedly. Use checkpointing:

1. Check if checkpoint exists: `ls -la adapters_colab/checkpoint-*`
2. To resume, add to config:
   ```yaml
   training:
     resume_from_checkpoint: "./adapters_colab/checkpoint-XXX"
   ```
   Or pass CLI arg:
   ```bash
   python -m stack_2_9_training.train_lora --config train_config.yaml --resume_from_checkpoint ./adapters_colab/checkpoint-XXX
   ```

---

## 🧪 Quick Validation Before Full Training

Run a mini training to verify setup:

```bash
python scripts/create_mini_dataset.py --size 100  # 100 examples
python -m stack_2_9_training.train_lora --config train_config_colab.yaml --num_train_epochs 1
```

Should take 15-30 minutes and give you a sense of whether training works.

---

## 📁 Files in This Package

- `COLAB_TRAINING.md` - This guide
- `colab_train_stack29.ipynb` - Ready-to-run Colab notebook
- `train_config_colab.yaml` - Optimized config for T4/7B
- `scripts/create_mini_dataset.py` - Create 5K sample dataset
- `stack_2_9_training/` - Training package (prepare_data, train_lora, merge_adapter)

---

## 🆘 Getting Help

- **Colab issues**: Check Google Colab documentation
- **CUDA OOM**: Reduce `max_seq_length` to 4096, increase `gradient_accumulation_steps`
- **Training crashes**: Ensure you have enough disk space (at least 10GB free)
- **Slow training**: Verify `bf16` is enabled (T4 supports it), check `nvidia-smi` for GPU utilization

---

## ✅ Ready to Go!

The Colab notebook is pre-configured and ready to execute. Just open it, select **GPU runtime**, and run all cells.

**Expected outcome:** Trained LoRA adapter in `./adapters_colab/`, merged model in `./model_final/`, ready for evaluation and Hugging Face publication.