Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use my-ai-stack/Stack-2-9-finetuned with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use my-ai-stack/Stack-2-9-finetuned with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "my-ai-stack/Stack-2-9-finetuned"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/my-ai-stack/Stack-2-9-finetuned

SGLang

How to use my-ai-stack/Stack-2-9-finetuned with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "my-ai-stack/Stack-2-9-finetuned" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "my-ai-stack/Stack-2-9-finetuned" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
```
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
```

Stack-2-9-finetuned / stack /docs /guides /COLAB_TRAINING.md

walidsobhie-code

refactor: Squeeze folders further - cleaner structure

65888d5 about 2 months ago

preview code

raw

history blame contribute delete

8.36 kB

Google Colab Training Guide for Stack 2.9

This guide walks through training Stack 2.9 Pattern Memory LoRA adapters using free Google Colab T4 GPUs.

⚡ Quick Start (3-5 hours)

Open Colab: https://colab.research.google.com/
Upload colab_train_stack29.ipynb
Runtime → Change runtime type → GPU (T4)
Run all cells sequentially

That's it! The notebook handles everything.

📋 Prerequisites

Google account (for Colab)
Basic understanding of notebook execution
(Optional) Google Drive for persistent storage

🎯 What This Covers

Setting up the environment on Colab
Mounting Google Drive to keep your data between sessions
Installing dependencies (PyTorch, Transformers, PEFT, etc.)
Preparing training data (either full or mini dataset)
Training LoRA adapter on Qwen2.5-Coder-7B (or 32B if you have A100)
Merging adapter with base model
Testing inference with the trained model
Exporting to Hugging Face Hub (optional)

⏱️ Estimated Timings (T4 GPU)

Step	Duration
Environment setup	5-10 min
Data preparation	2-5 min (using mini dataset) / 30-60 min (full dataset)
Training (2 epochs, 7B)	3-5 hours
Adapter merging	2-3 min
Inference testing	1-2 min
Total	~4-6 hours

Note: Colab free tier has ~12 hour runtime limit. Training fits within this.

💾 Storage Strategy

Option A: Google Drive (Recommended for persistence)

from google.colab import drive
drive.mount('/content/drive')
# Data stored in /content/drive/MyDrive/stack-2.9/

Pros: Data persists after runtime disconnect, no re-upload needed.

Option B: Local Colab storage (ephemeral)

# Data stored in /content/stack-2.9/
# Lost when runtime disconnects (~12 hours max)

Use for: Quick experiments, one-off training runs.

🧠 Memory Optimization for T4 (15GB VRAM)

The provided train_config_colab.yaml is tuned specifically for T4:

Base model: Qwen/Qwen2.5-Coder-7B (4-bit ≈ 4.5GB)
Context length: 8192 (instead of 131072)
Batch size: 1 (with gradient accumulation 16)
LoRA rank: 16 (instead of 64)
4-bit quantization: load_in_4bit=True
8-bit optimizer: paged_adamw_8bit
Gradient checkpointing: Enabled
BF16 precision: Enabled

Total expected VRAM usage: ~10-12GB (leaves headroom)

🛠️ Step-by-Step Instructions

1. Notebook Setup

Open colab_train_stack29.ipynb in Colab. It contains pre-filled cells with:

Dependency installation
Drive mounting (optional)
Clone repo / upload data
Copy training config
Run training
Merge adapter
Test inference

2. Install Dependencies

The notebook installs:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers==4.40.0 peft==0.10.0 accelerate bitsandbytes==0.43.0 datasets pyyaml

Takes ~5 minutes.

3. Prepare Training Data

For quick prototyping (recommended first run):

python scripts/create_mini_dataset.py --size 5000 --output data_mini/train_mini.jsonl

This creates a 5K stratified sample in ~30 seconds.

For full training:

Download your existing training-data/final/train.jsonl to Colab (upload to Drive or local).

4. Prepare Configuration

Copy the Colab-optimized config:

cp stack_2_9_training/train_config_colab.yaml stack_2_9_training/train_config.yaml

Or edit train_config.yaml directly to match the Colab settings.

5. Run Training

cd stack-2.9-training
python -m stack_2_9_training.train_lora --config train_config.yaml

Monitor progress:

Watch nvidia-smi in a separate cell: !nvidia-smi --loop=5
Training logs show loss per step
Checkpoints saved every 500 steps to ./adapters/

Expected output:

Train loss: 1.234
Step 100/2000 - loss 1.234
...
Training completed. Model saved to ./adapters/

6. Merge Adapter

After training finishes:

python -m stack_2_9_training.merge_adapter --base-model Qwen/Qwen2.5-Coder-7B

Output: ./model_final/ with full model + tokenizer.

Takes 2-3 minutes.

7. Test Inference

Quick test:

from stack_2_9_eval.model_client import create_model_client

# Point to your merged model
client = create_model_client(
    provider="ollama",  # or use direct HF pipeline
    model="./model_final"
)

result = client.generate("Write a Python function to reverse a string")
print(result.text)

For production use, serve via vLLM or Hugging Face TGI.

🚨 Troubleshooting OOM (Out of Memory)

If you get CUDA OOM errors, try these fixes in order:

1. Reduce sequence length

Edit train_config_colab.yaml:

training:
  max_seq_length: 4096  # instead of 8192

2. Reduce batch size further

training:
  per_device_train_batch_size: 1  # already 1
  gradient_accumulation_steps: 32  # increase to 32 (slower but less memory)

3. Disable gradient checkpointing (memory vs speed trade-off)

training:
  gradient_checkpointing: false  # uses more memory but faster

4. Lower LoRA rank

peft:
  r: 8  # or even 4
  lora_alpha: 16

5. Switch to CPU (last resort)

Very slow (days), but works:

model:
  load_in_4bit: false  # CPU cannot handle 4-bit quantization well

📊 Expected Performance

On Colab T4 (free) with 7B model:

Metric	Value
Training time (2 epochs, 5K examples)	~3-4 hours
Training time (2 epochs, 50K examples)	~12-18 hours
VRAM usage	10-12 GB
Disk space needed	5-10 GB (model + checkpoints)
Inference throughput	~15-25 tokens/sec

☁️ Upgrading to A100 (Colab Pro)

If you have Colab Pro with A100 (40GB):

Change model in config:

model:
  name: "Qwen/Qwen2.5-Coder-32B"

Increase context:
```
tokenizer:
  model_max_length: 32768
```

Increase batch size:

training:
  per_device_train_batch_size: 4
  gradient_accumulation_steps: 4

Training time for 50K examples: ~6-8 hours

📤 Exporting to Hugging Face Hub

After merging, push to HF:

from huggingface_hub import HfApi

api = HfApi(token="your-hf-token")
api.upload_folder(
    folder_path="./model_final",
    repo_id="your-org/stack-2.9-7b-lora",
    repo_type="model"
)

Then update TOGETHER_AI.md with your model ID.

🔄 Resuming Interrupted Training

Colab can disconnect unexpectedly. Use checkpointing:

Check if checkpoint exists: ls -la adapters_colab/checkpoint-*

To resume, add to config:

training:
  resume_from_checkpoint: "./adapters_colab/checkpoint-XXX"

Or pass CLI arg:

python -m stack_2_9_training.train_lora --config train_config.yaml --resume_from_checkpoint ./adapters_colab/checkpoint-XXX

🧪 Quick Validation Before Full Training

Run a mini training to verify setup:

python scripts/create_mini_dataset.py --size 100  # 100 examples
python -m stack_2_9_training.train_lora --config train_config_colab.yaml --num_train_epochs 1

Should take 15-30 minutes and give you a sense of whether training works.

📁 Files in This Package

COLAB_TRAINING.md - This guide
colab_train_stack29.ipynb - Ready-to-run Colab notebook
train_config_colab.yaml - Optimized config for T4/7B
scripts/create_mini_dataset.py - Create 5K sample dataset
stack_2_9_training/ - Training package (prepare_data, train_lora, merge_adapter)

🆘 Getting Help

Colab issues: Check Google Colab documentation
CUDA OOM: Reduce max_seq_length to 4096, increase gradient_accumulation_steps
Training crashes: Ensure you have enough disk space (at least 10GB free)
Slow training: Verify bf16 is enabled (T4 supports it), check nvidia-smi for GPU utilization

✅ Ready to Go!

The Colab notebook is pre-configured and ready to execute. Just open it, select GPU runtime, and run all cells.

Expected outcome: Trained LoRA adapter in ./adapters_colab/, merged model in ./model_final/, ready for evaluation and Hugging Face publication.