Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use my-ai-stack/Stack-2-9-finetuned with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned") model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use my-ai-stack/Stack-2-9-finetuned with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "my-ai-stack/Stack-2-9-finetuned" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
- SGLang
How to use my-ai-stack/Stack-2-9-finetuned with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
Google Colab Training Guide for Stack 2.9
This guide walks through training Stack 2.9 Pattern Memory LoRA adapters using free Google Colab T4 GPUs.
β‘ Quick Start (3-5 hours)
- Open Colab: https://colab.research.google.com/
- Upload
colab_train_stack29.ipynb - Runtime β Change runtime type β GPU (T4)
- Run all cells sequentially
That's it! The notebook handles everything.
π Prerequisites
- Google account (for Colab)
- Basic understanding of notebook execution
- (Optional) Google Drive for persistent storage
π― What This Covers
- Setting up the environment on Colab
- Mounting Google Drive to keep your data between sessions
- Installing dependencies (PyTorch, Transformers, PEFT, etc.)
- Preparing training data (either full or mini dataset)
- Training LoRA adapter on Qwen2.5-Coder-7B (or 32B if you have A100)
- Merging adapter with base model
- Testing inference with the trained model
- Exporting to Hugging Face Hub (optional)
β±οΈ Estimated Timings (T4 GPU)
| Step | Duration |
|---|---|
| Environment setup | 5-10 min |
| Data preparation | 2-5 min (using mini dataset) / 30-60 min (full dataset) |
| Training (2 epochs, 7B) | 3-5 hours |
| Adapter merging | 2-3 min |
| Inference testing | 1-2 min |
| Total | ~4-6 hours |
Note: Colab free tier has ~12 hour runtime limit. Training fits within this.
πΎ Storage Strategy
Option A: Google Drive (Recommended for persistence)
from google.colab import drive
drive.mount('/content/drive')
# Data stored in /content/drive/MyDrive/stack-2.9/
Pros: Data persists after runtime disconnect, no re-upload needed.
Option B: Local Colab storage (ephemeral)
# Data stored in /content/stack-2.9/
# Lost when runtime disconnects (~12 hours max)
Use for: Quick experiments, one-off training runs.
π§ Memory Optimization for T4 (15GB VRAM)
The provided train_config_colab.yaml is tuned specifically for T4:
- Base model:
Qwen/Qwen2.5-Coder-7B(4-bit β 4.5GB) - Context length: 8192 (instead of 131072)
- Batch size: 1 (with gradient accumulation 16)
- LoRA rank: 16 (instead of 64)
- 4-bit quantization:
load_in_4bit=True - 8-bit optimizer:
paged_adamw_8bit - Gradient checkpointing: Enabled
- BF16 precision: Enabled
Total expected VRAM usage: ~10-12GB (leaves headroom)
π οΈ Step-by-Step Instructions
1. Notebook Setup
Open colab_train_stack29.ipynb in Colab. It contains pre-filled cells with:
- Dependency installation
- Drive mounting (optional)
- Clone repo / upload data
- Copy training config
- Run training
- Merge adapter
- Test inference
2. Install Dependencies
The notebook installs:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install transformers==4.40.0 peft==0.10.0 accelerate bitsandbytes==0.43.0 datasets pyyaml
Takes ~5 minutes.
3. Prepare Training Data
For quick prototyping (recommended first run):
python scripts/create_mini_dataset.py --size 5000 --output data_mini/train_mini.jsonl
This creates a 5K stratified sample in ~30 seconds.
For full training:
Download your existing training-data/final/train.jsonl to Colab (upload to Drive or local).
4. Prepare Configuration
Copy the Colab-optimized config:
cp stack_2_9_training/train_config_colab.yaml stack_2_9_training/train_config.yaml
Or edit train_config.yaml directly to match the Colab settings.
5. Run Training
cd stack-2.9-training
python -m stack_2_9_training.train_lora --config train_config.yaml
Monitor progress:
- Watch
nvidia-smiin a separate cell:!nvidia-smi --loop=5 - Training logs show loss per step
- Checkpoints saved every 500 steps to
./adapters/
Expected output:
Train loss: 1.234
Step 100/2000 - loss 1.234
...
Training completed. Model saved to ./adapters/
6. Merge Adapter
After training finishes:
python -m stack_2_9_training.merge_adapter --base-model Qwen/Qwen2.5-Coder-7B
Output: ./model_final/ with full model + tokenizer.
Takes 2-3 minutes.
7. Test Inference
Quick test:
from stack_2_9_eval.model_client import create_model_client
# Point to your merged model
client = create_model_client(
provider="ollama", # or use direct HF pipeline
model="./model_final"
)
result = client.generate("Write a Python function to reverse a string")
print(result.text)
For production use, serve via vLLM or Hugging Face TGI.
π¨ Troubleshooting OOM (Out of Memory)
If you get CUDA OOM errors, try these fixes in order:
1. Reduce sequence length
Edit train_config_colab.yaml:
training:
max_seq_length: 4096 # instead of 8192
2. Reduce batch size further
training:
per_device_train_batch_size: 1 # already 1
gradient_accumulation_steps: 32 # increase to 32 (slower but less memory)
3. Disable gradient checkpointing (memory vs speed trade-off)
training:
gradient_checkpointing: false # uses more memory but faster
4. Lower LoRA rank
peft:
r: 8 # or even 4
lora_alpha: 16
5. Switch to CPU (last resort)
Very slow (days), but works:
model:
load_in_4bit: false # CPU cannot handle 4-bit quantization well
π Expected Performance
On Colab T4 (free) with 7B model:
| Metric | Value |
|---|---|
| Training time (2 epochs, 5K examples) | ~3-4 hours |
| Training time (2 epochs, 50K examples) | ~12-18 hours |
| VRAM usage | 10-12 GB |
| Disk space needed | 5-10 GB (model + checkpoints) |
| Inference throughput | ~15-25 tokens/sec |
βοΈ Upgrading to A100 (Colab Pro)
If you have Colab Pro with A100 (40GB):
Change model in config:
model: name: "Qwen/Qwen2.5-Coder-32B"Increase context:
tokenizer: model_max_length: 32768Increase batch size:
training: per_device_train_batch_size: 4 gradient_accumulation_steps: 4Training time for 50K examples: ~6-8 hours
π€ Exporting to Hugging Face Hub
After merging, push to HF:
from huggingface_hub import HfApi
api = HfApi(token="your-hf-token")
api.upload_folder(
folder_path="./model_final",
repo_id="your-org/stack-2.9-7b-lora",
repo_type="model"
)
Then update TOGETHER_AI.md with your model ID.
π Resuming Interrupted Training
Colab can disconnect unexpectedly. Use checkpointing:
- Check if checkpoint exists:
ls -la adapters_colab/checkpoint-* - To resume, add to config:
Or pass CLI arg:training: resume_from_checkpoint: "./adapters_colab/checkpoint-XXX"python -m stack_2_9_training.train_lora --config train_config.yaml --resume_from_checkpoint ./adapters_colab/checkpoint-XXX
π§ͺ Quick Validation Before Full Training
Run a mini training to verify setup:
python scripts/create_mini_dataset.py --size 100 # 100 examples
python -m stack_2_9_training.train_lora --config train_config_colab.yaml --num_train_epochs 1
Should take 15-30 minutes and give you a sense of whether training works.
π Files in This Package
COLAB_TRAINING.md- This guidecolab_train_stack29.ipynb- Ready-to-run Colab notebooktrain_config_colab.yaml- Optimized config for T4/7Bscripts/create_mini_dataset.py- Create 5K sample datasetstack_2_9_training/- Training package (prepare_data, train_lora, merge_adapter)
π Getting Help
- Colab issues: Check Google Colab documentation
- CUDA OOM: Reduce
max_seq_lengthto 4096, increasegradient_accumulation_steps - Training crashes: Ensure you have enough disk space (at least 10GB free)
- Slow training: Verify
bf16is enabled (T4 supports it), checknvidia-smifor GPU utilization
β Ready to Go!
The Colab notebook is pre-configured and ready to execute. Just open it, select GPU runtime, and run all cells.
Expected outcome: Trained LoRA adapter in ./adapters_colab/, merged model in ./model_final/, ready for evaluation and Hugging Face publication.