Text Generation
Transformers
English
qwen2
code-generation
python
fine-tuning
Qwen
tools
agent-framework
multi-agent
conversational
Eval Results (legacy)
Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use my-ai-stack/Stack-2-9-finetuned with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned") model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use my-ai-stack/Stack-2-9-finetuned with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "my-ai-stack/Stack-2-9-finetuned" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
- SGLang
How to use my-ai-stack/Stack-2-9-finetuned with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
| # Google Colab Training Guide for Stack 2.9 | |
| This guide walks through training Stack 2.9 Pattern Memory LoRA adapters using **free Google Colab** T4 GPUs. | |
| --- | |
| ## β‘ Quick Start (3-5 hours) | |
| 1. **Open Colab**: https://colab.research.google.com/ | |
| 2. **Upload** `colab_train_stack29.ipynb` | |
| 3. **Runtime β Change runtime type β GPU (T4)** | |
| 4. **Run all cells sequentially** | |
| That's it! The notebook handles everything. | |
| --- | |
| ## π Prerequisites | |
| - Google account (for Colab) | |
| - Basic understanding of notebook execution | |
| - (Optional) Google Drive for persistent storage | |
| --- | |
| ## π― What This Covers | |
| 1. **Setting up the environment** on Colab | |
| 2. **Mounting Google Drive** to keep your data between sessions | |
| 3. **Installing dependencies** (PyTorch, Transformers, PEFT, etc.) | |
| 4. **Preparing training data** (either full or mini dataset) | |
| 5. **Training LoRA adapter** on Qwen2.5-Coder-7B (or 32B if you have A100) | |
| 6. **Merging adapter** with base model | |
| 7. **Testing inference** with the trained model | |
| 8. **Exporting to Hugging Face Hub** (optional) | |
| --- | |
| ## β±οΈ Estimated Timings (T4 GPU) | |
| | Step | Duration | | |
| |------|----------| | |
| | Environment setup | 5-10 min | | |
| | Data preparation | 2-5 min (using mini dataset) / 30-60 min (full dataset) | | |
| | Training (2 epochs, 7B) | 3-5 hours | | |
| | Adapter merging | 2-3 min | | |
| | Inference testing | 1-2 min | | |
| | **Total** | **~4-6 hours** | | |
| **Note:** Colab free tier has ~12 hour runtime limit. Training fits within this. | |
| --- | |
| ## πΎ Storage Strategy | |
| ### Option A: Google Drive (Recommended for persistence) | |
| ```python | |
| from google.colab import drive | |
| drive.mount('/content/drive') | |
| # Data stored in /content/drive/MyDrive/stack-2.9/ | |
| ``` | |
| **Pros:** Data persists after runtime disconnect, no re-upload needed. | |
| ### Option B: Local Colab storage (ephemeral) | |
| ```bash | |
| # Data stored in /content/stack-2.9/ | |
| # Lost when runtime disconnects (~12 hours max) | |
| ``` | |
| **Use for:** Quick experiments, one-off training runs. | |
| --- | |
| ## π§ Memory Optimization for T4 (15GB VRAM) | |
| The provided `train_config_colab.yaml` is tuned specifically for T4: | |
| - **Base model**: `Qwen/Qwen2.5-Coder-7B` (4-bit β 4.5GB) | |
| - **Context length**: 8192 (instead of 131072) | |
| - **Batch size**: 1 (with gradient accumulation 16) | |
| - **LoRA rank**: 16 (instead of 64) | |
| - **4-bit quantization**: `load_in_4bit=True` | |
| - **8-bit optimizer**: `paged_adamw_8bit` | |
| - **Gradient checkpointing**: Enabled | |
| - **BF16 precision**: Enabled | |
| **Total expected VRAM usage**: ~10-12GB (leaves headroom) | |
| --- | |
| ## π οΈ Step-by-Step Instructions | |
| ### 1. Notebook Setup | |
| Open `colab_train_stack29.ipynb` in Colab. It contains pre-filled cells with: | |
| - Dependency installation | |
| - Drive mounting (optional) | |
| - Clone repo / upload data | |
| - Copy training config | |
| - Run training | |
| - Merge adapter | |
| - Test inference | |
| ### 2. Install Dependencies | |
| The notebook installs: | |
| ```bash | |
| pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 | |
| pip install transformers==4.40.0 peft==0.10.0 accelerate bitsandbytes==0.43.0 datasets pyyaml | |
| ``` | |
| Takes ~5 minutes. | |
| ### 3. Prepare Training Data | |
| **For quick prototyping** (recommended first run): | |
| ```bash | |
| python scripts/create_mini_dataset.py --size 5000 --output data_mini/train_mini.jsonl | |
| ``` | |
| This creates a 5K stratified sample in ~30 seconds. | |
| **For full training:** | |
| Download your existing `training-data/final/train.jsonl` to Colab (upload to Drive or local). | |
| ### 4. Prepare Configuration | |
| Copy the Colab-optimized config: | |
| ```bash | |
| cp stack_2_9_training/train_config_colab.yaml stack_2_9_training/train_config.yaml | |
| ``` | |
| Or edit `train_config.yaml` directly to match the Colab settings. | |
| ### 5. Run Training | |
| ```bash | |
| cd stack-2.9-training | |
| python -m stack_2_9_training.train_lora --config train_config.yaml | |
| ``` | |
| **Monitor progress:** | |
| - Watch `nvidia-smi` in a separate cell: `!nvidia-smi --loop=5` | |
| - Training logs show loss per step | |
| - Checkpoints saved every 500 steps to `./adapters/` | |
| **Expected output:** | |
| ``` | |
| Train loss: 1.234 | |
| Step 100/2000 - loss 1.234 | |
| ... | |
| Training completed. Model saved to ./adapters/ | |
| ``` | |
| ### 6. Merge Adapter | |
| After training finishes: | |
| ```bash | |
| python -m stack_2_9_training.merge_adapter --base-model Qwen/Qwen2.5-Coder-7B | |
| ``` | |
| Output: `./model_final/` with full model + tokenizer. | |
| Takes 2-3 minutes. | |
| ### 7. Test Inference | |
| Quick test: | |
| ```python | |
| from stack_2_9_eval.model_client import create_model_client | |
| # Point to your merged model | |
| client = create_model_client( | |
| provider="ollama", # or use direct HF pipeline | |
| model="./model_final" | |
| ) | |
| result = client.generate("Write a Python function to reverse a string") | |
| print(result.text) | |
| ``` | |
| For production use, serve via vLLM or Hugging Face TGI. | |
| --- | |
| ## π¨ Troubleshooting OOM (Out of Memory) | |
| If you get CUDA OOM errors, try these fixes **in order**: | |
| ### 1. Reduce sequence length | |
| Edit `train_config_colab.yaml`: | |
| ```yaml | |
| training: | |
| max_seq_length: 4096 # instead of 8192 | |
| ``` | |
| ### 2. Reduce batch size further | |
| ```yaml | |
| training: | |
| per_device_train_batch_size: 1 # already 1 | |
| gradient_accumulation_steps: 32 # increase to 32 (slower but less memory) | |
| ``` | |
| ### 3. Disable gradient checkpointing (memory vs speed trade-off) | |
| ```yaml | |
| training: | |
| gradient_checkpointing: false # uses more memory but faster | |
| ``` | |
| ### 4. Lower LoRA rank | |
| ```yaml | |
| peft: | |
| r: 8 # or even 4 | |
| lora_alpha: 16 | |
| ``` | |
| ### 5. Switch to CPU (last resort) | |
| Very slow (days), but works: | |
| ```yaml | |
| model: | |
| load_in_4bit: false # CPU cannot handle 4-bit quantization well | |
| ``` | |
| --- | |
| ## π Expected Performance | |
| On **Colab T4 (free)** with 7B model: | |
| | Metric | Value | | |
| |--------|-------| | |
| | Training time (2 epochs, 5K examples) | ~3-4 hours | | |
| | Training time (2 epochs, 50K examples) | ~12-18 hours | | |
| | VRAM usage | 10-12 GB | | |
| | Disk space needed | 5-10 GB (model + checkpoints) | | |
| | Inference throughput | ~15-25 tokens/sec | | |
| --- | |
| ## βοΈ Upgrading to A100 (Colab Pro) | |
| If you have **Colab Pro** with A100 (40GB): | |
| 1. Change model in config: | |
| ```yaml | |
| model: | |
| name: "Qwen/Qwen2.5-Coder-32B" | |
| ``` | |
| 2. Increase context: | |
| ```yaml | |
| tokenizer: | |
| model_max_length: 32768 | |
| ``` | |
| 3. Increase batch size: | |
| ```yaml | |
| training: | |
| per_device_train_batch_size: 4 | |
| gradient_accumulation_steps: 4 | |
| ``` | |
| 4. Training time for 50K examples: ~6-8 hours | |
| --- | |
| ## π€ Exporting to Hugging Face Hub | |
| After merging, push to HF: | |
| ```python | |
| from huggingface_hub import HfApi | |
| api = HfApi(token="your-hf-token") | |
| api.upload_folder( | |
| folder_path="./model_final", | |
| repo_id="your-org/stack-2.9-7b-lora", | |
| repo_type="model" | |
| ) | |
| ``` | |
| Then update `TOGETHER_AI.md` with your model ID. | |
| --- | |
| ## π Resuming Interrupted Training | |
| Colab can disconnect unexpectedly. Use checkpointing: | |
| 1. Check if checkpoint exists: `ls -la adapters_colab/checkpoint-*` | |
| 2. To resume, add to config: | |
| ```yaml | |
| training: | |
| resume_from_checkpoint: "./adapters_colab/checkpoint-XXX" | |
| ``` | |
| Or pass CLI arg: | |
| ```bash | |
| python -m stack_2_9_training.train_lora --config train_config.yaml --resume_from_checkpoint ./adapters_colab/checkpoint-XXX | |
| ``` | |
| --- | |
| ## π§ͺ Quick Validation Before Full Training | |
| Run a mini training to verify setup: | |
| ```bash | |
| python scripts/create_mini_dataset.py --size 100 # 100 examples | |
| python -m stack_2_9_training.train_lora --config train_config_colab.yaml --num_train_epochs 1 | |
| ``` | |
| Should take 15-30 minutes and give you a sense of whether training works. | |
| --- | |
| ## π Files in This Package | |
| - `COLAB_TRAINING.md` - This guide | |
| - `colab_train_stack29.ipynb` - Ready-to-run Colab notebook | |
| - `train_config_colab.yaml` - Optimized config for T4/7B | |
| - `scripts/create_mini_dataset.py` - Create 5K sample dataset | |
| - `stack_2_9_training/` - Training package (prepare_data, train_lora, merge_adapter) | |
| --- | |
| ## π Getting Help | |
| - **Colab issues**: Check Google Colab documentation | |
| - **CUDA OOM**: Reduce `max_seq_length` to 4096, increase `gradient_accumulation_steps` | |
| - **Training crashes**: Ensure you have enough disk space (at least 10GB free) | |
| - **Slow training**: Verify `bf16` is enabled (T4 supports it), check `nvidia-smi` for GPU utilization | |
| --- | |
| ## β Ready to Go! | |
| The Colab notebook is pre-configured and ready to execute. Just open it, select **GPU runtime**, and run all cells. | |
| **Expected outcome:** Trained LoRA adapter in `./adapters_colab/`, merged model in `./model_final/`, ready for evaluation and Hugging Face publication. | |