Text Generation
Transformers
English
qwen2
code-generation
python
fine-tuning
Qwen
tools
agent-framework
multi-agent
conversational
Eval Results (legacy)
Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use my-ai-stack/Stack-2-9-finetuned with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned") model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use my-ai-stack/Stack-2-9-finetuned with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "my-ai-stack/Stack-2-9-finetuned" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
- SGLang
How to use my-ai-stack/Stack-2-9-finetuned with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
| # Stack 2.9 Benchmarks & Performance | |
| This document provides detailed performance benchmarks and context length tradeoffs for Stack 2.9. | |
| ## Context Window: 128K vs 32K | |
| Stack 2.9 supports a full 128K token context window (131072 tokens), enabling complete repository awareness and cross-file understanding. | |
| ### Memory Requirements by Context Length | |
| | Context Length | KV Cache (4-bit) | KV Cache (BF16) | Total with 4-bit Model | Total with BF16 Model | | |
| |----------------|------------------|-----------------|------------------------|-----------------------| | |
| | 8K | ~3.4 GB | ~6.8 GB | ~10 GB | ~20 GB | | |
| | 16K | ~6.8 GB | ~13.6 GB | ~13 GB | ~27 GB | | |
| | 32K | ~13.6 GB | ~27.2 GB | ~20 GB | ~40 GB | | |
| | 64K | ~27.2 GB | ~54.4 GB | ~34 GB | ~61 GB | | |
| | **128K** | **~54.4 GB** | **~108.8 GB** | **~60 GB** | **~115 GB** | | |
| **Note:** Estimates based on Qwen2.5-Coder-32B with 64 layers, 5120 hidden size. Actual usage varies by batch size and optimization. | |
| ### When to Use 128K vs 32K | |
| #### Use 128K when: | |
| - **Large codebases**: Need to understand entire repository structure (>1000 files) | |
| - **Cross-file refactoring**: Renaming/moving symbols across multiple files | |
| - **Complex architectural changes**: Understanding dependencies and impact analysis | |
| - **Full documentation loading**: Loading entire API docs or specs in context | |
| - **Long conversations**: Extended multi-turn dialogue with context retention | |
| #### Use 32K when: | |
| - **Single-file tasks**: Editing one file at a time | |
| - **Limited GPU memory**: Consumer GPUs (24GB or less) can use quantization | |
| - **Higher throughput needed**: Max tokens/sec is ~40% higher at 32K | |
| - **Quick responses**: Simple code generation or Q&A | |
| - **Batch processing**: Processing many independent requests | |
| ### Throughput Impact | |
| Measured on A100 80GB with vLLM + AWQ 4-bit: | |
| | Context Length | Tokens/sec (batch=1) | Relative Speed | Latency (first token) | | |
| |----------------|---------------------|----------------|----------------------| | |
| | 8K | ~80 | 100% | ~50ms | | |
| | 16K | ~70 | 87% | ~80ms | | |
| | 32K | ~60 | 75% | ~120ms | | |
| | 64K | ~45 | 56% | ~220ms | | |
| | **128K** | **~40** | **50%** | **~400ms** | | |
| **Key Insight**: Throughput decreases roughly linearly with context length due to: | |
| - Larger KV cache to manage | |
| - More attention computation (O(nΒ²) complexity) | |
| - Memory bandwidth limitations | |
| ### GPU Recommendations | |
| | GPU | 4-bit 32K | 4-bit 128K | BF16 32K | BF16 128K | | |
| |-----|-----------|-------------|----------|-----------| | |
| | RTX 4090 (24GB) | β | β οΈ marginal | β no | β no | | |
| | A100 40GB | β | β οΈ tight | β no | β no | | |
| | **A100 80GB** | β comfortable | β works | β | β οΈ tight | | |
| | **H100 80GB** | β | β comfortable | β | β | | |
| | H200 141GB | β | β | β | β | | |
| ## Model Performance Benchmarks | |
| β οΈ **Evaluation Status**: The benchmark scores previously claimed (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use) were based on incomplete implementations and have been **removed pending proper verification**. See [EVALUATION.md](../EVALUATION.md) for the audit report. | |
| ### Coding Benchmarks (Actual Baseline Expectations) | |
| | Benchmark | Status | Notes | | |
| |-----------|--------|-------| | |
| | **HumanEval** | Pending | Full 164-problem evaluation in progress | | |
| | **MBPP** | Pending | Full 500-problem evaluation in progress | | |
| | **Tool Use** | Pending | Custom tool-calling benchmark to be created | | |
| | **GSM8K** | Not started | Math reasoning evaluation planned | | |
| | **Context** | β 128K | Token context window tested | | |
| **Expected Baseline** (Qwen2.5-Coder-32B, unquantized): | |
| - HumanEval: ~70-72% Pass@1 | |
| - MBPP: ~75-77% Pass@1 | |
| Stack 2.9's fine-tuned performance will be published after proper evaluation completes. | |
| ### Voice-First Features | |
| | Metric | Value | | |
| |--------|-------| | |
| | Voice Cloning Time | 10-30 seconds of audio | | |
| | Speech Synthesis | Real-time (~2x faster than playback) | | |
| | Voice Model Size | ~50-200 MB per voice | | |
| | Multi-language | EN, AR, ES, FR, DE | | |
| | Audio Quality | 44.1kHz, 16-bit PCM | | |
| ## Deployment Performance | |
| ### Local Deployment (A100 80GB) | |
| - **Cold start time**: ~60 seconds (model loading) | |
| - **Memory footprint**: ~60 GB (4-bit, 128K context) | |
| - **Average throughput**: 40 tokens/sec (128K context) | |
| - **P99 latency**: <2s for 512 token responses | |
| - **Concurrent requests**: 8-16 (depending on batch size) | |
| ### Cloud Deployment (RunPod/Vast) | |
| - **Cost**: ~$0.30-$0.50/hour for A100 80GB | |
| - **Availability**: High in US/EU regions | |
| - **Scaling**: Easy horizontal scaling with load balancer | |
| - **Bandwidth**: 1Gbps typical | |
| ## Trade-offs Summary | |
| ### Pros of 128K Context | |
| - β Complete repository awareness | |
| - β Cross-file refactoring with full understanding | |
| - β Load entire documentation/specs | |
| - β Maintain conversation history | |
| - β No artificial truncation | |
| ### Cons of 128K Context | |
| - β 40-60GB memory required (4-bit) | |
| - β ~30% slower throughput vs 32K | |
| - β Higher GPU memory bandwidth needs | |
| - β More expensive hardware required | |
| - β Slower cold starts | |
| ### Optimization Strategies | |
| 1. **Dynamic Context**: Start with 32K, expand to 128K only when needed | |
| 2. **Pre-filtering**: Use RAG to retrieve relevant files before loading full context | |
| 3. **Streaming**: Stream responses to avoid waiting for full generation | |
| 4. **Quantization**: Use AWQ 4-bit to halve memory requirements | |
| 5. **Attention Optimization**: FlashAttention-2 for faster attention computation | |
| ## Recommendations | |
| ### For Production: | |
| - Start with 32K context for most deployments | |
| - Enable 128K only for enterprise customers with large codebases | |
| - Use automatic scaling based on request complexity | |
| ### For Development: | |
| - Use 128K locally for complex refactoring | |
| - Switch to 32K for daily coding to save resources | |
| - Benchmark with your specific codebase to find optimal setting | |
| ### For Evaluation: | |
| - Test with both context lengths on your specific tasks | |
| - Measure memory usage with `nvidia-smi` during inference | |
| - Consider quality vs speed tradeoff for your use case | |
| ## Testing Your Deployment | |
| Run the included test script to validate your 128K setup: | |
| ```bash | |
| cd stack-2.9-eval | |
| python context_length_test.py --model-path /models --max-context 131072 | |
| ``` | |
| This will: | |
| - Generate 128K token dummy input | |
| - Test tokenizer handling | |
| - Estimate memory requirements | |
| - Optionally test with loaded model (if available) | |