Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use my-ai-stack/Stack-2-9-finetuned with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned") model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use my-ai-stack/Stack-2-9-finetuned with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "my-ai-stack/Stack-2-9-finetuned" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
- SGLang
How to use my-ai-stack/Stack-2-9-finetuned with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
Stack 2.9 Benchmarks & Performance
This document provides detailed performance benchmarks and context length tradeoffs for Stack 2.9.
Context Window: 128K vs 32K
Stack 2.9 supports a full 128K token context window (131072 tokens), enabling complete repository awareness and cross-file understanding.
Memory Requirements by Context Length
| Context Length | KV Cache (4-bit) | KV Cache (BF16) | Total with 4-bit Model | Total with BF16 Model |
|---|---|---|---|---|
| 8K | ~3.4 GB | ~6.8 GB | ~10 GB | ~20 GB |
| 16K | ~6.8 GB | ~13.6 GB | ~13 GB | ~27 GB |
| 32K | ~13.6 GB | ~27.2 GB | ~20 GB | ~40 GB |
| 64K | ~27.2 GB | ~54.4 GB | ~34 GB | ~61 GB |
| 128K | ~54.4 GB | ~108.8 GB | ~60 GB | ~115 GB |
Note: Estimates based on Qwen2.5-Coder-32B with 64 layers, 5120 hidden size. Actual usage varies by batch size and optimization.
When to Use 128K vs 32K
Use 128K when:
- Large codebases: Need to understand entire repository structure (>1000 files)
- Cross-file refactoring: Renaming/moving symbols across multiple files
- Complex architectural changes: Understanding dependencies and impact analysis
- Full documentation loading: Loading entire API docs or specs in context
- Long conversations: Extended multi-turn dialogue with context retention
Use 32K when:
- Single-file tasks: Editing one file at a time
- Limited GPU memory: Consumer GPUs (24GB or less) can use quantization
- Higher throughput needed: Max tokens/sec is ~40% higher at 32K
- Quick responses: Simple code generation or Q&A
- Batch processing: Processing many independent requests
Throughput Impact
Measured on A100 80GB with vLLM + AWQ 4-bit:
| Context Length | Tokens/sec (batch=1) | Relative Speed | Latency (first token) |
|---|---|---|---|
| 8K | ~80 | 100% | ~50ms |
| 16K | ~70 | 87% | ~80ms |
| 32K | ~60 | 75% | ~120ms |
| 64K | ~45 | 56% | ~220ms |
| 128K | ~40 | 50% | ~400ms |
Key Insight: Throughput decreases roughly linearly with context length due to:
- Larger KV cache to manage
- More attention computation (O(nΒ²) complexity)
- Memory bandwidth limitations
GPU Recommendations
| GPU | 4-bit 32K | 4-bit 128K | BF16 32K | BF16 128K |
|---|---|---|---|---|
| RTX 4090 (24GB) | β | β οΈ marginal | β no | β no |
| A100 40GB | β | β οΈ tight | β no | β no |
| A100 80GB | β comfortable | β works | β | β οΈ tight |
| H100 80GB | β | β comfortable | β | β |
| H200 141GB | β | β | β | β |
Model Performance Benchmarks
β οΈ Evaluation Status: The benchmark scores previously claimed (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use) were based on incomplete implementations and have been removed pending proper verification. See EVALUATION.md for the audit report.
Coding Benchmarks (Actual Baseline Expectations)
| Benchmark | Status | Notes |
|---|---|---|
| HumanEval | Pending | Full 164-problem evaluation in progress |
| MBPP | Pending | Full 500-problem evaluation in progress |
| Tool Use | Pending | Custom tool-calling benchmark to be created |
| GSM8K | Not started | Math reasoning evaluation planned |
| Context | β 128K | Token context window tested |
Expected Baseline (Qwen2.5-Coder-32B, unquantized):
- HumanEval: ~70-72% Pass@1
- MBPP: ~75-77% Pass@1
Stack 2.9's fine-tuned performance will be published after proper evaluation completes.
Voice-First Features
| Metric | Value |
|---|---|
| Voice Cloning Time | 10-30 seconds of audio |
| Speech Synthesis | Real-time (~2x faster than playback) |
| Voice Model Size | ~50-200 MB per voice |
| Multi-language | EN, AR, ES, FR, DE |
| Audio Quality | 44.1kHz, 16-bit PCM |
Deployment Performance
Local Deployment (A100 80GB)
- Cold start time: ~60 seconds (model loading)
- Memory footprint: ~60 GB (4-bit, 128K context)
- Average throughput: 40 tokens/sec (128K context)
- P99 latency: <2s for 512 token responses
- Concurrent requests: 8-16 (depending on batch size)
Cloud Deployment (RunPod/Vast)
- Cost: ~$0.30-$0.50/hour for A100 80GB
- Availability: High in US/EU regions
- Scaling: Easy horizontal scaling with load balancer
- Bandwidth: 1Gbps typical
Trade-offs Summary
Pros of 128K Context
- β Complete repository awareness
- β Cross-file refactoring with full understanding
- β Load entire documentation/specs
- β Maintain conversation history
- β No artificial truncation
Cons of 128K Context
- β 40-60GB memory required (4-bit)
- β ~30% slower throughput vs 32K
- β Higher GPU memory bandwidth needs
- β More expensive hardware required
- β Slower cold starts
Optimization Strategies
- Dynamic Context: Start with 32K, expand to 128K only when needed
- Pre-filtering: Use RAG to retrieve relevant files before loading full context
- Streaming: Stream responses to avoid waiting for full generation
- Quantization: Use AWQ 4-bit to halve memory requirements
- Attention Optimization: FlashAttention-2 for faster attention computation
Recommendations
For Production:
- Start with 32K context for most deployments
- Enable 128K only for enterprise customers with large codebases
- Use automatic scaling based on request complexity
For Development:
- Use 128K locally for complex refactoring
- Switch to 32K for daily coding to save resources
- Benchmark with your specific codebase to find optimal setting
For Evaluation:
- Test with both context lengths on your specific tasks
- Measure memory usage with
nvidia-smiduring inference - Consider quality vs speed tradeoff for your use case
Testing Your Deployment
Run the included test script to validate your 128K setup:
cd stack-2.9-eval
python context_length_test.py --model-path /models --max-context 131072
This will:
- Generate 128K token dummy input
- Test tokenizer handling
- Estimate memory requirements
- Optionally test with loaded model (if available)