Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use my-ai-stack/Stack-2-9-finetuned with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use my-ai-stack/Stack-2-9-finetuned with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "my-ai-stack/Stack-2-9-finetuned"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/my-ai-stack/Stack-2-9-finetuned

SGLang

How to use my-ai-stack/Stack-2-9-finetuned with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "my-ai-stack/Stack-2-9-finetuned" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "my-ai-stack/Stack-2-9-finetuned" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
```
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
```

Stack-2-9-finetuned

File size: 6,802 Bytes

# Stack 2.9 Benchmarks & Performance

This document provides detailed performance benchmarks and context length tradeoffs for Stack 2.9.

## Context Window: 128K vs 32K

Stack 2.9 supports a full 128K token context window (131072 tokens), enabling complete repository awareness and cross-file understanding.

### Memory Requirements by Context Length

| Context Length | KV Cache (4-bit) | KV Cache (BF16) | Total with 4-bit Model | Total with BF16 Model |
|----------------|------------------|-----------------|------------------------|-----------------------|
| 8K             | ~3.4 GB          | ~6.8 GB         | ~10 GB                 | ~20 GB                |
| 16K            | ~6.8 GB          | ~13.6 GB        | ~13 GB                 | ~27 GB                |
| 32K            | ~13.6 GB         | ~27.2 GB        | ~20 GB                 | ~40 GB                |
| 64K            | ~27.2 GB         | ~54.4 GB        | ~34 GB                 | ~61 GB                |
| **128K**       | **~54.4 GB**     | **~108.8 GB**   | **~60 GB**             | **~115 GB**           |

**Note:** Estimates based on Qwen2.5-Coder-32B with 64 layers, 5120 hidden size. Actual usage varies by batch size and optimization.

### When to Use 128K vs 32K

#### Use 128K when:
- **Large codebases**: Need to understand entire repository structure (>1000 files)
- **Cross-file refactoring**: Renaming/moving symbols across multiple files
- **Complex architectural changes**: Understanding dependencies and impact analysis
- **Full documentation loading**: Loading entire API docs or specs in context
- **Long conversations**: Extended multi-turn dialogue with context retention

#### Use 32K when:
- **Single-file tasks**: Editing one file at a time
- **Limited GPU memory**: Consumer GPUs (24GB or less) can use quantization
- **Higher throughput needed**: Max tokens/sec is ~40% higher at 32K
- **Quick responses**: Simple code generation or Q&A
- **Batch processing**: Processing many independent requests

### Throughput Impact

Measured on A100 80GB with vLLM + AWQ 4-bit:

| Context Length | Tokens/sec (batch=1) | Relative Speed | Latency (first token) |
|----------------|---------------------|----------------|----------------------|
| 8K             | ~80                 | 100%           | ~50ms                |
| 16K            | ~70                 | 87%            | ~80ms                |
| 32K            | ~60                 | 75%            | ~120ms               |
| 64K            | ~45                 | 56%            | ~220ms               |
| **128K**       | **~40**             | **50%**        | **~400ms**           |

**Key Insight**: Throughput decreases roughly linearly with context length due to:
- Larger KV cache to manage
- More attention computation (O(n²) complexity)
- Memory bandwidth limitations

### GPU Recommendations

| GPU | 4-bit 32K | 4-bit 128K | BF16 32K | BF16 128K |
|-----|-----------|-------------|----------|-----------|
| RTX 4090 (24GB) | ✅ | ⚠️ marginal | ❌ no | ❌ no |
| A100 40GB | ✅ | ⚠️ tight | ❌ no | ❌ no |
| **A100 80GB** | ✅ comfortable | ✅ works | ✅ | ⚠️ tight |
| **H100 80GB** | ✅ | ✅ comfortable | ✅ | ✅ |
| H200 141GB | ✅ | ✅ | ✅ | ✅ |

## Model Performance Benchmarks

⚠️ **Evaluation Status**: The benchmark scores previously claimed (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use) were based on incomplete implementations and have been **removed pending proper verification**. See [EVALUATION.md](../EVALUATION.md) for the audit report.

### Coding Benchmarks (Actual Baseline Expectations)

| Benchmark | Status | Notes |
|-----------|--------|-------|
| **HumanEval** | Pending | Full 164-problem evaluation in progress |
| **MBPP** | Pending | Full 500-problem evaluation in progress |
| **Tool Use** | Pending | Custom tool-calling benchmark to be created |
| **GSM8K** | Not started | Math reasoning evaluation planned |
| **Context** | ✅ 128K | Token context window tested |

**Expected Baseline** (Qwen2.5-Coder-32B, unquantized):
- HumanEval: ~70-72% Pass@1
- MBPP: ~75-77% Pass@1

Stack 2.9's fine-tuned performance will be published after proper evaluation completes.

### Voice-First Features

| Metric | Value |
|--------|-------|
| Voice Cloning Time | 10-30 seconds of audio |
| Speech Synthesis | Real-time (~2x faster than playback) |
| Voice Model Size | ~50-200 MB per voice |
| Multi-language | EN, AR, ES, FR, DE |
| Audio Quality | 44.1kHz, 16-bit PCM |

## Deployment Performance

### Local Deployment (A100 80GB)

- **Cold start time**: ~60 seconds (model loading)
- **Memory footprint**: ~60 GB (4-bit, 128K context)
- **Average throughput**: 40 tokens/sec (128K context)
- **P99 latency**: <2s for 512 token responses
- **Concurrent requests**: 8-16 (depending on batch size)

### Cloud Deployment (RunPod/Vast)

- **Cost**: ~$0.30-$0.50/hour for A100 80GB
- **Availability**: High in US/EU regions
- **Scaling**: Easy horizontal scaling with load balancer
- **Bandwidth**: 1Gbps typical

## Trade-offs Summary

### Pros of 128K Context
- ✅ Complete repository awareness
- ✅ Cross-file refactoring with full understanding
- ✅ Load entire documentation/specs
- ✅ Maintain conversation history
- ✅ No artificial truncation

### Cons of 128K Context
- ❌ 40-60GB memory required (4-bit)
- ❌ ~30% slower throughput vs 32K
- ❌ Higher GPU memory bandwidth needs
- ❌ More expensive hardware required
- ❌ Slower cold starts

### Optimization Strategies

1. **Dynamic Context**: Start with 32K, expand to 128K only when needed
2. **Pre-filtering**: Use RAG to retrieve relevant files before loading full context
3. **Streaming**: Stream responses to avoid waiting for full generation
4. **Quantization**: Use AWQ 4-bit to halve memory requirements
5. **Attention Optimization**: FlashAttention-2 for faster attention computation

## Recommendations

### For Production:
- Start with 32K context for most deployments
- Enable 128K only for enterprise customers with large codebases
- Use automatic scaling based on request complexity

### For Development:
- Use 128K locally for complex refactoring
- Switch to 32K for daily coding to save resources
- Benchmark with your specific codebase to find optimal setting

### For Evaluation:
- Test with both context lengths on your specific tasks
- Measure memory usage with `nvidia-smi` during inference
- Consider quality vs speed tradeoff for your use case

## Testing Your Deployment

Run the included test script to validate your 128K setup:

```bash
cd stack-2.9-eval
python context_length_test.py --model-path /models --max-context 131072
```

This will:
- Generate 128K token dummy input
- Test tokenizer handling
- Estimate memory requirements
- Optionally test with loaded model (if available)