Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use my-ai-stack/Stack-2-9-finetuned with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use my-ai-stack/Stack-2-9-finetuned with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "my-ai-stack/Stack-2-9-finetuned"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/my-ai-stack/Stack-2-9-finetuned

SGLang

How to use my-ai-stack/Stack-2-9-finetuned with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "my-ai-stack/Stack-2-9-finetuned" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "my-ai-stack/Stack-2-9-finetuned" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
```
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
```

Stack-2-9-finetuned / stack /internal /BENCHMARKS.md

walidsobhie-code

refactor: Squeeze folders further - cleaner structure

65888d5 about 1 month ago

preview code

raw

history blame contribute delete

6.8 kB

	# Stack 2.9 Benchmarks & Performance

	This document provides detailed performance benchmarks and context length tradeoffs for Stack 2.9.

	## Context Window: 128K vs 32K

	Stack 2.9 supports a full 128K token context window (131072 tokens), enabling complete repository awareness and cross-file understanding.

	### Memory Requirements by Context Length

	\| Context Length \| KV Cache (4-bit) \| KV Cache (BF16) \| Total with 4-bit Model \| Total with BF16 Model \|
	\|----------------\|------------------\|-----------------\|------------------------\|-----------------------\|
	\| 8K \| ~3.4 GB \| ~6.8 GB \| ~10 GB \| ~20 GB \|
	\| 16K \| ~6.8 GB \| ~13.6 GB \| ~13 GB \| ~27 GB \|
	\| 32K \| ~13.6 GB \| ~27.2 GB \| ~20 GB \| ~40 GB \|
	\| 64K \| ~27.2 GB \| ~54.4 GB \| ~34 GB \| ~61 GB \|
	\| 128K \| ~54.4 GB \| ~108.8 GB \| ~60 GB \| ~115 GB \|

	Note: Estimates based on Qwen2.5-Coder-32B with 64 layers, 5120 hidden size. Actual usage varies by batch size and optimization.

	### When to Use 128K vs 32K

	#### Use 128K when:
	- Large codebases: Need to understand entire repository structure (>1000 files)
	- Cross-file refactoring: Renaming/moving symbols across multiple files
	- Complex architectural changes: Understanding dependencies and impact analysis
	- Full documentation loading: Loading entire API docs or specs in context
	- Long conversations: Extended multi-turn dialogue with context retention

	#### Use 32K when:
	- Single-file tasks: Editing one file at a time
	- Limited GPU memory: Consumer GPUs (24GB or less) can use quantization
	- Higher throughput needed: Max tokens/sec is ~40% higher at 32K
	- Quick responses: Simple code generation or Q&A
	- Batch processing: Processing many independent requests

	### Throughput Impact

	Measured on A100 80GB with vLLM + AWQ 4-bit:

	\| Context Length \| Tokens/sec (batch=1) \| Relative Speed \| Latency (first token) \|
	\|----------------\|---------------------\|----------------\|----------------------\|
	\| 8K \| ~80 \| 100% \| ~50ms \|
	\| 16K \| ~70 \| 87% \| ~80ms \|
	\| 32K \| ~60 \| 75% \| ~120ms \|
	\| 64K \| ~45 \| 56% \| ~220ms \|
	\| 128K \| ~40 \| 50% \| ~400ms \|

	Key Insight: Throughput decreases roughly linearly with context length due to:
	- Larger KV cache to manage
	- More attention computation (O(n²) complexity)
	- Memory bandwidth limitations

	### GPU Recommendations

	\| GPU \| 4-bit 32K \| 4-bit 128K \| BF16 32K \| BF16 128K \|
	\|-----\|-----------\|-------------\|----------\|-----------\|
	\| RTX 4090 (24GB) \| ✅ \| ⚠️ marginal \| ❌ no \| ❌ no \|
	\| A100 40GB \| ✅ \| ⚠️ tight \| ❌ no \| ❌ no \|
	\| A100 80GB \| ✅ comfortable \| ✅ works \| ✅ \| ⚠️ tight \|
	\| H100 80GB \| ✅ \| ✅ comfortable \| ✅ \| ✅ \|
	\| H200 141GB \| ✅ \| ✅ \| ✅ \| ✅ \|

	## Model Performance Benchmarks

	⚠️ Evaluation Status: The benchmark scores previously claimed (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use) were based on incomplete implementations and have been removed pending proper verification. See [EVALUATION.md](../EVALUATION.md) for the audit report.

	### Coding Benchmarks (Actual Baseline Expectations)

	\| Benchmark \| Status \| Notes \|
	\|-----------\|--------\|-------\|
	\| HumanEval \| Pending \| Full 164-problem evaluation in progress \|
	\| MBPP \| Pending \| Full 500-problem evaluation in progress \|
	\| Tool Use \| Pending \| Custom tool-calling benchmark to be created \|
	\| GSM8K \| Not started \| Math reasoning evaluation planned \|
	\| Context \| ✅ 128K \| Token context window tested \|

	Expected Baseline (Qwen2.5-Coder-32B, unquantized):
	- HumanEval: ~70-72% Pass@1
	- MBPP: ~75-77% Pass@1

	Stack 2.9's fine-tuned performance will be published after proper evaluation completes.

	### Voice-First Features

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Voice Cloning Time \| 10-30 seconds of audio \|
	\| Speech Synthesis \| Real-time (~2x faster than playback) \|
	\| Voice Model Size \| ~50-200 MB per voice \|
	\| Multi-language \| EN, AR, ES, FR, DE \|
	\| Audio Quality \| 44.1kHz, 16-bit PCM \|

	## Deployment Performance

	### Local Deployment (A100 80GB)

	- Cold start time: ~60 seconds (model loading)
	- Memory footprint: ~60 GB (4-bit, 128K context)
	- Average throughput: 40 tokens/sec (128K context)
	- P99 latency: <2s for 512 token responses
	- Concurrent requests: 8-16 (depending on batch size)

	### Cloud Deployment (RunPod/Vast)

	- Cost: ~$0.30-$0.50/hour for A100 80GB
	- Availability: High in US/EU regions
	- Scaling: Easy horizontal scaling with load balancer
	- Bandwidth: 1Gbps typical

	## Trade-offs Summary

	### Pros of 128K Context
	- ✅ Complete repository awareness
	- ✅ Cross-file refactoring with full understanding
	- ✅ Load entire documentation/specs
	- ✅ Maintain conversation history
	- ✅ No artificial truncation

	### Cons of 128K Context
	- ❌ 40-60GB memory required (4-bit)
	- ❌ ~30% slower throughput vs 32K
	- ❌ Higher GPU memory bandwidth needs
	- ❌ More expensive hardware required
	- ❌ Slower cold starts

	### Optimization Strategies

	1. Dynamic Context: Start with 32K, expand to 128K only when needed
	2. Pre-filtering: Use RAG to retrieve relevant files before loading full context
	3. Streaming: Stream responses to avoid waiting for full generation
	4. Quantization: Use AWQ 4-bit to halve memory requirements
	5. Attention Optimization: FlashAttention-2 for faster attention computation

	## Recommendations

	### For Production:
	- Start with 32K context for most deployments
	- Enable 128K only for enterprise customers with large codebases
	- Use automatic scaling based on request complexity

	### For Development:
	- Use 128K locally for complex refactoring
	- Switch to 32K for daily coding to save resources
	- Benchmark with your specific codebase to find optimal setting

	### For Evaluation:
	- Test with both context lengths on your specific tasks
	- Measure memory usage with `nvidia-smi` during inference
	- Consider quality vs speed tradeoff for your use case

	## Testing Your Deployment

	Run the included test script to validate your 128K setup:

	```bash
	cd stack-2.9-eval
	python context_length_test.py --model-path /models --max-context 131072
	```

	This will:
	- Generate 128K token dummy input
	- Test tokenizer handling
	- Estimate memory requirements
	- Optionally test with loaded model (if available)

	# Stack 2.9 Benchmarks & Performance

	This document provides detailed performance benchmarks and context length tradeoffs for Stack 2.9.

	## Context Window: 128K vs 32K

	Stack 2.9 supports a full 128K token context window (131072 tokens), enabling complete repository awareness and cross-file understanding.

	### Memory Requirements by Context Length

	\| Context Length \| KV Cache (4-bit) \| KV Cache (BF16) \| Total with 4-bit Model \| Total with BF16 Model \|
	\|----------------\|------------------\|-----------------\|------------------------\|-----------------------\|
	\| 8K \| ~3.4 GB \| ~6.8 GB \| ~10 GB \| ~20 GB \|
	\| 16K \| ~6.8 GB \| ~13.6 GB \| ~13 GB \| ~27 GB \|
	\| 32K \| ~13.6 GB \| ~27.2 GB \| ~20 GB \| ~40 GB \|
	\| 64K \| ~27.2 GB \| ~54.4 GB \| ~34 GB \| ~61 GB \|
	\| 128K \| ~54.4 GB \| ~108.8 GB \| ~60 GB \| ~115 GB \|

	Note: Estimates based on Qwen2.5-Coder-32B with 64 layers, 5120 hidden size. Actual usage varies by batch size and optimization.

	### When to Use 128K vs 32K

	#### Use 128K when:
	- Large codebases: Need to understand entire repository structure (>1000 files)
	- Cross-file refactoring: Renaming/moving symbols across multiple files
	- Complex architectural changes: Understanding dependencies and impact analysis
	- Full documentation loading: Loading entire API docs or specs in context
	- Long conversations: Extended multi-turn dialogue with context retention

	#### Use 32K when:
	- Single-file tasks: Editing one file at a time
	- Limited GPU memory: Consumer GPUs (24GB or less) can use quantization
	- Higher throughput needed: Max tokens/sec is ~40% higher at 32K
	- Quick responses: Simple code generation or Q&A
	- Batch processing: Processing many independent requests

	### Throughput Impact

	Measured on A100 80GB with vLLM + AWQ 4-bit:

	\| Context Length \| Tokens/sec (batch=1) \| Relative Speed \| Latency (first token) \|
	\|----------------\|---------------------\|----------------\|----------------------\|
	\| 8K \| ~80 \| 100% \| ~50ms \|
	\| 16K \| ~70 \| 87% \| ~80ms \|
	\| 32K \| ~60 \| 75% \| ~120ms \|
	\| 64K \| ~45 \| 56% \| ~220ms \|
	\| 128K \| ~40 \| 50% \| ~400ms \|

	Key Insight: Throughput decreases roughly linearly with context length due to:
	- Larger KV cache to manage
	- More attention computation (O(n²) complexity)
	- Memory bandwidth limitations

	### GPU Recommendations

	\| GPU \| 4-bit 32K \| 4-bit 128K \| BF16 32K \| BF16 128K \|
	\|-----\|-----------\|-------------\|----------\|-----------\|
	\| RTX 4090 (24GB) \| ✅ \| ⚠️ marginal \| ❌ no \| ❌ no \|
	\| A100 40GB \| ✅ \| ⚠️ tight \| ❌ no \| ❌ no \|
	\| A100 80GB \| ✅ comfortable \| ✅ works \| ✅ \| ⚠️ tight \|
	\| H100 80GB \| ✅ \| ✅ comfortable \| ✅ \| ✅ \|
	\| H200 141GB \| ✅ \| ✅ \| ✅ \| ✅ \|

	## Model Performance Benchmarks

	⚠️ Evaluation Status: The benchmark scores previously claimed (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use) were based on incomplete implementations and have been removed pending proper verification. See [EVALUATION.md](../EVALUATION.md) for the audit report.

	### Coding Benchmarks (Actual Baseline Expectations)

	\| Benchmark \| Status \| Notes \|
	\|-----------\|--------\|-------\|
	\| HumanEval \| Pending \| Full 164-problem evaluation in progress \|
	\| MBPP \| Pending \| Full 500-problem evaluation in progress \|
	\| Tool Use \| Pending \| Custom tool-calling benchmark to be created \|
	\| GSM8K \| Not started \| Math reasoning evaluation planned \|
	\| Context \| ✅ 128K \| Token context window tested \|

	Expected Baseline (Qwen2.5-Coder-32B, unquantized):
	- HumanEval: ~70-72% Pass@1
	- MBPP: ~75-77% Pass@1

	Stack 2.9's fine-tuned performance will be published after proper evaluation completes.

	### Voice-First Features

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Voice Cloning Time \| 10-30 seconds of audio \|
	\| Speech Synthesis \| Real-time (~2x faster than playback) \|
	\| Voice Model Size \| ~50-200 MB per voice \|
	\| Multi-language \| EN, AR, ES, FR, DE \|
	\| Audio Quality \| 44.1kHz, 16-bit PCM \|

	## Deployment Performance

	### Local Deployment (A100 80GB)

	- Cold start time: ~60 seconds (model loading)
	- Memory footprint: ~60 GB (4-bit, 128K context)
	- Average throughput: 40 tokens/sec (128K context)
	- P99 latency: <2s for 512 token responses
	- Concurrent requests: 8-16 (depending on batch size)

	### Cloud Deployment (RunPod/Vast)

	- Cost: ~$0.30-$0.50/hour for A100 80GB
	- Availability: High in US/EU regions
	- Scaling: Easy horizontal scaling with load balancer
	- Bandwidth: 1Gbps typical

	## Trade-offs Summary

	### Pros of 128K Context
	- ✅ Complete repository awareness
	- ✅ Cross-file refactoring with full understanding
	- ✅ Load entire documentation/specs
	- ✅ Maintain conversation history
	- ✅ No artificial truncation

	### Cons of 128K Context
	- ❌ 40-60GB memory required (4-bit)
	- ❌ ~30% slower throughput vs 32K
	- ❌ Higher GPU memory bandwidth needs
	- ❌ More expensive hardware required
	- ❌ Slower cold starts

	### Optimization Strategies

	1. Dynamic Context: Start with 32K, expand to 128K only when needed
	2. Pre-filtering: Use RAG to retrieve relevant files before loading full context
	3. Streaming: Stream responses to avoid waiting for full generation
	4. Quantization: Use AWQ 4-bit to halve memory requirements
	5. Attention Optimization: FlashAttention-2 for faster attention computation

	## Recommendations

	### For Production:
	- Start with 32K context for most deployments
	- Enable 128K only for enterprise customers with large codebases
	- Use automatic scaling based on request complexity

	### For Development:
	- Use 128K locally for complex refactoring
	- Switch to 32K for daily coding to save resources
	- Benchmark with your specific codebase to find optimal setting

	### For Evaluation:
	- Test with both context lengths on your specific tasks
	- Measure memory usage with `nvidia-smi` during inference
	- Consider quality vs speed tradeoff for your use case

	## Testing Your Deployment

	Run the included test script to validate your 128K setup:

	```bash
	cd stack-2.9-eval
	python context_length_test.py --model-path /models --max-context 131072
	```

	This will:
	- Generate 128K token dummy input
	- Test tokenizer handling
	- Estimate memory requirements
	- Optionally test with loaded model (if available)