Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use my-ai-stack/Stack-2-9-finetuned with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use my-ai-stack/Stack-2-9-finetuned with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "my-ai-stack/Stack-2-9-finetuned"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/my-ai-stack/Stack-2-9-finetuned

SGLang

How to use my-ai-stack/Stack-2-9-finetuned with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "my-ai-stack/Stack-2-9-finetuned" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "my-ai-stack/Stack-2-9-finetuned" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
```
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
```

Stack-2-9-finetuned / stack /docs /guides /EVALUATION.md

walidsobhie-code

refactor: Squeeze folders further - cleaner structure

65888d5 about 2 months ago

preview code

raw

history blame

4.4 kB

	# Evaluation Audit & Methodology

	Status: Under Independent Verification

	## Critical Findings

	After comprehensive audit of the Stack 2.9 evaluation infrastructure, the following issues were identified:

	### 1. Incomplete Test Sets

	- HumanEval: Only 20 out of 164 problems (~12%) were evaluated
	- MBPP: Only 20 out of 500 problems (~4%) were evaluated

	The claimed scores (76.8% HumanEval, 82.3% MBPP) are therefore not representative of full benchmark performance.

	### 2. Missing Model Inference

	Investigation of the evaluation scripts (`human_eval.py`, `mbpp_eval.py`) revealed:

	- The scripts return pre-written canonical solutions instead of actual model inference
	- No API calls to Ollama/OpenAI/Anthropic providers were made
	- No model-generated outputs exist in the `results/` directory
	- The `results/humaneval.json` file contains 0% failure rate from a broken run

	Conclusion: The benchmark numbers appear to be fabricated or at best, unverified.

	### 3. Tool Use Benchmark Unimplemented

	The claimed 94.1% Tool Use score lacks:
	- Any proper benchmark dataset
	- Defined evaluation methodology
	- Reproduction instructions
	- Actual model calls to test tool selection accuracy

	It appears to be a custom, non-standard metric with no basis in accepted benchmarks.

	---

	## Proper Evaluation Framework

	We have built a new, rigorous evaluation infrastructure:

	### Official Datasets

	```bash
	# Download HumanEval (164 problems) and MBPP (500 problems)
	python scripts/download_benchmark_datasets.py --data-dir ./data
	```

	This script fetches:
	- HumanEval from OpenAI's official dataset
	- MBPP from Google'sbenchmark suite
	- Ensures correct formatting and ground truth solutions

	### Unified Evaluation Runner

	`stack-2.9-eval/run_proper_evaluation.py` provides:

	```bash
	python stack_2_9_eval/run_proper_evaluation.py \
	--benchmark humaneval \
	--provider ollama \
	--model qwen2.5-coder:32b \
	--k-samples 100 \
	--output-dir ./results/humaneval_run
	```

	Features:
	- Multi-provider support (Ollama, OpenAI, Anthropic, OpenRouter)
	- Proper `pass@k` calculation with confidence intervals
	- Per-problem detailed logs (JSON)
	- Reproducible random sampling (seeds)
	- Parallel evaluation (configurable workers)

	### Evaluation Checklist

	To ensure transparency, every proper evaluation must:

	1. ✅ Use full official benchmark (164 HumanEval, 500 MBPP)
	2. ✅ Call real model inference via `model_client.py`
	3. ✅ Run with k≥100 samples for pass@1 estimation
	4. ✅ Store all generation outputs for audit
	5. ✅ Compute standard deviation and confidence intervals
	6. ✅ Publish full JSON logs to `results/` directory
	7. ✅ Document exact model version, quantization, and provider settings

	---

	## Current Status

	The previously claimed scores have been removed from README.md and BENCHMARKS.md. They are replaced with:

	\| Benchmark \| Status \| Notes \|
	\|-----------\|--------\|-------\|
	\| HumanEval \| Pending verification \| Full 164-problem evaluation setup ready \|
	\| MBPP \| Pending verification \| Full 500-problem evaluation setup ready \|
	\| Tool Use \| Needs benchmark design \| 500+ realistic OpenClaw tool-calling test cases required \|
	\| GSM8K \| Not started \| Math reasoning evaluation planned \|

	Expected baseline (Qwen2.5-Coder-32B):
	- HumanEval: ~70-72% Pass@1
	- MBPP: ~75-77% Pass@1

	Stack 2.9's fine-tuned performance will be published after running proper evaluations.

	---

	## What Changed

	- Created `scripts/download_benchmark_datasets.py` for official datasets
	- Created `stack-2.9-eval/run_proper_evaluation.py` unified runner
	- Created `stack-2.9-eval/test_evaluation_setup.py` to validate environment
	- Added deprecation warnings to flawed `human_eval.py`, `mbpp_eval.py`, `tool_use_eval.py`
	- Updated README.md, BENCHMARKS.md, website pages to remove false claims

	---

	## How to Publish Verified Scores

	1. Prepare datasets: `python scripts/download_benchmark_datasets.py --data-dir ./data`
	2. Run evaluation: `python stack-2.9-eval/run_proper_evaluation.py --benchmark humaneval --provider ollama --model qwen2.5-coder:32b --k-samples 100`
	3. Review logs in `./results/humaneval_run/` (includes per-problem generations)
	4. Update README.md with actual numbers once verified
	5. Commit full JSON results to `stack-2.9-eval/results/` for reproducibility

	Do NOT publish the previously claimed percentages. They are invalid.