Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use my-ai-stack/Stack-2-9-finetuned with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use my-ai-stack/Stack-2-9-finetuned with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "my-ai-stack/Stack-2-9-finetuned"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/my-ai-stack/Stack-2-9-finetuned

SGLang

How to use my-ai-stack/Stack-2-9-finetuned with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "my-ai-stack/Stack-2-9-finetuned" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "my-ai-stack/Stack-2-9-finetuned" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
```
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
```

Stack-2-9-finetuned / stack /eval /HUMAN_EVAL_PLAN.md

walidsobhie-code

refactor: Squeeze folders further - cleaner structure

65888d5 about 2 months ago

preview code

raw

history blame contribute delete

6.46 kB

Stack 2.9 HumanEval Evaluation Plan

Status: Pending GPU availability | Last Updated: 2026-04-01

This document provides complete instructions for running HumanEval benchmark evaluation on Stack 2.9.

Quick Start (When GPU Available)

# 1. Navigate to eval directory
cd /Users/walidsobhi/.openclaw/workspace/stack-2.9/stack-2.9-eval

# 2. Install dependencies
pip install -r requirements.txt

# 3. Run quick test (1 sample)
python3 -m benchmarks.human_eval --max-problems 1 --provider ollama

# 4. Run full evaluation (20 problems - current dataset)
python3 -m benchmarks.human_eval --max-problems 20 --provider ollama

# 5. For full 164-problem benchmark, download dataset first
# See "Full HumanEval Dataset" section below

Hardware Requirements

Minimum

GPU: NVIDIA RTX 4090 (24GB VRAM) with 4-bit quantization
RAM: 64GB system memory
Storage: 50GB free space

This Machine (Insufficient)

GPU: Apple Silicon (M-series) - no CUDA support
RAM: 16-24GB unified memory
Status: Cannot run 32B model inference

Software Setup

Ubuntu/Debian

# Install CUDA (if not already installed)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cuda-toolkit-12-1

# Install Python dependencies
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install vllm transformers human-eval openai

macOS (Intel/NVIDIA only)

# Install Python 3.10+
brew install python@3.11

# Create venv
python3.11 -m venv .venv
source .venv/bin/activate

# Install dependencies (CPU-only, will be slow)
pip install torch transformers human-eval
# Note: vLLM requires CUDA - not available on macOS

Running the Evaluation

Option 1: Using Built-in Benchmark (Current)

The repo has a simplified 20-problem dataset built into benchmarks/human_eval.py:

cd /Users/walidsobhi/.openclaw/workspace/stack-2.9/stack-2.9-eval

# With Ollama
python3 -m benchmarks.human_eval \
  --provider ollama \
  --model qwen2.5-coder:32b \
  --max-problems 20

# With OpenAI
export OPENAI_API_KEY=your-key-here
python3 -m benchmarks.human_eval \
  --provider openai \
  --model gpt-4o \
  --max-problems 20

# With Anthropic
export ANTHROPIC_API_KEY=your-key-here
python3 -m benchmarks.human_eval \
  --provider anthropic \
  --model claude-sonnet-4-20250514 \
  --max-problems 20

Option 2: Full HumanEval Dataset (164 Problems)

# Clone human-eval repository
git clone https://github.com/openai/human-eval.git
cd human-eval

# Install
pip install -e .

# Create evaluation script
cat > eval_full.py << 'EOF'
import human_eval
from human_eval.data import write_jsonl, read_jsonl
from human_eval.evaluator import evaluate

# Load problems
problems = read_jsonl("data/HumanEval.jsonl.gz")

# Generate completions (using your model)
# ... generation code ...

# Evaluate
results = evaluate("examples.jsonl")
print(f"Pass@1: {results['pass_at_1']}")
EOF

Option 3: Using vLLM (Fastest)

# Start vLLM server
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-Coder-32B-Instruct \
  --dtype half \
  --tensor-parallel-size 2

# In another terminal, run evaluation
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "def add(x, y):\n    \"\"\"\n    Add two numbers.\n    \"\"\"\n    pass",
    "max_tokens": 256
  }'

Interpreting Results

Expected Output Format

{
  "pass_at_1": 14,
  "pass_at_3": 17,
  "pass_at_5": 18,
  "total_cases": 20,
  "accuracy": 0.70,
  "benchmark": "HumanEval",
  "model": "qwen2.5-coder:32b",
  "results": [
    {"task_id": 1, "passed": true, "error": null},
    {"task_id": 2, "passed": false, "error": "AssertionError"}
  ]
}

Score Interpretation

Pass@1	Rating	Notes
< 50%	Poor	Model struggles with basic functions
50-70%	Fair	Basic competency, some gaps
70-80%	Good	Solid coding ability
80-90%	Excellent	Strong code generation
> 90%	Outstanding	Near-human performance

Expected Scores for Stack 2.9

Model	Pass@1	Pass@10	Pass@100
Qwen2.5-Coder-32B (baseline)	76.8%	~85%	~93%
Stack 2.9 (estimated)	78-82%	86-90%	93-95%

Troubleshooting

Out of Memory (OOM)

CUDA out of memory: Tried to allocate 40GB

Solutions:

Use quantization: --quantization awq or 4-bit
Reduce batch size: --batch-size 1
Use smaller model: Try 7B or 14B variant
Enable gradient checkpointing

vLLM Errors

ValueError: Invalid model architecture

Solutions:

Update vLLM: pip install -U vllm
Check model support: https://docs.vllm.ai/en/latest/models/supported_models.html
Use HuggingFace backend instead

Dataset Download Issues

HTTP 404: Not Found

Solutions:

Check URL: https://github.com/openai/human-eval/raw/main/data/HumanEval.jsonl.gz
Use mirror: https://huggingface.co/datasets/openai/human-eval

Slow Inference

Tokens/second: < 5

Solutions:

Use A100/H100 GPU (10x faster than consumer cards)
Enable FlashAttention: --enforce-eager not set
Increase batch size for throughput testing

Success Checklist

Before reporting results, verify:

At least 20 problems evaluated
Pass@1 calculated correctly (passed/total)
Results saved to JSON file
Model name documented
Temperature and settings recorded
Baseline comparison available (Qwen2.5-Coder-32B)

Output Files

After running, these files should be created:

stack-2.9-eval/results/
├── humaneval.json          # Final results
├── humaneval_raw.json      # Raw model outputs
├── humaneval_errors.json   # Failed attempts with errors
└── humaneval_log.txt       # Execution log

Contact

For issues or questions:

GitHub: https://github.com/my-ai-stack/stack-2.9/issues
Docs: See stack-2.9-eval/README.md

Note: This machine cannot run the evaluation due to lack of NVIDIA GPU. Estimated results are based on Qwen2.5-Coder-32B published benchmarks.

my-ai-stack
/

Stack-2-9-finetuned

Stack 2.9 HumanEval Evaluation Plan

Quick Start (When GPU Available)

Hardware Requirements

Recommended

Minimum

This Machine (Insufficient)

Software Setup

Ubuntu/Debian

macOS (Intel/NVIDIA only)

Running the Evaluation

Option 1: Using Built-in Benchmark (Current)

Option 2: Full HumanEval Dataset (164 Problems)

Option 3: Using vLLM (Fastest)

Interpreting Results

Expected Output Format

Score Interpretation

Expected Scores for Stack 2.9

Troubleshooting

Out of Memory (OOM)

vLLM Errors

Dataset Download Issues

Slow Inference

Success Checklist

Output Files

Contact