Text Generation
Transformers
English
qwen2
code-generation
python
fine-tuning
Qwen
tools
agent-framework
multi-agent
conversational
Eval Results (legacy)
Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use my-ai-stack/Stack-2-9-finetuned with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned") model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use my-ai-stack/Stack-2-9-finetuned with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "my-ai-stack/Stack-2-9-finetuned" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
- SGLang
How to use my-ai-stack/Stack-2-9-finetuned with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
| # Evaluation Audit & Methodology | |
| **Status:** Under Independent Verification | |
| ## Critical Findings | |
| After comprehensive audit of the Stack 2.9 evaluation infrastructure, the following issues were identified: | |
| ### 1. Incomplete Test Sets | |
| - **HumanEval**: Only **20 out of 164 problems** (~12%) were evaluated | |
| - **MBPP**: Only **20 out of 500 problems** (~4%) were evaluated | |
| The claimed scores (76.8% HumanEval, 82.3% MBPP) are therefore **not representative** of full benchmark performance. | |
| ### 2. Missing Model Inference | |
| Investigation of the evaluation scripts (`human_eval.py`, `mbpp_eval.py`) revealed: | |
| - The scripts return **pre-written canonical solutions** instead of actual model inference | |
| - No API calls to Ollama/OpenAI/Anthropic providers were made | |
| - No model-generated outputs exist in the `results/` directory | |
| - The `results/humaneval.json` file contains 0% failure rate from a broken run | |
| **Conclusion:** The benchmark numbers appear to be fabricated or at best, unverified. | |
| ### 3. Tool Use Benchmark Unimplemented | |
| The claimed 94.1% Tool Use score lacks: | |
| - Any proper benchmark dataset | |
| - Defined evaluation methodology | |
| - Reproduction instructions | |
| - Actual model calls to test tool selection accuracy | |
| It appears to be a custom, non-standard metric with no basis in accepted benchmarks. | |
| --- | |
| ## Proper Evaluation Framework | |
| We have built a new, rigorous evaluation infrastructure: | |
| ### Official Datasets | |
| ```bash | |
| # Download HumanEval (164 problems) and MBPP (500 problems) | |
| python scripts/download_benchmark_datasets.py --data-dir ./data | |
| ``` | |
| This script fetches: | |
| - HumanEval from OpenAI's official dataset | |
| - MBPP from Google'sbenchmark suite | |
| - Ensures correct formatting and ground truth solutions | |
| ### Unified Evaluation Runner | |
| `stack-2.9-eval/run_proper_evaluation.py` provides: | |
| ```bash | |
| python stack_2_9_eval/run_proper_evaluation.py \ | |
| --benchmark humaneval \ | |
| --provider ollama \ | |
| --model qwen2.5-coder:32b \ | |
| --k-samples 100 \ | |
| --output-dir ./results/humaneval_run | |
| ``` | |
| Features: | |
| - Multi-provider support (Ollama, OpenAI, Anthropic, OpenRouter) | |
| - Proper `pass@k` calculation with confidence intervals | |
| - Per-problem detailed logs (JSON) | |
| - Reproducible random sampling (seeds) | |
| - Parallel evaluation (configurable workers) | |
| ### Evaluation Checklist | |
| To ensure transparency, every proper evaluation must: | |
| 1. β Use full official benchmark (164 HumanEval, 500 MBPP) | |
| 2. β Call real model inference via `model_client.py` | |
| 3. β Run with kβ₯100 samples for pass@1 estimation | |
| 4. β Store all generation outputs for audit | |
| 5. β Compute standard deviation and confidence intervals | |
| 6. β Publish full JSON logs to `results/` directory | |
| 7. β Document exact model version, quantization, and provider settings | |
| --- | |
| ## Current Status | |
| The previously claimed scores have been **removed** from README.md and BENCHMARKS.md. They are replaced with: | |
| | Benchmark | Status | Notes | | |
| |-----------|--------|-------| | |
| | HumanEval | Pending verification | Full 164-problem evaluation setup ready | | |
| | MBPP | Pending verification | Full 500-problem evaluation setup ready | | |
| | Tool Use | Needs benchmark design | 500+ realistic OpenClaw tool-calling test cases required | | |
| | GSM8K | Not started | Math reasoning evaluation planned | | |
| Expected baseline (Qwen2.5-Coder-32B): | |
| - HumanEval: ~70-72% Pass@1 | |
| - MBPP: ~75-77% Pass@1 | |
| Stack 2.9's fine-tuned performance will be published after running proper evaluations. | |
| --- | |
| ## What Changed | |
| - Created `scripts/download_benchmark_datasets.py` for official datasets | |
| - Created `stack-2.9-eval/run_proper_evaluation.py` unified runner | |
| - Created `stack-2.9-eval/test_evaluation_setup.py` to validate environment | |
| - Added deprecation warnings to flawed `human_eval.py`, `mbpp_eval.py`, `tool_use_eval.py` | |
| - Updated README.md, BENCHMARKS.md, website pages to remove false claims | |
| --- | |
| ## How to Publish Verified Scores | |
| 1. Prepare datasets: `python scripts/download_benchmark_datasets.py --data-dir ./data` | |
| 2. Run evaluation: `python stack-2.9-eval/run_proper_evaluation.py --benchmark humaneval --provider ollama --model qwen2.5-coder:32b --k-samples 100` | |
| 3. Review logs in `./results/humaneval_run/` (includes per-problem generations) | |
| 4. Update README.md with actual numbers once verified | |
| 5. Commit full JSON results to `stack-2.9-eval/results/` for reproducibility | |
| **Do NOT publish** the previously claimed percentages. They are invalid. | |