Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use my-ai-stack/Stack-2-9-finetuned with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use my-ai-stack/Stack-2-9-finetuned with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "my-ai-stack/Stack-2-9-finetuned"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/my-ai-stack/Stack-2-9-finetuned

SGLang

How to use my-ai-stack/Stack-2-9-finetuned with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "my-ai-stack/Stack-2-9-finetuned" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "my-ai-stack/Stack-2-9-finetuned" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "my-ai-stack/Stack-2-9-finetuned",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
```
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
```

walidsobhie-code commited on Apr 2

Commit

2088481

1 Parent(s): 6a89842

Critical fixes: 1) Rename misleading 'self-evolving' claims to accurate 'pattern memory' system across all docs and code. 2) Add missing GPU requirements, document cloud deployment (RunPod/Vast), and implement OpenRouter integration in model_client.py with factory support. 3) Document 37 built-in tools with full schemas in docs/tools.md. 4) Expose fraudulent evaluation scores (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use), remove them from README/BENCHMARKS/website, add EVALUATION.md audit report, and deprecation warnings to flawed eval scripts. Also updated HuggingFace Space demo with correct terminology.

Browse files

Files changed (17) hide show

EVALUATION.md +126 -0
README.md +104 -17
docs/tools.md +206 -0
space/README.md +4 -4
space/app.py +11 -11
stack-2.9-deploy/README.md +19 -0
stack-2.9-docs/ARCHITECTURE.md +10 -10
stack-2.9-docs/BENCHMARKS.md +15 -8
stack-2.9-docs/CONTRIBUTING.md +1 -1
stack-2.9-docs/README.md +23 -15
stack-2.9-eval/human_eval.py +8 -11
stack-2.9-eval/mbpp_eval.py +9 -11
stack-2.9-eval/model_client.py +138 -2
stack-2.9-eval/tool_use_eval.py +11 -15
stack_cli/cli.py +1 -1
website/benchmark.html +12 -12
website/index.html +11 -11

EVALUATION.md ADDED Viewed

	@@ -0,0 +1,126 @@

+# Evaluation Audit & Methodology
+**Status:** Under Independent Verification
+## Critical Findings
+After comprehensive audit of the Stack 2.9 evaluation infrastructure, the following issues were identified:
+### 1. Incomplete Test Sets
+- **HumanEval**: Only **20 out of 164 problems** (~12%) were evaluated
+- **MBPP**: Only **20 out of 500 problems** (~4%) were evaluated
+The claimed scores (76.8% HumanEval, 82.3% MBPP) are therefore **not representative** of full benchmark performance.
+### 2. Missing Model Inference
+Investigation of the evaluation scripts (`human_eval.py`, `mbpp_eval.py`) revealed:
+- The scripts return **pre-written canonical solutions** instead of actual model inference
+- No API calls to Ollama/OpenAI/Anthropic providers were made
+- No model-generated outputs exist in the `results/` directory
+- The `results/humaneval.json` file contains 0% failure rate from a broken run
+**Conclusion:** The benchmark numbers appear to be fabricated or at best, unverified.
+### 3. Tool Use Benchmark Unimplemented
+The claimed 94.1% Tool Use score lacks:
+- Any proper benchmark dataset
+- Defined evaluation methodology
+- Reproduction instructions
+- Actual model calls to test tool selection accuracy
+It appears to be a custom, non-standard metric with no basis in accepted benchmarks.
+---
+## Proper Evaluation Framework
+We have built a new, rigorous evaluation infrastructure:
+### Official Datasets
+```bash
+# Download HumanEval (164 problems) and MBPP (500 problems)
+python scripts/download_benchmark_datasets.py --data-dir ./data
+```
+This script fetches:
+- HumanEval from OpenAI's official dataset
+- MBPP from Google'sbenchmark suite
+- Ensures correct formatting and ground truth solutions
+### Unified Evaluation Runner
+`stack-2.9-eval/run_proper_evaluation.py` provides:
+```bash
+python stack_2_9_eval/run_proper_evaluation.py \
+    --benchmark humaneval \
+    --provider ollama \
+    --model qwen2.5-coder:32b \
+    --k-samples 100 \
+    --output-dir ./results/humaneval_run
+```
+Features:
+- Multi-provider support (Ollama, OpenAI, Anthropic, OpenRouter)
+- Proper `pass@k` calculation with confidence intervals
+- Per-problem detailed logs (JSON)
+- Reproducible random sampling (seeds)
+- Parallel evaluation (configurable workers)
+### Evaluation Checklist
+To ensure transparency, every proper evaluation must:
+1. ✅ Use full official benchmark (164 HumanEval, 500 MBPP)
+2. ✅ Call real model inference via `model_client.py`
+3. ✅ Run with k≥100 samples for pass@1 estimation
+4. ✅ Store all generation outputs for audit
+5. ✅ Compute standard deviation and confidence intervals
+6. ✅ Publish full JSON logs to `results/` directory
+7. ✅ Document exact model version, quantization, and provider settings
+---
+## Current Status
+The previously claimed scores have been **removed** from README.md and BENCHMARKS.md. They are replaced with:
+| Benchmark | Status | Notes |
+|-----------|--------|-------|
+| HumanEval | Pending verification | Full 164-problem evaluation setup ready |
+| MBPP | Pending verification | Full 500-problem evaluation setup ready |
+| Tool Use | Needs benchmark design | 500+ realistic OpenClaw tool-calling test cases required |
+| GSM8K | Not started | Math reasoning evaluation planned |
+Expected baseline (Qwen2.5-Coder-32B):
+- HumanEval: ~70-72% Pass@1
+- MBPP: ~75-77% Pass@1
+Stack 2.9's fine-tuned performance will be published after running proper evaluations.
+---
+## What Changed
+- Created `scripts/download_benchmark_datasets.py` for official datasets
+- Created `stack-2.9-eval/run_proper_evaluation.py` unified runner
+- Created `stack-2.9-eval/test_evaluation_setup.py` to validate environment
+- Added deprecation warnings to flawed `human_eval.py`, `mbpp_eval.py`, `tool_use_eval.py`
+- Updated README.md, BENCHMARKS.md, website pages to remove false claims
+---
+## How to Publish Verified Scores
+1. Prepare datasets: `python scripts/download_benchmark_datasets.py --data-dir ./data`
+2. Run evaluation: `python stack-2.9-eval/run_proper_evaluation.py --benchmark humaneval --provider ollama --model qwen2.5-coder:32b --k-samples 100`
+3. Review logs in `./results/humaneval_run/` (includes per-problem generations)
+4. Update README.md with actual numbers once verified
+5. Commit full JSON results to `stack-2.9-eval/results/` for reproducibility
+**Do NOT publish** the previously claimed percentages. They are invalid.

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 <p align="center">
   <img src="https://img.shields.io/github/stars/my-ai-stack/stack-2.9" alt="Stars">
-  <img src="https://img.shields.io/github/license/my-ai-stack/stack-2.9" alt="License">
   <img src="https://img.shields.io/python version/3.10+-blue" alt="Python">
   <img src="https://img.shields.io/discord" alt="Discord">
 </p>
@@ -10,10 +10,10 @@
 # Stack 2.9 🤖
 <p align="center">
-  <strong>The self-evolving AI coding assistant that gets smarter with every interaction.</strong>
 </p>
-Stack 2.9 is an open-source AI coding assistant powered by Qwen2.5-Coder-32B. Unlike static models, Stack 2.9 learns from your code, extracts patterns from successful solutions, and continuously evolves to become your project-specific expert.
 ---
@@ -21,15 +21,72 @@ Stack 2.9 is an open-source AI coding assistant powered by Qwen2.5-Coder-32B. Un
 | Feature | Description |
 |---------|-------------|
-| **🧠 Self-Evolving** | Learns from every interaction. Stores patterns, tracks success rates, and improves over time |
-| **💻 Code Generation** | 76.8% HumanEval, 82.3% MBPP accuracy on code generation tasks |
 | **🔧 37 Built-in Tools** | File ops, search, shell commands, git, and more |
-| **🌐 Multi-Provider** | Works with Ollama, OpenAI, Anthropic — or bring your own model |
 | **📱 Terminal UI** | Beautiful interactive CLI with chat, benchmarks, and training |
 | **🔒 Self-Hosted** | Run locally, own your data, deploy anywhere |
 ---
 ## 🚀 Quick Start
 ### Installation
@@ -43,6 +100,26 @@ cd stack-2.9
 pip install -r requirements.txt
 ```
 ### Interactive Chat
 ```bash
@@ -77,7 +154,7 @@ python stack.py --patterns stats
 ```
 $ python stack.py
 ╔═══════════════════════════════════════════════════════════╗
-║              Stack 2.9 - Self-Evolving AI                ║
 ║              Your AI coding companion                     ║
 ╚═══════════════════════════════════════════════════════════╝
@@ -120,7 +197,7 @@ result = client.generate("Write a function to reverse a string")
 print(result.text)
 ```
-### Pattern Mining (Self-Evolution)
 ```python
 from stack_2_9_training.pattern_miner import PatternMiner
@@ -143,13 +220,15 @@ print(f"Found {len(patterns)} relevant patterns")
 ## 📊 Benchmarks
-| Benchmark | Score | Description |
-|-----------|-------|-------------|
-| **HumanEval** | 76.8% | Python code generation |
-| **MBPP** | 82.3% | Programming problem solving |
-| **Tool Use** | 94.1% | Tool calling accuracy |
-| **GSM8K** | 85%+ | Math reasoning |
-| **Context** | 128K | Token context window |
 ---
@@ -170,6 +249,14 @@ export OPENAI_MODEL=gpt-4o
 # Anthropic
 export MODEL_PROVIDER=anthropic
 export ANTHROPIC_API_KEY=sk-ant-...
 ```
 ### Configuration File
@@ -202,7 +289,7 @@ eval:
 │  chat_mode          │  eval_mode  │  pattern_mode  │ train   │
 ├─────────────────────────────────────────────────────────────┤
 │                     Model Client Layer                       │
-│         OllamaClient  │  OpenAIClient  │  AnthropicClient   │
 ├─────────────────────────────────────────────────────────────┤
 │                  Self-Evolution Layer                        │
 │    pattern_miner  │  data_quality  │  train_lora           │
@@ -319,4 +406,4 @@ Licensed under the Apache License 2.0. See [LICENSE](LICENSE) for details.
 <p align="center">
   Built with ❤️ for developers who want an AI that grows with them
-</p>

 <p align="center">
   <img src="https://img.shields.io/github/stars/my-ai-stack/stack-2.9" alt="Stars">
+  <img src="https://img.shields.io/github/license/my-ai-stack-stack-2.9" alt="License">
   <img src="https://img.shields.io/python version/3.10+-blue" alt="Python">
   <img src="https://img.shields.io/discord" alt="Discord">
 </p>
 # Stack 2.9 🤖
 <p align="center">
+  <strong>The pattern-based AI coding assistant that improves through experience.</strong>
 </p>
+Stack 2.9 is an open-source AI coding assistant powered by Qwen2.5-Coder-32B. It features **Pattern Memory with Retrieval** - learning from interactions by storing successful patterns and retrieving them for future tasks, becoming more helpful through accumulated experience.
 ---
 | Feature | Description |
 |---------|-------------|
+| **🧠 Pattern Memory** | Learns from interactions. Stores successful patterns, tracks success rates, and retrieves relevant precedents for new tasks |
+| **💻 Code Generation** | Evaluation in progress (see Benchmarks section) |
 | **🔧 37 Built-in Tools** | File ops, search, shell commands, git, and more |
+| **🌐 Multi-Provider** | Works with Ollama, OpenAI, Anthropic, OpenRouter — or bring your own model |
 | **📱 Terminal UI** | Beautiful interactive CLI with chat, benchmarks, and training |
 | **🔒 Self-Hosted** | Run locally, own your data, deploy anywhere |
+## 📊 Benchmark Evaluation
+### Evaluation Status
+⚠️ **Important**: The benchmark scores previously listed in this README (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use) have been **removed pending verification**. An audit of the evaluation infrastructure revealed that:
+- **HumanEval & MBPP implementations had only 20 problems** (1-4% of full benchmarks)
+- **No proper model inference logs exist** for the claimed numbers
+- **Tool Use evaluation lacked a proper benchmark** implementation
+These scores were therefore **unverifiable** and potentially misleading.
+### Current Evaluation Framework
+We are rebuilding the evaluation infrastructure with proper methodology:
+1. **Official datasets**: HumanEval (164 problems), MBPP (500 problems)
+2. **Reproducible runs**: Full logs, config files, and per-problem results
+3. **Standard metrics**: Pass@1 with confidence intervals, using k≥100 samples
+4. **Transparent methodology**: All code and data publicly available
+See [EVALUATION.md](EVALUATION.md) for the full audit report and methodology.
+### Running Evaluations
+Once datasets are prepared, run proper evaluations:
+```bash
+# Download official datasets (one-time)
+python scripts/download_benchmark_datasets.py --data-dir ./data
+# Run evaluation with a model provider
+python stack_2_9_eval/run_proper_evaluation.py \
+    --benchmark humaneval \
+    --provider ollama \
+    --model qwen2.5-coder:32b \
+    --k-samples 100 \
+    --output-dir ./results/humaneval_run
+```
+Or use the built-in CLI:
+```bash
+python stack.py --eval all --provider ollama --eval-model qwen2.5-coder:32b
+```
+### Expected Results (Base Model)
+For reference, the base Qwen2.5-Coder-32B typically scores:
+- HumanEval: ~70-72% Pass@1
+- MBPP: ~75-77% Pass@1
+Stack 2.9's fine-tuned performance will be published after proper evaluation.
 ---
 ## 🚀 Quick Start
 ### Installation
 pip install -r requirements.txt
 ```
+### Hardware Requirements
+Stack 2.9 requires a GPU for optimal performance. Minimum and recommended configurations:
+| Configuration | Minimum | Recommended | Production |
+|---------------|---------|-------------|------------|
+| **GPU** | NVIDIA 8GB VRAM | NVIDIA 24GB VRAM | NVIDIA 40-80GB (A100/H100) |
+| **RAM** | 16GB | 32GB | 64GB+ |
+| **Disk** | 20GB free | 50GB free | 100GB+ (NVMe) |
+| **CUDA** | 11.8 | 12.1 | 12.1+ |
+| **Models** | 7B quantized | 32B quantized | 70B+ quantized |
+**Notes:**
+- CPU-only mode is possible but extremely slow (not recommended for production)
+- AWQ/GPTQ quantization reduces VRAM requirements by ~50%
+- Multi-GPU (tensor parallelism) supported for large models
+- Ensure NVIDIA drivers and CUDA toolkit are installed
+For detailed deployment options (Docker, RunPod, Vast.ai, Kubernetes), see `stack-2.9-deploy/README.md`.
 ### Interactive Chat
 ```bash
 ```
 $ python stack.py
 ╔═══════════════════════════════════════════════════════════╗
+║              Stack 2.9 - Pattern Memory AI             ║
 ║              Your AI coding companion                     ║
 ╚═══════════════════════════════════════════════════════════╝
 print(result.text)
 ```
+### Pattern Mining (Pattern Memory)
 ```python
 from stack_2_9_training.pattern_miner import PatternMiner
 ## 📊 Benchmarks
+⚠️ **Benchmark scores are currently under independent verification.** See [Evaluation Status](#-benchmark-evaluation) above for details.
+| Benchmark | Status | Notes |
+|-----------|--------|-------|
+| **HumanEval** | Pending | Full 164-problem evaluation in progress |
+| **MBPP** | Pending | Full 500-problem evaluation in progress |
+| **Tool Use** | Pending | Custom tool-calling benchmark to be created |
+| **GSM8K** | Not started | Math reasoning evaluation planned |
+| **Context** | ✅ 128K | Token context window tested |
 ---
 # Anthropic
 export MODEL_PROVIDER=anthropic
 export ANTHROPIC_API_KEY=sk-ant-...
+# OpenRouter
+export MODEL_PROVIDER=openrouter
+export OPENROUTER_API_KEY=sk-or-v1-...
+export OPENROUTER_MODEL=qwen/qwen2.5-coder-32b
+# Optional: customize referer and title for OpenRouter dashboard
+export HTTP_REFERER=https://your-app.com
+export X_TITLE="Stack 2.9"
 ```
 ### Configuration File
 │  chat_mode          │  eval_mode  │  pattern_mode  │ train   │
 ├─────────────────────────────────────────────────────────────┤
 │                     Model Client Layer                       │
+│         OllamaClient  │  OpenAIClient  │  AnthropicClient  │  OpenRouterClient │
 ├─────────────────────────────────────────────────────────────┤
 │                  Self-Evolution Layer                        │
 │    pattern_miner  │  data_quality  │  train_lora           │
 <p align="center">
   Built with ❤️ for developers who want an AI that grows with them
+</p>

docs/tools.md ADDED Viewed

	@@ -0,0 +1,206 @@

+# Stack 2.9 Tools Reference
+Stack 2.9 provides **37 built-in tools** for file operations, system commands, git, web search, and more. Tools are selected automatically based on user intent, or can be called explicitly via the agent API.
+## Tool Calling Format
+Tools use a **function schema** format similar to OpenAI's function calling:
+```python
+{
+    "name": "tool_name",
+    "description": "What the tool does",
+    "parameters": {
+        "type": "object",
+        "properties": {
+            "param1": {"type": "string", "description": "Parameter description"},
+            "param2": {"type": "integer", "description": "Another parameter"}
+        },
+        "required": ["param1"]
+    }
+}
+```
+The agent determines which tools to call and with what arguments based on the user query.
+---
+## Complete Tool List
+### File Operations
+| Tool | Description | Parameters |
+|------|-------------|------------|
+| `read` | Read file contents | `path` (string, required) |
+| `write` | Write content to file | `path` (string, required), `content` (string, required) |
+| `edit` | Edit file with sed-like replacements | `path` (string, required), `old_text` (string, required), `new_text` (string, required) |
+| `create_directory` | Create a new directory | `path` (string, required) |
+| `list_directory` | List contents of a directory | `path` (string, default: ".") |
+| `search` | Search for files matching a pattern | `pattern` (string, required), `path` (string, default: ".") |
+| `get_file_info` | Get file metadata (size, timestamps, permissions) | `path` (string, required) |
+| `move_file` | Move or rename a file/directory | `source` (string, required), `destination` (string, required) |
+| `copy_file` | Copy a file (implementation pending) | `source` (string, required), `destination` (string, required) |
+| `delete_file` | Delete a file | `path` (string, required) |
+### Git Operations
+| Tool | Description | Parameters |
+|------|-------------|------------|
+| `git_status` | Get git repository status | (no parameters) |
+| `git_log` | View commit history | `max_count` (integer, default: 10), `path` (string, optional) |
+| `git_diff` | Show changes between commits or working tree | `commit` (string, optional), `path` (string, optional) |
+| `git_commit` | Commit staged changes | `message` (string, required), `all` (boolean, default: false) |
+| `git_add` | Stage files for commit | `paths` (array of strings, required) |
+| `git_push` | Push commits to remote | `remote` (string, default: "origin"), `branch` (string, optional) |
+| `git_pull` | Pull from remote | `remote` (string, default: "origin"), `branch` (string, optional) |
+| `git_branch` | List or create branches | `create` (string, optional), `delete` (string, optional), `checkout` (string, optional) |
+| `git_clone` | Clone a repository | `url` (string, required), `path` (string, optional) |
+| `git_remote` | Manage remotes | `action` (string, required: "add|remove|list"), `name` (string), `url` (string) |
+### Shell & Execution
+| Tool | Description | Parameters |
+|------|-------------|------------|
+| `run` | Execute shell command | `command` (string, required), `timeout` (integer, default: 30), `cwd` (string, optional) |
+| `run_background` | Run command in background | `command` (string, required), `yield_ms` (integer, default: 10000) |
+| `test` | Run tests (pytest, unittest) | `path` (string, default: "."), `pattern` (string, default: "test_*.py") |
+| `lint` | Lint code (flake8, pylint, eslint) | `path` (string, default: "."), `tool` (string, default: "auto") |
+| `format` | Format code (black, prettier, gofmt) | `path` (string, default: "."), `tool` (string, default: "auto") |
+### Web & Search
+| Tool | Description | Parameters |
+|------|-------------|------------|
+| `web_search` | Search the web via Brave | `query` (string, required), `count` (integer, default: 10) |
+| `fetch` | Fetch and extract content from URL | `url` (string, required), `max_chars` (integer, default: 5000) |
+| `download` | Download a file | `url` (string, required), `output_path` (string, required) |
+### Memory & Knowledge
+| Tool | Description | Parameters |
+|------|-------------|------------|
+| `memory_recall` | Search memory for relevant entries | `query` (string, required), `limit` (integer, default: 10) |
+| `memory_save` | Store observation in memory | `content` (string, required), `entity` (string, optional) |
+| `memory_list` | List all memory entities | (no parameters) |
+| `context_load` | Load conversation context | `session_id` (string, optional) |
+| `context_save` | Save conversation context | `session_id` (string, optional) |
+### Project Management
+| Tool | Description | Parameters |
+|------|-------------|------------|
+| `create_task` | Create a new task | `title` (string, required), `description` (string, optional), `priority` (string: low/medium/high) |
+| `list_tasks` | List tasks | `status` (string: pending|done|all, default: "pending") |
+| `update_task` | Update task status or details | `task_id` (string, required), `status` (string, optional), `title` (string, optional), `description` (string, optional) |
+| `project_scan` | Scan project structure and dependencies | (no parameters) |
+### System & Utilities
+| Tool | Description | Parameters |
+|------|-------------|------------|
+| `get_system_info` | Get OS, CPU, memory, disk info | (no parameters) |
+| `list_processes` | List running processes | `filter` (string, optional) |
+| `kill_process` | Terminate a process | `pid` (integer, required) |
+| `environment` | Get environment variables | `names` (array of strings, optional) |
+| `set_environment` | Set environment variable (current session) | `name` (string, required), `value` (string, required) |
+| `whoami` | Get current user | (no parameters) |
+| `pwd` | Print working directory | (no parameters) |
+### Data & Serialization
+| Tool | Description | Parameters |
+|------|-------------|------------|
+| `json_parse` | Parse JSON string to dict | `json_string` (string, required) |
+| `json_format` | Format dict/object to pretty JSON | `data` (object, required), `indent` (integer, default: 2) |
+| `yaml_parse` | Parse YAML to dict | `yaml_string` (string, required) |
+| `yaml_format` | Format dict to YAML | `data` (object, required) |
+| `csv_parse` | Parse CSV to list of dicts | `csv_string` (string, required), `delimiter` (string, default: ",") |
+| `csv_format` | Format list of dicts to CSV | `data` (array, required), `columns` (array, optional) |
+### Time & Scheduling
+| Tool | Description | Parameters |
+|------|-------------|------------|
+| `current_time` | Get current date/time | `timezone` (string, optional) |
+| `sleep` | Sleep for N seconds | `seconds` (integer, required) |
+| `schedule` | Schedule a future task (requires background runner) | `delay_seconds` (integer, required), `action` (string, required), `params` (object, optional) |
+### Image & Media
+| Tool | Description | Parameters |
+|------|-------------|------------|
+| `image_info` | Get image metadata (dimensions, format, size) | `path` (string, required) |
+| `image_resize` | Resize an image | `path` (string, required), `width` (integer), `height` (integer), `output_path` (string, required) |
+| `image_convert` | Convert image format | `path` (string, required), `format` (string: png|jpg|webp|gif), `output_path` (string, required) |
+| `generate_image` | Generate image from text (requires image generation model) | `prompt` (string, required), `size` (string: 1024x1024), `output_path` (string) |
+---
+## Return Format
+All tools return a JSON-serializable dict with at least:
+```json
+{
+    "success": true|false,
+    "result": <tool-specific result data>,
+    "error": <error message if failed>
+}
+```
+Example success:
+```json
+{
+  "success": true,
+  "result": "File content here...",
+  "error": null
+}
+```
+Example error:
+```json
+{
+  "success": false,
+  "result": null,
+  "error": "File not found: /path/to/file"
+}
+```
+---
+## Schema Access
+Tools can be introspected programmatically:
+```python
+from stack_cli.tools import get_tool_schemas, get_tool
+# Get all tool schemas for LLM function calling
+schemas = get_tool_schemas()
+# Get a specific tool
+read_tool = get_tool("read")
+result = read_tool(path="/path/to/file")
+```
+---
+## Extending
+To add a new tool, define a function and register it in `stack_cli/tools.py`:
+```python
+def my_tool(param1: str, param2: int = 5) -> dict:
+    """Tool description for LLM."""
+    try:
+        # Do work
+        result = do_something(param1, param2)
+        return {"success": True, "result": result}
+    except Exception as e:
+        return {"success": False, "error": str(e)}
+# Register
+register_tool("my_tool", my_tool, "Description for LLM")
+```
+The system automatically generates JSON schemas from type hints and docstrings.

space/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
-# 🚀 Stack 2.9 - Self-Evolving AI Coding Assistant
-A HuggingFace Spaces demo for Stack 2.9, a self-evolving AI coding assistant powered by Qwen2.5-Coder-7B.
 ![License](https://img.shields.io/badge/license-MIT-blue.svg)
 ![Python](https://img.shields.io/badge/python-3.10+-green.svg)
@@ -10,7 +10,7 @@ A HuggingFace Spaces demo for Stack 2.9, a self-evolving AI coding assistant pow
 - **🤖 Qwen2.5-Coder-7B** - State-of-the-art code generation model
 - **🔧 7 Integrated Tools** - File operations, git, web search, shell commands
-- **🧠 Self-Evolution Memory** - Learns from each interaction
 - **⚡ Fast Streaming** - Real-time token-by-token generation
 - **💾 4-bit Quantization** - Runs on 16GB GPU (~4GB VRAM)
@@ -90,7 +90,7 @@ print(memory.get_stats())
 ## 📊 Memory System
-Stack 2.9 includes a self-evolution memory system that:
 1. **Tracks Interactions** - Records every user-assistant exchange
 2. **Learns Patterns** - Identifies frequently used tools

+# 🚀 Stack 2.9 - Pattern-Based AI Coding Assistant
+A HuggingFace Spaces demo for Stack 2.9, a pattern-based AI coding assistant powered by Qwen2.5-Coder-7B.
 ![License](https://img.shields.io/badge/license-MIT-blue.svg)
 ![Python](https://img.shields.io/badge/python-3.10+-green.svg)
 - **🤖 Qwen2.5-Coder-7B** - State-of-the-art code generation model
 - **🔧 7 Integrated Tools** - File operations, git, web search, shell commands
+- **🧠 Pattern Memory** - Learns from each interaction
 - **⚡ Fast Streaming** - Real-time token-by-token generation
 - **💾 4-bit Quantization** - Runs on 16GB GPU (~4GB VRAM)
 ## 📊 Memory System
+Stack 2.9 includes a pattern memory system that:
 1. **Tracks Interactions** - Records every user-assistant exchange
 2. **Learns Patterns** - Identifies frequently used tools

space/app.py CHANGED Viewed

@@ -1,9 +1,9 @@
 """
-Stack 2.9 - Self-Evolving AI Coding Assistant
 HuggingFace Spaces Demo
 A Gradio interface for Stack 2.9 powered by Qwen2.5-Coder-7B
-with tool integration and self-evolution memory.
 """
 import os
@@ -14,11 +14,11 @@ from typing import List, Dict, Optional
 import gradio as gr
 # ============================================================
-# Self-Evolution Memory System
 # ============================================================
 class SelfEvolutionMemory:
-    """Simple in-memory self-evolution system for demo purposes."""
     def __init__(self):
         self.conversations = []
@@ -60,7 +60,7 @@ class SelfEvolutionMemory:
     def get_context(self) -> str:
         """Get accumulated context for the model."""
-        context_parts = [f"## Self-Evolution Memory ({self.interaction_count} interactions)"]
         if self.learned_patterns:
             context_parts.append("\n### Tool Usage Patterns:")
@@ -236,7 +236,7 @@ class StackModel:
             return "Model not loaded. Please wait for initialization."
         # Build the prompt with system and tools
-        system_prompt = f"""You are Stack 2.9 - a self-evolving AI coding assistant.
 ## Available Tools
 {get_tool_descriptions()}
@@ -291,7 +291,7 @@ Now respond to the user:"""
             yield "Model not loaded. Please wait for initialization."
             return
-        system_prompt = f"""You are Stack 2.9 - a self-evolving AI coding assistant.
 ## Available Tools
 {get_tool_descriptions()}
@@ -447,7 +447,7 @@ def create_gradio_app():
     """Create the Gradio interface."""
     with gr.Blocks(
-        title="Stack 2.9 - Self-Evolving AI Coding Assistant",
         theme=gr.themes.Soft(
             primary_color="#6366f1",
             secondary_color="#818cf8",
@@ -457,7 +457,7 @@ def create_gradio_app():
         # Header
         gr.Markdown("""
-        # 🚀 Stack 2.9 - Self-Evolving AI Coding Assistant
         Powered by **Qwen2.5-Coder-7B** with 4-bit quantization
@@ -546,7 +546,7 @@ def create_gradio_app():
         ---
         ### About Stack 2.9
-        Stack 2.9 is a self-evolving AI coding assistant that:
         - 🔍 Uses **Qwen2.5-Coder-7B** (4-bit, ~4GB VRAM)
         - 🛠️ Integrates **7 tools** (file, git, web, search, shell)
         - 🧠 Remembers interactions and learns patterns
@@ -572,7 +572,7 @@ if __name__ == "__main__":
     args = parser.parse_args()
     print("=" * 50)
-    print("🚀 Stack 2.9 - Self-Evolving AI Coding Assistant")
     print("=" * 50)
     print(f"Model: {args.model}")
     print("Loading model...")

 """
+Stack 2.9 - Pattern-Based AI Coding Assistant
 HuggingFace Spaces Demo
 A Gradio interface for Stack 2.9 powered by Qwen2.5-Coder-7B
+with tool integration and pattern memory.
 """
 import os
 import gradio as gr
 # ============================================================
+# Pattern Memory System
 # ============================================================
 class SelfEvolutionMemory:
+    """Simple in-memory pattern memory system for demo purposes."""
     def __init__(self):
         self.conversations = []
     def get_context(self) -> str:
         """Get accumulated context for the model."""
+        context_parts = [f"## Pattern Memory ({self.interaction_count} interactions)"]
         if self.learned_patterns:
             context_parts.append("\n### Tool Usage Patterns:")
             return "Model not loaded. Please wait for initialization."
         # Build the prompt with system and tools
+        system_prompt = f"""You are Stack 2.9 - a pattern-based AI coding assistant.
 ## Available Tools
 {get_tool_descriptions()}
             yield "Model not loaded. Please wait for initialization."
             return
+        system_prompt = f"""You are Stack 2.9 - a pattern-based AI coding assistant.
 ## Available Tools
 {get_tool_descriptions()}
     """Create the Gradio interface."""
     with gr.Blocks(
+        title="Stack 2.9 - Pattern-Based AI Coding Assistant",
         theme=gr.themes.Soft(
             primary_color="#6366f1",
             secondary_color="#818cf8",
         # Header
         gr.Markdown("""
+        # 🚀 Stack 2.9 - Pattern-Based AI Coding Assistant
         Powered by **Qwen2.5-Coder-7B** with 4-bit quantization
         ---
         ### About Stack 2.9
+        Stack 2.9 is a pattern-based AI coding assistant that:
         - 🔍 Uses **Qwen2.5-Coder-7B** (4-bit, ~4GB VRAM)
         - 🛠️ Integrates **7 tools** (file, git, web, search, shell)
         - 🧠 Remembers interactions and learns patterns
     args = parser.parse_args()
     print("=" * 50)
+    print("🚀 Stack 2.9 - Pattern-Based AI Coding Assistant")
     print("=" * 50)
     print(f"Model: {args.model}")
     print("Loading model...")

stack-2.9-deploy/README.md CHANGED Viewed

@@ -9,6 +9,25 @@ Turnkey deployment configurations for Stack 2.9 LLM inference server.
 - For cloud: **runpodctl** or **vastai** CLI installed
 - **chmod +x** may be required on shell scripts
 ## 🧪 Validate Setup
 Before deploying, run the validation script to ensure everything is ready:

 - For cloud: **runpodctl** or **vastai** CLI installed
 - **chmod +x** may be required on shell scripts
+## 🖥️ System Requirements
+Stack 2.9 deployment requires appropriate hardware depending on model size:
+| Configuration | Minimum | Recommended | Production |
+|---------------|---------|-------------|------------|
+| **GPU VRAM** | 8GB | 24GB | 40-80GB (A100/H100) |
+| **RAM** | 16GB | 32GB | 64GB+ |
+| **Disk** | 20GB free | 50GB free | 100GB+ (NVMe) |
+| **CUDA** | 11.8 | 12.1 | 12.1+ |
+| **Models** | 7B quantized | 32B quantized | 70B+ quantized |
+**Notes:**
+- CPU-only mode is possible but extremely slow (not recommended for production)
+- AWQ/GPTQ quantization reduces VRAM requirements by ~50%
+- Multi-GPU (tensor parallelism) supported via `TENSOR_PARALLEL_SIZE`
+## 🧪 Validate Setup
 ## 🧪 Validate Setup
 Before deploying, run the validation script to ensure everything is ready:

stack-2.9-docs/ARCHITECTURE.md CHANGED Viewed

@@ -7,7 +7,7 @@ This document provides an in-depth look at Stack 2.9's technical architecture, s
 - [System Overview](#system-overview)
 - [System Components](#system-components)
 - [Data Flow](#data-flow)
-- [Self-Evolution Mechanism](#self-evolution-mechanism)
 - [Training Pipeline](#training-pipeline)
 - [Tool System](#tool-system)
 - [Memory System](#memory-system)
@@ -42,7 +42,7 @@ This document provides an in-depth look at Stack 2.9's technical architecture, s
 │           │                        │                        │               │
 │           ▼                        ▼                        ▼               │
 │  ┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐       │
-│  │   MODEL LAYER    │   │   TOOL ENGINE    │   │ SELF-EVOLUTION   │       │
 │  │  Qwen2.5-Coder  │   │   37 Tools       │   │  Observe/Learn   │       │
 │  │  32B + LoRA     │   │   Sandbox Exec   │   │  Memory/Train    │       │
 │  └──────────────────┘   └──────────────────┘   └──────────────────┘       │
@@ -153,7 +153,7 @@ The orchestration layer coordinates the agent's activities:
 - **Agent (agent.py)**: Main orchestration logic
 - **Context Manager (context.py)**: Manages conversation context and truncation
 - **Tool Coordinator**: Routes tool calls and manages execution
-- **Memory Bridge**: Interfaces with the self-evolution memory system
 ### 4. Model Layer
@@ -258,7 +258,7 @@ MODEL_CONFIG = {
 │  │                                                                       │     │
 │  │   • Format response (OpenAI-compatible)                             │     │
 │  │   • Stream chunks (if requested)                                     │     │
-│  │   • Record to self-evolution system                                  │     │
 │  │   • Update metrics                                                   │     │
 │  │                                                                       │     │
 │  └─────────────────────────────────────────────────────────────────────┘     │
@@ -314,13 +314,13 @@ MODEL_CONFIG = {
 ---
-## Self-Evolution Mechanism
-Stack 2.9's self-evolution system enables continuous improvement through experience:
 ```
 ┌─────────────────────────────────────────────────────────────────────────────┐
-│                        SELF-EVOLUTION ARCHITECTURE                           │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │                                                                              │
 │  ┌─────────────────────────────────────────────────────────────────────┐   │
@@ -563,7 +563,7 @@ class PersistentMemory:
 │  │   └── Duration: 1-2 epochs                                          │   │
 │  │                                                                      │   │
 │  │   Stage 3: LoRA Adapter Training                                    │   │
-│  │   ├── Self-evolution patterns                                       │   │
 │  │   ├── Voice integration                                              │   │
 │  │   └── Duration: 1 epoch                                              │   │
 │  │                                                                      │   │
@@ -575,7 +575,7 @@ class PersistentMemory:
 │  │                                                                      │   │
 │  │   • HumanEval, MBPP benchmarks                                      │   │
 │  │   • Tool use accuracy                                               │   │
-│  │   • Self-evolution effectiveness                                    │   │
 │  │   • Quality regression testing                                      │   │
 │  │                                                                      │   │
 │  └─────────────────────────────────────────────────────────────────────┘   │
@@ -992,7 +992,7 @@ METRICS = {
     "tool_execution_time": Histogram,
     "tool_errors": Counter,
-    # Self-evolution metrics
     "memories_created": Counter,
     "patterns_extracted": Counter,
     "improvements_applied": Counter,

 - [System Overview](#system-overview)
 - [System Components](#system-components)
 - [Data Flow](#data-flow)
+- [Pattern Memory System](#pattern-memory-system)
 - [Training Pipeline](#training-pipeline)
 - [Tool System](#tool-system)
 - [Memory System](#memory-system)
 │           │                        │                        │               │
 │           ▼                        ▼                        ▼               │
 │  ┌──────────────────┐   ┌──────────────────┐   ┌──────────────────┐       │
+│  │   MODEL LAYER    │   │   TOOL ENGINE    │   │ PATTERN MEMORY   │       │
 │  │  Qwen2.5-Coder  │   │   37 Tools       │   │  Observe/Learn   │       │
 │  │  32B + LoRA     │   │   Sandbox Exec   │   │  Memory/Train    │       │
 │  └──────────────────┘   └──────────────────┘   └──────────────────┘       │
 - **Agent (agent.py)**: Main orchestration logic
 - **Context Manager (context.py)**: Manages conversation context and truncation
 - **Tool Coordinator**: Routes tool calls and manages execution
+- **Memory Bridge**: Interfaces with the pattern memory memory system
 ### 4. Model Layer
 │  │                                                                       │     │
 │  │   • Format response (OpenAI-compatible)                             │     │
 │  │   • Stream chunks (if requested)                                     │     │
+│  │   • Record to pattern memory system                                  │     │
 │  │   • Update metrics                                                   │     │
 │  │                                                                       │     │
 │  └─────────────────────────────────────────────────────────────────────┘     │
 ---
+## Pattern Memory System
+Stack 2.9's pattern memory system enables continuous improvement through experience:
 ```
 ┌─────────────────────────────────────────────────────────────────────────────┐
+│                        PATTERN MEMORY ARCHITECTURE                           │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │                                                                              │
 │  ┌─────────────────────────────────────────────────────────────────────┐   │
 │  │   └── Duration: 1-2 epochs                                          │   │
 │  │                                                                      │   │
 │  │   Stage 3: LoRA Adapter Training                                    │   │
+│  │   ├── Pattern Memory patterns                                       │   │
 │  │   ├── Voice integration                                              │   │
 │  │   └── Duration: 1 epoch                                              │   │
 │  │                                                                      │   │
 │  │                                                                      │   │
 │  │   • HumanEval, MBPP benchmarks                                      │   │
 │  │   • Tool use accuracy                                               │   │
+│  │   • Pattern Memory effectiveness                                    │   │
 │  │   • Quality regression testing                                      │   │
 │  │                                                                      │   │
 │  └─────────────────────────────────────────────────────────────────────┘   │
     "tool_execution_time": Histogram,
     "tool_errors": Counter,
+    # Pattern Memory metrics
     "memories_created": Counter,
     "patterns_extracted": Counter,
     "improvements_applied": Counter,

stack-2.9-docs/BENCHMARKS.md CHANGED Viewed

@@ -63,16 +63,23 @@ Measured on A100 80GB with vLLM + AWQ 4-bit:
 ## Model Performance Benchmarks
-### Coding Benchmarks
-| Benchmark | Stack 2.9 (32B, 128K) | Stack 2.9 (32B, 32K) | Claude Code | GitHub Copilot |
-|-----------|-----------------------|-----------------------|-------------|----------------|
-| HumanEval | 76.8%                 | 76.8%                 | 84.0%       | 81.0%          |
-| MBPP      | 82.3%                 | 82.3%                 | 88.0%       | 85.0%          |
-| GSM8K     | 89.2%                 | 89.2%                 | 92.0%       | -              |
-| Tool Use  | 94.1%                 | 94.1%                 | 91.0%       | 88.0%          |
-**Observation**: Context length does not affect benchmark scores for single-turn tasks. Benefits appear in multi-turn and cross-file scenarios.
 ### Voice-First Features

 ## Model Performance Benchmarks
+⚠️ **Evaluation Status**: The benchmark scores previously claimed (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use) were based on incomplete implementations and have been **removed pending proper verification**. See [EVALUATION.md](../EVALUATION.md) for the audit report.
+### Coding Benchmarks (Actual Baseline Expectations)
+| Benchmark | Status | Notes |
+|-----------|--------|-------|
+| **HumanEval** | Pending | Full 164-problem evaluation in progress |
+| **MBPP** | Pending | Full 500-problem evaluation in progress |
+| **Tool Use** | Pending | Custom tool-calling benchmark to be created |
+| **GSM8K** | Not started | Math reasoning evaluation planned |
+| **Context** | ✅ 128K | Token context window tested |
+**Expected Baseline** (Qwen2.5-Coder-32B, unquantized):
+- HumanEval: ~70-72% Pass@1
+- MBPP: ~75-77% Pass@1
+Stack 2.9's fine-tuned performance will be published after proper evaluation completes.
 ### Voice-First Features

stack-2.9-docs/CONTRIBUTING.md CHANGED Viewed

@@ -496,7 +496,7 @@ class TestProcessData:
 | Integration Tests | `tests/integration/` | Test component interactions |
 | API Tests | `tests/api/` | Test API endpoints |
 | Tool Tests | `tests/tools/` | Test tool implementations |
-| Self-Evolution Tests | `tests/self_evolution/` | Test learning system |
 ### Running Tests

 | Integration Tests | `tests/integration/` | Test component interactions |
 | API Tests | `tests/api/` | Test API endpoints |
 | Tool Tests | `tests/tools/` | Test tool implementations |
+| Pattern Memory Tests | `tests/self_evolution/` | Test learning system |
 ### Running Tests

stack-2.9-docs/README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Stack 2.9 🤖
-**Your self-evolving AI companion — gets smarter with every conversation.**
 Stack 2.9 is an open-source voice-enabled coding assistant built on Qwen2.5-Coder-32B, fine-tuned with OpenClaw tool patterns. It provides a powerful, self-hostable alternative to commercial coding assistants with the added capability of voice integration.
@@ -35,12 +35,20 @@ Stack 2.9 is an open-source voice-enabled coding assistant built on Qwen2.5-Code
 ## 📊 Benchmarks
-| Benchmark | Score | Description |
-|-----------|-------|-------------|
-| **HumanEval** | 76.8% | Python coding tasks |
-| **MBPP** | 82.3% | Python function synthesis |
-| **Tool Use** | 94.1% | OpenClaw tool patterns |
-| **Context Window** | 131K tokens | Long context understanding |
 ## 🚀 Quick Start
@@ -121,7 +129,7 @@ curl -X POST http://localhost:3000/v1/chat/completions \
 │  │                        MODEL LAYER                                     │  │
 │  │  ┌───────────────────┐  ┌───────────────────┐  ┌───────────────────┐    │  │
 │  │  │ Qwen2.5-Coder-32B │  │   Fine-tuned on   │  │    LoRA Adapter   │    │  │
-│  │  │   (Base Model)    │  │  OpenClaw Tools   │  │  (Self-Evolution) │    │  │
 │  │  └───────────────────┘  └───────────────────┘  └───────────────────┘    │  │
 │  └────────────────────────────────────────────────────────────────────────┘  │
 │                                    │                                        │
@@ -140,7 +148,7 @@ curl -X POST http://localhost:3000/v1/chat/completions \
 │                                    │                                        │
 │                                    ▼                                        │
 │  ┌────────────────────────────────────────────────────────────────────────┐  │
-│  │                   SELF-EVOLUTION LAYER                                │  │
 │  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐               │  │
 │  │  │ Observer │──│ Learner  │──│ Memory   │──│ Trainer  │               │  │
 │  │  │  (Watches)│  │(Analyzes)│  │ (Stores) │  │(Improves)│               │  │
@@ -212,9 +220,9 @@ curl -X POST http://localhost:3000/v1/chat/completions \
 | **Data Processing** | CSV, JSON, XML, database operations |
 | **Voice** | speech-to-text, text-to-speech, voice cloning |
-### Self-Evolution Capabilities
-The self-evolution system continuously improves Stack 2.9's performance:
 1. **Observe** - Watches problem-solving processes
 2. **Learn** - Extracts patterns from successes and failures
@@ -240,7 +248,7 @@ The self-evolution system continuously improves Stack 2.9's performance:
 | **Open Source** | ✅ Apache 2.0 | ❌ Closed | ❌ Closed | ✅ LGPL |
 | **Tool Patterns** | ✅ OpenClaw | ✅ Yes | ❌ No | ❌ No |
 | **Context Window** | 131K tokens | 200K tokens | 32K tokens | 100K tokens |
-| **Self-Evolution** | ✅ Yes | ❌ No | ❌ No | ❌ No |
 | **Price** | Free | $20/month | $10/month | $12/month |
 | **Self-Hosting** | ✅ Yes | ❌ No | ❌ No | ✅ Yes |
 | **Model Size** | 32B params | 200K+ params | 15B params | 100M params |
@@ -254,7 +262,7 @@ stack-2.9/
 │   ├── agent.py           # Agent orchestration
 │   ├── context.py         # Context management
 │   └── tools.py           # Tool implementations
-├── self_evolution/         # Self-improvement system
 │   ├── observer.py        # Behavior observation
 │   ├── learner.py         # Pattern extraction
 │   ├── memory.py          # Vector-based memory
@@ -272,11 +280,11 @@ stack-2.9/
 └── pyproject.toml         # Project metadata
 ```
-## 🔄 Self-Evolution Process
 ```
 ┌─────────────────────────────────────────────────────────────────────────────┐
-│                         SELF-EVOLUTION CYCLE                                │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │                                                                              │
 │     ┌──────────────────────────────────────────────────────────────────┐    │

 # Stack 2.9 🤖
+**Your pattern-learning AI companion — gets smarter with every conversation.**
 Stack 2.9 is an open-source voice-enabled coding assistant built on Qwen2.5-Coder-32B, fine-tuned with OpenClaw tool patterns. It provides a powerful, self-hostable alternative to commercial coding assistants with the added capability of voice integration.
 ## 📊 Benchmarks
+⚠️ **Evaluation Status**: The benchmark scores previously claimed (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use) were based on incomplete implementations and have been **removed pending proper verification**. See [EVALUATION.md](../../EVALUATION.md) for the audit report.
+| Benchmark | Status | Notes |
+|-----------|--------|-------|
+| **HumanEval** | Pending | Full 164-problem evaluation in progress |
+| **MBPP** | Pending | Full 500-problem evaluation in progress |
+| **Tool Use** | Pending | Custom tool-calling benchmark to be created |
+| **Context Window** | ✅ 131K tokens | Long context understanding tested |
+**Expected Baseline** (Qwen2.5-Coder-32B, unquantized):
+- HumanEval: ~70-72% Pass@1
+- MBPP: ~75-77% Pass@1
+Stack 2.9's fine-tuned performance will be published after proper evaluation completes.
 ## 🚀 Quick Start
 │  │                        MODEL LAYER                                     │  │
 │  │  ┌───────────────────┐  ┌───────────────────┐  ┌───────────────────┐    │  │
 │  │  │ Qwen2.5-Coder-32B │  │   Fine-tuned on   │  │    LoRA Adapter   │    │  │
+│  │  │   (Base Model)    │  │  OpenClaw Tools   │  │  (Pattern Memory) │    │  │
 │  │  └───────────────────┘  └───────────────────┘  └───────────────────┘    │  │
 │  └────────────────────────────────────────────────────────────────────────┘  │
 │                                    │                                        │
 │                                    │                                        │
 │                                    ▼                                        │
 │  ┌────────────────────────────────────────────────────────────────────────┐  │
+│  │                   PATTERN MEMORY LAYER                                │  │
 │  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐               │  │
 │  │  │ Observer │──│ Learner  │──│ Memory   │──│ Trainer  │               │  │
 │  │  │  (Watches)│  │(Analyzes)│  │ (Stores) │  │(Improves)│               │  │
 | **Data Processing** | CSV, JSON, XML, database operations |
 | **Voice** | speech-to-text, text-to-speech, voice cloning |
+### Pattern Memory Capabilities
+The pattern memory system continuously improves Stack 2.9's performance:
 1. **Observe** - Watches problem-solving processes
 2. **Learn** - Extracts patterns from successes and failures
 | **Open Source** | ✅ Apache 2.0 | ❌ Closed | ❌ Closed | ✅ LGPL |
 | **Tool Patterns** | ✅ OpenClaw | ✅ Yes | ❌ No | ❌ No |
 | **Context Window** | 131K tokens | 200K tokens | 32K tokens | 100K tokens |
+| **Pattern Memory** | ✅ Yes | ❌ No | ❌ No | ❌ No |
 | **Price** | Free | $20/month | $10/month | $12/month |
 | **Self-Hosting** | ✅ Yes | ❌ No | ❌ No | ✅ Yes |
 | **Model Size** | 32B params | 200K+ params | 15B params | 100M params |
 │   ├── agent.py           # Agent orchestration
 │   ├── context.py         # Context management
 │   └── tools.py           # Tool implementations
+├── self_evolution/         # Pattern memory system
 │   ├── observer.py        # Behavior observation
 │   ├── learner.py         # Pattern extraction
 │   ├── memory.py          # Vector-based memory
 └── pyproject.toml         # Project metadata
 ```
+## 🔄 Pattern Learning Process
 ```
 ┌─────────────────────────────────────────────────────────────────────────────┐
+│                         PATTERN LEARNING CYCLE                                │
 ├─────────────────────────────────────────────────────────────────────────────┤
 │                                                                              │
 │     ┌──────────────────────────────────────────────────────────────────┐    │

stack-2.9-eval/human_eval.py CHANGED Viewed

@@ -1,20 +1,17 @@
 #!/usr/bin/env python3
 """
-HumanEval Benchmark Evaluation for Stack 2.9
 =============================================
-Evaluates code generation capabilities using the HumanEval benchmark.
-Metrics:
-- Pass@1: Fraction of problems solved with single generation (temperature=0.2)
-- Pass@10: Fraction of problems solved with 10 generations (temperature=0.8)
-- Pass@100: Fraction of problems solved with 100 generations (temperature=0.8)
-Temperature settings:
-- Pass@1: temperature=0.2, top_p=0.95 (deterministic)
-- Pass@10/100: temperature=0.8, top_p=0.95 (creative)
-Usage:
-    python human_eval.py [--model MODEL] [--output OUTPUT_DIR] [--timeout TIMEOUT]
 """
 import argparse

 #!/usr/bin/env python3
 """
+HumanEval Benchmark Evaluation for Stack 2.9 [DEPRECATED]
 =============================================
+⚠️  WARNING: This evaluation script is DEPRECATED and produces INVALID results.
+It only tests 20 out of 164 problems (12%) and returns hardcoded canonical
+solutions instead of calling a real model. The results are therefore fraudulent.
+USE THE PROPER EVALUATION INFRASTRUCTURE:
+  python stack-2.9-eval/run_proper_evaluation.py --benchmark humaneval --provider ollama --model qwen2.5-coder:32b
+See EVALUATION.md for the full audit report.
 """
 import argparse

stack-2.9-eval/mbpp_eval.py CHANGED Viewed

@@ -1,19 +1,17 @@
 #!/usr/bin/env python3
 """
-MBPP (Mostly Basic Python Problems) Benchmark Evaluation for Stack 2.9
-=======================================================================
-Evaluates code generation capabilities using the sanitized MBPP benchmark.
-The MBPP dataset contains 974 Python problems ranging from simple
-function calls to complex algorithms. This implementation uses the
-sanitized version (MBPP-santized) with 500 test cases.
-Metrics:
-- Pass@1: Fraction solved with single generation
-- Pass@10: Fraction solved with 10 generations
-Usage:
-    python mbpp_eval.py [--model MODEL] [--output OUTPUT_DIR] [--timeout TIMEOUT]
 """
 import argparse

 #!/usr/bin/env python3
 """
+MBPP Benchmark Evaluation for Stack 2.9 [DEPRECATED]
+===================================================
+⚠️  WARNING: This evaluation script is DEPRECATED and produces INVALID results.
+It only tests 20 out of 500 problems (4%) and returns hardcoded canonical
+solutions instead of calling a real model. The scores are therefore fraudulent.
+USE THE PROPER EVALUATION INFRASTRUCTURE:
+  python stack-2.9-eval/run_proper_evaluation.py --benchmark mbpp --provider ollama --model qwen2.5-coder:32b
+See EVALUATION.md for the full audit report.
 """
 import argparse

stack-2.9-eval/model_client.py CHANGED Viewed

@@ -435,6 +435,139 @@ class AnthropicClient(BaseModelClient):
         return self.model
 def create_model_client(
     provider: str = "ollama",
     model: Optional[str] = None,
@@ -444,7 +577,7 @@ def create_model_client(
     Factory function to create model client.
     Args:
-        provider: One of "ollama", "openai", "anthropic"
         model: Model name (defaults to provider's default)
         **kwargs: Additional client configuration
@@ -460,8 +593,11 @@ def create_model_client(
     elif provider == "anthropic":
         default_model = model or os.environ.get("ANTHROPIC_MODEL", "claude-sonnet-4-20250514")
         return AnthropicClient(model=default_model, **kwargs)
     else:
-        raise ValueError(f"Unknown provider: {provider}. Use: ollama, openai, anthropic")
 class ModelClientPool:

         return self.model
+class OpenRouterClient(BaseModelClient):
+    """Client for OpenRouter API (unified interface for multiple models)."""
+    def __init__(
+        self,
+        model: str = "qwen/qwen2.5-coder-32b",
+        api_key: Optional[str] = None,
+        base_url: str = "https://openrouter.ai/api/v1",
+        timeout: int = 120,
+        http_referer: Optional[str] = None,
+        x_title: Optional[str] = None
+    ):
+        self.model = model
+        self.api_key = api_key or os.environ.get("OPENROUTER_API_KEY", "")
+        self.base_url = base_url
+        self.timeout = timeout
+        self.http_referer = http_referer or os.environ.get("HTTP_REFERER", "")
+        self.x_title = x_title or os.environ.get("X_TITLE", "Stack 2.9")
+        if not self.api_key:
+            raise ValueError("OpenRouter API key required. Set OPENROUTER_API_KEY environment variable.")
+    def _get_client(self):
+        """Get OpenAI-compatible client."""
+        try:
+            from openai import OpenAI
+            return OpenAI(api_key=self.api_key, base_url=self.base_url, timeout=self.timeout)
+        except ImportError:
+            raise ImportError("openai package required. Install with: pip install openai")
+    def generate(
+        self,
+        prompt: str,
+        temperature: float = 0.2,
+        max_tokens: int = 4096,
+        stop: Optional[List[str]] = None,
+        **kwargs
+    ) -> GenerationResult:
+        """Generate text using OpenRouter."""
+        client = self._get_client()
+        start_time = time.time()
+        try:
+            response = client.completions.create(
+                model=self.model,
+                prompt=prompt,
+                temperature=temperature,
+                max_tokens=max_tokens,
+                stop=stop,
+                **kwargs
+            )
+            duration = time.time() - start_time
+            result = GenerationResult(
+                text=response.choices[0].text,
+                model=self.model,
+                tokens=response.usage.completion_tokens,
+                duration=duration,
+                finish_reason=response.choices[0].finish_reason,
+                raw_response=response.model_dump()
+            )
+            return result
+        except Exception as e:
+            logger.error(f"OpenRouter request failed: {e}")
+            raise
+    def chat(
+        self,
+        messages: List[ChatMessage],
+        temperature: float = 0.2,
+        max_tokens: int = 4096,
+        tools: Optional[List[Dict]] = None,
+        **kwargs
+    ) -> GenerationResult:
+        """Generate chat response using OpenRouter."""
+        client = self._get_client()
+        # Convert messages to chat format
+        chat_messages = [{"role": m.role, "content": m.content} for m in messages]
+        request_params = {
+            "model": self.model,
+            "messages": chat_messages,
+            "temperature": temperature,
+            "max_tokens": max_tokens,
+        }
+        if tools:
+            request_params["tools"] = tools
+        request_params.update(kwargs)
+        # Add OpenRouter-specific headers
+        extra_headers = {}
+        if self.http_referer:
+            extra_headers["HTTP-Referer"] = self.http_referer
+        if self.x_title:
+            extra_headers["X-Title"] = self.x_title
+        start_time = time.time()
+        try:
+            response = client.chat.completions.create(
+                extra_headers=extra_headers if extra_headers else None,
+                **request_params
+            )
+            duration = time.time() - start_time
+            msg = response.choices[0].message
+            text = msg.content or ""
+            result = GenerationResult(
+                text=text,
+                model=self.model,
+                tokens=response.usage.completion_tokens,
+                duration=duration,
+                finish_reason=response.choices[0].finish_reason,
+                raw_response=response.model_dump()
+            )
+            return result
+        except Exception as e:
+            logger.error(f"OpenRouter chat request failed: {e}")
+            raise
+    def get_model_name(self) -> str:
+        return self.model
 def create_model_client(
     provider: str = "ollama",
     model: Optional[str] = None,
     Factory function to create model client.
     Args:
+        provider: One of "ollama", "openai", "anthropic", "openrouter"
         model: Model name (defaults to provider's default)
         **kwargs: Additional client configuration
     elif provider == "anthropic":
         default_model = model or os.environ.get("ANTHROPIC_MODEL", "claude-sonnet-4-20250514")
         return AnthropicClient(model=default_model, **kwargs)
+    elif provider == "openrouter":
+        default_model = model or os.environ.get("OPENROUTER_MODEL", "qwen/qwen2.5-coder-32b")
+        return OpenRouterClient(model=default_model, **kwargs)
     else:
+        raise ValueError(f"Unknown provider: {provider}. Use: ollama, openai, anthropic, openrouter")
 class ModelClientPool:

stack-2.9-eval/tool_use_eval.py CHANGED Viewed

@@ -1,22 +1,18 @@
 #!/usr/bin/env python3
 """
-Tool Use Evaluation for Stack 2.9
-===================================
-Evaluates tool calling capabilities across 500+ test cases covering:
-- File operations (read, write, edit, glob)
-- Git operations (status, commit, push, branch)
-- Search operations (grep, web search)
-- Execution operations (bash, shell commands)
-- System operations (task management, config)
-Metrics:
-- Tool Selection Accuracy: Correct tool chosen for task
-- Parameter Accuracy: Correct parameters provided
-- Execution Success Rate: Task completed successfully
-- Overall Success Rate: Combined metric
-Usage:
-    python tool_use_eval.py [--model MODEL] [--output OUTPUT_DIR]
 """
 import argparse

 #!/usr/bin/env python3
 """
+Tool Use Evaluation for Stack 2.9 [DEPRECATED]
+==============================================
+⚠️  WARNING: This evaluation script is DEPRECATED and the methodology is INVALID.
+This evaluator uses a naive keyword-matching simulation, not actual model inference.
+There is no proper benchmark implementation for tool calling. The claimed 94.1%
+score is unverifiable and misleading.
+A proper tool use benchmark needs to be built with 500+ realistic test cases and
+actual model calls. This script remains only as a placeholder.
+See EVALUATION.md for the full audit report.
 """
 import argparse

stack_cli/cli.py CHANGED Viewed

@@ -509,7 +509,7 @@ Examples:
     parser.add_argument(
         '--patterns',
         choices=['list', 'stats', 'clear'],
-        help="Manage patterns for self-evolution"
     )
     # Training

     parser.add_argument(
         '--patterns',
         choices=['list', 'stats', 'clear'],
+        help="Manage learned patterns"
     )
     # Training

website/benchmark.html CHANGED Viewed

@@ -42,15 +42,15 @@
             <p class="subtitle">Stack 2.9 vs Leading AI Models</p>
             <div class="benchmark-summary">
                 <div class="summary-card">
-                    <div class="summary-value">76.8%</div>
                     <div class="summary-label">HumanEval</div>
                 </div>
                 <div class="summary-card">
-                    <div class="summary-value">82.3%</div>
                     <div class="summary-label">MBPP</div>
                 </div>
                 <div class="summary-card highlight">
-                    <div class="summary-value">94.1%</div>
                     <div class="summary-label">Tool Use</div>
                 </div>
                 <div class="summary-card">
@@ -114,10 +114,10 @@
                     <tbody>
                         <tr class="highlight-row">
                             <td><strong>Stack 2.9</strong></td>
-                            <td>76.8%</td>
-                            <td>82.3%</td>
-                            <td>21.4%</td>
-                            <td class="best">94.1%</td>
                             <td>32B</td>
                         </tr>
                         <tr>
@@ -303,7 +303,7 @@
                 <div class="footer-brand">
                     <span class="logo-icon">🤖</span>
                     <span>Stack 2.9</span>
-                    <p>Your self-evolving AI companion</p>
                 </div>
                 <div class="footer-links">
                     <a href="https://github.com/my-ai-stack/stack-2.9" target="_blank">GitHub</a>
@@ -335,8 +335,8 @@
                         labels: ['HumanEval', 'MBPP', 'SWE-bench', 'Tool Use'],
                         datasets: [
                             {
-                                label: 'Stack 2.9',
-                                data: [76.8, 82.3, 21.4, 94.1],
                                 backgroundColor: '#6366f1',
                                 borderRadius: 8,
                             },
@@ -404,8 +404,8 @@
                         labels: ['Base', '10 convos', '50 convos', '100 convos', '200 convos', '500 convos'],
                         datasets: [
                             {
-                                label: 'Stack 2.9',
-                                data: [70, 73, 78, 82, 86, 91],
                                 borderColor: '#6366f1',
                                 backgroundColor: 'rgba(99, 102, 241, 0.1)',
                                 fill: true,

             <p class="subtitle">Stack 2.9 vs Leading AI Models</p>
             <div class="benchmark-summary">
                 <div class="summary-card">
+                    <div class="summary-value">TBD</div>
                     <div class="summary-label">HumanEval</div>
                 </div>
                 <div class="summary-card">
+                    <div class="summary-value">TBD</div>
                     <div class="summary-label">MBPP</div>
                 </div>
                 <div class="summary-card highlight">
+                    <div class="summary-value">TBD</div>
                     <div class="summary-label">Tool Use</div>
                 </div>
                 <div class="summary-card">
                     <tbody>
                         <tr class="highlight-row">
                             <td><strong>Stack 2.9</strong></td>
+                            <td>TBD</td>
+                            <td>TBD</td>
+                            <td>TBD</td>
+                            <td class="best">TBD</td>
                             <td>32B</td>
                         </tr>
                         <tr>
                 <div class="footer-brand">
                     <span class="logo-icon">🤖</span>
                     <span>Stack 2.9</span>
+                    <p>Your pattern-learning AI companion</p>
                 </div>
                 <div class="footer-links">
                     <a href="https://github.com/my-ai-stack/stack-2.9" target="_blank">GitHub</a>
                         labels: ['HumanEval', 'MBPP', 'SWE-bench', 'Tool Use'],
                         datasets: [
                             {
+                                label: 'Stack 2.9 (pending verification)',
+                                data: [0, 0, 0, 0],
                                 backgroundColor: '#6366f1',
                                 borderRadius: 8,
                             },
                         labels: ['Base', '10 convos', '50 convos', '100 convos', '200 convos', '500 convos'],
                         datasets: [
                             {
+                                label: 'Stack 2.9 (evaluation pending)',
+                                data: [null, null, null, null, null, null],
                                 borderColor: '#6366f1',
                                 backgroundColor: 'rgba(99, 102, 241, 0.1)',
                                 fill: true,

website/index.html CHANGED Viewed

@@ -3,7 +3,7 @@
 <head>
     <meta charset="UTF-8">
     <meta name="viewport" content="width=device-width, initial-scale=1.0">
-    <title>Stack 2.9 — Your Self-Evolving AI Companion</title>
     <link rel="stylesheet" href="styles.css">
     <link rel="icon" href="data:image/svg+xml,<svg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 100 100'><text y='.9em' font-size='90'>🤖</text></svg>">
     <meta name="description" content="Stack 2.9 - Open-source AI that learns, adapts, and improves itself over time. Built on Qwen2.5-Coder-32B.">
@@ -83,7 +83,7 @@
             <div class="features-grid">
                 <div class="feature-card">
                     <div class="feature-icon">🧠</div>
-                    <h3>Self-Evolving</h3>
                     <p>Learns from every conversation and task. Improves its own capabilities through experience. Gets smarter the more you use it.</p>
                 </div>
                 <div class="feature-card">
@@ -121,24 +121,24 @@
             <p class="section-subtitle">Competitive results on standard coding benchmarks</p>
             <div class="benchmark-grid">
                 <div class="benchmark-card">
-                    <div class="benchmark-value">76.8%</div>
                     <div class="benchmark-label">HumanEval</div>
                     <div class="benchmark-bar">
-                        <div class="benchmark-fill" style="width: 76.8%"></div>
                     </div>
                 </div>
                 <div class="benchmark-card">
-                    <div class="benchmark-value">82.3%</div>
                     <div class="benchmark-label">MBPP</div>
                     <div class="benchmark-bar">
-                        <div class="benchmark-fill" style="width: 82.3%"></div>
                     </div>
                 </div>
                 <div class="benchmark-card highlight">
-                    <div class="benchmark-value">94.1%</div>
                     <div class="benchmark-label">Tool Use</div>
                     <div class="benchmark-bar">
-                        <div class="benchmark-fill" style="width: 94.1%"></div>
                     </div>
                 </div>
                 <div class="benchmark-card">
@@ -184,7 +184,7 @@
     <section class="how-it-works">
         <div class="container">
-            <h2 class="section-title">How Self-Evolution Works</h2>
             <div class="steps">
                 <div class="step">
                     <div class="step-number">1</div>
@@ -212,7 +212,7 @@
                 <div class="step-arrow">→</div>
                 <div class="step">
                     <div class="step-number">5</div>
-                    <h3>Evolve</h3>
                     <p>Gradually becomes smarter</p>
                 </div>
             </div>
@@ -257,7 +257,7 @@
                 <div class="footer-brand">
                     <span class="logo-icon">🤖</span>
                     <span>Stack 2.9</span>
-                    <p>Your self-evolving AI companion</p>
                 </div>
                 <div class="footer-links">
                     <a href="https://github.com/my-ai-stack/stack-2.9" target="_blank">GitHub</a>

 <head>
     <meta charset="UTF-8">
     <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Stack 2.9 — Your Pattern-Learning AI Companion</title>
     <link rel="stylesheet" href="styles.css">
     <link rel="icon" href="data:image/svg+xml,<svg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 100 100'><text y='.9em' font-size='90'>🤖</text></svg>">
     <meta name="description" content="Stack 2.9 - Open-source AI that learns, adapts, and improves itself over time. Built on Qwen2.5-Coder-32B.">
             <div class="features-grid">
                 <div class="feature-card">
                     <div class="feature-icon">🧠</div>
+                    <h3>Pattern Learning</h3>
                     <p>Learns from every conversation and task. Improves its own capabilities through experience. Gets smarter the more you use it.</p>
                 </div>
                 <div class="feature-card">
             <p class="section-subtitle">Competitive results on standard coding benchmarks</p>
             <div class="benchmark-grid">
                 <div class="benchmark-card">
+                    <div class="benchmark-value">TBD</div>
                     <div class="benchmark-label">HumanEval</div>
                     <div class="benchmark-bar">
+                        <div class="benchmark-fill" style="width: 0%"></div>
                     </div>
                 </div>
                 <div class="benchmark-card">
+                    <div class="benchmark-value">TBD</div>
                     <div class="benchmark-label">MBPP</div>
                     <div class="benchmark-bar">
+                        <div class="benchmark-fill" style="width: 0%"></div>
                     </div>
                 </div>
                 <div class="benchmark-card highlight">
+                    <div class="benchmark-value">TBD</div>
                     <div class="benchmark-label">Tool Use</div>
                     <div class="benchmark-bar">
+                        <div class="benchmark-fill" style="width: 0%"></div>
                     </div>
                 </div>
                 <div class="benchmark-card">
     <section class="how-it-works">
         <div class="container">
+            <h2 class="section-title">How Pattern Learning Works</h2>
             <div class="steps">
                 <div class="step">
                     <div class="step-number">1</div>
                 <div class="step-arrow">→</div>
                 <div class="step">
                     <div class="step-number">5</div>
+                    <h3>Improve</h3>
                     <p>Gradually becomes smarter</p>
                 </div>
             </div>
                 <div class="footer-brand">
                     <span class="logo-icon">🤖</span>
                     <span>Stack 2.9</span>
+                    <p>Your pattern-learning AI companion</p>
                 </div>
                 <div class="footer-links">
                     <a href="https://github.com/my-ai-stack/stack-2.9" target="_blank">GitHub</a>