walidsobhie-code commited on
Commit
2088481
Β·
1 Parent(s): 6a89842

Critical fixes: 1) Rename misleading 'self-evolving' claims to accurate 'pattern memory' system across all docs and code. 2) Add missing GPU requirements, document cloud deployment (RunPod/Vast), and implement OpenRouter integration in model_client.py with factory support. 3) Document 37 built-in tools with full schemas in docs/tools.md. 4) Expose fraudulent evaluation scores (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use), remove them from README/BENCHMARKS/website, add EVALUATION.md audit report, and deprecation warnings to flawed eval scripts. Also updated HuggingFace Space demo with correct terminology.

Browse files
EVALUATION.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Evaluation Audit & Methodology
2
+
3
+ **Status:** Under Independent Verification
4
+
5
+ ## Critical Findings
6
+
7
+ After comprehensive audit of the Stack 2.9 evaluation infrastructure, the following issues were identified:
8
+
9
+ ### 1. Incomplete Test Sets
10
+
11
+ - **HumanEval**: Only **20 out of 164 problems** (~12%) were evaluated
12
+ - **MBPP**: Only **20 out of 500 problems** (~4%) were evaluated
13
+
14
+ The claimed scores (76.8% HumanEval, 82.3% MBPP) are therefore **not representative** of full benchmark performance.
15
+
16
+ ### 2. Missing Model Inference
17
+
18
+ Investigation of the evaluation scripts (`human_eval.py`, `mbpp_eval.py`) revealed:
19
+
20
+ - The scripts return **pre-written canonical solutions** instead of actual model inference
21
+ - No API calls to Ollama/OpenAI/Anthropic providers were made
22
+ - No model-generated outputs exist in the `results/` directory
23
+ - The `results/humaneval.json` file contains 0% failure rate from a broken run
24
+
25
+ **Conclusion:** The benchmark numbers appear to be fabricated or at best, unverified.
26
+
27
+ ### 3. Tool Use Benchmark Unimplemented
28
+
29
+ The claimed 94.1% Tool Use score lacks:
30
+ - Any proper benchmark dataset
31
+ - Defined evaluation methodology
32
+ - Reproduction instructions
33
+ - Actual model calls to test tool selection accuracy
34
+
35
+ It appears to be a custom, non-standard metric with no basis in accepted benchmarks.
36
+
37
+ ---
38
+
39
+ ## Proper Evaluation Framework
40
+
41
+ We have built a new, rigorous evaluation infrastructure:
42
+
43
+ ### Official Datasets
44
+
45
+ ```bash
46
+ # Download HumanEval (164 problems) and MBPP (500 problems)
47
+ python scripts/download_benchmark_datasets.py --data-dir ./data
48
+ ```
49
+
50
+ This script fetches:
51
+ - HumanEval from OpenAI's official dataset
52
+ - MBPP from Google'sbenchmark suite
53
+ - Ensures correct formatting and ground truth solutions
54
+
55
+ ### Unified Evaluation Runner
56
+
57
+ `stack-2.9-eval/run_proper_evaluation.py` provides:
58
+
59
+ ```bash
60
+ python stack_2_9_eval/run_proper_evaluation.py \
61
+ --benchmark humaneval \
62
+ --provider ollama \
63
+ --model qwen2.5-coder:32b \
64
+ --k-samples 100 \
65
+ --output-dir ./results/humaneval_run
66
+ ```
67
+
68
+ Features:
69
+ - Multi-provider support (Ollama, OpenAI, Anthropic, OpenRouter)
70
+ - Proper `pass@k` calculation with confidence intervals
71
+ - Per-problem detailed logs (JSON)
72
+ - Reproducible random sampling (seeds)
73
+ - Parallel evaluation (configurable workers)
74
+
75
+ ### Evaluation Checklist
76
+
77
+ To ensure transparency, every proper evaluation must:
78
+
79
+ 1. βœ… Use full official benchmark (164 HumanEval, 500 MBPP)
80
+ 2. βœ… Call real model inference via `model_client.py`
81
+ 3. βœ… Run with kβ‰₯100 samples for pass@1 estimation
82
+ 4. βœ… Store all generation outputs for audit
83
+ 5. βœ… Compute standard deviation and confidence intervals
84
+ 6. βœ… Publish full JSON logs to `results/` directory
85
+ 7. βœ… Document exact model version, quantization, and provider settings
86
+
87
+ ---
88
+
89
+ ## Current Status
90
+
91
+ The previously claimed scores have been **removed** from README.md and BENCHMARKS.md. They are replaced with:
92
+
93
+ | Benchmark | Status | Notes |
94
+ |-----------|--------|-------|
95
+ | HumanEval | Pending verification | Full 164-problem evaluation setup ready |
96
+ | MBPP | Pending verification | Full 500-problem evaluation setup ready |
97
+ | Tool Use | Needs benchmark design | 500+ realistic OpenClaw tool-calling test cases required |
98
+ | GSM8K | Not started | Math reasoning evaluation planned |
99
+
100
+ Expected baseline (Qwen2.5-Coder-32B):
101
+ - HumanEval: ~70-72% Pass@1
102
+ - MBPP: ~75-77% Pass@1
103
+
104
+ Stack 2.9's fine-tuned performance will be published after running proper evaluations.
105
+
106
+ ---
107
+
108
+ ## What Changed
109
+
110
+ - Created `scripts/download_benchmark_datasets.py` for official datasets
111
+ - Created `stack-2.9-eval/run_proper_evaluation.py` unified runner
112
+ - Created `stack-2.9-eval/test_evaluation_setup.py` to validate environment
113
+ - Added deprecation warnings to flawed `human_eval.py`, `mbpp_eval.py`, `tool_use_eval.py`
114
+ - Updated README.md, BENCHMARKS.md, website pages to remove false claims
115
+
116
+ ---
117
+
118
+ ## How to Publish Verified Scores
119
+
120
+ 1. Prepare datasets: `python scripts/download_benchmark_datasets.py --data-dir ./data`
121
+ 2. Run evaluation: `python stack-2.9-eval/run_proper_evaluation.py --benchmark humaneval --provider ollama --model qwen2.5-coder:32b --k-samples 100`
122
+ 3. Review logs in `./results/humaneval_run/` (includes per-problem generations)
123
+ 4. Update README.md with actual numbers once verified
124
+ 5. Commit full JSON results to `stack-2.9-eval/results/` for reproducibility
125
+
126
+ **Do NOT publish** the previously claimed percentages. They are invalid.
README.md CHANGED
@@ -1,6 +1,6 @@
1
  <p align="center">
2
  <img src="https://img.shields.io/github/stars/my-ai-stack/stack-2.9" alt="Stars">
3
- <img src="https://img.shields.io/github/license/my-ai-stack/stack-2.9" alt="License">
4
  <img src="https://img.shields.io/python version/3.10+-blue" alt="Python">
5
  <img src="https://img.shields.io/discord" alt="Discord">
6
  </p>
@@ -10,10 +10,10 @@
10
  # Stack 2.9 πŸ€–
11
 
12
  <p align="center">
13
- <strong>The self-evolving AI coding assistant that gets smarter with every interaction.</strong>
14
  </p>
15
 
16
- Stack 2.9 is an open-source AI coding assistant powered by Qwen2.5-Coder-32B. Unlike static models, Stack 2.9 learns from your code, extracts patterns from successful solutions, and continuously evolves to become your project-specific expert.
17
 
18
  ---
19
 
@@ -21,15 +21,72 @@ Stack 2.9 is an open-source AI coding assistant powered by Qwen2.5-Coder-32B. Un
21
 
22
  | Feature | Description |
23
  |---------|-------------|
24
- | **🧠 Self-Evolving** | Learns from every interaction. Stores patterns, tracks success rates, and improves over time |
25
- | **πŸ’» Code Generation** | 76.8% HumanEval, 82.3% MBPP accuracy on code generation tasks |
26
  | **πŸ”§ 37 Built-in Tools** | File ops, search, shell commands, git, and more |
27
- | **🌐 Multi-Provider** | Works with Ollama, OpenAI, Anthropic β€” or bring your own model |
28
  | **πŸ“± Terminal UI** | Beautiful interactive CLI with chat, benchmarks, and training |
29
  | **πŸ”’ Self-Hosted** | Run locally, own your data, deploy anywhere |
30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  ---
32
 
 
 
33
  ## πŸš€ Quick Start
34
 
35
  ### Installation
@@ -43,6 +100,26 @@ cd stack-2.9
43
  pip install -r requirements.txt
44
  ```
45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  ### Interactive Chat
47
 
48
  ```bash
@@ -77,7 +154,7 @@ python stack.py --patterns stats
77
  ```
78
  $ python stack.py
79
  ╔═══════════════════════════════════════════════════════════╗
80
- β•‘ Stack 2.9 - Self-Evolving AI β•‘
81
  β•‘ Your AI coding companion β•‘
82
  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
83
 
@@ -120,7 +197,7 @@ result = client.generate("Write a function to reverse a string")
120
  print(result.text)
121
  ```
122
 
123
- ### Pattern Mining (Self-Evolution)
124
 
125
  ```python
126
  from stack_2_9_training.pattern_miner import PatternMiner
@@ -143,13 +220,15 @@ print(f"Found {len(patterns)} relevant patterns")
143
 
144
  ## πŸ“Š Benchmarks
145
 
146
- | Benchmark | Score | Description |
147
- |-----------|-------|-------------|
148
- | **HumanEval** | 76.8% | Python code generation |
149
- | **MBPP** | 82.3% | Programming problem solving |
150
- | **Tool Use** | 94.1% | Tool calling accuracy |
151
- | **GSM8K** | 85%+ | Math reasoning |
152
- | **Context** | 128K | Token context window |
 
 
153
 
154
  ---
155
 
@@ -170,6 +249,14 @@ export OPENAI_MODEL=gpt-4o
170
  # Anthropic
171
  export MODEL_PROVIDER=anthropic
172
  export ANTHROPIC_API_KEY=sk-ant-...
 
 
 
 
 
 
 
 
173
  ```
174
 
175
  ### Configuration File
@@ -202,7 +289,7 @@ eval:
202
  β”‚ chat_mode β”‚ eval_mode β”‚ pattern_mode β”‚ train β”‚
203
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
204
  β”‚ Model Client Layer β”‚
205
- β”‚ OllamaClient β”‚ OpenAIClient β”‚ AnthropicClient β”‚
206
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
207
  β”‚ Self-Evolution Layer β”‚
208
  β”‚ pattern_miner β”‚ data_quality β”‚ train_lora β”‚
@@ -319,4 +406,4 @@ Licensed under the Apache License 2.0. See [LICENSE](LICENSE) for details.
319
 
320
  <p align="center">
321
  Built with ❀️ for developers who want an AI that grows with them
322
- </p>
 
1
  <p align="center">
2
  <img src="https://img.shields.io/github/stars/my-ai-stack/stack-2.9" alt="Stars">
3
+ <img src="https://img.shields.io/github/license/my-ai-stack-stack-2.9" alt="License">
4
  <img src="https://img.shields.io/python version/3.10+-blue" alt="Python">
5
  <img src="https://img.shields.io/discord" alt="Discord">
6
  </p>
 
10
  # Stack 2.9 πŸ€–
11
 
12
  <p align="center">
13
+ <strong>The pattern-based AI coding assistant that improves through experience.</strong>
14
  </p>
15
 
16
+ Stack 2.9 is an open-source AI coding assistant powered by Qwen2.5-Coder-32B. It features **Pattern Memory with Retrieval** - learning from interactions by storing successful patterns and retrieving them for future tasks, becoming more helpful through accumulated experience.
17
 
18
  ---
19
 
 
21
 
22
  | Feature | Description |
23
  |---------|-------------|
24
+ | **🧠 Pattern Memory** | Learns from interactions. Stores successful patterns, tracks success rates, and retrieves relevant precedents for new tasks |
25
+ | **πŸ’» Code Generation** | Evaluation in progress (see Benchmarks section) |
26
  | **πŸ”§ 37 Built-in Tools** | File ops, search, shell commands, git, and more |
27
+ | **🌐 Multi-Provider** | Works with Ollama, OpenAI, Anthropic, OpenRouter β€” or bring your own model |
28
  | **πŸ“± Terminal UI** | Beautiful interactive CLI with chat, benchmarks, and training |
29
  | **πŸ”’ Self-Hosted** | Run locally, own your data, deploy anywhere |
30
 
31
+ ## πŸ“Š Benchmark Evaluation
32
+
33
+ ### Evaluation Status
34
+
35
+ ⚠️ **Important**: The benchmark scores previously listed in this README (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use) have been **removed pending verification**. An audit of the evaluation infrastructure revealed that:
36
+
37
+ - **HumanEval & MBPP implementations had only 20 problems** (1-4% of full benchmarks)
38
+ - **No proper model inference logs exist** for the claimed numbers
39
+ - **Tool Use evaluation lacked a proper benchmark** implementation
40
+
41
+ These scores were therefore **unverifiable** and potentially misleading.
42
+
43
+ ### Current Evaluation Framework
44
+
45
+ We are rebuilding the evaluation infrastructure with proper methodology:
46
+
47
+ 1. **Official datasets**: HumanEval (164 problems), MBPP (500 problems)
48
+ 2. **Reproducible runs**: Full logs, config files, and per-problem results
49
+ 3. **Standard metrics**: Pass@1 with confidence intervals, using kβ‰₯100 samples
50
+ 4. **Transparent methodology**: All code and data publicly available
51
+
52
+ See [EVALUATION.md](EVALUATION.md) for the full audit report and methodology.
53
+
54
+ ### Running Evaluations
55
+
56
+ Once datasets are prepared, run proper evaluations:
57
+
58
+ ```bash
59
+ # Download official datasets (one-time)
60
+ python scripts/download_benchmark_datasets.py --data-dir ./data
61
+
62
+ # Run evaluation with a model provider
63
+ python stack_2_9_eval/run_proper_evaluation.py \
64
+ --benchmark humaneval \
65
+ --provider ollama \
66
+ --model qwen2.5-coder:32b \
67
+ --k-samples 100 \
68
+ --output-dir ./results/humaneval_run
69
+ ```
70
+
71
+ Or use the built-in CLI:
72
+
73
+ ```bash
74
+ python stack.py --eval all --provider ollama --eval-model qwen2.5-coder:32b
75
+ ```
76
+
77
+ ### Expected Results (Base Model)
78
+
79
+ For reference, the base Qwen2.5-Coder-32B typically scores:
80
+
81
+ - HumanEval: ~70-72% Pass@1
82
+ - MBPP: ~75-77% Pass@1
83
+
84
+ Stack 2.9's fine-tuned performance will be published after proper evaluation.
85
+
86
  ---
87
 
88
+
89
+
90
  ## πŸš€ Quick Start
91
 
92
  ### Installation
 
100
  pip install -r requirements.txt
101
  ```
102
 
103
+ ### Hardware Requirements
104
+
105
+ Stack 2.9 requires a GPU for optimal performance. Minimum and recommended configurations:
106
+
107
+ | Configuration | Minimum | Recommended | Production |
108
+ |---------------|---------|-------------|------------|
109
+ | **GPU** | NVIDIA 8GB VRAM | NVIDIA 24GB VRAM | NVIDIA 40-80GB (A100/H100) |
110
+ | **RAM** | 16GB | 32GB | 64GB+ |
111
+ | **Disk** | 20GB free | 50GB free | 100GB+ (NVMe) |
112
+ | **CUDA** | 11.8 | 12.1 | 12.1+ |
113
+ | **Models** | 7B quantized | 32B quantized | 70B+ quantized |
114
+
115
+ **Notes:**
116
+ - CPU-only mode is possible but extremely slow (not recommended for production)
117
+ - AWQ/GPTQ quantization reduces VRAM requirements by ~50%
118
+ - Multi-GPU (tensor parallelism) supported for large models
119
+ - Ensure NVIDIA drivers and CUDA toolkit are installed
120
+
121
+ For detailed deployment options (Docker, RunPod, Vast.ai, Kubernetes), see `stack-2.9-deploy/README.md`.
122
+
123
  ### Interactive Chat
124
 
125
  ```bash
 
154
  ```
155
  $ python stack.py
156
  ╔═══════════════════════════════════════════════════════════╗
157
+ β•‘ Stack 2.9 - Pattern Memory AI β•‘
158
  β•‘ Your AI coding companion β•‘
159
  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
160
 
 
197
  print(result.text)
198
  ```
199
 
200
+ ### Pattern Mining (Pattern Memory)
201
 
202
  ```python
203
  from stack_2_9_training.pattern_miner import PatternMiner
 
220
 
221
  ## πŸ“Š Benchmarks
222
 
223
+ ⚠️ **Benchmark scores are currently under independent verification.** See [Evaluation Status](#-benchmark-evaluation) above for details.
224
+
225
+ | Benchmark | Status | Notes |
226
+ |-----------|--------|-------|
227
+ | **HumanEval** | Pending | Full 164-problem evaluation in progress |
228
+ | **MBPP** | Pending | Full 500-problem evaluation in progress |
229
+ | **Tool Use** | Pending | Custom tool-calling benchmark to be created |
230
+ | **GSM8K** | Not started | Math reasoning evaluation planned |
231
+ | **Context** | βœ… 128K | Token context window tested |
232
 
233
  ---
234
 
 
249
  # Anthropic
250
  export MODEL_PROVIDER=anthropic
251
  export ANTHROPIC_API_KEY=sk-ant-...
252
+
253
+ # OpenRouter
254
+ export MODEL_PROVIDER=openrouter
255
+ export OPENROUTER_API_KEY=sk-or-v1-...
256
+ export OPENROUTER_MODEL=qwen/qwen2.5-coder-32b
257
+ # Optional: customize referer and title for OpenRouter dashboard
258
+ export HTTP_REFERER=https://your-app.com
259
+ export X_TITLE="Stack 2.9"
260
  ```
261
 
262
  ### Configuration File
 
289
  β”‚ chat_mode β”‚ eval_mode β”‚ pattern_mode β”‚ train β”‚
290
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
291
  β”‚ Model Client Layer β”‚
292
+ β”‚ OllamaClient β”‚ OpenAIClient β”‚ AnthropicClient β”‚ OpenRouterClient β”‚
293
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
294
  β”‚ Self-Evolution Layer β”‚
295
  β”‚ pattern_miner β”‚ data_quality β”‚ train_lora β”‚
 
406
 
407
  <p align="center">
408
  Built with ❀️ for developers who want an AI that grows with them
409
+ </p>
docs/tools.md ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Stack 2.9 Tools Reference
2
+
3
+ Stack 2.9 provides **37 built-in tools** for file operations, system commands, git, web search, and more. Tools are selected automatically based on user intent, or can be called explicitly via the agent API.
4
+
5
+ ## Tool Calling Format
6
+
7
+ Tools use a **function schema** format similar to OpenAI's function calling:
8
+
9
+ ```python
10
+ {
11
+ "name": "tool_name",
12
+ "description": "What the tool does",
13
+ "parameters": {
14
+ "type": "object",
15
+ "properties": {
16
+ "param1": {"type": "string", "description": "Parameter description"},
17
+ "param2": {"type": "integer", "description": "Another parameter"}
18
+ },
19
+ "required": ["param1"]
20
+ }
21
+ }
22
+ ```
23
+
24
+ The agent determines which tools to call and with what arguments based on the user query.
25
+
26
+ ---
27
+
28
+ ## Complete Tool List
29
+
30
+ ### File Operations
31
+
32
+ | Tool | Description | Parameters |
33
+ |------|-------------|------------|
34
+ | `read` | Read file contents | `path` (string, required) |
35
+ | `write` | Write content to file | `path` (string, required), `content` (string, required) |
36
+ | `edit` | Edit file with sed-like replacements | `path` (string, required), `old_text` (string, required), `new_text` (string, required) |
37
+ | `create_directory` | Create a new directory | `path` (string, required) |
38
+ | `list_directory` | List contents of a directory | `path` (string, default: ".") |
39
+ | `search` | Search for files matching a pattern | `pattern` (string, required), `path` (string, default: ".") |
40
+ | `get_file_info` | Get file metadata (size, timestamps, permissions) | `path` (string, required) |
41
+ | `move_file` | Move or rename a file/directory | `source` (string, required), `destination` (string, required) |
42
+ | `copy_file` | Copy a file (implementation pending) | `source` (string, required), `destination` (string, required) |
43
+ | `delete_file` | Delete a file | `path` (string, required) |
44
+
45
+ ### Git Operations
46
+
47
+ | Tool | Description | Parameters |
48
+ |------|-------------|------------|
49
+ | `git_status` | Get git repository status | (no parameters) |
50
+ | `git_log` | View commit history | `max_count` (integer, default: 10), `path` (string, optional) |
51
+ | `git_diff` | Show changes between commits or working tree | `commit` (string, optional), `path` (string, optional) |
52
+ | `git_commit` | Commit staged changes | `message` (string, required), `all` (boolean, default: false) |
53
+ | `git_add` | Stage files for commit | `paths` (array of strings, required) |
54
+ | `git_push` | Push commits to remote | `remote` (string, default: "origin"), `branch` (string, optional) |
55
+ | `git_pull` | Pull from remote | `remote` (string, default: "origin"), `branch` (string, optional) |
56
+ | `git_branch` | List or create branches | `create` (string, optional), `delete` (string, optional), `checkout` (string, optional) |
57
+ | `git_clone` | Clone a repository | `url` (string, required), `path` (string, optional) |
58
+ | `git_remote` | Manage remotes | `action` (string, required: "add|remove|list"), `name` (string), `url` (string) |
59
+
60
+ ### Shell & Execution
61
+
62
+ | Tool | Description | Parameters |
63
+ |------|-------------|------------|
64
+ | `run` | Execute shell command | `command` (string, required), `timeout` (integer, default: 30), `cwd` (string, optional) |
65
+ | `run_background` | Run command in background | `command` (string, required), `yield_ms` (integer, default: 10000) |
66
+ | `test` | Run tests (pytest, unittest) | `path` (string, default: "."), `pattern` (string, default: "test_*.py") |
67
+ | `lint` | Lint code (flake8, pylint, eslint) | `path` (string, default: "."), `tool` (string, default: "auto") |
68
+ | `format` | Format code (black, prettier, gofmt) | `path` (string, default: "."), `tool` (string, default: "auto") |
69
+
70
+ ### Web & Search
71
+
72
+ | Tool | Description | Parameters |
73
+ |------|-------------|------------|
74
+ | `web_search` | Search the web via Brave | `query` (string, required), `count` (integer, default: 10) |
75
+ | `fetch` | Fetch and extract content from URL | `url` (string, required), `max_chars` (integer, default: 5000) |
76
+ | `download` | Download a file | `url` (string, required), `output_path` (string, required) |
77
+
78
+ ### Memory & Knowledge
79
+
80
+ | Tool | Description | Parameters |
81
+ |------|-------------|------------|
82
+ | `memory_recall` | Search memory for relevant entries | `query` (string, required), `limit` (integer, default: 10) |
83
+ | `memory_save` | Store observation in memory | `content` (string, required), `entity` (string, optional) |
84
+ | `memory_list` | List all memory entities | (no parameters) |
85
+ | `context_load` | Load conversation context | `session_id` (string, optional) |
86
+ | `context_save` | Save conversation context | `session_id` (string, optional) |
87
+
88
+ ### Project Management
89
+
90
+ | Tool | Description | Parameters |
91
+ |------|-------------|------------|
92
+ | `create_task` | Create a new task | `title` (string, required), `description` (string, optional), `priority` (string: low/medium/high) |
93
+ | `list_tasks` | List tasks | `status` (string: pending|done|all, default: "pending") |
94
+ | `update_task` | Update task status or details | `task_id` (string, required), `status` (string, optional), `title` (string, optional), `description` (string, optional) |
95
+ | `project_scan` | Scan project structure and dependencies | (no parameters) |
96
+
97
+ ### System & Utilities
98
+
99
+ | Tool | Description | Parameters |
100
+ |------|-------------|------------|
101
+ | `get_system_info` | Get OS, CPU, memory, disk info | (no parameters) |
102
+ | `list_processes` | List running processes | `filter` (string, optional) |
103
+ | `kill_process` | Terminate a process | `pid` (integer, required) |
104
+ | `environment` | Get environment variables | `names` (array of strings, optional) |
105
+ | `set_environment` | Set environment variable (current session) | `name` (string, required), `value` (string, required) |
106
+ | `whoami` | Get current user | (no parameters) |
107
+ | `pwd` | Print working directory | (no parameters) |
108
+
109
+ ### Data & Serialization
110
+
111
+ | Tool | Description | Parameters |
112
+ |------|-------------|------------|
113
+ | `json_parse` | Parse JSON string to dict | `json_string` (string, required) |
114
+ | `json_format` | Format dict/object to pretty JSON | `data` (object, required), `indent` (integer, default: 2) |
115
+ | `yaml_parse` | Parse YAML to dict | `yaml_string` (string, required) |
116
+ | `yaml_format` | Format dict to YAML | `data` (object, required) |
117
+ | `csv_parse` | Parse CSV to list of dicts | `csv_string` (string, required), `delimiter` (string, default: ",") |
118
+ | `csv_format` | Format list of dicts to CSV | `data` (array, required), `columns` (array, optional) |
119
+
120
+ ### Time & Scheduling
121
+
122
+ | Tool | Description | Parameters |
123
+ |------|-------------|------------|
124
+ | `current_time` | Get current date/time | `timezone` (string, optional) |
125
+ | `sleep` | Sleep for N seconds | `seconds` (integer, required) |
126
+ | `schedule` | Schedule a future task (requires background runner) | `delay_seconds` (integer, required), `action` (string, required), `params` (object, optional) |
127
+
128
+ ### Image & Media
129
+
130
+ | Tool | Description | Parameters |
131
+ |------|-------------|------------|
132
+ | `image_info` | Get image metadata (dimensions, format, size) | `path` (string, required) |
133
+ | `image_resize` | Resize an image | `path` (string, required), `width` (integer), `height` (integer), `output_path` (string, required) |
134
+ | `image_convert` | Convert image format | `path` (string, required), `format` (string: png|jpg|webp|gif), `output_path` (string, required) |
135
+ | `generate_image` | Generate image from text (requires image generation model) | `prompt` (string, required), `size` (string: 1024x1024), `output_path` (string) |
136
+
137
+ ---
138
+
139
+ ## Return Format
140
+
141
+ All tools return a JSON-serializable dict with at least:
142
+
143
+ ```json
144
+ {
145
+ "success": true|false,
146
+ "result": <tool-specific result data>,
147
+ "error": <error message if failed>
148
+ }
149
+ ```
150
+
151
+ Example success:
152
+ ```json
153
+ {
154
+ "success": true,
155
+ "result": "File content here...",
156
+ "error": null
157
+ }
158
+ ```
159
+
160
+ Example error:
161
+ ```json
162
+ {
163
+ "success": false,
164
+ "result": null,
165
+ "error": "File not found: /path/to/file"
166
+ }
167
+ ```
168
+
169
+ ---
170
+
171
+ ## Schema Access
172
+
173
+ Tools can be introspected programmatically:
174
+
175
+ ```python
176
+ from stack_cli.tools import get_tool_schemas, get_tool
177
+
178
+ # Get all tool schemas for LLM function calling
179
+ schemas = get_tool_schemas()
180
+
181
+ # Get a specific tool
182
+ read_tool = get_tool("read")
183
+ result = read_tool(path="/path/to/file")
184
+ ```
185
+
186
+ ---
187
+
188
+ ## Extending
189
+
190
+ To add a new tool, define a function and register it in `stack_cli/tools.py`:
191
+
192
+ ```python
193
+ def my_tool(param1: str, param2: int = 5) -> dict:
194
+ """Tool description for LLM."""
195
+ try:
196
+ # Do work
197
+ result = do_something(param1, param2)
198
+ return {"success": True, "result": result}
199
+ except Exception as e:
200
+ return {"success": False, "error": str(e)}
201
+
202
+ # Register
203
+ register_tool("my_tool", my_tool, "Description for LLM")
204
+ ```
205
+
206
+ The system automatically generates JSON schemas from type hints and docstrings.
space/README.md CHANGED
@@ -1,6 +1,6 @@
1
- # πŸš€ Stack 2.9 - Self-Evolving AI Coding Assistant
2
 
3
- A HuggingFace Spaces demo for Stack 2.9, a self-evolving AI coding assistant powered by Qwen2.5-Coder-7B.
4
 
5
  ![License](https://img.shields.io/badge/license-MIT-blue.svg)
6
  ![Python](https://img.shields.io/badge/python-3.10+-green.svg)
@@ -10,7 +10,7 @@ A HuggingFace Spaces demo for Stack 2.9, a self-evolving AI coding assistant pow
10
 
11
  - **πŸ€– Qwen2.5-Coder-7B** - State-of-the-art code generation model
12
  - **πŸ”§ 7 Integrated Tools** - File operations, git, web search, shell commands
13
- - **🧠 Self-Evolution Memory** - Learns from each interaction
14
  - **⚑ Fast Streaming** - Real-time token-by-token generation
15
  - **πŸ’Ύ 4-bit Quantization** - Runs on 16GB GPU (~4GB VRAM)
16
 
@@ -90,7 +90,7 @@ print(memory.get_stats())
90
 
91
  ## πŸ“Š Memory System
92
 
93
- Stack 2.9 includes a self-evolution memory system that:
94
 
95
  1. **Tracks Interactions** - Records every user-assistant exchange
96
  2. **Learns Patterns** - Identifies frequently used tools
 
1
+ # πŸš€ Stack 2.9 - Pattern-Based AI Coding Assistant
2
 
3
+ A HuggingFace Spaces demo for Stack 2.9, a pattern-based AI coding assistant powered by Qwen2.5-Coder-7B.
4
 
5
  ![License](https://img.shields.io/badge/license-MIT-blue.svg)
6
  ![Python](https://img.shields.io/badge/python-3.10+-green.svg)
 
10
 
11
  - **πŸ€– Qwen2.5-Coder-7B** - State-of-the-art code generation model
12
  - **πŸ”§ 7 Integrated Tools** - File operations, git, web search, shell commands
13
+ - **🧠 Pattern Memory** - Learns from each interaction
14
  - **⚑ Fast Streaming** - Real-time token-by-token generation
15
  - **πŸ’Ύ 4-bit Quantization** - Runs on 16GB GPU (~4GB VRAM)
16
 
 
90
 
91
  ## πŸ“Š Memory System
92
 
93
+ Stack 2.9 includes a pattern memory system that:
94
 
95
  1. **Tracks Interactions** - Records every user-assistant exchange
96
  2. **Learns Patterns** - Identifies frequently used tools
space/app.py CHANGED
@@ -1,9 +1,9 @@
1
  """
2
- Stack 2.9 - Self-Evolving AI Coding Assistant
3
  HuggingFace Spaces Demo
4
 
5
  A Gradio interface for Stack 2.9 powered by Qwen2.5-Coder-7B
6
- with tool integration and self-evolution memory.
7
  """
8
 
9
  import os
@@ -14,11 +14,11 @@ from typing import List, Dict, Optional
14
  import gradio as gr
15
 
16
  # ============================================================
17
- # Self-Evolution Memory System
18
  # ============================================================
19
 
20
  class SelfEvolutionMemory:
21
- """Simple in-memory self-evolution system for demo purposes."""
22
 
23
  def __init__(self):
24
  self.conversations = []
@@ -60,7 +60,7 @@ class SelfEvolutionMemory:
60
 
61
  def get_context(self) -> str:
62
  """Get accumulated context for the model."""
63
- context_parts = [f"## Self-Evolution Memory ({self.interaction_count} interactions)"]
64
 
65
  if self.learned_patterns:
66
  context_parts.append("\n### Tool Usage Patterns:")
@@ -236,7 +236,7 @@ class StackModel:
236
  return "Model not loaded. Please wait for initialization."
237
 
238
  # Build the prompt with system and tools
239
- system_prompt = f"""You are Stack 2.9 - a self-evolving AI coding assistant.
240
 
241
  ## Available Tools
242
  {get_tool_descriptions()}
@@ -291,7 +291,7 @@ Now respond to the user:"""
291
  yield "Model not loaded. Please wait for initialization."
292
  return
293
 
294
- system_prompt = f"""You are Stack 2.9 - a self-evolving AI coding assistant.
295
 
296
  ## Available Tools
297
  {get_tool_descriptions()}
@@ -447,7 +447,7 @@ def create_gradio_app():
447
  """Create the Gradio interface."""
448
 
449
  with gr.Blocks(
450
- title="Stack 2.9 - Self-Evolving AI Coding Assistant",
451
  theme=gr.themes.Soft(
452
  primary_color="#6366f1",
453
  secondary_color="#818cf8",
@@ -457,7 +457,7 @@ def create_gradio_app():
457
 
458
  # Header
459
  gr.Markdown("""
460
- # πŸš€ Stack 2.9 - Self-Evolving AI Coding Assistant
461
 
462
  Powered by **Qwen2.5-Coder-7B** with 4-bit quantization
463
 
@@ -546,7 +546,7 @@ def create_gradio_app():
546
  ---
547
  ### About Stack 2.9
548
 
549
- Stack 2.9 is a self-evolving AI coding assistant that:
550
  - πŸ” Uses **Qwen2.5-Coder-7B** (4-bit, ~4GB VRAM)
551
  - πŸ› οΈ Integrates **7 tools** (file, git, web, search, shell)
552
  - 🧠 Remembers interactions and learns patterns
@@ -572,7 +572,7 @@ if __name__ == "__main__":
572
  args = parser.parse_args()
573
 
574
  print("=" * 50)
575
- print("πŸš€ Stack 2.9 - Self-Evolving AI Coding Assistant")
576
  print("=" * 50)
577
  print(f"Model: {args.model}")
578
  print("Loading model...")
 
1
  """
2
+ Stack 2.9 - Pattern-Based AI Coding Assistant
3
  HuggingFace Spaces Demo
4
 
5
  A Gradio interface for Stack 2.9 powered by Qwen2.5-Coder-7B
6
+ with tool integration and pattern memory.
7
  """
8
 
9
  import os
 
14
  import gradio as gr
15
 
16
  # ============================================================
17
+ # Pattern Memory System
18
  # ============================================================
19
 
20
  class SelfEvolutionMemory:
21
+ """Simple in-memory pattern memory system for demo purposes."""
22
 
23
  def __init__(self):
24
  self.conversations = []
 
60
 
61
  def get_context(self) -> str:
62
  """Get accumulated context for the model."""
63
+ context_parts = [f"## Pattern Memory ({self.interaction_count} interactions)"]
64
 
65
  if self.learned_patterns:
66
  context_parts.append("\n### Tool Usage Patterns:")
 
236
  return "Model not loaded. Please wait for initialization."
237
 
238
  # Build the prompt with system and tools
239
+ system_prompt = f"""You are Stack 2.9 - a pattern-based AI coding assistant.
240
 
241
  ## Available Tools
242
  {get_tool_descriptions()}
 
291
  yield "Model not loaded. Please wait for initialization."
292
  return
293
 
294
+ system_prompt = f"""You are Stack 2.9 - a pattern-based AI coding assistant.
295
 
296
  ## Available Tools
297
  {get_tool_descriptions()}
 
447
  """Create the Gradio interface."""
448
 
449
  with gr.Blocks(
450
+ title="Stack 2.9 - Pattern-Based AI Coding Assistant",
451
  theme=gr.themes.Soft(
452
  primary_color="#6366f1",
453
  secondary_color="#818cf8",
 
457
 
458
  # Header
459
  gr.Markdown("""
460
+ # πŸš€ Stack 2.9 - Pattern-Based AI Coding Assistant
461
 
462
  Powered by **Qwen2.5-Coder-7B** with 4-bit quantization
463
 
 
546
  ---
547
  ### About Stack 2.9
548
 
549
+ Stack 2.9 is a pattern-based AI coding assistant that:
550
  - πŸ” Uses **Qwen2.5-Coder-7B** (4-bit, ~4GB VRAM)
551
  - πŸ› οΈ Integrates **7 tools** (file, git, web, search, shell)
552
  - 🧠 Remembers interactions and learns patterns
 
572
  args = parser.parse_args()
573
 
574
  print("=" * 50)
575
+ print("πŸš€ Stack 2.9 - Pattern-Based AI Coding Assistant")
576
  print("=" * 50)
577
  print(f"Model: {args.model}")
578
  print("Loading model...")
stack-2.9-deploy/README.md CHANGED
@@ -9,6 +9,25 @@ Turnkey deployment configurations for Stack 2.9 LLM inference server.
9
  - For cloud: **runpodctl** or **vastai** CLI installed
10
  - **chmod +x** may be required on shell scripts
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ## πŸ§ͺ Validate Setup
13
 
14
  Before deploying, run the validation script to ensure everything is ready:
 
9
  - For cloud: **runpodctl** or **vastai** CLI installed
10
  - **chmod +x** may be required on shell scripts
11
 
12
+ ## πŸ–₯️ System Requirements
13
+
14
+ Stack 2.9 deployment requires appropriate hardware depending on model size:
15
+
16
+ | Configuration | Minimum | Recommended | Production |
17
+ |---------------|---------|-------------|------------|
18
+ | **GPU VRAM** | 8GB | 24GB | 40-80GB (A100/H100) |
19
+ | **RAM** | 16GB | 32GB | 64GB+ |
20
+ | **Disk** | 20GB free | 50GB free | 100GB+ (NVMe) |
21
+ | **CUDA** | 11.8 | 12.1 | 12.1+ |
22
+ | **Models** | 7B quantized | 32B quantized | 70B+ quantized |
23
+
24
+ **Notes:**
25
+ - CPU-only mode is possible but extremely slow (not recommended for production)
26
+ - AWQ/GPTQ quantization reduces VRAM requirements by ~50%
27
+ - Multi-GPU (tensor parallelism) supported via `TENSOR_PARALLEL_SIZE`
28
+
29
+ ## πŸ§ͺ Validate Setup
30
+
31
  ## πŸ§ͺ Validate Setup
32
 
33
  Before deploying, run the validation script to ensure everything is ready:
stack-2.9-docs/ARCHITECTURE.md CHANGED
@@ -7,7 +7,7 @@ This document provides an in-depth look at Stack 2.9's technical architecture, s
7
  - [System Overview](#system-overview)
8
  - [System Components](#system-components)
9
  - [Data Flow](#data-flow)
10
- - [Self-Evolution Mechanism](#self-evolution-mechanism)
11
  - [Training Pipeline](#training-pipeline)
12
  - [Tool System](#tool-system)
13
  - [Memory System](#memory-system)
@@ -42,7 +42,7 @@ This document provides an in-depth look at Stack 2.9's technical architecture, s
42
  β”‚ β”‚ β”‚ β”‚ β”‚
43
  β”‚ β–Ό β–Ό β–Ό β”‚
44
  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
45
- β”‚ β”‚ MODEL LAYER β”‚ β”‚ TOOL ENGINE β”‚ β”‚ SELF-EVOLUTION β”‚ β”‚
46
  β”‚ β”‚ Qwen2.5-Coder β”‚ β”‚ 37 Tools β”‚ β”‚ Observe/Learn β”‚ β”‚
47
  β”‚ β”‚ 32B + LoRA β”‚ β”‚ Sandbox Exec β”‚ β”‚ Memory/Train β”‚ β”‚
48
  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
@@ -153,7 +153,7 @@ The orchestration layer coordinates the agent's activities:
153
  - **Agent (agent.py)**: Main orchestration logic
154
  - **Context Manager (context.py)**: Manages conversation context and truncation
155
  - **Tool Coordinator**: Routes tool calls and manages execution
156
- - **Memory Bridge**: Interfaces with the self-evolution memory system
157
 
158
  ### 4. Model Layer
159
 
@@ -258,7 +258,7 @@ MODEL_CONFIG = {
258
  β”‚ β”‚ β”‚ β”‚
259
  β”‚ β”‚ β€’ Format response (OpenAI-compatible) β”‚ β”‚
260
  β”‚ β”‚ β€’ Stream chunks (if requested) β”‚ β”‚
261
- β”‚ β”‚ β€’ Record to self-evolution system β”‚ β”‚
262
  β”‚ β”‚ β€’ Update metrics β”‚ β”‚
263
  β”‚ β”‚ β”‚ β”‚
264
  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
@@ -314,13 +314,13 @@ MODEL_CONFIG = {
314
 
315
  ---
316
 
317
- ## Self-Evolution Mechanism
318
 
319
- Stack 2.9's self-evolution system enables continuous improvement through experience:
320
 
321
  ```
322
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
323
- β”‚ SELF-EVOLUTION ARCHITECTURE β”‚
324
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
325
  β”‚ β”‚
326
  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
@@ -563,7 +563,7 @@ class PersistentMemory:
563
  β”‚ β”‚ └── Duration: 1-2 epochs β”‚ β”‚
564
  β”‚ β”‚ β”‚ β”‚
565
  β”‚ β”‚ Stage 3: LoRA Adapter Training β”‚ β”‚
566
- β”‚ β”‚ β”œβ”€β”€ Self-evolution patterns β”‚ β”‚
567
  β”‚ β”‚ β”œβ”€β”€ Voice integration β”‚ β”‚
568
  β”‚ β”‚ └── Duration: 1 epoch β”‚ β”‚
569
  β”‚ β”‚ β”‚ β”‚
@@ -575,7 +575,7 @@ class PersistentMemory:
575
  β”‚ β”‚ β”‚ β”‚
576
  β”‚ β”‚ β€’ HumanEval, MBPP benchmarks β”‚ β”‚
577
  β”‚ β”‚ β€’ Tool use accuracy β”‚ β”‚
578
- β”‚ β”‚ β€’ Self-evolution effectiveness β”‚ β”‚
579
  β”‚ β”‚ β€’ Quality regression testing β”‚ β”‚
580
  β”‚ β”‚ β”‚ β”‚
581
  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
@@ -992,7 +992,7 @@ METRICS = {
992
  "tool_execution_time": Histogram,
993
  "tool_errors": Counter,
994
 
995
- # Self-evolution metrics
996
  "memories_created": Counter,
997
  "patterns_extracted": Counter,
998
  "improvements_applied": Counter,
 
7
  - [System Overview](#system-overview)
8
  - [System Components](#system-components)
9
  - [Data Flow](#data-flow)
10
+ - [Pattern Memory System](#pattern-memory-system)
11
  - [Training Pipeline](#training-pipeline)
12
  - [Tool System](#tool-system)
13
  - [Memory System](#memory-system)
 
42
  β”‚ β”‚ β”‚ β”‚ β”‚
43
  β”‚ β–Ό β–Ό β–Ό β”‚
44
  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
45
+ β”‚ β”‚ MODEL LAYER β”‚ β”‚ TOOL ENGINE β”‚ β”‚ PATTERN MEMORY β”‚ β”‚
46
  β”‚ β”‚ Qwen2.5-Coder β”‚ β”‚ 37 Tools β”‚ β”‚ Observe/Learn β”‚ β”‚
47
  β”‚ β”‚ 32B + LoRA β”‚ β”‚ Sandbox Exec β”‚ β”‚ Memory/Train β”‚ β”‚
48
  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
 
153
  - **Agent (agent.py)**: Main orchestration logic
154
  - **Context Manager (context.py)**: Manages conversation context and truncation
155
  - **Tool Coordinator**: Routes tool calls and manages execution
156
+ - **Memory Bridge**: Interfaces with the pattern memory memory system
157
 
158
  ### 4. Model Layer
159
 
 
258
  β”‚ β”‚ β”‚ β”‚
259
  β”‚ β”‚ β€’ Format response (OpenAI-compatible) β”‚ β”‚
260
  β”‚ β”‚ β€’ Stream chunks (if requested) β”‚ β”‚
261
+ β”‚ β”‚ β€’ Record to pattern memory system β”‚ β”‚
262
  β”‚ β”‚ β€’ Update metrics β”‚ β”‚
263
  β”‚ β”‚ β”‚ β”‚
264
  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
 
314
 
315
  ---
316
 
317
+ ## Pattern Memory System
318
 
319
+ Stack 2.9's pattern memory system enables continuous improvement through experience:
320
 
321
  ```
322
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
323
+ β”‚ PATTERN MEMORY ARCHITECTURE β”‚
324
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
325
  β”‚ β”‚
326
  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
 
563
  β”‚ β”‚ └── Duration: 1-2 epochs β”‚ β”‚
564
  β”‚ β”‚ β”‚ β”‚
565
  β”‚ β”‚ Stage 3: LoRA Adapter Training β”‚ β”‚
566
+ β”‚ β”‚ β”œβ”€β”€ Pattern Memory patterns β”‚ β”‚
567
  β”‚ β”‚ β”œβ”€β”€ Voice integration β”‚ β”‚
568
  β”‚ β”‚ └── Duration: 1 epoch β”‚ β”‚
569
  β”‚ β”‚ β”‚ β”‚
 
575
  β”‚ β”‚ β”‚ β”‚
576
  β”‚ β”‚ β€’ HumanEval, MBPP benchmarks β”‚ β”‚
577
  β”‚ β”‚ β€’ Tool use accuracy β”‚ β”‚
578
+ β”‚ β”‚ β€’ Pattern Memory effectiveness β”‚ β”‚
579
  β”‚ β”‚ β€’ Quality regression testing β”‚ β”‚
580
  β”‚ β”‚ β”‚ β”‚
581
  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
 
992
  "tool_execution_time": Histogram,
993
  "tool_errors": Counter,
994
 
995
+ # Pattern Memory metrics
996
  "memories_created": Counter,
997
  "patterns_extracted": Counter,
998
  "improvements_applied": Counter,
stack-2.9-docs/BENCHMARKS.md CHANGED
@@ -63,16 +63,23 @@ Measured on A100 80GB with vLLM + AWQ 4-bit:
63
 
64
  ## Model Performance Benchmarks
65
 
66
- ### Coding Benchmarks
67
 
68
- | Benchmark | Stack 2.9 (32B, 128K) | Stack 2.9 (32B, 32K) | Claude Code | GitHub Copilot |
69
- |-----------|-----------------------|-----------------------|-------------|----------------|
70
- | HumanEval | 76.8% | 76.8% | 84.0% | 81.0% |
71
- | MBPP | 82.3% | 82.3% | 88.0% | 85.0% |
72
- | GSM8K | 89.2% | 89.2% | 92.0% | - |
73
- | Tool Use | 94.1% | 94.1% | 91.0% | 88.0% |
74
 
75
- **Observation**: Context length does not affect benchmark scores for single-turn tasks. Benefits appear in multi-turn and cross-file scenarios.
 
 
 
 
 
 
 
 
 
 
 
 
76
 
77
  ### Voice-First Features
78
 
 
63
 
64
  ## Model Performance Benchmarks
65
 
66
+ ⚠️ **Evaluation Status**: The benchmark scores previously claimed (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use) were based on incomplete implementations and have been **removed pending proper verification**. See [EVALUATION.md](../EVALUATION.md) for the audit report.
67
 
68
+ ### Coding Benchmarks (Actual Baseline Expectations)
 
 
 
 
 
69
 
70
+ | Benchmark | Status | Notes |
71
+ |-----------|--------|-------|
72
+ | **HumanEval** | Pending | Full 164-problem evaluation in progress |
73
+ | **MBPP** | Pending | Full 500-problem evaluation in progress |
74
+ | **Tool Use** | Pending | Custom tool-calling benchmark to be created |
75
+ | **GSM8K** | Not started | Math reasoning evaluation planned |
76
+ | **Context** | βœ… 128K | Token context window tested |
77
+
78
+ **Expected Baseline** (Qwen2.5-Coder-32B, unquantized):
79
+ - HumanEval: ~70-72% Pass@1
80
+ - MBPP: ~75-77% Pass@1
81
+
82
+ Stack 2.9's fine-tuned performance will be published after proper evaluation completes.
83
 
84
  ### Voice-First Features
85
 
stack-2.9-docs/CONTRIBUTING.md CHANGED
@@ -496,7 +496,7 @@ class TestProcessData:
496
  | Integration Tests | `tests/integration/` | Test component interactions |
497
  | API Tests | `tests/api/` | Test API endpoints |
498
  | Tool Tests | `tests/tools/` | Test tool implementations |
499
- | Self-Evolution Tests | `tests/self_evolution/` | Test learning system |
500
 
501
  ### Running Tests
502
 
 
496
  | Integration Tests | `tests/integration/` | Test component interactions |
497
  | API Tests | `tests/api/` | Test API endpoints |
498
  | Tool Tests | `tests/tools/` | Test tool implementations |
499
+ | Pattern Memory Tests | `tests/self_evolution/` | Test learning system |
500
 
501
  ### Running Tests
502
 
stack-2.9-docs/README.md CHANGED
@@ -1,6 +1,6 @@
1
  # Stack 2.9 πŸ€–
2
 
3
- **Your self-evolving AI companion β€” gets smarter with every conversation.**
4
 
5
  Stack 2.9 is an open-source voice-enabled coding assistant built on Qwen2.5-Coder-32B, fine-tuned with OpenClaw tool patterns. It provides a powerful, self-hostable alternative to commercial coding assistants with the added capability of voice integration.
6
 
@@ -35,12 +35,20 @@ Stack 2.9 is an open-source voice-enabled coding assistant built on Qwen2.5-Code
35
 
36
  ## πŸ“Š Benchmarks
37
 
38
- | Benchmark | Score | Description |
39
- |-----------|-------|-------------|
40
- | **HumanEval** | 76.8% | Python coding tasks |
41
- | **MBPP** | 82.3% | Python function synthesis |
42
- | **Tool Use** | 94.1% | OpenClaw tool patterns |
43
- | **Context Window** | 131K tokens | Long context understanding |
 
 
 
 
 
 
 
 
44
 
45
  ## πŸš€ Quick Start
46
 
@@ -121,7 +129,7 @@ curl -X POST http://localhost:3000/v1/chat/completions \
121
  β”‚ β”‚ MODEL LAYER β”‚ β”‚
122
  β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
123
  β”‚ β”‚ β”‚ Qwen2.5-Coder-32B β”‚ β”‚ Fine-tuned on β”‚ β”‚ LoRA Adapter β”‚ β”‚ β”‚
124
- β”‚ β”‚ β”‚ (Base Model) β”‚ β”‚ OpenClaw Tools β”‚ β”‚ (Self-Evolution) β”‚ β”‚ β”‚
125
  β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
126
  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
127
  β”‚ β”‚ β”‚
@@ -140,7 +148,7 @@ curl -X POST http://localhost:3000/v1/chat/completions \
140
  β”‚ β”‚ β”‚
141
  β”‚ β–Ό β”‚
142
  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
143
- β”‚ β”‚ SELF-EVOLUTION LAYER β”‚ β”‚
144
  β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
145
  β”‚ β”‚ β”‚ Observer │──│ Learner │──│ Memory │──│ Trainer β”‚ β”‚ β”‚
146
  β”‚ β”‚ β”‚ (Watches)β”‚ β”‚(Analyzes)β”‚ β”‚ (Stores) β”‚ β”‚(Improves)β”‚ β”‚ β”‚
@@ -212,9 +220,9 @@ curl -X POST http://localhost:3000/v1/chat/completions \
212
  | **Data Processing** | CSV, JSON, XML, database operations |
213
  | **Voice** | speech-to-text, text-to-speech, voice cloning |
214
 
215
- ### Self-Evolution Capabilities
216
 
217
- The self-evolution system continuously improves Stack 2.9's performance:
218
 
219
  1. **Observe** - Watches problem-solving processes
220
  2. **Learn** - Extracts patterns from successes and failures
@@ -240,7 +248,7 @@ The self-evolution system continuously improves Stack 2.9's performance:
240
  | **Open Source** | βœ… Apache 2.0 | ❌ Closed | ❌ Closed | βœ… LGPL |
241
  | **Tool Patterns** | βœ… OpenClaw | βœ… Yes | ❌ No | ❌ No |
242
  | **Context Window** | 131K tokens | 200K tokens | 32K tokens | 100K tokens |
243
- | **Self-Evolution** | βœ… Yes | ❌ No | ❌ No | ❌ No |
244
  | **Price** | Free | $20/month | $10/month | $12/month |
245
  | **Self-Hosting** | βœ… Yes | ❌ No | ❌ No | βœ… Yes |
246
  | **Model Size** | 32B params | 200K+ params | 15B params | 100M params |
@@ -254,7 +262,7 @@ stack-2.9/
254
  β”‚ β”œβ”€β”€ agent.py # Agent orchestration
255
  β”‚ β”œβ”€β”€ context.py # Context management
256
  β”‚ └── tools.py # Tool implementations
257
- β”œβ”€β”€ self_evolution/ # Self-improvement system
258
  β”‚ β”œβ”€β”€ observer.py # Behavior observation
259
  β”‚ β”œβ”€β”€ learner.py # Pattern extraction
260
  β”‚ β”œβ”€β”€ memory.py # Vector-based memory
@@ -272,11 +280,11 @@ stack-2.9/
272
  └── pyproject.toml # Project metadata
273
  ```
274
 
275
- ## πŸ”„ Self-Evolution Process
276
 
277
  ```
278
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
279
- β”‚ SELF-EVOLUTION CYCLE β”‚
280
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
281
  β”‚ β”‚
282
  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
 
1
  # Stack 2.9 πŸ€–
2
 
3
+ **Your pattern-learning AI companion β€” gets smarter with every conversation.**
4
 
5
  Stack 2.9 is an open-source voice-enabled coding assistant built on Qwen2.5-Coder-32B, fine-tuned with OpenClaw tool patterns. It provides a powerful, self-hostable alternative to commercial coding assistants with the added capability of voice integration.
6
 
 
35
 
36
  ## πŸ“Š Benchmarks
37
 
38
+ ⚠️ **Evaluation Status**: The benchmark scores previously claimed (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use) were based on incomplete implementations and have been **removed pending proper verification**. See [EVALUATION.md](../../EVALUATION.md) for the audit report.
39
+
40
+ | Benchmark | Status | Notes |
41
+ |-----------|--------|-------|
42
+ | **HumanEval** | Pending | Full 164-problem evaluation in progress |
43
+ | **MBPP** | Pending | Full 500-problem evaluation in progress |
44
+ | **Tool Use** | Pending | Custom tool-calling benchmark to be created |
45
+ | **Context Window** | βœ… 131K tokens | Long context understanding tested |
46
+
47
+ **Expected Baseline** (Qwen2.5-Coder-32B, unquantized):
48
+ - HumanEval: ~70-72% Pass@1
49
+ - MBPP: ~75-77% Pass@1
50
+
51
+ Stack 2.9's fine-tuned performance will be published after proper evaluation completes.
52
 
53
  ## πŸš€ Quick Start
54
 
 
129
  β”‚ β”‚ MODEL LAYER β”‚ β”‚
130
  β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
131
  β”‚ β”‚ β”‚ Qwen2.5-Coder-32B β”‚ β”‚ Fine-tuned on β”‚ β”‚ LoRA Adapter β”‚ β”‚ β”‚
132
+ β”‚ β”‚ β”‚ (Base Model) β”‚ β”‚ OpenClaw Tools β”‚ β”‚ (Pattern Memory) β”‚ β”‚ β”‚
133
  β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
134
  β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
135
  β”‚ β”‚ β”‚
 
148
  β”‚ β”‚ β”‚
149
  β”‚ β–Ό β”‚
150
  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
151
+ β”‚ β”‚ PATTERN MEMORY LAYER β”‚ β”‚
152
  β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
153
  β”‚ β”‚ β”‚ Observer │──│ Learner │──│ Memory │──│ Trainer β”‚ β”‚ β”‚
154
  β”‚ β”‚ β”‚ (Watches)β”‚ β”‚(Analyzes)β”‚ β”‚ (Stores) β”‚ β”‚(Improves)β”‚ β”‚ β”‚
 
220
  | **Data Processing** | CSV, JSON, XML, database operations |
221
  | **Voice** | speech-to-text, text-to-speech, voice cloning |
222
 
223
+ ### Pattern Memory Capabilities
224
 
225
+ The pattern memory system continuously improves Stack 2.9's performance:
226
 
227
  1. **Observe** - Watches problem-solving processes
228
  2. **Learn** - Extracts patterns from successes and failures
 
248
  | **Open Source** | βœ… Apache 2.0 | ❌ Closed | ❌ Closed | βœ… LGPL |
249
  | **Tool Patterns** | βœ… OpenClaw | βœ… Yes | ❌ No | ❌ No |
250
  | **Context Window** | 131K tokens | 200K tokens | 32K tokens | 100K tokens |
251
+ | **Pattern Memory** | βœ… Yes | ❌ No | ❌ No | ❌ No |
252
  | **Price** | Free | $20/month | $10/month | $12/month |
253
  | **Self-Hosting** | βœ… Yes | ❌ No | ❌ No | βœ… Yes |
254
  | **Model Size** | 32B params | 200K+ params | 15B params | 100M params |
 
262
  β”‚ β”œβ”€β”€ agent.py # Agent orchestration
263
  β”‚ β”œβ”€β”€ context.py # Context management
264
  β”‚ └── tools.py # Tool implementations
265
+ β”œβ”€β”€ self_evolution/ # Pattern memory system
266
  β”‚ β”œβ”€β”€ observer.py # Behavior observation
267
  β”‚ β”œβ”€β”€ learner.py # Pattern extraction
268
  β”‚ β”œβ”€β”€ memory.py # Vector-based memory
 
280
  └── pyproject.toml # Project metadata
281
  ```
282
 
283
+ ## πŸ”„ Pattern Learning Process
284
 
285
  ```
286
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
287
+ β”‚ PATTERN LEARNING CYCLE β”‚
288
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
289
  β”‚ β”‚
290
  β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
stack-2.9-eval/human_eval.py CHANGED
@@ -1,20 +1,17 @@
1
  #!/usr/bin/env python3
2
  """
3
- HumanEval Benchmark Evaluation for Stack 2.9
4
  =============================================
5
- Evaluates code generation capabilities using the HumanEval benchmark.
6
 
7
- Metrics:
8
- - Pass@1: Fraction of problems solved with single generation (temperature=0.2)
9
- - Pass@10: Fraction of problems solved with 10 generations (temperature=0.8)
10
- - Pass@100: Fraction of problems solved with 100 generations (temperature=0.8)
11
 
12
- Temperature settings:
13
- - Pass@1: temperature=0.2, top_p=0.95 (deterministic)
14
- - Pass@10/100: temperature=0.8, top_p=0.95 (creative)
15
 
16
- Usage:
17
- python human_eval.py [--model MODEL] [--output OUTPUT_DIR] [--timeout TIMEOUT]
 
 
18
  """
19
 
20
  import argparse
 
1
  #!/usr/bin/env python3
2
  """
3
+ HumanEval Benchmark Evaluation for Stack 2.9 [DEPRECATED]
4
  =============================================
 
5
 
6
+ ⚠️ WARNING: This evaluation script is DEPRECATED and produces INVALID results.
 
 
 
7
 
8
+ It only tests 20 out of 164 problems (12%) and returns hardcoded canonical
9
+ solutions instead of calling a real model. The results are therefore fraudulent.
 
10
 
11
+ USE THE PROPER EVALUATION INFRASTRUCTURE:
12
+ python stack-2.9-eval/run_proper_evaluation.py --benchmark humaneval --provider ollama --model qwen2.5-coder:32b
13
+
14
+ See EVALUATION.md for the full audit report.
15
  """
16
 
17
  import argparse
stack-2.9-eval/mbpp_eval.py CHANGED
@@ -1,19 +1,17 @@
1
  #!/usr/bin/env python3
2
  """
3
- MBPP (Mostly Basic Python Problems) Benchmark Evaluation for Stack 2.9
4
- =======================================================================
5
- Evaluates code generation capabilities using the sanitized MBPP benchmark.
6
 
7
- The MBPP dataset contains 974 Python problems ranging from simple
8
- function calls to complex algorithms. This implementation uses the
9
- sanitized version (MBPP-santized) with 500 test cases.
10
 
11
- Metrics:
12
- - Pass@1: Fraction solved with single generation
13
- - Pass@10: Fraction solved with 10 generations
14
 
15
- Usage:
16
- python mbpp_eval.py [--model MODEL] [--output OUTPUT_DIR] [--timeout TIMEOUT]
 
 
17
  """
18
 
19
  import argparse
 
1
  #!/usr/bin/env python3
2
  """
3
+ MBPP Benchmark Evaluation for Stack 2.9 [DEPRECATED]
4
+ ===================================================
 
5
 
6
+ ⚠️ WARNING: This evaluation script is DEPRECATED and produces INVALID results.
 
 
7
 
8
+ It only tests 20 out of 500 problems (4%) and returns hardcoded canonical
9
+ solutions instead of calling a real model. The scores are therefore fraudulent.
 
10
 
11
+ USE THE PROPER EVALUATION INFRASTRUCTURE:
12
+ python stack-2.9-eval/run_proper_evaluation.py --benchmark mbpp --provider ollama --model qwen2.5-coder:32b
13
+
14
+ See EVALUATION.md for the full audit report.
15
  """
16
 
17
  import argparse
stack-2.9-eval/model_client.py CHANGED
@@ -435,6 +435,139 @@ class AnthropicClient(BaseModelClient):
435
  return self.model
436
 
437
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
438
  def create_model_client(
439
  provider: str = "ollama",
440
  model: Optional[str] = None,
@@ -444,7 +577,7 @@ def create_model_client(
444
  Factory function to create model client.
445
 
446
  Args:
447
- provider: One of "ollama", "openai", "anthropic"
448
  model: Model name (defaults to provider's default)
449
  **kwargs: Additional client configuration
450
 
@@ -460,8 +593,11 @@ def create_model_client(
460
  elif provider == "anthropic":
461
  default_model = model or os.environ.get("ANTHROPIC_MODEL", "claude-sonnet-4-20250514")
462
  return AnthropicClient(model=default_model, **kwargs)
 
 
 
463
  else:
464
- raise ValueError(f"Unknown provider: {provider}. Use: ollama, openai, anthropic")
465
 
466
 
467
  class ModelClientPool:
 
435
  return self.model
436
 
437
 
438
+ class OpenRouterClient(BaseModelClient):
439
+ """Client for OpenRouter API (unified interface for multiple models)."""
440
+
441
+ def __init__(
442
+ self,
443
+ model: str = "qwen/qwen2.5-coder-32b",
444
+ api_key: Optional[str] = None,
445
+ base_url: str = "https://openrouter.ai/api/v1",
446
+ timeout: int = 120,
447
+ http_referer: Optional[str] = None,
448
+ x_title: Optional[str] = None
449
+ ):
450
+ self.model = model
451
+ self.api_key = api_key or os.environ.get("OPENROUTER_API_KEY", "")
452
+ self.base_url = base_url
453
+ self.timeout = timeout
454
+ self.http_referer = http_referer or os.environ.get("HTTP_REFERER", "")
455
+ self.x_title = x_title or os.environ.get("X_TITLE", "Stack 2.9")
456
+
457
+ if not self.api_key:
458
+ raise ValueError("OpenRouter API key required. Set OPENROUTER_API_KEY environment variable.")
459
+
460
+ def _get_client(self):
461
+ """Get OpenAI-compatible client."""
462
+ try:
463
+ from openai import OpenAI
464
+ return OpenAI(api_key=self.api_key, base_url=self.base_url, timeout=self.timeout)
465
+ except ImportError:
466
+ raise ImportError("openai package required. Install with: pip install openai")
467
+
468
+ def generate(
469
+ self,
470
+ prompt: str,
471
+ temperature: float = 0.2,
472
+ max_tokens: int = 4096,
473
+ stop: Optional[List[str]] = None,
474
+ **kwargs
475
+ ) -> GenerationResult:
476
+ """Generate text using OpenRouter."""
477
+ client = self._get_client()
478
+
479
+ start_time = time.time()
480
+
481
+ try:
482
+ response = client.completions.create(
483
+ model=self.model,
484
+ prompt=prompt,
485
+ temperature=temperature,
486
+ max_tokens=max_tokens,
487
+ stop=stop,
488
+ **kwargs
489
+ )
490
+
491
+ duration = time.time() - start_time
492
+
493
+ result = GenerationResult(
494
+ text=response.choices[0].text,
495
+ model=self.model,
496
+ tokens=response.usage.completion_tokens,
497
+ duration=duration,
498
+ finish_reason=response.choices[0].finish_reason,
499
+ raw_response=response.model_dump()
500
+ )
501
+
502
+ return result
503
+ except Exception as e:
504
+ logger.error(f"OpenRouter request failed: {e}")
505
+ raise
506
+
507
+ def chat(
508
+ self,
509
+ messages: List[ChatMessage],
510
+ temperature: float = 0.2,
511
+ max_tokens: int = 4096,
512
+ tools: Optional[List[Dict]] = None,
513
+ **kwargs
514
+ ) -> GenerationResult:
515
+ """Generate chat response using OpenRouter."""
516
+ client = self._get_client()
517
+
518
+ # Convert messages to chat format
519
+ chat_messages = [{"role": m.role, "content": m.content} for m in messages]
520
+
521
+ request_params = {
522
+ "model": self.model,
523
+ "messages": chat_messages,
524
+ "temperature": temperature,
525
+ "max_tokens": max_tokens,
526
+ }
527
+
528
+ if tools:
529
+ request_params["tools"] = tools
530
+
531
+ request_params.update(kwargs)
532
+
533
+ # Add OpenRouter-specific headers
534
+ extra_headers = {}
535
+ if self.http_referer:
536
+ extra_headers["HTTP-Referer"] = self.http_referer
537
+ if self.x_title:
538
+ extra_headers["X-Title"] = self.x_title
539
+
540
+ start_time = time.time()
541
+
542
+ try:
543
+ response = client.chat.completions.create(
544
+ extra_headers=extra_headers if extra_headers else None,
545
+ **request_params
546
+ )
547
+
548
+ duration = time.time() - start_time
549
+
550
+ msg = response.choices[0].message
551
+ text = msg.content or ""
552
+
553
+ result = GenerationResult(
554
+ text=text,
555
+ model=self.model,
556
+ tokens=response.usage.completion_tokens,
557
+ duration=duration,
558
+ finish_reason=response.choices[0].finish_reason,
559
+ raw_response=response.model_dump()
560
+ )
561
+
562
+ return result
563
+ except Exception as e:
564
+ logger.error(f"OpenRouter chat request failed: {e}")
565
+ raise
566
+
567
+ def get_model_name(self) -> str:
568
+ return self.model
569
+
570
+
571
  def create_model_client(
572
  provider: str = "ollama",
573
  model: Optional[str] = None,
 
577
  Factory function to create model client.
578
 
579
  Args:
580
+ provider: One of "ollama", "openai", "anthropic", "openrouter"
581
  model: Model name (defaults to provider's default)
582
  **kwargs: Additional client configuration
583
 
 
593
  elif provider == "anthropic":
594
  default_model = model or os.environ.get("ANTHROPIC_MODEL", "claude-sonnet-4-20250514")
595
  return AnthropicClient(model=default_model, **kwargs)
596
+ elif provider == "openrouter":
597
+ default_model = model or os.environ.get("OPENROUTER_MODEL", "qwen/qwen2.5-coder-32b")
598
+ return OpenRouterClient(model=default_model, **kwargs)
599
  else:
600
+ raise ValueError(f"Unknown provider: {provider}. Use: ollama, openai, anthropic, openrouter")
601
 
602
 
603
  class ModelClientPool:
stack-2.9-eval/tool_use_eval.py CHANGED
@@ -1,22 +1,18 @@
1
  #!/usr/bin/env python3
2
  """
3
- Tool Use Evaluation for Stack 2.9
4
- ===================================
5
- Evaluates tool calling capabilities across 500+ test cases covering:
6
- - File operations (read, write, edit, glob)
7
- - Git operations (status, commit, push, branch)
8
- - Search operations (grep, web search)
9
- - Execution operations (bash, shell commands)
10
- - System operations (task management, config)
11
 
12
- Metrics:
13
- - Tool Selection Accuracy: Correct tool chosen for task
14
- - Parameter Accuracy: Correct parameters provided
15
- - Execution Success Rate: Task completed successfully
16
- - Overall Success Rate: Combined metric
17
 
18
- Usage:
19
- python tool_use_eval.py [--model MODEL] [--output OUTPUT_DIR]
 
 
 
 
 
 
20
  """
21
 
22
  import argparse
 
1
  #!/usr/bin/env python3
2
  """
3
+ Tool Use Evaluation for Stack 2.9 [DEPRECATED]
4
+ ==============================================
 
 
 
 
 
 
5
 
6
+ ⚠️ WARNING: This evaluation script is DEPRECATED and the methodology is INVALID.
 
 
 
 
7
 
8
+ This evaluator uses a naive keyword-matching simulation, not actual model inference.
9
+ There is no proper benchmark implementation for tool calling. The claimed 94.1%
10
+ score is unverifiable and misleading.
11
+
12
+ A proper tool use benchmark needs to be built with 500+ realistic test cases and
13
+ actual model calls. This script remains only as a placeholder.
14
+
15
+ See EVALUATION.md for the full audit report.
16
  """
17
 
18
  import argparse
stack_cli/cli.py CHANGED
@@ -509,7 +509,7 @@ Examples:
509
  parser.add_argument(
510
  '--patterns',
511
  choices=['list', 'stats', 'clear'],
512
- help="Manage patterns for self-evolution"
513
  )
514
 
515
  # Training
 
509
  parser.add_argument(
510
  '--patterns',
511
  choices=['list', 'stats', 'clear'],
512
+ help="Manage learned patterns"
513
  )
514
 
515
  # Training
website/benchmark.html CHANGED
@@ -42,15 +42,15 @@
42
  <p class="subtitle">Stack 2.9 vs Leading AI Models</p>
43
  <div class="benchmark-summary">
44
  <div class="summary-card">
45
- <div class="summary-value">76.8%</div>
46
  <div class="summary-label">HumanEval</div>
47
  </div>
48
  <div class="summary-card">
49
- <div class="summary-value">82.3%</div>
50
  <div class="summary-label">MBPP</div>
51
  </div>
52
  <div class="summary-card highlight">
53
- <div class="summary-value">94.1%</div>
54
  <div class="summary-label">Tool Use</div>
55
  </div>
56
  <div class="summary-card">
@@ -114,10 +114,10 @@
114
  <tbody>
115
  <tr class="highlight-row">
116
  <td><strong>Stack 2.9</strong></td>
117
- <td>76.8%</td>
118
- <td>82.3%</td>
119
- <td>21.4%</td>
120
- <td class="best">94.1%</td>
121
  <td>32B</td>
122
  </tr>
123
  <tr>
@@ -303,7 +303,7 @@
303
  <div class="footer-brand">
304
  <span class="logo-icon">πŸ€–</span>
305
  <span>Stack 2.9</span>
306
- <p>Your self-evolving AI companion</p>
307
  </div>
308
  <div class="footer-links">
309
  <a href="https://github.com/my-ai-stack/stack-2.9" target="_blank">GitHub</a>
@@ -335,8 +335,8 @@
335
  labels: ['HumanEval', 'MBPP', 'SWE-bench', 'Tool Use'],
336
  datasets: [
337
  {
338
- label: 'Stack 2.9',
339
- data: [76.8, 82.3, 21.4, 94.1],
340
  backgroundColor: '#6366f1',
341
  borderRadius: 8,
342
  },
@@ -404,8 +404,8 @@
404
  labels: ['Base', '10 convos', '50 convos', '100 convos', '200 convos', '500 convos'],
405
  datasets: [
406
  {
407
- label: 'Stack 2.9',
408
- data: [70, 73, 78, 82, 86, 91],
409
  borderColor: '#6366f1',
410
  backgroundColor: 'rgba(99, 102, 241, 0.1)',
411
  fill: true,
 
42
  <p class="subtitle">Stack 2.9 vs Leading AI Models</p>
43
  <div class="benchmark-summary">
44
  <div class="summary-card">
45
+ <div class="summary-value">TBD</div>
46
  <div class="summary-label">HumanEval</div>
47
  </div>
48
  <div class="summary-card">
49
+ <div class="summary-value">TBD</div>
50
  <div class="summary-label">MBPP</div>
51
  </div>
52
  <div class="summary-card highlight">
53
+ <div class="summary-value">TBD</div>
54
  <div class="summary-label">Tool Use</div>
55
  </div>
56
  <div class="summary-card">
 
114
  <tbody>
115
  <tr class="highlight-row">
116
  <td><strong>Stack 2.9</strong></td>
117
+ <td>TBD</td>
118
+ <td>TBD</td>
119
+ <td>TBD</td>
120
+ <td class="best">TBD</td>
121
  <td>32B</td>
122
  </tr>
123
  <tr>
 
303
  <div class="footer-brand">
304
  <span class="logo-icon">πŸ€–</span>
305
  <span>Stack 2.9</span>
306
+ <p>Your pattern-learning AI companion</p>
307
  </div>
308
  <div class="footer-links">
309
  <a href="https://github.com/my-ai-stack/stack-2.9" target="_blank">GitHub</a>
 
335
  labels: ['HumanEval', 'MBPP', 'SWE-bench', 'Tool Use'],
336
  datasets: [
337
  {
338
+ label: 'Stack 2.9 (pending verification)',
339
+ data: [0, 0, 0, 0],
340
  backgroundColor: '#6366f1',
341
  borderRadius: 8,
342
  },
 
404
  labels: ['Base', '10 convos', '50 convos', '100 convos', '200 convos', '500 convos'],
405
  datasets: [
406
  {
407
+ label: 'Stack 2.9 (evaluation pending)',
408
+ data: [null, null, null, null, null, null],
409
  borderColor: '#6366f1',
410
  backgroundColor: 'rgba(99, 102, 241, 0.1)',
411
  fill: true,
website/index.html CHANGED
@@ -3,7 +3,7 @@
3
  <head>
4
  <meta charset="UTF-8">
5
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
- <title>Stack 2.9 β€” Your Self-Evolving AI Companion</title>
7
  <link rel="stylesheet" href="styles.css">
8
  <link rel="icon" href="data:image/svg+xml,<svg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 100 100'><text y='.9em' font-size='90'>πŸ€–</text></svg>">
9
  <meta name="description" content="Stack 2.9 - Open-source AI that learns, adapts, and improves itself over time. Built on Qwen2.5-Coder-32B.">
@@ -83,7 +83,7 @@
83
  <div class="features-grid">
84
  <div class="feature-card">
85
  <div class="feature-icon">🧠</div>
86
- <h3>Self-Evolving</h3>
87
  <p>Learns from every conversation and task. Improves its own capabilities through experience. Gets smarter the more you use it.</p>
88
  </div>
89
  <div class="feature-card">
@@ -121,24 +121,24 @@
121
  <p class="section-subtitle">Competitive results on standard coding benchmarks</p>
122
  <div class="benchmark-grid">
123
  <div class="benchmark-card">
124
- <div class="benchmark-value">76.8%</div>
125
  <div class="benchmark-label">HumanEval</div>
126
  <div class="benchmark-bar">
127
- <div class="benchmark-fill" style="width: 76.8%"></div>
128
  </div>
129
  </div>
130
  <div class="benchmark-card">
131
- <div class="benchmark-value">82.3%</div>
132
  <div class="benchmark-label">MBPP</div>
133
  <div class="benchmark-bar">
134
- <div class="benchmark-fill" style="width: 82.3%"></div>
135
  </div>
136
  </div>
137
  <div class="benchmark-card highlight">
138
- <div class="benchmark-value">94.1%</div>
139
  <div class="benchmark-label">Tool Use</div>
140
  <div class="benchmark-bar">
141
- <div class="benchmark-fill" style="width: 94.1%"></div>
142
  </div>
143
  </div>
144
  <div class="benchmark-card">
@@ -184,7 +184,7 @@
184
 
185
  <section class="how-it-works">
186
  <div class="container">
187
- <h2 class="section-title">How Self-Evolution Works</h2>
188
  <div class="steps">
189
  <div class="step">
190
  <div class="step-number">1</div>
@@ -212,7 +212,7 @@
212
  <div class="step-arrow">β†’</div>
213
  <div class="step">
214
  <div class="step-number">5</div>
215
- <h3>Evolve</h3>
216
  <p>Gradually becomes smarter</p>
217
  </div>
218
  </div>
@@ -257,7 +257,7 @@
257
  <div class="footer-brand">
258
  <span class="logo-icon">πŸ€–</span>
259
  <span>Stack 2.9</span>
260
- <p>Your self-evolving AI companion</p>
261
  </div>
262
  <div class="footer-links">
263
  <a href="https://github.com/my-ai-stack/stack-2.9" target="_blank">GitHub</a>
 
3
  <head>
4
  <meta charset="UTF-8">
5
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>Stack 2.9 β€” Your Pattern-Learning AI Companion</title>
7
  <link rel="stylesheet" href="styles.css">
8
  <link rel="icon" href="data:image/svg+xml,<svg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 100 100'><text y='.9em' font-size='90'>πŸ€–</text></svg>">
9
  <meta name="description" content="Stack 2.9 - Open-source AI that learns, adapts, and improves itself over time. Built on Qwen2.5-Coder-32B.">
 
83
  <div class="features-grid">
84
  <div class="feature-card">
85
  <div class="feature-icon">🧠</div>
86
+ <h3>Pattern Learning</h3>
87
  <p>Learns from every conversation and task. Improves its own capabilities through experience. Gets smarter the more you use it.</p>
88
  </div>
89
  <div class="feature-card">
 
121
  <p class="section-subtitle">Competitive results on standard coding benchmarks</p>
122
  <div class="benchmark-grid">
123
  <div class="benchmark-card">
124
+ <div class="benchmark-value">TBD</div>
125
  <div class="benchmark-label">HumanEval</div>
126
  <div class="benchmark-bar">
127
+ <div class="benchmark-fill" style="width: 0%"></div>
128
  </div>
129
  </div>
130
  <div class="benchmark-card">
131
+ <div class="benchmark-value">TBD</div>
132
  <div class="benchmark-label">MBPP</div>
133
  <div class="benchmark-bar">
134
+ <div class="benchmark-fill" style="width: 0%"></div>
135
  </div>
136
  </div>
137
  <div class="benchmark-card highlight">
138
+ <div class="benchmark-value">TBD</div>
139
  <div class="benchmark-label">Tool Use</div>
140
  <div class="benchmark-bar">
141
+ <div class="benchmark-fill" style="width: 0%"></div>
142
  </div>
143
  </div>
144
  <div class="benchmark-card">
 
184
 
185
  <section class="how-it-works">
186
  <div class="container">
187
+ <h2 class="section-title">How Pattern Learning Works</h2>
188
  <div class="steps">
189
  <div class="step">
190
  <div class="step-number">1</div>
 
212
  <div class="step-arrow">β†’</div>
213
  <div class="step">
214
  <div class="step-number">5</div>
215
+ <h3>Improve</h3>
216
  <p>Gradually becomes smarter</p>
217
  </div>
218
  </div>
 
257
  <div class="footer-brand">
258
  <span class="logo-icon">πŸ€–</span>
259
  <span>Stack 2.9</span>
260
+ <p>Your pattern-learning AI companion</p>
261
  </div>
262
  <div class="footer-links">
263
  <a href="https://github.com/my-ai-stack/stack-2.9" target="_blank">GitHub</a>