Text Generation
Transformers
English
qwen2
code-generation
python
fine-tuning
Qwen
tools
agent-framework
multi-agent
conversational
Eval Results (legacy)
Instructions to use my-ai-stack/Stack-2-9-finetuned with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use my-ai-stack/Stack-2-9-finetuned with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("my-ai-stack/Stack-2-9-finetuned") model = AutoModelForCausalLM.from_pretrained("my-ai-stack/Stack-2-9-finetuned") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use my-ai-stack/Stack-2-9-finetuned with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "my-ai-stack/Stack-2-9-finetuned" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
- SGLang
How to use my-ai-stack/Stack-2-9-finetuned with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "my-ai-stack/Stack-2-9-finetuned" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "my-ai-stack/Stack-2-9-finetuned", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use my-ai-stack/Stack-2-9-finetuned with Docker Model Runner:
docker model run hf.co/my-ai-stack/Stack-2-9-finetuned
walidsobhie-code commited on
Commit Β·
2088481
1
Parent(s): 6a89842
Critical fixes: 1) Rename misleading 'self-evolving' claims to accurate 'pattern memory' system across all docs and code. 2) Add missing GPU requirements, document cloud deployment (RunPod/Vast), and implement OpenRouter integration in model_client.py with factory support. 3) Document 37 built-in tools with full schemas in docs/tools.md. 4) Expose fraudulent evaluation scores (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use), remove them from README/BENCHMARKS/website, add EVALUATION.md audit report, and deprecation warnings to flawed eval scripts. Also updated HuggingFace Space demo with correct terminology.
Browse files- EVALUATION.md +126 -0
- README.md +104 -17
- docs/tools.md +206 -0
- space/README.md +4 -4
- space/app.py +11 -11
- stack-2.9-deploy/README.md +19 -0
- stack-2.9-docs/ARCHITECTURE.md +10 -10
- stack-2.9-docs/BENCHMARKS.md +15 -8
- stack-2.9-docs/CONTRIBUTING.md +1 -1
- stack-2.9-docs/README.md +23 -15
- stack-2.9-eval/human_eval.py +8 -11
- stack-2.9-eval/mbpp_eval.py +9 -11
- stack-2.9-eval/model_client.py +138 -2
- stack-2.9-eval/tool_use_eval.py +11 -15
- stack_cli/cli.py +1 -1
- website/benchmark.html +12 -12
- website/index.html +11 -11
EVALUATION.md
ADDED
|
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Evaluation Audit & Methodology
|
| 2 |
+
|
| 3 |
+
**Status:** Under Independent Verification
|
| 4 |
+
|
| 5 |
+
## Critical Findings
|
| 6 |
+
|
| 7 |
+
After comprehensive audit of the Stack 2.9 evaluation infrastructure, the following issues were identified:
|
| 8 |
+
|
| 9 |
+
### 1. Incomplete Test Sets
|
| 10 |
+
|
| 11 |
+
- **HumanEval**: Only **20 out of 164 problems** (~12%) were evaluated
|
| 12 |
+
- **MBPP**: Only **20 out of 500 problems** (~4%) were evaluated
|
| 13 |
+
|
| 14 |
+
The claimed scores (76.8% HumanEval, 82.3% MBPP) are therefore **not representative** of full benchmark performance.
|
| 15 |
+
|
| 16 |
+
### 2. Missing Model Inference
|
| 17 |
+
|
| 18 |
+
Investigation of the evaluation scripts (`human_eval.py`, `mbpp_eval.py`) revealed:
|
| 19 |
+
|
| 20 |
+
- The scripts return **pre-written canonical solutions** instead of actual model inference
|
| 21 |
+
- No API calls to Ollama/OpenAI/Anthropic providers were made
|
| 22 |
+
- No model-generated outputs exist in the `results/` directory
|
| 23 |
+
- The `results/humaneval.json` file contains 0% failure rate from a broken run
|
| 24 |
+
|
| 25 |
+
**Conclusion:** The benchmark numbers appear to be fabricated or at best, unverified.
|
| 26 |
+
|
| 27 |
+
### 3. Tool Use Benchmark Unimplemented
|
| 28 |
+
|
| 29 |
+
The claimed 94.1% Tool Use score lacks:
|
| 30 |
+
- Any proper benchmark dataset
|
| 31 |
+
- Defined evaluation methodology
|
| 32 |
+
- Reproduction instructions
|
| 33 |
+
- Actual model calls to test tool selection accuracy
|
| 34 |
+
|
| 35 |
+
It appears to be a custom, non-standard metric with no basis in accepted benchmarks.
|
| 36 |
+
|
| 37 |
+
---
|
| 38 |
+
|
| 39 |
+
## Proper Evaluation Framework
|
| 40 |
+
|
| 41 |
+
We have built a new, rigorous evaluation infrastructure:
|
| 42 |
+
|
| 43 |
+
### Official Datasets
|
| 44 |
+
|
| 45 |
+
```bash
|
| 46 |
+
# Download HumanEval (164 problems) and MBPP (500 problems)
|
| 47 |
+
python scripts/download_benchmark_datasets.py --data-dir ./data
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
This script fetches:
|
| 51 |
+
- HumanEval from OpenAI's official dataset
|
| 52 |
+
- MBPP from Google'sbenchmark suite
|
| 53 |
+
- Ensures correct formatting and ground truth solutions
|
| 54 |
+
|
| 55 |
+
### Unified Evaluation Runner
|
| 56 |
+
|
| 57 |
+
`stack-2.9-eval/run_proper_evaluation.py` provides:
|
| 58 |
+
|
| 59 |
+
```bash
|
| 60 |
+
python stack_2_9_eval/run_proper_evaluation.py \
|
| 61 |
+
--benchmark humaneval \
|
| 62 |
+
--provider ollama \
|
| 63 |
+
--model qwen2.5-coder:32b \
|
| 64 |
+
--k-samples 100 \
|
| 65 |
+
--output-dir ./results/humaneval_run
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
Features:
|
| 69 |
+
- Multi-provider support (Ollama, OpenAI, Anthropic, OpenRouter)
|
| 70 |
+
- Proper `pass@k` calculation with confidence intervals
|
| 71 |
+
- Per-problem detailed logs (JSON)
|
| 72 |
+
- Reproducible random sampling (seeds)
|
| 73 |
+
- Parallel evaluation (configurable workers)
|
| 74 |
+
|
| 75 |
+
### Evaluation Checklist
|
| 76 |
+
|
| 77 |
+
To ensure transparency, every proper evaluation must:
|
| 78 |
+
|
| 79 |
+
1. β
Use full official benchmark (164 HumanEval, 500 MBPP)
|
| 80 |
+
2. β
Call real model inference via `model_client.py`
|
| 81 |
+
3. β
Run with kβ₯100 samples for pass@1 estimation
|
| 82 |
+
4. β
Store all generation outputs for audit
|
| 83 |
+
5. β
Compute standard deviation and confidence intervals
|
| 84 |
+
6. β
Publish full JSON logs to `results/` directory
|
| 85 |
+
7. β
Document exact model version, quantization, and provider settings
|
| 86 |
+
|
| 87 |
+
---
|
| 88 |
+
|
| 89 |
+
## Current Status
|
| 90 |
+
|
| 91 |
+
The previously claimed scores have been **removed** from README.md and BENCHMARKS.md. They are replaced with:
|
| 92 |
+
|
| 93 |
+
| Benchmark | Status | Notes |
|
| 94 |
+
|-----------|--------|-------|
|
| 95 |
+
| HumanEval | Pending verification | Full 164-problem evaluation setup ready |
|
| 96 |
+
| MBPP | Pending verification | Full 500-problem evaluation setup ready |
|
| 97 |
+
| Tool Use | Needs benchmark design | 500+ realistic OpenClaw tool-calling test cases required |
|
| 98 |
+
| GSM8K | Not started | Math reasoning evaluation planned |
|
| 99 |
+
|
| 100 |
+
Expected baseline (Qwen2.5-Coder-32B):
|
| 101 |
+
- HumanEval: ~70-72% Pass@1
|
| 102 |
+
- MBPP: ~75-77% Pass@1
|
| 103 |
+
|
| 104 |
+
Stack 2.9's fine-tuned performance will be published after running proper evaluations.
|
| 105 |
+
|
| 106 |
+
---
|
| 107 |
+
|
| 108 |
+
## What Changed
|
| 109 |
+
|
| 110 |
+
- Created `scripts/download_benchmark_datasets.py` for official datasets
|
| 111 |
+
- Created `stack-2.9-eval/run_proper_evaluation.py` unified runner
|
| 112 |
+
- Created `stack-2.9-eval/test_evaluation_setup.py` to validate environment
|
| 113 |
+
- Added deprecation warnings to flawed `human_eval.py`, `mbpp_eval.py`, `tool_use_eval.py`
|
| 114 |
+
- Updated README.md, BENCHMARKS.md, website pages to remove false claims
|
| 115 |
+
|
| 116 |
+
---
|
| 117 |
+
|
| 118 |
+
## How to Publish Verified Scores
|
| 119 |
+
|
| 120 |
+
1. Prepare datasets: `python scripts/download_benchmark_datasets.py --data-dir ./data`
|
| 121 |
+
2. Run evaluation: `python stack-2.9-eval/run_proper_evaluation.py --benchmark humaneval --provider ollama --model qwen2.5-coder:32b --k-samples 100`
|
| 122 |
+
3. Review logs in `./results/humaneval_run/` (includes per-problem generations)
|
| 123 |
+
4. Update README.md with actual numbers once verified
|
| 124 |
+
5. Commit full JSON results to `stack-2.9-eval/results/` for reproducibility
|
| 125 |
+
|
| 126 |
+
**Do NOT publish** the previously claimed percentages. They are invalid.
|
README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
<p align="center">
|
| 2 |
<img src="https://img.shields.io/github/stars/my-ai-stack/stack-2.9" alt="Stars">
|
| 3 |
-
<img src="https://img.shields.io/github/license/my-ai-stack
|
| 4 |
<img src="https://img.shields.io/python version/3.10+-blue" alt="Python">
|
| 5 |
<img src="https://img.shields.io/discord" alt="Discord">
|
| 6 |
</p>
|
|
@@ -10,10 +10,10 @@
|
|
| 10 |
# Stack 2.9 π€
|
| 11 |
|
| 12 |
<p align="center">
|
| 13 |
-
<strong>The
|
| 14 |
</p>
|
| 15 |
|
| 16 |
-
Stack 2.9 is an open-source AI coding assistant powered by Qwen2.5-Coder-32B.
|
| 17 |
|
| 18 |
---
|
| 19 |
|
|
@@ -21,15 +21,72 @@ Stack 2.9 is an open-source AI coding assistant powered by Qwen2.5-Coder-32B. Un
|
|
| 21 |
|
| 22 |
| Feature | Description |
|
| 23 |
|---------|-------------|
|
| 24 |
-
| **π§
|
| 25 |
-
| **π» Code Generation** |
|
| 26 |
| **π§ 37 Built-in Tools** | File ops, search, shell commands, git, and more |
|
| 27 |
-
| **π Multi-Provider** | Works with Ollama, OpenAI, Anthropic β or bring your own model |
|
| 28 |
| **π± Terminal UI** | Beautiful interactive CLI with chat, benchmarks, and training |
|
| 29 |
| **π Self-Hosted** | Run locally, own your data, deploy anywhere |
|
| 30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
---
|
| 32 |
|
|
|
|
|
|
|
| 33 |
## π Quick Start
|
| 34 |
|
| 35 |
### Installation
|
|
@@ -43,6 +100,26 @@ cd stack-2.9
|
|
| 43 |
pip install -r requirements.txt
|
| 44 |
```
|
| 45 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
### Interactive Chat
|
| 47 |
|
| 48 |
```bash
|
|
@@ -77,7 +154,7 @@ python stack.py --patterns stats
|
|
| 77 |
```
|
| 78 |
$ python stack.py
|
| 79 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 80 |
-
β Stack 2.9 -
|
| 81 |
β Your AI coding companion β
|
| 82 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 83 |
|
|
@@ -120,7 +197,7 @@ result = client.generate("Write a function to reverse a string")
|
|
| 120 |
print(result.text)
|
| 121 |
```
|
| 122 |
|
| 123 |
-
### Pattern Mining (
|
| 124 |
|
| 125 |
```python
|
| 126 |
from stack_2_9_training.pattern_miner import PatternMiner
|
|
@@ -143,13 +220,15 @@ print(f"Found {len(patterns)} relevant patterns")
|
|
| 143 |
|
| 144 |
## π Benchmarks
|
| 145 |
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
|
| 149 |
-
|
|
| 150 |
-
| **
|
| 151 |
-
| **
|
| 152 |
-
| **
|
|
|
|
|
|
|
| 153 |
|
| 154 |
---
|
| 155 |
|
|
@@ -170,6 +249,14 @@ export OPENAI_MODEL=gpt-4o
|
|
| 170 |
# Anthropic
|
| 171 |
export MODEL_PROVIDER=anthropic
|
| 172 |
export ANTHROPIC_API_KEY=sk-ant-...
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 173 |
```
|
| 174 |
|
| 175 |
### Configuration File
|
|
@@ -202,7 +289,7 @@ eval:
|
|
| 202 |
β chat_mode β eval_mode β pattern_mode β train β
|
| 203 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 204 |
β Model Client Layer β
|
| 205 |
-
β OllamaClient β OpenAIClient β AnthropicClient
|
| 206 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 207 |
β Self-Evolution Layer β
|
| 208 |
β pattern_miner β data_quality β train_lora β
|
|
@@ -319,4 +406,4 @@ Licensed under the Apache License 2.0. See [LICENSE](LICENSE) for details.
|
|
| 319 |
|
| 320 |
<p align="center">
|
| 321 |
Built with β€οΈ for developers who want an AI that grows with them
|
| 322 |
-
</p>
|
|
|
|
| 1 |
<p align="center">
|
| 2 |
<img src="https://img.shields.io/github/stars/my-ai-stack/stack-2.9" alt="Stars">
|
| 3 |
+
<img src="https://img.shields.io/github/license/my-ai-stack-stack-2.9" alt="License">
|
| 4 |
<img src="https://img.shields.io/python version/3.10+-blue" alt="Python">
|
| 5 |
<img src="https://img.shields.io/discord" alt="Discord">
|
| 6 |
</p>
|
|
|
|
| 10 |
# Stack 2.9 π€
|
| 11 |
|
| 12 |
<p align="center">
|
| 13 |
+
<strong>The pattern-based AI coding assistant that improves through experience.</strong>
|
| 14 |
</p>
|
| 15 |
|
| 16 |
+
Stack 2.9 is an open-source AI coding assistant powered by Qwen2.5-Coder-32B. It features **Pattern Memory with Retrieval** - learning from interactions by storing successful patterns and retrieving them for future tasks, becoming more helpful through accumulated experience.
|
| 17 |
|
| 18 |
---
|
| 19 |
|
|
|
|
| 21 |
|
| 22 |
| Feature | Description |
|
| 23 |
|---------|-------------|
|
| 24 |
+
| **π§ Pattern Memory** | Learns from interactions. Stores successful patterns, tracks success rates, and retrieves relevant precedents for new tasks |
|
| 25 |
+
| **π» Code Generation** | Evaluation in progress (see Benchmarks section) |
|
| 26 |
| **π§ 37 Built-in Tools** | File ops, search, shell commands, git, and more |
|
| 27 |
+
| **π Multi-Provider** | Works with Ollama, OpenAI, Anthropic, OpenRouter β or bring your own model |
|
| 28 |
| **π± Terminal UI** | Beautiful interactive CLI with chat, benchmarks, and training |
|
| 29 |
| **π Self-Hosted** | Run locally, own your data, deploy anywhere |
|
| 30 |
|
| 31 |
+
## π Benchmark Evaluation
|
| 32 |
+
|
| 33 |
+
### Evaluation Status
|
| 34 |
+
|
| 35 |
+
β οΈ **Important**: The benchmark scores previously listed in this README (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use) have been **removed pending verification**. An audit of the evaluation infrastructure revealed that:
|
| 36 |
+
|
| 37 |
+
- **HumanEval & MBPP implementations had only 20 problems** (1-4% of full benchmarks)
|
| 38 |
+
- **No proper model inference logs exist** for the claimed numbers
|
| 39 |
+
- **Tool Use evaluation lacked a proper benchmark** implementation
|
| 40 |
+
|
| 41 |
+
These scores were therefore **unverifiable** and potentially misleading.
|
| 42 |
+
|
| 43 |
+
### Current Evaluation Framework
|
| 44 |
+
|
| 45 |
+
We are rebuilding the evaluation infrastructure with proper methodology:
|
| 46 |
+
|
| 47 |
+
1. **Official datasets**: HumanEval (164 problems), MBPP (500 problems)
|
| 48 |
+
2. **Reproducible runs**: Full logs, config files, and per-problem results
|
| 49 |
+
3. **Standard metrics**: Pass@1 with confidence intervals, using kβ₯100 samples
|
| 50 |
+
4. **Transparent methodology**: All code and data publicly available
|
| 51 |
+
|
| 52 |
+
See [EVALUATION.md](EVALUATION.md) for the full audit report and methodology.
|
| 53 |
+
|
| 54 |
+
### Running Evaluations
|
| 55 |
+
|
| 56 |
+
Once datasets are prepared, run proper evaluations:
|
| 57 |
+
|
| 58 |
+
```bash
|
| 59 |
+
# Download official datasets (one-time)
|
| 60 |
+
python scripts/download_benchmark_datasets.py --data-dir ./data
|
| 61 |
+
|
| 62 |
+
# Run evaluation with a model provider
|
| 63 |
+
python stack_2_9_eval/run_proper_evaluation.py \
|
| 64 |
+
--benchmark humaneval \
|
| 65 |
+
--provider ollama \
|
| 66 |
+
--model qwen2.5-coder:32b \
|
| 67 |
+
--k-samples 100 \
|
| 68 |
+
--output-dir ./results/humaneval_run
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
Or use the built-in CLI:
|
| 72 |
+
|
| 73 |
+
```bash
|
| 74 |
+
python stack.py --eval all --provider ollama --eval-model qwen2.5-coder:32b
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
### Expected Results (Base Model)
|
| 78 |
+
|
| 79 |
+
For reference, the base Qwen2.5-Coder-32B typically scores:
|
| 80 |
+
|
| 81 |
+
- HumanEval: ~70-72% Pass@1
|
| 82 |
+
- MBPP: ~75-77% Pass@1
|
| 83 |
+
|
| 84 |
+
Stack 2.9's fine-tuned performance will be published after proper evaluation.
|
| 85 |
+
|
| 86 |
---
|
| 87 |
|
| 88 |
+
|
| 89 |
+
|
| 90 |
## π Quick Start
|
| 91 |
|
| 92 |
### Installation
|
|
|
|
| 100 |
pip install -r requirements.txt
|
| 101 |
```
|
| 102 |
|
| 103 |
+
### Hardware Requirements
|
| 104 |
+
|
| 105 |
+
Stack 2.9 requires a GPU for optimal performance. Minimum and recommended configurations:
|
| 106 |
+
|
| 107 |
+
| Configuration | Minimum | Recommended | Production |
|
| 108 |
+
|---------------|---------|-------------|------------|
|
| 109 |
+
| **GPU** | NVIDIA 8GB VRAM | NVIDIA 24GB VRAM | NVIDIA 40-80GB (A100/H100) |
|
| 110 |
+
| **RAM** | 16GB | 32GB | 64GB+ |
|
| 111 |
+
| **Disk** | 20GB free | 50GB free | 100GB+ (NVMe) |
|
| 112 |
+
| **CUDA** | 11.8 | 12.1 | 12.1+ |
|
| 113 |
+
| **Models** | 7B quantized | 32B quantized | 70B+ quantized |
|
| 114 |
+
|
| 115 |
+
**Notes:**
|
| 116 |
+
- CPU-only mode is possible but extremely slow (not recommended for production)
|
| 117 |
+
- AWQ/GPTQ quantization reduces VRAM requirements by ~50%
|
| 118 |
+
- Multi-GPU (tensor parallelism) supported for large models
|
| 119 |
+
- Ensure NVIDIA drivers and CUDA toolkit are installed
|
| 120 |
+
|
| 121 |
+
For detailed deployment options (Docker, RunPod, Vast.ai, Kubernetes), see `stack-2.9-deploy/README.md`.
|
| 122 |
+
|
| 123 |
### Interactive Chat
|
| 124 |
|
| 125 |
```bash
|
|
|
|
| 154 |
```
|
| 155 |
$ python stack.py
|
| 156 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 157 |
+
β Stack 2.9 - Pattern Memory AI β
|
| 158 |
β Your AI coding companion β
|
| 159 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 160 |
|
|
|
|
| 197 |
print(result.text)
|
| 198 |
```
|
| 199 |
|
| 200 |
+
### Pattern Mining (Pattern Memory)
|
| 201 |
|
| 202 |
```python
|
| 203 |
from stack_2_9_training.pattern_miner import PatternMiner
|
|
|
|
| 220 |
|
| 221 |
## π Benchmarks
|
| 222 |
|
| 223 |
+
β οΈ **Benchmark scores are currently under independent verification.** See [Evaluation Status](#-benchmark-evaluation) above for details.
|
| 224 |
+
|
| 225 |
+
| Benchmark | Status | Notes |
|
| 226 |
+
|-----------|--------|-------|
|
| 227 |
+
| **HumanEval** | Pending | Full 164-problem evaluation in progress |
|
| 228 |
+
| **MBPP** | Pending | Full 500-problem evaluation in progress |
|
| 229 |
+
| **Tool Use** | Pending | Custom tool-calling benchmark to be created |
|
| 230 |
+
| **GSM8K** | Not started | Math reasoning evaluation planned |
|
| 231 |
+
| **Context** | β
128K | Token context window tested |
|
| 232 |
|
| 233 |
---
|
| 234 |
|
|
|
|
| 249 |
# Anthropic
|
| 250 |
export MODEL_PROVIDER=anthropic
|
| 251 |
export ANTHROPIC_API_KEY=sk-ant-...
|
| 252 |
+
|
| 253 |
+
# OpenRouter
|
| 254 |
+
export MODEL_PROVIDER=openrouter
|
| 255 |
+
export OPENROUTER_API_KEY=sk-or-v1-...
|
| 256 |
+
export OPENROUTER_MODEL=qwen/qwen2.5-coder-32b
|
| 257 |
+
# Optional: customize referer and title for OpenRouter dashboard
|
| 258 |
+
export HTTP_REFERER=https://your-app.com
|
| 259 |
+
export X_TITLE="Stack 2.9"
|
| 260 |
```
|
| 261 |
|
| 262 |
### Configuration File
|
|
|
|
| 289 |
β chat_mode β eval_mode β pattern_mode β train β
|
| 290 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 291 |
β Model Client Layer β
|
| 292 |
+
β OllamaClient β OpenAIClient β AnthropicClient β OpenRouterClient β
|
| 293 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 294 |
β Self-Evolution Layer β
|
| 295 |
β pattern_miner β data_quality β train_lora β
|
|
|
|
| 406 |
|
| 407 |
<p align="center">
|
| 408 |
Built with β€οΈ for developers who want an AI that grows with them
|
| 409 |
+
</p>
|
docs/tools.md
ADDED
|
@@ -0,0 +1,206 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Stack 2.9 Tools Reference
|
| 2 |
+
|
| 3 |
+
Stack 2.9 provides **37 built-in tools** for file operations, system commands, git, web search, and more. Tools are selected automatically based on user intent, or can be called explicitly via the agent API.
|
| 4 |
+
|
| 5 |
+
## Tool Calling Format
|
| 6 |
+
|
| 7 |
+
Tools use a **function schema** format similar to OpenAI's function calling:
|
| 8 |
+
|
| 9 |
+
```python
|
| 10 |
+
{
|
| 11 |
+
"name": "tool_name",
|
| 12 |
+
"description": "What the tool does",
|
| 13 |
+
"parameters": {
|
| 14 |
+
"type": "object",
|
| 15 |
+
"properties": {
|
| 16 |
+
"param1": {"type": "string", "description": "Parameter description"},
|
| 17 |
+
"param2": {"type": "integer", "description": "Another parameter"}
|
| 18 |
+
},
|
| 19 |
+
"required": ["param1"]
|
| 20 |
+
}
|
| 21 |
+
}
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
The agent determines which tools to call and with what arguments based on the user query.
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## Complete Tool List
|
| 29 |
+
|
| 30 |
+
### File Operations
|
| 31 |
+
|
| 32 |
+
| Tool | Description | Parameters |
|
| 33 |
+
|------|-------------|------------|
|
| 34 |
+
| `read` | Read file contents | `path` (string, required) |
|
| 35 |
+
| `write` | Write content to file | `path` (string, required), `content` (string, required) |
|
| 36 |
+
| `edit` | Edit file with sed-like replacements | `path` (string, required), `old_text` (string, required), `new_text` (string, required) |
|
| 37 |
+
| `create_directory` | Create a new directory | `path` (string, required) |
|
| 38 |
+
| `list_directory` | List contents of a directory | `path` (string, default: ".") |
|
| 39 |
+
| `search` | Search for files matching a pattern | `pattern` (string, required), `path` (string, default: ".") |
|
| 40 |
+
| `get_file_info` | Get file metadata (size, timestamps, permissions) | `path` (string, required) |
|
| 41 |
+
| `move_file` | Move or rename a file/directory | `source` (string, required), `destination` (string, required) |
|
| 42 |
+
| `copy_file` | Copy a file (implementation pending) | `source` (string, required), `destination` (string, required) |
|
| 43 |
+
| `delete_file` | Delete a file | `path` (string, required) |
|
| 44 |
+
|
| 45 |
+
### Git Operations
|
| 46 |
+
|
| 47 |
+
| Tool | Description | Parameters |
|
| 48 |
+
|------|-------------|------------|
|
| 49 |
+
| `git_status` | Get git repository status | (no parameters) |
|
| 50 |
+
| `git_log` | View commit history | `max_count` (integer, default: 10), `path` (string, optional) |
|
| 51 |
+
| `git_diff` | Show changes between commits or working tree | `commit` (string, optional), `path` (string, optional) |
|
| 52 |
+
| `git_commit` | Commit staged changes | `message` (string, required), `all` (boolean, default: false) |
|
| 53 |
+
| `git_add` | Stage files for commit | `paths` (array of strings, required) |
|
| 54 |
+
| `git_push` | Push commits to remote | `remote` (string, default: "origin"), `branch` (string, optional) |
|
| 55 |
+
| `git_pull` | Pull from remote | `remote` (string, default: "origin"), `branch` (string, optional) |
|
| 56 |
+
| `git_branch` | List or create branches | `create` (string, optional), `delete` (string, optional), `checkout` (string, optional) |
|
| 57 |
+
| `git_clone` | Clone a repository | `url` (string, required), `path` (string, optional) |
|
| 58 |
+
| `git_remote` | Manage remotes | `action` (string, required: "add|remove|list"), `name` (string), `url` (string) |
|
| 59 |
+
|
| 60 |
+
### Shell & Execution
|
| 61 |
+
|
| 62 |
+
| Tool | Description | Parameters |
|
| 63 |
+
|------|-------------|------------|
|
| 64 |
+
| `run` | Execute shell command | `command` (string, required), `timeout` (integer, default: 30), `cwd` (string, optional) |
|
| 65 |
+
| `run_background` | Run command in background | `command` (string, required), `yield_ms` (integer, default: 10000) |
|
| 66 |
+
| `test` | Run tests (pytest, unittest) | `path` (string, default: "."), `pattern` (string, default: "test_*.py") |
|
| 67 |
+
| `lint` | Lint code (flake8, pylint, eslint) | `path` (string, default: "."), `tool` (string, default: "auto") |
|
| 68 |
+
| `format` | Format code (black, prettier, gofmt) | `path` (string, default: "."), `tool` (string, default: "auto") |
|
| 69 |
+
|
| 70 |
+
### Web & Search
|
| 71 |
+
|
| 72 |
+
| Tool | Description | Parameters |
|
| 73 |
+
|------|-------------|------------|
|
| 74 |
+
| `web_search` | Search the web via Brave | `query` (string, required), `count` (integer, default: 10) |
|
| 75 |
+
| `fetch` | Fetch and extract content from URL | `url` (string, required), `max_chars` (integer, default: 5000) |
|
| 76 |
+
| `download` | Download a file | `url` (string, required), `output_path` (string, required) |
|
| 77 |
+
|
| 78 |
+
### Memory & Knowledge
|
| 79 |
+
|
| 80 |
+
| Tool | Description | Parameters |
|
| 81 |
+
|------|-------------|------------|
|
| 82 |
+
| `memory_recall` | Search memory for relevant entries | `query` (string, required), `limit` (integer, default: 10) |
|
| 83 |
+
| `memory_save` | Store observation in memory | `content` (string, required), `entity` (string, optional) |
|
| 84 |
+
| `memory_list` | List all memory entities | (no parameters) |
|
| 85 |
+
| `context_load` | Load conversation context | `session_id` (string, optional) |
|
| 86 |
+
| `context_save` | Save conversation context | `session_id` (string, optional) |
|
| 87 |
+
|
| 88 |
+
### Project Management
|
| 89 |
+
|
| 90 |
+
| Tool | Description | Parameters |
|
| 91 |
+
|------|-------------|------------|
|
| 92 |
+
| `create_task` | Create a new task | `title` (string, required), `description` (string, optional), `priority` (string: low/medium/high) |
|
| 93 |
+
| `list_tasks` | List tasks | `status` (string: pending|done|all, default: "pending") |
|
| 94 |
+
| `update_task` | Update task status or details | `task_id` (string, required), `status` (string, optional), `title` (string, optional), `description` (string, optional) |
|
| 95 |
+
| `project_scan` | Scan project structure and dependencies | (no parameters) |
|
| 96 |
+
|
| 97 |
+
### System & Utilities
|
| 98 |
+
|
| 99 |
+
| Tool | Description | Parameters |
|
| 100 |
+
|------|-------------|------------|
|
| 101 |
+
| `get_system_info` | Get OS, CPU, memory, disk info | (no parameters) |
|
| 102 |
+
| `list_processes` | List running processes | `filter` (string, optional) |
|
| 103 |
+
| `kill_process` | Terminate a process | `pid` (integer, required) |
|
| 104 |
+
| `environment` | Get environment variables | `names` (array of strings, optional) |
|
| 105 |
+
| `set_environment` | Set environment variable (current session) | `name` (string, required), `value` (string, required) |
|
| 106 |
+
| `whoami` | Get current user | (no parameters) |
|
| 107 |
+
| `pwd` | Print working directory | (no parameters) |
|
| 108 |
+
|
| 109 |
+
### Data & Serialization
|
| 110 |
+
|
| 111 |
+
| Tool | Description | Parameters |
|
| 112 |
+
|------|-------------|------------|
|
| 113 |
+
| `json_parse` | Parse JSON string to dict | `json_string` (string, required) |
|
| 114 |
+
| `json_format` | Format dict/object to pretty JSON | `data` (object, required), `indent` (integer, default: 2) |
|
| 115 |
+
| `yaml_parse` | Parse YAML to dict | `yaml_string` (string, required) |
|
| 116 |
+
| `yaml_format` | Format dict to YAML | `data` (object, required) |
|
| 117 |
+
| `csv_parse` | Parse CSV to list of dicts | `csv_string` (string, required), `delimiter` (string, default: ",") |
|
| 118 |
+
| `csv_format` | Format list of dicts to CSV | `data` (array, required), `columns` (array, optional) |
|
| 119 |
+
|
| 120 |
+
### Time & Scheduling
|
| 121 |
+
|
| 122 |
+
| Tool | Description | Parameters |
|
| 123 |
+
|------|-------------|------------|
|
| 124 |
+
| `current_time` | Get current date/time | `timezone` (string, optional) |
|
| 125 |
+
| `sleep` | Sleep for N seconds | `seconds` (integer, required) |
|
| 126 |
+
| `schedule` | Schedule a future task (requires background runner) | `delay_seconds` (integer, required), `action` (string, required), `params` (object, optional) |
|
| 127 |
+
|
| 128 |
+
### Image & Media
|
| 129 |
+
|
| 130 |
+
| Tool | Description | Parameters |
|
| 131 |
+
|------|-------------|------------|
|
| 132 |
+
| `image_info` | Get image metadata (dimensions, format, size) | `path` (string, required) |
|
| 133 |
+
| `image_resize` | Resize an image | `path` (string, required), `width` (integer), `height` (integer), `output_path` (string, required) |
|
| 134 |
+
| `image_convert` | Convert image format | `path` (string, required), `format` (string: png|jpg|webp|gif), `output_path` (string, required) |
|
| 135 |
+
| `generate_image` | Generate image from text (requires image generation model) | `prompt` (string, required), `size` (string: 1024x1024), `output_path` (string) |
|
| 136 |
+
|
| 137 |
+
---
|
| 138 |
+
|
| 139 |
+
## Return Format
|
| 140 |
+
|
| 141 |
+
All tools return a JSON-serializable dict with at least:
|
| 142 |
+
|
| 143 |
+
```json
|
| 144 |
+
{
|
| 145 |
+
"success": true|false,
|
| 146 |
+
"result": <tool-specific result data>,
|
| 147 |
+
"error": <error message if failed>
|
| 148 |
+
}
|
| 149 |
+
```
|
| 150 |
+
|
| 151 |
+
Example success:
|
| 152 |
+
```json
|
| 153 |
+
{
|
| 154 |
+
"success": true,
|
| 155 |
+
"result": "File content here...",
|
| 156 |
+
"error": null
|
| 157 |
+
}
|
| 158 |
+
```
|
| 159 |
+
|
| 160 |
+
Example error:
|
| 161 |
+
```json
|
| 162 |
+
{
|
| 163 |
+
"success": false,
|
| 164 |
+
"result": null,
|
| 165 |
+
"error": "File not found: /path/to/file"
|
| 166 |
+
}
|
| 167 |
+
```
|
| 168 |
+
|
| 169 |
+
---
|
| 170 |
+
|
| 171 |
+
## Schema Access
|
| 172 |
+
|
| 173 |
+
Tools can be introspected programmatically:
|
| 174 |
+
|
| 175 |
+
```python
|
| 176 |
+
from stack_cli.tools import get_tool_schemas, get_tool
|
| 177 |
+
|
| 178 |
+
# Get all tool schemas for LLM function calling
|
| 179 |
+
schemas = get_tool_schemas()
|
| 180 |
+
|
| 181 |
+
# Get a specific tool
|
| 182 |
+
read_tool = get_tool("read")
|
| 183 |
+
result = read_tool(path="/path/to/file")
|
| 184 |
+
```
|
| 185 |
+
|
| 186 |
+
---
|
| 187 |
+
|
| 188 |
+
## Extending
|
| 189 |
+
|
| 190 |
+
To add a new tool, define a function and register it in `stack_cli/tools.py`:
|
| 191 |
+
|
| 192 |
+
```python
|
| 193 |
+
def my_tool(param1: str, param2: int = 5) -> dict:
|
| 194 |
+
"""Tool description for LLM."""
|
| 195 |
+
try:
|
| 196 |
+
# Do work
|
| 197 |
+
result = do_something(param1, param2)
|
| 198 |
+
return {"success": True, "result": result}
|
| 199 |
+
except Exception as e:
|
| 200 |
+
return {"success": False, "error": str(e)}
|
| 201 |
+
|
| 202 |
+
# Register
|
| 203 |
+
register_tool("my_tool", my_tool, "Description for LLM")
|
| 204 |
+
```
|
| 205 |
+
|
| 206 |
+
The system automatically generates JSON schemas from type hints and docstrings.
|
space/README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
-
# π Stack 2.9 -
|
| 2 |
|
| 3 |
-
A HuggingFace Spaces demo for Stack 2.9, a
|
| 4 |
|
| 5 |

|
| 6 |

|
|
@@ -10,7 +10,7 @@ A HuggingFace Spaces demo for Stack 2.9, a self-evolving AI coding assistant pow
|
|
| 10 |
|
| 11 |
- **π€ Qwen2.5-Coder-7B** - State-of-the-art code generation model
|
| 12 |
- **π§ 7 Integrated Tools** - File operations, git, web search, shell commands
|
| 13 |
-
- **π§
|
| 14 |
- **β‘ Fast Streaming** - Real-time token-by-token generation
|
| 15 |
- **πΎ 4-bit Quantization** - Runs on 16GB GPU (~4GB VRAM)
|
| 16 |
|
|
@@ -90,7 +90,7 @@ print(memory.get_stats())
|
|
| 90 |
|
| 91 |
## π Memory System
|
| 92 |
|
| 93 |
-
Stack 2.9 includes a
|
| 94 |
|
| 95 |
1. **Tracks Interactions** - Records every user-assistant exchange
|
| 96 |
2. **Learns Patterns** - Identifies frequently used tools
|
|
|
|
| 1 |
+
# π Stack 2.9 - Pattern-Based AI Coding Assistant
|
| 2 |
|
| 3 |
+
A HuggingFace Spaces demo for Stack 2.9, a pattern-based AI coding assistant powered by Qwen2.5-Coder-7B.
|
| 4 |
|
| 5 |

|
| 6 |

|
|
|
|
| 10 |
|
| 11 |
- **π€ Qwen2.5-Coder-7B** - State-of-the-art code generation model
|
| 12 |
- **π§ 7 Integrated Tools** - File operations, git, web search, shell commands
|
| 13 |
+
- **π§ Pattern Memory** - Learns from each interaction
|
| 14 |
- **β‘ Fast Streaming** - Real-time token-by-token generation
|
| 15 |
- **πΎ 4-bit Quantization** - Runs on 16GB GPU (~4GB VRAM)
|
| 16 |
|
|
|
|
| 90 |
|
| 91 |
## π Memory System
|
| 92 |
|
| 93 |
+
Stack 2.9 includes a pattern memory system that:
|
| 94 |
|
| 95 |
1. **Tracks Interactions** - Records every user-assistant exchange
|
| 96 |
2. **Learns Patterns** - Identifies frequently used tools
|
space/app.py
CHANGED
|
@@ -1,9 +1,9 @@
|
|
| 1 |
"""
|
| 2 |
-
Stack 2.9 -
|
| 3 |
HuggingFace Spaces Demo
|
| 4 |
|
| 5 |
A Gradio interface for Stack 2.9 powered by Qwen2.5-Coder-7B
|
| 6 |
-
with tool integration and
|
| 7 |
"""
|
| 8 |
|
| 9 |
import os
|
|
@@ -14,11 +14,11 @@ from typing import List, Dict, Optional
|
|
| 14 |
import gradio as gr
|
| 15 |
|
| 16 |
# ============================================================
|
| 17 |
-
#
|
| 18 |
# ============================================================
|
| 19 |
|
| 20 |
class SelfEvolutionMemory:
|
| 21 |
-
"""Simple in-memory
|
| 22 |
|
| 23 |
def __init__(self):
|
| 24 |
self.conversations = []
|
|
@@ -60,7 +60,7 @@ class SelfEvolutionMemory:
|
|
| 60 |
|
| 61 |
def get_context(self) -> str:
|
| 62 |
"""Get accumulated context for the model."""
|
| 63 |
-
context_parts = [f"##
|
| 64 |
|
| 65 |
if self.learned_patterns:
|
| 66 |
context_parts.append("\n### Tool Usage Patterns:")
|
|
@@ -236,7 +236,7 @@ class StackModel:
|
|
| 236 |
return "Model not loaded. Please wait for initialization."
|
| 237 |
|
| 238 |
# Build the prompt with system and tools
|
| 239 |
-
system_prompt = f"""You are Stack 2.9 - a
|
| 240 |
|
| 241 |
## Available Tools
|
| 242 |
{get_tool_descriptions()}
|
|
@@ -291,7 +291,7 @@ Now respond to the user:"""
|
|
| 291 |
yield "Model not loaded. Please wait for initialization."
|
| 292 |
return
|
| 293 |
|
| 294 |
-
system_prompt = f"""You are Stack 2.9 - a
|
| 295 |
|
| 296 |
## Available Tools
|
| 297 |
{get_tool_descriptions()}
|
|
@@ -447,7 +447,7 @@ def create_gradio_app():
|
|
| 447 |
"""Create the Gradio interface."""
|
| 448 |
|
| 449 |
with gr.Blocks(
|
| 450 |
-
title="Stack 2.9 -
|
| 451 |
theme=gr.themes.Soft(
|
| 452 |
primary_color="#6366f1",
|
| 453 |
secondary_color="#818cf8",
|
|
@@ -457,7 +457,7 @@ def create_gradio_app():
|
|
| 457 |
|
| 458 |
# Header
|
| 459 |
gr.Markdown("""
|
| 460 |
-
# π Stack 2.9 -
|
| 461 |
|
| 462 |
Powered by **Qwen2.5-Coder-7B** with 4-bit quantization
|
| 463 |
|
|
@@ -546,7 +546,7 @@ def create_gradio_app():
|
|
| 546 |
---
|
| 547 |
### About Stack 2.9
|
| 548 |
|
| 549 |
-
Stack 2.9 is a
|
| 550 |
- π Uses **Qwen2.5-Coder-7B** (4-bit, ~4GB VRAM)
|
| 551 |
- π οΈ Integrates **7 tools** (file, git, web, search, shell)
|
| 552 |
- π§ Remembers interactions and learns patterns
|
|
@@ -572,7 +572,7 @@ if __name__ == "__main__":
|
|
| 572 |
args = parser.parse_args()
|
| 573 |
|
| 574 |
print("=" * 50)
|
| 575 |
-
print("π Stack 2.9 -
|
| 576 |
print("=" * 50)
|
| 577 |
print(f"Model: {args.model}")
|
| 578 |
print("Loading model...")
|
|
|
|
| 1 |
"""
|
| 2 |
+
Stack 2.9 - Pattern-Based AI Coding Assistant
|
| 3 |
HuggingFace Spaces Demo
|
| 4 |
|
| 5 |
A Gradio interface for Stack 2.9 powered by Qwen2.5-Coder-7B
|
| 6 |
+
with tool integration and pattern memory.
|
| 7 |
"""
|
| 8 |
|
| 9 |
import os
|
|
|
|
| 14 |
import gradio as gr
|
| 15 |
|
| 16 |
# ============================================================
|
| 17 |
+
# Pattern Memory System
|
| 18 |
# ============================================================
|
| 19 |
|
| 20 |
class SelfEvolutionMemory:
|
| 21 |
+
"""Simple in-memory pattern memory system for demo purposes."""
|
| 22 |
|
| 23 |
def __init__(self):
|
| 24 |
self.conversations = []
|
|
|
|
| 60 |
|
| 61 |
def get_context(self) -> str:
|
| 62 |
"""Get accumulated context for the model."""
|
| 63 |
+
context_parts = [f"## Pattern Memory ({self.interaction_count} interactions)"]
|
| 64 |
|
| 65 |
if self.learned_patterns:
|
| 66 |
context_parts.append("\n### Tool Usage Patterns:")
|
|
|
|
| 236 |
return "Model not loaded. Please wait for initialization."
|
| 237 |
|
| 238 |
# Build the prompt with system and tools
|
| 239 |
+
system_prompt = f"""You are Stack 2.9 - a pattern-based AI coding assistant.
|
| 240 |
|
| 241 |
## Available Tools
|
| 242 |
{get_tool_descriptions()}
|
|
|
|
| 291 |
yield "Model not loaded. Please wait for initialization."
|
| 292 |
return
|
| 293 |
|
| 294 |
+
system_prompt = f"""You are Stack 2.9 - a pattern-based AI coding assistant.
|
| 295 |
|
| 296 |
## Available Tools
|
| 297 |
{get_tool_descriptions()}
|
|
|
|
| 447 |
"""Create the Gradio interface."""
|
| 448 |
|
| 449 |
with gr.Blocks(
|
| 450 |
+
title="Stack 2.9 - Pattern-Based AI Coding Assistant",
|
| 451 |
theme=gr.themes.Soft(
|
| 452 |
primary_color="#6366f1",
|
| 453 |
secondary_color="#818cf8",
|
|
|
|
| 457 |
|
| 458 |
# Header
|
| 459 |
gr.Markdown("""
|
| 460 |
+
# π Stack 2.9 - Pattern-Based AI Coding Assistant
|
| 461 |
|
| 462 |
Powered by **Qwen2.5-Coder-7B** with 4-bit quantization
|
| 463 |
|
|
|
|
| 546 |
---
|
| 547 |
### About Stack 2.9
|
| 548 |
|
| 549 |
+
Stack 2.9 is a pattern-based AI coding assistant that:
|
| 550 |
- π Uses **Qwen2.5-Coder-7B** (4-bit, ~4GB VRAM)
|
| 551 |
- π οΈ Integrates **7 tools** (file, git, web, search, shell)
|
| 552 |
- π§ Remembers interactions and learns patterns
|
|
|
|
| 572 |
args = parser.parse_args()
|
| 573 |
|
| 574 |
print("=" * 50)
|
| 575 |
+
print("π Stack 2.9 - Pattern-Based AI Coding Assistant")
|
| 576 |
print("=" * 50)
|
| 577 |
print(f"Model: {args.model}")
|
| 578 |
print("Loading model...")
|
stack-2.9-deploy/README.md
CHANGED
|
@@ -9,6 +9,25 @@ Turnkey deployment configurations for Stack 2.9 LLM inference server.
|
|
| 9 |
- For cloud: **runpodctl** or **vastai** CLI installed
|
| 10 |
- **chmod +x** may be required on shell scripts
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
## π§ͺ Validate Setup
|
| 13 |
|
| 14 |
Before deploying, run the validation script to ensure everything is ready:
|
|
|
|
| 9 |
- For cloud: **runpodctl** or **vastai** CLI installed
|
| 10 |
- **chmod +x** may be required on shell scripts
|
| 11 |
|
| 12 |
+
## π₯οΈ System Requirements
|
| 13 |
+
|
| 14 |
+
Stack 2.9 deployment requires appropriate hardware depending on model size:
|
| 15 |
+
|
| 16 |
+
| Configuration | Minimum | Recommended | Production |
|
| 17 |
+
|---------------|---------|-------------|------------|
|
| 18 |
+
| **GPU VRAM** | 8GB | 24GB | 40-80GB (A100/H100) |
|
| 19 |
+
| **RAM** | 16GB | 32GB | 64GB+ |
|
| 20 |
+
| **Disk** | 20GB free | 50GB free | 100GB+ (NVMe) |
|
| 21 |
+
| **CUDA** | 11.8 | 12.1 | 12.1+ |
|
| 22 |
+
| **Models** | 7B quantized | 32B quantized | 70B+ quantized |
|
| 23 |
+
|
| 24 |
+
**Notes:**
|
| 25 |
+
- CPU-only mode is possible but extremely slow (not recommended for production)
|
| 26 |
+
- AWQ/GPTQ quantization reduces VRAM requirements by ~50%
|
| 27 |
+
- Multi-GPU (tensor parallelism) supported via `TENSOR_PARALLEL_SIZE`
|
| 28 |
+
|
| 29 |
+
## π§ͺ Validate Setup
|
| 30 |
+
|
| 31 |
## π§ͺ Validate Setup
|
| 32 |
|
| 33 |
Before deploying, run the validation script to ensure everything is ready:
|
stack-2.9-docs/ARCHITECTURE.md
CHANGED
|
@@ -7,7 +7,7 @@ This document provides an in-depth look at Stack 2.9's technical architecture, s
|
|
| 7 |
- [System Overview](#system-overview)
|
| 8 |
- [System Components](#system-components)
|
| 9 |
- [Data Flow](#data-flow)
|
| 10 |
-
- [
|
| 11 |
- [Training Pipeline](#training-pipeline)
|
| 12 |
- [Tool System](#tool-system)
|
| 13 |
- [Memory System](#memory-system)
|
|
@@ -42,7 +42,7 @@ This document provides an in-depth look at Stack 2.9's technical architecture, s
|
|
| 42 |
β β β β β
|
| 43 |
β βΌ βΌ βΌ β
|
| 44 |
β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ β
|
| 45 |
-
β β MODEL LAYER β β TOOL ENGINE β β
|
| 46 |
β β Qwen2.5-Coder β β 37 Tools β β Observe/Learn β β
|
| 47 |
β β 32B + LoRA β β Sandbox Exec β β Memory/Train β β
|
| 48 |
β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ β
|
|
@@ -153,7 +153,7 @@ The orchestration layer coordinates the agent's activities:
|
|
| 153 |
- **Agent (agent.py)**: Main orchestration logic
|
| 154 |
- **Context Manager (context.py)**: Manages conversation context and truncation
|
| 155 |
- **Tool Coordinator**: Routes tool calls and manages execution
|
| 156 |
-
- **Memory Bridge**: Interfaces with the
|
| 157 |
|
| 158 |
### 4. Model Layer
|
| 159 |
|
|
@@ -258,7 +258,7 @@ MODEL_CONFIG = {
|
|
| 258 |
β β β β
|
| 259 |
β β β’ Format response (OpenAI-compatible) β β
|
| 260 |
β β β’ Stream chunks (if requested) β β
|
| 261 |
-
β β β’ Record to
|
| 262 |
β β β’ Update metrics β β
|
| 263 |
β β β β
|
| 264 |
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
|
@@ -314,13 +314,13 @@ MODEL_CONFIG = {
|
|
| 314 |
|
| 315 |
---
|
| 316 |
|
| 317 |
-
##
|
| 318 |
|
| 319 |
-
Stack 2.9's
|
| 320 |
|
| 321 |
```
|
| 322 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 323 |
-
β
|
| 324 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 325 |
β β
|
| 326 |
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
|
@@ -563,7 +563,7 @@ class PersistentMemory:
|
|
| 563 |
β β βββ Duration: 1-2 epochs β β
|
| 564 |
β β β β
|
| 565 |
β β Stage 3: LoRA Adapter Training β β
|
| 566 |
-
β β βββ
|
| 567 |
β β βββ Voice integration β β
|
| 568 |
β β βββ Duration: 1 epoch β β
|
| 569 |
β β β β
|
|
@@ -575,7 +575,7 @@ class PersistentMemory:
|
|
| 575 |
β β β β
|
| 576 |
β β β’ HumanEval, MBPP benchmarks β β
|
| 577 |
β β β’ Tool use accuracy β β
|
| 578 |
-
β β β’
|
| 579 |
β β β’ Quality regression testing β β
|
| 580 |
β β β β
|
| 581 |
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
|
@@ -992,7 +992,7 @@ METRICS = {
|
|
| 992 |
"tool_execution_time": Histogram,
|
| 993 |
"tool_errors": Counter,
|
| 994 |
|
| 995 |
-
#
|
| 996 |
"memories_created": Counter,
|
| 997 |
"patterns_extracted": Counter,
|
| 998 |
"improvements_applied": Counter,
|
|
|
|
| 7 |
- [System Overview](#system-overview)
|
| 8 |
- [System Components](#system-components)
|
| 9 |
- [Data Flow](#data-flow)
|
| 10 |
+
- [Pattern Memory System](#pattern-memory-system)
|
| 11 |
- [Training Pipeline](#training-pipeline)
|
| 12 |
- [Tool System](#tool-system)
|
| 13 |
- [Memory System](#memory-system)
|
|
|
|
| 42 |
β β β β β
|
| 43 |
β βΌ βΌ βΌ β
|
| 44 |
β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ β
|
| 45 |
+
β β MODEL LAYER β β TOOL ENGINE β β PATTERN MEMORY β β
|
| 46 |
β β Qwen2.5-Coder β β 37 Tools β β Observe/Learn β β
|
| 47 |
β β 32B + LoRA β β Sandbox Exec β β Memory/Train β β
|
| 48 |
β ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ β
|
|
|
|
| 153 |
- **Agent (agent.py)**: Main orchestration logic
|
| 154 |
- **Context Manager (context.py)**: Manages conversation context and truncation
|
| 155 |
- **Tool Coordinator**: Routes tool calls and manages execution
|
| 156 |
+
- **Memory Bridge**: Interfaces with the pattern memory memory system
|
| 157 |
|
| 158 |
### 4. Model Layer
|
| 159 |
|
|
|
|
| 258 |
β β β β
|
| 259 |
β β β’ Format response (OpenAI-compatible) β β
|
| 260 |
β β β’ Stream chunks (if requested) β β
|
| 261 |
+
β β β’ Record to pattern memory system β β
|
| 262 |
β β β’ Update metrics β β
|
| 263 |
β β β β
|
| 264 |
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
|
|
|
| 314 |
|
| 315 |
---
|
| 316 |
|
| 317 |
+
## Pattern Memory System
|
| 318 |
|
| 319 |
+
Stack 2.9's pattern memory system enables continuous improvement through experience:
|
| 320 |
|
| 321 |
```
|
| 322 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 323 |
+
β PATTERN MEMORY ARCHITECTURE β
|
| 324 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 325 |
β β
|
| 326 |
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
|
|
|
| 563 |
β β βββ Duration: 1-2 epochs β β
|
| 564 |
β β β β
|
| 565 |
β β Stage 3: LoRA Adapter Training β β
|
| 566 |
+
β β βββ Pattern Memory patterns β β
|
| 567 |
β β βββ Voice integration β β
|
| 568 |
β β βββ Duration: 1 epoch β β
|
| 569 |
β β β β
|
|
|
|
| 575 |
β β β β
|
| 576 |
β β β’ HumanEval, MBPP benchmarks β β
|
| 577 |
β β β’ Tool use accuracy β β
|
| 578 |
+
β β β’ Pattern Memory effectiveness β β
|
| 579 |
β β β’ Quality regression testing β β
|
| 580 |
β β β β
|
| 581 |
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
|
|
|
| 992 |
"tool_execution_time": Histogram,
|
| 993 |
"tool_errors": Counter,
|
| 994 |
|
| 995 |
+
# Pattern Memory metrics
|
| 996 |
"memories_created": Counter,
|
| 997 |
"patterns_extracted": Counter,
|
| 998 |
"improvements_applied": Counter,
|
stack-2.9-docs/BENCHMARKS.md
CHANGED
|
@@ -63,16 +63,23 @@ Measured on A100 80GB with vLLM + AWQ 4-bit:
|
|
| 63 |
|
| 64 |
## Model Performance Benchmarks
|
| 65 |
|
| 66 |
-
|
| 67 |
|
| 68 |
-
|
| 69 |
-
|-----------|-----------------------|-----------------------|-------------|----------------|
|
| 70 |
-
| HumanEval | 76.8% | 76.8% | 84.0% | 81.0% |
|
| 71 |
-
| MBPP | 82.3% | 82.3% | 88.0% | 85.0% |
|
| 72 |
-
| GSM8K | 89.2% | 89.2% | 92.0% | - |
|
| 73 |
-
| Tool Use | 94.1% | 94.1% | 91.0% | 88.0% |
|
| 74 |
|
| 75 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 76 |
|
| 77 |
### Voice-First Features
|
| 78 |
|
|
|
|
| 63 |
|
| 64 |
## Model Performance Benchmarks
|
| 65 |
|
| 66 |
+
β οΈ **Evaluation Status**: The benchmark scores previously claimed (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use) were based on incomplete implementations and have been **removed pending proper verification**. See [EVALUATION.md](../EVALUATION.md) for the audit report.
|
| 67 |
|
| 68 |
+
### Coding Benchmarks (Actual Baseline Expectations)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
+
| Benchmark | Status | Notes |
|
| 71 |
+
|-----------|--------|-------|
|
| 72 |
+
| **HumanEval** | Pending | Full 164-problem evaluation in progress |
|
| 73 |
+
| **MBPP** | Pending | Full 500-problem evaluation in progress |
|
| 74 |
+
| **Tool Use** | Pending | Custom tool-calling benchmark to be created |
|
| 75 |
+
| **GSM8K** | Not started | Math reasoning evaluation planned |
|
| 76 |
+
| **Context** | β
128K | Token context window tested |
|
| 77 |
+
|
| 78 |
+
**Expected Baseline** (Qwen2.5-Coder-32B, unquantized):
|
| 79 |
+
- HumanEval: ~70-72% Pass@1
|
| 80 |
+
- MBPP: ~75-77% Pass@1
|
| 81 |
+
|
| 82 |
+
Stack 2.9's fine-tuned performance will be published after proper evaluation completes.
|
| 83 |
|
| 84 |
### Voice-First Features
|
| 85 |
|
stack-2.9-docs/CONTRIBUTING.md
CHANGED
|
@@ -496,7 +496,7 @@ class TestProcessData:
|
|
| 496 |
| Integration Tests | `tests/integration/` | Test component interactions |
|
| 497 |
| API Tests | `tests/api/` | Test API endpoints |
|
| 498 |
| Tool Tests | `tests/tools/` | Test tool implementations |
|
| 499 |
-
|
|
| 500 |
|
| 501 |
### Running Tests
|
| 502 |
|
|
|
|
| 496 |
| Integration Tests | `tests/integration/` | Test component interactions |
|
| 497 |
| API Tests | `tests/api/` | Test API endpoints |
|
| 498 |
| Tool Tests | `tests/tools/` | Test tool implementations |
|
| 499 |
+
| Pattern Memory Tests | `tests/self_evolution/` | Test learning system |
|
| 500 |
|
| 501 |
### Running Tests
|
| 502 |
|
stack-2.9-docs/README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# Stack 2.9 π€
|
| 2 |
|
| 3 |
-
**Your
|
| 4 |
|
| 5 |
Stack 2.9 is an open-source voice-enabled coding assistant built on Qwen2.5-Coder-32B, fine-tuned with OpenClaw tool patterns. It provides a powerful, self-hostable alternative to commercial coding assistants with the added capability of voice integration.
|
| 6 |
|
|
@@ -35,12 +35,20 @@ Stack 2.9 is an open-source voice-enabled coding assistant built on Qwen2.5-Code
|
|
| 35 |
|
| 36 |
## π Benchmarks
|
| 37 |
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
|
| 41 |
-
|
|
| 42 |
-
| **
|
| 43 |
-
| **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
## π Quick Start
|
| 46 |
|
|
@@ -121,7 +129,7 @@ curl -X POST http://localhost:3000/v1/chat/completions \
|
|
| 121 |
β β MODEL LAYER β β
|
| 122 |
β β βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ β β
|
| 123 |
β β β Qwen2.5-Coder-32B β β Fine-tuned on β β LoRA Adapter β β β
|
| 124 |
-
β β β (Base Model) β β OpenClaw Tools β β (
|
| 125 |
β β βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ β β
|
| 126 |
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 127 |
β β β
|
|
@@ -140,7 +148,7 @@ curl -X POST http://localhost:3000/v1/chat/completions \
|
|
| 140 |
β β β
|
| 141 |
β βΌ β
|
| 142 |
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 143 |
-
β β
|
| 144 |
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
|
| 145 |
β β β Observer ββββ Learner ββββ Memory ββββ Trainer β β β
|
| 146 |
β β β (Watches)β β(Analyzes)β β (Stores) β β(Improves)β β β
|
|
@@ -212,9 +220,9 @@ curl -X POST http://localhost:3000/v1/chat/completions \
|
|
| 212 |
| **Data Processing** | CSV, JSON, XML, database operations |
|
| 213 |
| **Voice** | speech-to-text, text-to-speech, voice cloning |
|
| 214 |
|
| 215 |
-
###
|
| 216 |
|
| 217 |
-
The
|
| 218 |
|
| 219 |
1. **Observe** - Watches problem-solving processes
|
| 220 |
2. **Learn** - Extracts patterns from successes and failures
|
|
@@ -240,7 +248,7 @@ The self-evolution system continuously improves Stack 2.9's performance:
|
|
| 240 |
| **Open Source** | β
Apache 2.0 | β Closed | β Closed | β
LGPL |
|
| 241 |
| **Tool Patterns** | β
OpenClaw | β
Yes | β No | β No |
|
| 242 |
| **Context Window** | 131K tokens | 200K tokens | 32K tokens | 100K tokens |
|
| 243 |
-
| **
|
| 244 |
| **Price** | Free | $20/month | $10/month | $12/month |
|
| 245 |
| **Self-Hosting** | β
Yes | β No | β No | β
Yes |
|
| 246 |
| **Model Size** | 32B params | 200K+ params | 15B params | 100M params |
|
|
@@ -254,7 +262,7 @@ stack-2.9/
|
|
| 254 |
β βββ agent.py # Agent orchestration
|
| 255 |
β βββ context.py # Context management
|
| 256 |
β βββ tools.py # Tool implementations
|
| 257 |
-
βββ self_evolution/ #
|
| 258 |
β βββ observer.py # Behavior observation
|
| 259 |
β βββ learner.py # Pattern extraction
|
| 260 |
β βββ memory.py # Vector-based memory
|
|
@@ -272,11 +280,11 @@ stack-2.9/
|
|
| 272 |
βββ pyproject.toml # Project metadata
|
| 273 |
```
|
| 274 |
|
| 275 |
-
## π
|
| 276 |
|
| 277 |
```
|
| 278 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 279 |
-
β
|
| 280 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 281 |
β β
|
| 282 |
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
|
|
|
| 1 |
# Stack 2.9 π€
|
| 2 |
|
| 3 |
+
**Your pattern-learning AI companion β gets smarter with every conversation.**
|
| 4 |
|
| 5 |
Stack 2.9 is an open-source voice-enabled coding assistant built on Qwen2.5-Coder-32B, fine-tuned with OpenClaw tool patterns. It provides a powerful, self-hostable alternative to commercial coding assistants with the added capability of voice integration.
|
| 6 |
|
|
|
|
| 35 |
|
| 36 |
## π Benchmarks
|
| 37 |
|
| 38 |
+
β οΈ **Evaluation Status**: The benchmark scores previously claimed (76.8% HumanEval, 82.3% MBPP, 94.1% Tool Use) were based on incomplete implementations and have been **removed pending proper verification**. See [EVALUATION.md](../../EVALUATION.md) for the audit report.
|
| 39 |
+
|
| 40 |
+
| Benchmark | Status | Notes |
|
| 41 |
+
|-----------|--------|-------|
|
| 42 |
+
| **HumanEval** | Pending | Full 164-problem evaluation in progress |
|
| 43 |
+
| **MBPP** | Pending | Full 500-problem evaluation in progress |
|
| 44 |
+
| **Tool Use** | Pending | Custom tool-calling benchmark to be created |
|
| 45 |
+
| **Context Window** | β
131K tokens | Long context understanding tested |
|
| 46 |
+
|
| 47 |
+
**Expected Baseline** (Qwen2.5-Coder-32B, unquantized):
|
| 48 |
+
- HumanEval: ~70-72% Pass@1
|
| 49 |
+
- MBPP: ~75-77% Pass@1
|
| 50 |
+
|
| 51 |
+
Stack 2.9's fine-tuned performance will be published after proper evaluation completes.
|
| 52 |
|
| 53 |
## π Quick Start
|
| 54 |
|
|
|
|
| 129 |
β β MODEL LAYER β β
|
| 130 |
β β βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ β β
|
| 131 |
β β β Qwen2.5-Coder-32B β β Fine-tuned on β β LoRA Adapter β β β
|
| 132 |
+
β β β (Base Model) β β OpenClaw Tools β β (Pattern Memory) β β β
|
| 133 |
β β βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ β β
|
| 134 |
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 135 |
β β β
|
|
|
|
| 148 |
β β β
|
| 149 |
β βΌ β
|
| 150 |
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
| 151 |
+
β β PATTERN MEMORY LAYER β β
|
| 152 |
β β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ β β
|
| 153 |
β β β Observer ββββ Learner ββββ Memory ββββ Trainer β β β
|
| 154 |
β β β (Watches)β β(Analyzes)β β (Stores) β β(Improves)β β β
|
|
|
|
| 220 |
| **Data Processing** | CSV, JSON, XML, database operations |
|
| 221 |
| **Voice** | speech-to-text, text-to-speech, voice cloning |
|
| 222 |
|
| 223 |
+
### Pattern Memory Capabilities
|
| 224 |
|
| 225 |
+
The pattern memory system continuously improves Stack 2.9's performance:
|
| 226 |
|
| 227 |
1. **Observe** - Watches problem-solving processes
|
| 228 |
2. **Learn** - Extracts patterns from successes and failures
|
|
|
|
| 248 |
| **Open Source** | β
Apache 2.0 | β Closed | β Closed | β
LGPL |
|
| 249 |
| **Tool Patterns** | β
OpenClaw | β
Yes | β No | β No |
|
| 250 |
| **Context Window** | 131K tokens | 200K tokens | 32K tokens | 100K tokens |
|
| 251 |
+
| **Pattern Memory** | β
Yes | β No | β No | β No |
|
| 252 |
| **Price** | Free | $20/month | $10/month | $12/month |
|
| 253 |
| **Self-Hosting** | β
Yes | β No | β No | β
Yes |
|
| 254 |
| **Model Size** | 32B params | 200K+ params | 15B params | 100M params |
|
|
|
|
| 262 |
β βββ agent.py # Agent orchestration
|
| 263 |
β βββ context.py # Context management
|
| 264 |
β βββ tools.py # Tool implementations
|
| 265 |
+
βββ self_evolution/ # Pattern memory system
|
| 266 |
β βββ observer.py # Behavior observation
|
| 267 |
β βββ learner.py # Pattern extraction
|
| 268 |
β βββ memory.py # Vector-based memory
|
|
|
|
| 280 |
βββ pyproject.toml # Project metadata
|
| 281 |
```
|
| 282 |
|
| 283 |
+
## π Pattern Learning Process
|
| 284 |
|
| 285 |
```
|
| 286 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 287 |
+
β PATTERN LEARNING CYCLE β
|
| 288 |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
|
| 289 |
β β
|
| 290 |
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
|
stack-2.9-eval/human_eval.py
CHANGED
|
@@ -1,20 +1,17 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
-
HumanEval Benchmark Evaluation for Stack 2.9
|
| 4 |
=============================================
|
| 5 |
-
Evaluates code generation capabilities using the HumanEval benchmark.
|
| 6 |
|
| 7 |
-
|
| 8 |
-
- Pass@1: Fraction of problems solved with single generation (temperature=0.2)
|
| 9 |
-
- Pass@10: Fraction of problems solved with 10 generations (temperature=0.8)
|
| 10 |
-
- Pass@100: Fraction of problems solved with 100 generations (temperature=0.8)
|
| 11 |
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
- Pass@10/100: temperature=0.8, top_p=0.95 (creative)
|
| 15 |
|
| 16 |
-
|
| 17 |
-
|
|
|
|
|
|
|
| 18 |
"""
|
| 19 |
|
| 20 |
import argparse
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
+
HumanEval Benchmark Evaluation for Stack 2.9 [DEPRECATED]
|
| 4 |
=============================================
|
|
|
|
| 5 |
|
| 6 |
+
β οΈ WARNING: This evaluation script is DEPRECATED and produces INVALID results.
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
+
It only tests 20 out of 164 problems (12%) and returns hardcoded canonical
|
| 9 |
+
solutions instead of calling a real model. The results are therefore fraudulent.
|
|
|
|
| 10 |
|
| 11 |
+
USE THE PROPER EVALUATION INFRASTRUCTURE:
|
| 12 |
+
python stack-2.9-eval/run_proper_evaluation.py --benchmark humaneval --provider ollama --model qwen2.5-coder:32b
|
| 13 |
+
|
| 14 |
+
See EVALUATION.md for the full audit report.
|
| 15 |
"""
|
| 16 |
|
| 17 |
import argparse
|
stack-2.9-eval/mbpp_eval.py
CHANGED
|
@@ -1,19 +1,17 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
-
MBPP
|
| 4 |
-
===================================================
|
| 5 |
-
Evaluates code generation capabilities using the sanitized MBPP benchmark.
|
| 6 |
|
| 7 |
-
|
| 8 |
-
function calls to complex algorithms. This implementation uses the
|
| 9 |
-
sanitized version (MBPP-santized) with 500 test cases.
|
| 10 |
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
- Pass@10: Fraction solved with 10 generations
|
| 14 |
|
| 15 |
-
|
| 16 |
-
|
|
|
|
|
|
|
| 17 |
"""
|
| 18 |
|
| 19 |
import argparse
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
+
MBPP Benchmark Evaluation for Stack 2.9 [DEPRECATED]
|
| 4 |
+
===================================================
|
|
|
|
| 5 |
|
| 6 |
+
β οΈ WARNING: This evaluation script is DEPRECATED and produces INVALID results.
|
|
|
|
|
|
|
| 7 |
|
| 8 |
+
It only tests 20 out of 500 problems (4%) and returns hardcoded canonical
|
| 9 |
+
solutions instead of calling a real model. The scores are therefore fraudulent.
|
|
|
|
| 10 |
|
| 11 |
+
USE THE PROPER EVALUATION INFRASTRUCTURE:
|
| 12 |
+
python stack-2.9-eval/run_proper_evaluation.py --benchmark mbpp --provider ollama --model qwen2.5-coder:32b
|
| 13 |
+
|
| 14 |
+
See EVALUATION.md for the full audit report.
|
| 15 |
"""
|
| 16 |
|
| 17 |
import argparse
|
stack-2.9-eval/model_client.py
CHANGED
|
@@ -435,6 +435,139 @@ class AnthropicClient(BaseModelClient):
|
|
| 435 |
return self.model
|
| 436 |
|
| 437 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 438 |
def create_model_client(
|
| 439 |
provider: str = "ollama",
|
| 440 |
model: Optional[str] = None,
|
|
@@ -444,7 +577,7 @@ def create_model_client(
|
|
| 444 |
Factory function to create model client.
|
| 445 |
|
| 446 |
Args:
|
| 447 |
-
provider: One of "ollama", "openai", "anthropic"
|
| 448 |
model: Model name (defaults to provider's default)
|
| 449 |
**kwargs: Additional client configuration
|
| 450 |
|
|
@@ -460,8 +593,11 @@ def create_model_client(
|
|
| 460 |
elif provider == "anthropic":
|
| 461 |
default_model = model or os.environ.get("ANTHROPIC_MODEL", "claude-sonnet-4-20250514")
|
| 462 |
return AnthropicClient(model=default_model, **kwargs)
|
|
|
|
|
|
|
|
|
|
| 463 |
else:
|
| 464 |
-
raise ValueError(f"Unknown provider: {provider}. Use: ollama, openai, anthropic")
|
| 465 |
|
| 466 |
|
| 467 |
class ModelClientPool:
|
|
|
|
| 435 |
return self.model
|
| 436 |
|
| 437 |
|
| 438 |
+
class OpenRouterClient(BaseModelClient):
|
| 439 |
+
"""Client for OpenRouter API (unified interface for multiple models)."""
|
| 440 |
+
|
| 441 |
+
def __init__(
|
| 442 |
+
self,
|
| 443 |
+
model: str = "qwen/qwen2.5-coder-32b",
|
| 444 |
+
api_key: Optional[str] = None,
|
| 445 |
+
base_url: str = "https://openrouter.ai/api/v1",
|
| 446 |
+
timeout: int = 120,
|
| 447 |
+
http_referer: Optional[str] = None,
|
| 448 |
+
x_title: Optional[str] = None
|
| 449 |
+
):
|
| 450 |
+
self.model = model
|
| 451 |
+
self.api_key = api_key or os.environ.get("OPENROUTER_API_KEY", "")
|
| 452 |
+
self.base_url = base_url
|
| 453 |
+
self.timeout = timeout
|
| 454 |
+
self.http_referer = http_referer or os.environ.get("HTTP_REFERER", "")
|
| 455 |
+
self.x_title = x_title or os.environ.get("X_TITLE", "Stack 2.9")
|
| 456 |
+
|
| 457 |
+
if not self.api_key:
|
| 458 |
+
raise ValueError("OpenRouter API key required. Set OPENROUTER_API_KEY environment variable.")
|
| 459 |
+
|
| 460 |
+
def _get_client(self):
|
| 461 |
+
"""Get OpenAI-compatible client."""
|
| 462 |
+
try:
|
| 463 |
+
from openai import OpenAI
|
| 464 |
+
return OpenAI(api_key=self.api_key, base_url=self.base_url, timeout=self.timeout)
|
| 465 |
+
except ImportError:
|
| 466 |
+
raise ImportError("openai package required. Install with: pip install openai")
|
| 467 |
+
|
| 468 |
+
def generate(
|
| 469 |
+
self,
|
| 470 |
+
prompt: str,
|
| 471 |
+
temperature: float = 0.2,
|
| 472 |
+
max_tokens: int = 4096,
|
| 473 |
+
stop: Optional[List[str]] = None,
|
| 474 |
+
**kwargs
|
| 475 |
+
) -> GenerationResult:
|
| 476 |
+
"""Generate text using OpenRouter."""
|
| 477 |
+
client = self._get_client()
|
| 478 |
+
|
| 479 |
+
start_time = time.time()
|
| 480 |
+
|
| 481 |
+
try:
|
| 482 |
+
response = client.completions.create(
|
| 483 |
+
model=self.model,
|
| 484 |
+
prompt=prompt,
|
| 485 |
+
temperature=temperature,
|
| 486 |
+
max_tokens=max_tokens,
|
| 487 |
+
stop=stop,
|
| 488 |
+
**kwargs
|
| 489 |
+
)
|
| 490 |
+
|
| 491 |
+
duration = time.time() - start_time
|
| 492 |
+
|
| 493 |
+
result = GenerationResult(
|
| 494 |
+
text=response.choices[0].text,
|
| 495 |
+
model=self.model,
|
| 496 |
+
tokens=response.usage.completion_tokens,
|
| 497 |
+
duration=duration,
|
| 498 |
+
finish_reason=response.choices[0].finish_reason,
|
| 499 |
+
raw_response=response.model_dump()
|
| 500 |
+
)
|
| 501 |
+
|
| 502 |
+
return result
|
| 503 |
+
except Exception as e:
|
| 504 |
+
logger.error(f"OpenRouter request failed: {e}")
|
| 505 |
+
raise
|
| 506 |
+
|
| 507 |
+
def chat(
|
| 508 |
+
self,
|
| 509 |
+
messages: List[ChatMessage],
|
| 510 |
+
temperature: float = 0.2,
|
| 511 |
+
max_tokens: int = 4096,
|
| 512 |
+
tools: Optional[List[Dict]] = None,
|
| 513 |
+
**kwargs
|
| 514 |
+
) -> GenerationResult:
|
| 515 |
+
"""Generate chat response using OpenRouter."""
|
| 516 |
+
client = self._get_client()
|
| 517 |
+
|
| 518 |
+
# Convert messages to chat format
|
| 519 |
+
chat_messages = [{"role": m.role, "content": m.content} for m in messages]
|
| 520 |
+
|
| 521 |
+
request_params = {
|
| 522 |
+
"model": self.model,
|
| 523 |
+
"messages": chat_messages,
|
| 524 |
+
"temperature": temperature,
|
| 525 |
+
"max_tokens": max_tokens,
|
| 526 |
+
}
|
| 527 |
+
|
| 528 |
+
if tools:
|
| 529 |
+
request_params["tools"] = tools
|
| 530 |
+
|
| 531 |
+
request_params.update(kwargs)
|
| 532 |
+
|
| 533 |
+
# Add OpenRouter-specific headers
|
| 534 |
+
extra_headers = {}
|
| 535 |
+
if self.http_referer:
|
| 536 |
+
extra_headers["HTTP-Referer"] = self.http_referer
|
| 537 |
+
if self.x_title:
|
| 538 |
+
extra_headers["X-Title"] = self.x_title
|
| 539 |
+
|
| 540 |
+
start_time = time.time()
|
| 541 |
+
|
| 542 |
+
try:
|
| 543 |
+
response = client.chat.completions.create(
|
| 544 |
+
extra_headers=extra_headers if extra_headers else None,
|
| 545 |
+
**request_params
|
| 546 |
+
)
|
| 547 |
+
|
| 548 |
+
duration = time.time() - start_time
|
| 549 |
+
|
| 550 |
+
msg = response.choices[0].message
|
| 551 |
+
text = msg.content or ""
|
| 552 |
+
|
| 553 |
+
result = GenerationResult(
|
| 554 |
+
text=text,
|
| 555 |
+
model=self.model,
|
| 556 |
+
tokens=response.usage.completion_tokens,
|
| 557 |
+
duration=duration,
|
| 558 |
+
finish_reason=response.choices[0].finish_reason,
|
| 559 |
+
raw_response=response.model_dump()
|
| 560 |
+
)
|
| 561 |
+
|
| 562 |
+
return result
|
| 563 |
+
except Exception as e:
|
| 564 |
+
logger.error(f"OpenRouter chat request failed: {e}")
|
| 565 |
+
raise
|
| 566 |
+
|
| 567 |
+
def get_model_name(self) -> str:
|
| 568 |
+
return self.model
|
| 569 |
+
|
| 570 |
+
|
| 571 |
def create_model_client(
|
| 572 |
provider: str = "ollama",
|
| 573 |
model: Optional[str] = None,
|
|
|
|
| 577 |
Factory function to create model client.
|
| 578 |
|
| 579 |
Args:
|
| 580 |
+
provider: One of "ollama", "openai", "anthropic", "openrouter"
|
| 581 |
model: Model name (defaults to provider's default)
|
| 582 |
**kwargs: Additional client configuration
|
| 583 |
|
|
|
|
| 593 |
elif provider == "anthropic":
|
| 594 |
default_model = model or os.environ.get("ANTHROPIC_MODEL", "claude-sonnet-4-20250514")
|
| 595 |
return AnthropicClient(model=default_model, **kwargs)
|
| 596 |
+
elif provider == "openrouter":
|
| 597 |
+
default_model = model or os.environ.get("OPENROUTER_MODEL", "qwen/qwen2.5-coder-32b")
|
| 598 |
+
return OpenRouterClient(model=default_model, **kwargs)
|
| 599 |
else:
|
| 600 |
+
raise ValueError(f"Unknown provider: {provider}. Use: ollama, openai, anthropic, openrouter")
|
| 601 |
|
| 602 |
|
| 603 |
class ModelClientPool:
|
stack-2.9-eval/tool_use_eval.py
CHANGED
|
@@ -1,22 +1,18 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
-
Tool Use Evaluation for Stack 2.9
|
| 4 |
-
===================================
|
| 5 |
-
Evaluates tool calling capabilities across 500+ test cases covering:
|
| 6 |
-
- File operations (read, write, edit, glob)
|
| 7 |
-
- Git operations (status, commit, push, branch)
|
| 8 |
-
- Search operations (grep, web search)
|
| 9 |
-
- Execution operations (bash, shell commands)
|
| 10 |
-
- System operations (task management, config)
|
| 11 |
|
| 12 |
-
|
| 13 |
-
- Tool Selection Accuracy: Correct tool chosen for task
|
| 14 |
-
- Parameter Accuracy: Correct parameters provided
|
| 15 |
-
- Execution Success Rate: Task completed successfully
|
| 16 |
-
- Overall Success Rate: Combined metric
|
| 17 |
|
| 18 |
-
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
"""
|
| 21 |
|
| 22 |
import argparse
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
"""
|
| 3 |
+
Tool Use Evaluation for Stack 2.9 [DEPRECATED]
|
| 4 |
+
==============================================
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
|
| 6 |
+
β οΈ WARNING: This evaluation script is DEPRECATED and the methodology is INVALID.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
+
This evaluator uses a naive keyword-matching simulation, not actual model inference.
|
| 9 |
+
There is no proper benchmark implementation for tool calling. The claimed 94.1%
|
| 10 |
+
score is unverifiable and misleading.
|
| 11 |
+
|
| 12 |
+
A proper tool use benchmark needs to be built with 500+ realistic test cases and
|
| 13 |
+
actual model calls. This script remains only as a placeholder.
|
| 14 |
+
|
| 15 |
+
See EVALUATION.md for the full audit report.
|
| 16 |
"""
|
| 17 |
|
| 18 |
import argparse
|
stack_cli/cli.py
CHANGED
|
@@ -509,7 +509,7 @@ Examples:
|
|
| 509 |
parser.add_argument(
|
| 510 |
'--patterns',
|
| 511 |
choices=['list', 'stats', 'clear'],
|
| 512 |
-
help="Manage
|
| 513 |
)
|
| 514 |
|
| 515 |
# Training
|
|
|
|
| 509 |
parser.add_argument(
|
| 510 |
'--patterns',
|
| 511 |
choices=['list', 'stats', 'clear'],
|
| 512 |
+
help="Manage learned patterns"
|
| 513 |
)
|
| 514 |
|
| 515 |
# Training
|
website/benchmark.html
CHANGED
|
@@ -42,15 +42,15 @@
|
|
| 42 |
<p class="subtitle">Stack 2.9 vs Leading AI Models</p>
|
| 43 |
<div class="benchmark-summary">
|
| 44 |
<div class="summary-card">
|
| 45 |
-
<div class="summary-value">
|
| 46 |
<div class="summary-label">HumanEval</div>
|
| 47 |
</div>
|
| 48 |
<div class="summary-card">
|
| 49 |
-
<div class="summary-value">
|
| 50 |
<div class="summary-label">MBPP</div>
|
| 51 |
</div>
|
| 52 |
<div class="summary-card highlight">
|
| 53 |
-
<div class="summary-value">
|
| 54 |
<div class="summary-label">Tool Use</div>
|
| 55 |
</div>
|
| 56 |
<div class="summary-card">
|
|
@@ -114,10 +114,10 @@
|
|
| 114 |
<tbody>
|
| 115 |
<tr class="highlight-row">
|
| 116 |
<td><strong>Stack 2.9</strong></td>
|
| 117 |
-
<td>
|
| 118 |
-
<td>
|
| 119 |
-
<td>
|
| 120 |
-
<td class="best">
|
| 121 |
<td>32B</td>
|
| 122 |
</tr>
|
| 123 |
<tr>
|
|
@@ -303,7 +303,7 @@
|
|
| 303 |
<div class="footer-brand">
|
| 304 |
<span class="logo-icon">π€</span>
|
| 305 |
<span>Stack 2.9</span>
|
| 306 |
-
<p>Your
|
| 307 |
</div>
|
| 308 |
<div class="footer-links">
|
| 309 |
<a href="https://github.com/my-ai-stack/stack-2.9" target="_blank">GitHub</a>
|
|
@@ -335,8 +335,8 @@
|
|
| 335 |
labels: ['HumanEval', 'MBPP', 'SWE-bench', 'Tool Use'],
|
| 336 |
datasets: [
|
| 337 |
{
|
| 338 |
-
label: 'Stack 2.9',
|
| 339 |
-
data: [
|
| 340 |
backgroundColor: '#6366f1',
|
| 341 |
borderRadius: 8,
|
| 342 |
},
|
|
@@ -404,8 +404,8 @@
|
|
| 404 |
labels: ['Base', '10 convos', '50 convos', '100 convos', '200 convos', '500 convos'],
|
| 405 |
datasets: [
|
| 406 |
{
|
| 407 |
-
label: 'Stack 2.9',
|
| 408 |
-
data: [
|
| 409 |
borderColor: '#6366f1',
|
| 410 |
backgroundColor: 'rgba(99, 102, 241, 0.1)',
|
| 411 |
fill: true,
|
|
|
|
| 42 |
<p class="subtitle">Stack 2.9 vs Leading AI Models</p>
|
| 43 |
<div class="benchmark-summary">
|
| 44 |
<div class="summary-card">
|
| 45 |
+
<div class="summary-value">TBD</div>
|
| 46 |
<div class="summary-label">HumanEval</div>
|
| 47 |
</div>
|
| 48 |
<div class="summary-card">
|
| 49 |
+
<div class="summary-value">TBD</div>
|
| 50 |
<div class="summary-label">MBPP</div>
|
| 51 |
</div>
|
| 52 |
<div class="summary-card highlight">
|
| 53 |
+
<div class="summary-value">TBD</div>
|
| 54 |
<div class="summary-label">Tool Use</div>
|
| 55 |
</div>
|
| 56 |
<div class="summary-card">
|
|
|
|
| 114 |
<tbody>
|
| 115 |
<tr class="highlight-row">
|
| 116 |
<td><strong>Stack 2.9</strong></td>
|
| 117 |
+
<td>TBD</td>
|
| 118 |
+
<td>TBD</td>
|
| 119 |
+
<td>TBD</td>
|
| 120 |
+
<td class="best">TBD</td>
|
| 121 |
<td>32B</td>
|
| 122 |
</tr>
|
| 123 |
<tr>
|
|
|
|
| 303 |
<div class="footer-brand">
|
| 304 |
<span class="logo-icon">π€</span>
|
| 305 |
<span>Stack 2.9</span>
|
| 306 |
+
<p>Your pattern-learning AI companion</p>
|
| 307 |
</div>
|
| 308 |
<div class="footer-links">
|
| 309 |
<a href="https://github.com/my-ai-stack/stack-2.9" target="_blank">GitHub</a>
|
|
|
|
| 335 |
labels: ['HumanEval', 'MBPP', 'SWE-bench', 'Tool Use'],
|
| 336 |
datasets: [
|
| 337 |
{
|
| 338 |
+
label: 'Stack 2.9 (pending verification)',
|
| 339 |
+
data: [0, 0, 0, 0],
|
| 340 |
backgroundColor: '#6366f1',
|
| 341 |
borderRadius: 8,
|
| 342 |
},
|
|
|
|
| 404 |
labels: ['Base', '10 convos', '50 convos', '100 convos', '200 convos', '500 convos'],
|
| 405 |
datasets: [
|
| 406 |
{
|
| 407 |
+
label: 'Stack 2.9 (evaluation pending)',
|
| 408 |
+
data: [null, null, null, null, null, null],
|
| 409 |
borderColor: '#6366f1',
|
| 410 |
backgroundColor: 'rgba(99, 102, 241, 0.1)',
|
| 411 |
fill: true,
|
website/index.html
CHANGED
|
@@ -3,7 +3,7 @@
|
|
| 3 |
<head>
|
| 4 |
<meta charset="UTF-8">
|
| 5 |
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
| 6 |
-
<title>Stack 2.9 β Your
|
| 7 |
<link rel="stylesheet" href="styles.css">
|
| 8 |
<link rel="icon" href="data:image/svg+xml,<svg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 100 100'><text y='.9em' font-size='90'>π€</text></svg>">
|
| 9 |
<meta name="description" content="Stack 2.9 - Open-source AI that learns, adapts, and improves itself over time. Built on Qwen2.5-Coder-32B.">
|
|
@@ -83,7 +83,7 @@
|
|
| 83 |
<div class="features-grid">
|
| 84 |
<div class="feature-card">
|
| 85 |
<div class="feature-icon">π§ </div>
|
| 86 |
-
<h3>
|
| 87 |
<p>Learns from every conversation and task. Improves its own capabilities through experience. Gets smarter the more you use it.</p>
|
| 88 |
</div>
|
| 89 |
<div class="feature-card">
|
|
@@ -121,24 +121,24 @@
|
|
| 121 |
<p class="section-subtitle">Competitive results on standard coding benchmarks</p>
|
| 122 |
<div class="benchmark-grid">
|
| 123 |
<div class="benchmark-card">
|
| 124 |
-
<div class="benchmark-value">
|
| 125 |
<div class="benchmark-label">HumanEval</div>
|
| 126 |
<div class="benchmark-bar">
|
| 127 |
-
<div class="benchmark-fill" style="width:
|
| 128 |
</div>
|
| 129 |
</div>
|
| 130 |
<div class="benchmark-card">
|
| 131 |
-
<div class="benchmark-value">
|
| 132 |
<div class="benchmark-label">MBPP</div>
|
| 133 |
<div class="benchmark-bar">
|
| 134 |
-
<div class="benchmark-fill" style="width:
|
| 135 |
</div>
|
| 136 |
</div>
|
| 137 |
<div class="benchmark-card highlight">
|
| 138 |
-
<div class="benchmark-value">
|
| 139 |
<div class="benchmark-label">Tool Use</div>
|
| 140 |
<div class="benchmark-bar">
|
| 141 |
-
<div class="benchmark-fill" style="width:
|
| 142 |
</div>
|
| 143 |
</div>
|
| 144 |
<div class="benchmark-card">
|
|
@@ -184,7 +184,7 @@
|
|
| 184 |
|
| 185 |
<section class="how-it-works">
|
| 186 |
<div class="container">
|
| 187 |
-
<h2 class="section-title">How
|
| 188 |
<div class="steps">
|
| 189 |
<div class="step">
|
| 190 |
<div class="step-number">1</div>
|
|
@@ -212,7 +212,7 @@
|
|
| 212 |
<div class="step-arrow">β</div>
|
| 213 |
<div class="step">
|
| 214 |
<div class="step-number">5</div>
|
| 215 |
-
<h3>
|
| 216 |
<p>Gradually becomes smarter</p>
|
| 217 |
</div>
|
| 218 |
</div>
|
|
@@ -257,7 +257,7 @@
|
|
| 257 |
<div class="footer-brand">
|
| 258 |
<span class="logo-icon">π€</span>
|
| 259 |
<span>Stack 2.9</span>
|
| 260 |
-
<p>Your
|
| 261 |
</div>
|
| 262 |
<div class="footer-links">
|
| 263 |
<a href="https://github.com/my-ai-stack/stack-2.9" target="_blank">GitHub</a>
|
|
|
|
| 3 |
<head>
|
| 4 |
<meta charset="UTF-8">
|
| 5 |
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
| 6 |
+
<title>Stack 2.9 β Your Pattern-Learning AI Companion</title>
|
| 7 |
<link rel="stylesheet" href="styles.css">
|
| 8 |
<link rel="icon" href="data:image/svg+xml,<svg xmlns='http://www.w3.org/2000/svg' viewBox='0 0 100 100'><text y='.9em' font-size='90'>π€</text></svg>">
|
| 9 |
<meta name="description" content="Stack 2.9 - Open-source AI that learns, adapts, and improves itself over time. Built on Qwen2.5-Coder-32B.">
|
|
|
|
| 83 |
<div class="features-grid">
|
| 84 |
<div class="feature-card">
|
| 85 |
<div class="feature-icon">π§ </div>
|
| 86 |
+
<h3>Pattern Learning</h3>
|
| 87 |
<p>Learns from every conversation and task. Improves its own capabilities through experience. Gets smarter the more you use it.</p>
|
| 88 |
</div>
|
| 89 |
<div class="feature-card">
|
|
|
|
| 121 |
<p class="section-subtitle">Competitive results on standard coding benchmarks</p>
|
| 122 |
<div class="benchmark-grid">
|
| 123 |
<div class="benchmark-card">
|
| 124 |
+
<div class="benchmark-value">TBD</div>
|
| 125 |
<div class="benchmark-label">HumanEval</div>
|
| 126 |
<div class="benchmark-bar">
|
| 127 |
+
<div class="benchmark-fill" style="width: 0%"></div>
|
| 128 |
</div>
|
| 129 |
</div>
|
| 130 |
<div class="benchmark-card">
|
| 131 |
+
<div class="benchmark-value">TBD</div>
|
| 132 |
<div class="benchmark-label">MBPP</div>
|
| 133 |
<div class="benchmark-bar">
|
| 134 |
+
<div class="benchmark-fill" style="width: 0%"></div>
|
| 135 |
</div>
|
| 136 |
</div>
|
| 137 |
<div class="benchmark-card highlight">
|
| 138 |
+
<div class="benchmark-value">TBD</div>
|
| 139 |
<div class="benchmark-label">Tool Use</div>
|
| 140 |
<div class="benchmark-bar">
|
| 141 |
+
<div class="benchmark-fill" style="width: 0%"></div>
|
| 142 |
</div>
|
| 143 |
</div>
|
| 144 |
<div class="benchmark-card">
|
|
|
|
| 184 |
|
| 185 |
<section class="how-it-works">
|
| 186 |
<div class="container">
|
| 187 |
+
<h2 class="section-title">How Pattern Learning Works</h2>
|
| 188 |
<div class="steps">
|
| 189 |
<div class="step">
|
| 190 |
<div class="step-number">1</div>
|
|
|
|
| 212 |
<div class="step-arrow">β</div>
|
| 213 |
<div class="step">
|
| 214 |
<div class="step-number">5</div>
|
| 215 |
+
<h3>Improve</h3>
|
| 216 |
<p>Gradually becomes smarter</p>
|
| 217 |
</div>
|
| 218 |
</div>
|
|
|
|
| 257 |
<div class="footer-brand">
|
| 258 |
<span class="logo-icon">π€</span>
|
| 259 |
<span>Stack 2.9</span>
|
| 260 |
+
<p>Your pattern-learning AI companion</p>
|
| 261 |
</div>
|
| 262 |
<div class="footer-links">
|
| 263 |
<a href="https://github.com/my-ai-stack/stack-2.9" target="_blank">GitHub</a>
|