File size: 5,784 Bytes
b03a8a0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
# Stack 2.9 Roadmap

> Last updated: April 2026

## Current Status

### What's Working ✅

- **Basic code generation** - The model can generate Python, JavaScript, and other code based on prompts
- **CLI interface** - Working command-line interface (`stack.py`, `src/cli/`)
- **Multi-provider support** - Ollama, OpenAI, Anthropic, OpenRouter, Together AI integrations
- **46 built-in tools** - File operations, git, shell, web search, memory, task planning
- **Pattern Memory infrastructure** - Observer, Learner, Memory components implemented
- **Training pipeline** - LoRA fine-tuning scripts, data preparation, model merging
- **Deployment options** - Docker, RunPod, Vast.ai, Kubernetes, HuggingFace Spaces
- **128K context window** - Extended from base model's 32K

### What's Broken or Missing ⚠️

- **Tool calling not trained** - Model doesn't reliably use tools; needs fine-tuning on tool patterns
- **Benchmark scores unverifiable** - Previous claims removed after audit found only 20/164 HumanEval problems tested
- **Self-evolution not functional** - Observer/Learner components exist but not connected to training pipeline
- **Voice integration incomplete** - Coqui XTTS integration present but not tested
- **Evaluation infrastructure in progress** - New proper evaluation framework built but not run on full benchmarks

### What Needs Testing 🔧

- Full HumanEval (164 problems) evaluation
- Full MBPP (500 problems) evaluation
- Tool-calling accuracy with real tasks
- Pattern Memory retrieval and effectiveness
- Voice input/output pipeline
- Multi-provider compatibility

### What Needs Documentation 📚

- Tool definitions and schemas
- API reference (internal/ARCHITECTURE.md exists but needs updating)
- Pattern Memory usage guide
- Deployment troubleshooting
- Evaluation methodology

---

## Timeline with Milestones

### Short-Term (1-2 Weeks)

| Milestone | Description | Status |
|-----------|-------------|--------|
| **S1.1** | Run full HumanEval (164 problems) with proper inference | Not started |
| **S1.2** | Run full MBPP (500 problems) with proper inference | Not started |
| **S1.3** | Document all 46 tool definitions in `docs/TOOLS.md` | In progress |
| **S1.4** | Fix evaluation scripts to use real model inference | Needed |
| **S1.5** | Create minimal reproducible test for tool calling | Not started |

**Owner:** Community contribution welcome

### Medium-Term (1-3 Months)

| Milestone | Description | Status |
|-----------|-------------|--------|
| **M2.1** | Fine-tune model on tool-calling patterns (RTMP data) | Not started |
| **M2.2** | Implement and test self-evolution loop (Observer → Learner → Memory → Trainer) | Not started |
| **M2.3** | Run full benchmark evaluation and publish verified scores | Not started |
| **M2.4** | Add MCP server support for external tool integration | Partial |
| **M2.5** | Voice integration end-to-end testing | Not started |
| **M2.6** | Implement pattern extraction from production usage | Not started |

**Owner:** Requires training compute budget or community contribution

### Long-Term (6+ Months)

| Milestone | Description | Status |
|-----------|-------------|--------|
| **L3.1** | RLHF training for improved tool selection | Future |
| **L3.2** | Team sync infrastructure (PostgreSQL + FastAPI) | Designed, not implemented |
| **L3.3** | Federated learning for privacy-preserving updates | Future |
| **L3.4** | Multi-modal support (images → code) | Future |
| **L3.5** | Real-time voice-to-voice conversation | Future |

**Owner:** Long-term vision, needs significant resources

---

## How to Contribute

### By Priority

1. **Run evaluations** - Help us verify benchmark scores by running `python stack_2_9_eval/run_proper_evaluation.py`
2. **Test tool calling** - Try the model with various tools and report what works/doesn't
3. **Documentation** - Improve docs, especially tool definitions and API reference
4. **Bug reports** - Open issues with reproduction steps
5. **Code contributions** - See CONTRIBUTING.md for guidelines

### Contribution Areas

| Area | Skill Needed | Priority |
|------|--------------|----------|
| Evaluation | Python, ML benchmarking | High |
| Tool calling tests | Python, CLI usage | High |
| Documentation | Technical writing | Medium |
| Training scripts | PyTorch, PEFT | Medium |
| Deployment | Docker, K8s, Cloud | Low |
| Pattern Memory | Vector databases, ML | Low |

### Quick Wins for Contributors

- Run `python stack.py -c "List files in current directory"` and report if tools work
- Review `stack/eval/results/` and verify evaluation logs
- Check `docs/TOOLS.md` accuracy against actual tool implementations
- Test with different providers (`--provider ollama|openai|anthropic`)

---

## Technical Notes

### Known Limitations

1. **Tool calling is not trained** - The base model has tool capabilities but Stack 2.9 hasn't been fine-tuned to use them reliably
2. **Pattern Memory is read-only** - The system stores patterns but doesn't automatically retrain on them yet
3. **Evaluation uses stub data** - Some eval scripts return pre-canned answers instead of running model
4. **Voice integration untested** - Code exists but hasn't been validated end-to-end

### Next Training Run Requirements

To fix tool calling, the next training run needs:

- Dataset: `data/rtmp-tools/combined_tools.jsonl` (already generated)
- Compute: ~1 hour on A100 for LoRA fine-tuning
- Configuration: Target tool_call logits, use `tool_use_examples.jsonl`

---

## Contact

- **Issues:** https://github.com/my-ai-stack/stack-2.9/issues
- **Discussions:** https://github.com/my-ai-stack/stack-2.9/discussions
- **Discord:** (link in README)

---

*This roadmap is a living document. Updates based on community feedback and project progress.*