Stack 2.9 Roadmap
Last updated: April 2026
Current Status
What's Working β
- Basic code generation - The model can generate Python, JavaScript, and other code based on prompts
- CLI interface - Working command-line interface (
stack.py,src/cli/) - Multi-provider support - Ollama, OpenAI, Anthropic, OpenRouter, Together AI integrations
- 46 built-in tools - File operations, git, shell, web search, memory, task planning
- Pattern Memory infrastructure - Observer, Learner, Memory components implemented
- Training pipeline - LoRA fine-tuning scripts, data preparation, model merging
- Deployment options - Docker, RunPod, Vast.ai, Kubernetes, HuggingFace Spaces
- 128K context window - Extended from base model's 32K
What's Broken or Missing β οΈ
- Tool calling not trained - Model doesn't reliably use tools; needs fine-tuning on tool patterns
- Benchmark scores unverifiable - Previous claims removed after audit found only 20/164 HumanEval problems tested
- Self-evolution not functional - Observer/Learner components exist but not connected to training pipeline
- Voice integration incomplete - Coqui XTTS integration present but not tested
- Evaluation infrastructure in progress - New proper evaluation framework built but not run on full benchmarks
What Needs Testing π§
- Full HumanEval (164 problems) evaluation
- Full MBPP (500 problems) evaluation
- Tool-calling accuracy with real tasks
- Pattern Memory retrieval and effectiveness
- Voice input/output pipeline
- Multi-provider compatibility
What Needs Documentation π
- Tool definitions and schemas
- API reference (internal/ARCHITECTURE.md exists but needs updating)
- Pattern Memory usage guide
- Deployment troubleshooting
- Evaluation methodology
Timeline with Milestones
Short-Term (1-2 Weeks)
| Milestone | Description | Status |
|---|---|---|
| S1.1 | Run full HumanEval (164 problems) with proper inference | Not started |
| S1.2 | Run full MBPP (500 problems) with proper inference | Not started |
| S1.3 | Document all 46 tool definitions in docs/TOOLS.md |
In progress |
| S1.4 | Fix evaluation scripts to use real model inference | Needed |
| S1.5 | Create minimal reproducible test for tool calling | Not started |
Owner: Community contribution welcome
Medium-Term (1-3 Months)
| Milestone | Description | Status |
|---|---|---|
| M2.1 | Fine-tune model on tool-calling patterns (RTMP data) | Not started |
| M2.2 | Implement and test self-evolution loop (Observer β Learner β Memory β Trainer) | Not started |
| M2.3 | Run full benchmark evaluation and publish verified scores | Not started |
| M2.4 | Add MCP server support for external tool integration | Partial |
| M2.5 | Voice integration end-to-end testing | Not started |
| M2.6 | Implement pattern extraction from production usage | Not started |
Owner: Requires training compute budget or community contribution
Long-Term (6+ Months)
| Milestone | Description | Status |
|---|---|---|
| L3.1 | RLHF training for improved tool selection | Future |
| L3.2 | Team sync infrastructure (PostgreSQL + FastAPI) | Designed, not implemented |
| L3.3 | Federated learning for privacy-preserving updates | Future |
| L3.4 | Multi-modal support (images β code) | Future |
| L3.5 | Real-time voice-to-voice conversation | Future |
Owner: Long-term vision, needs significant resources
How to Contribute
By Priority
- Run evaluations - Help us verify benchmark scores by running
python stack_2_9_eval/run_proper_evaluation.py - Test tool calling - Try the model with various tools and report what works/doesn't
- Documentation - Improve docs, especially tool definitions and API reference
- Bug reports - Open issues with reproduction steps
- Code contributions - See CONTRIBUTING.md for guidelines
Contribution Areas
| Area | Skill Needed | Priority |
|---|---|---|
| Evaluation | Python, ML benchmarking | High |
| Tool calling tests | Python, CLI usage | High |
| Documentation | Technical writing | Medium |
| Training scripts | PyTorch, PEFT | Medium |
| Deployment | Docker, K8s, Cloud | Low |
| Pattern Memory | Vector databases, ML | Low |
Quick Wins for Contributors
- Run
python stack.py -c "List files in current directory"and report if tools work - Review
stack/eval/results/and verify evaluation logs - Check
docs/TOOLS.mdaccuracy against actual tool implementations - Test with different providers (
--provider ollama|openai|anthropic)
Technical Notes
Known Limitations
- Tool calling is not trained - The base model has tool capabilities but Stack 2.9 hasn't been fine-tuned to use them reliably
- Pattern Memory is read-only - The system stores patterns but doesn't automatically retrain on them yet
- Evaluation uses stub data - Some eval scripts return pre-canned answers instead of running model
- Voice integration untested - Code exists but hasn't been validated end-to-end
Next Training Run Requirements
To fix tool calling, the next training run needs:
- Dataset:
data/rtmp-tools/combined_tools.jsonl(already generated) - Compute: ~1 hour on A100 for LoRA fine-tuning
- Configuration: Target tool_call logits, use
tool_use_examples.jsonl
Contact
- Issues: https://github.com/my-ai-stack/stack-2.9/issues
- Discussions: https://github.com/my-ai-stack/stack-2.9/discussions
- Discord: (link in README)
This roadmap is a living document. Updates based on community feedback and project progress.