# Stack 2.9 Roadmap > Last updated: April 2026 ## Current Status ### What's Working ✅ - **Basic code generation** - The model can generate Python, JavaScript, and other code based on prompts - **CLI interface** - Working command-line interface (`stack.py`, `src/cli/`) - **Multi-provider support** - Ollama, OpenAI, Anthropic, OpenRouter, Together AI integrations - **46 built-in tools** - File operations, git, shell, web search, memory, task planning - **Pattern Memory infrastructure** - Observer, Learner, Memory components implemented - **Training pipeline** - LoRA fine-tuning scripts, data preparation, model merging - **Deployment options** - Docker, RunPod, Vast.ai, Kubernetes, HuggingFace Spaces - **128K context window** - Extended from base model's 32K ### What's Broken or Missing ⚠️ - **Tool calling not trained** - Model doesn't reliably use tools; needs fine-tuning on tool patterns - **Benchmark scores unverifiable** - Previous claims removed after audit found only 20/164 HumanEval problems tested - **Self-evolution not functional** - Observer/Learner components exist but not connected to training pipeline - **Voice integration incomplete** - Coqui XTTS integration present but not tested - **Evaluation infrastructure in progress** - New proper evaluation framework built but not run on full benchmarks ### What Needs Testing 🔧 - Full HumanEval (164 problems) evaluation - Full MBPP (500 problems) evaluation - Tool-calling accuracy with real tasks - Pattern Memory retrieval and effectiveness - Voice input/output pipeline - Multi-provider compatibility ### What Needs Documentation 📚 - Tool definitions and schemas - API reference (internal/ARCHITECTURE.md exists but needs updating) - Pattern Memory usage guide - Deployment troubleshooting - Evaluation methodology --- ## Timeline with Milestones ### Short-Term (1-2 Weeks) | Milestone | Description | Status | |-----------|-------------|--------| | **S1.1** | Run full HumanEval (164 problems) with proper inference | Not started | | **S1.2** | Run full MBPP (500 problems) with proper inference | Not started | | **S1.3** | Document all 46 tool definitions in `docs/TOOLS.md` | In progress | | **S1.4** | Fix evaluation scripts to use real model inference | Needed | | **S1.5** | Create minimal reproducible test for tool calling | Not started | **Owner:** Community contribution welcome ### Medium-Term (1-3 Months) | Milestone | Description | Status | |-----------|-------------|--------| | **M2.1** | Fine-tune model on tool-calling patterns (RTMP data) | Not started | | **M2.2** | Implement and test self-evolution loop (Observer → Learner → Memory → Trainer) | Not started | | **M2.3** | Run full benchmark evaluation and publish verified scores | Not started | | **M2.4** | Add MCP server support for external tool integration | Partial | | **M2.5** | Voice integration end-to-end testing | Not started | | **M2.6** | Implement pattern extraction from production usage | Not started | **Owner:** Requires training compute budget or community contribution ### Long-Term (6+ Months) | Milestone | Description | Status | |-----------|-------------|--------| | **L3.1** | RLHF training for improved tool selection | Future | | **L3.2** | Team sync infrastructure (PostgreSQL + FastAPI) | Designed, not implemented | | **L3.3** | Federated learning for privacy-preserving updates | Future | | **L3.4** | Multi-modal support (images → code) | Future | | **L3.5** | Real-time voice-to-voice conversation | Future | **Owner:** Long-term vision, needs significant resources --- ## How to Contribute ### By Priority 1. **Run evaluations** - Help us verify benchmark scores by running `python stack_2_9_eval/run_proper_evaluation.py` 2. **Test tool calling** - Try the model with various tools and report what works/doesn't 3. **Documentation** - Improve docs, especially tool definitions and API reference 4. **Bug reports** - Open issues with reproduction steps 5. **Code contributions** - See CONTRIBUTING.md for guidelines ### Contribution Areas | Area | Skill Needed | Priority | |------|--------------|----------| | Evaluation | Python, ML benchmarking | High | | Tool calling tests | Python, CLI usage | High | | Documentation | Technical writing | Medium | | Training scripts | PyTorch, PEFT | Medium | | Deployment | Docker, K8s, Cloud | Low | | Pattern Memory | Vector databases, ML | Low | ### Quick Wins for Contributors - Run `python stack.py -c "List files in current directory"` and report if tools work - Review `stack/eval/results/` and verify evaluation logs - Check `docs/TOOLS.md` accuracy against actual tool implementations - Test with different providers (`--provider ollama|openai|anthropic`) --- ## Technical Notes ### Known Limitations 1. **Tool calling is not trained** - The base model has tool capabilities but Stack 2.9 hasn't been fine-tuned to use them reliably 2. **Pattern Memory is read-only** - The system stores patterns but doesn't automatically retrain on them yet 3. **Evaluation uses stub data** - Some eval scripts return pre-canned answers instead of running model 4. **Voice integration untested** - Code exists but hasn't been validated end-to-end ### Next Training Run Requirements To fix tool calling, the next training run needs: - Dataset: `data/rtmp-tools/combined_tools.jsonl` (already generated) - Compute: ~1 hour on A100 for LoRA fine-tuning - Configuration: Target tool_call logits, use `tool_use_examples.jsonl` --- ## Contact - **Issues:** https://github.com/my-ai-stack/stack-2.9/issues - **Discussions:** https://github.com/my-ai-stack/stack-2.9/discussions - **Discord:** (link in README) --- *This roadmap is a living document. Updates based on community feedback and project progress.*