Spaces:
Runtime error
Runtime error
File size: 5,030 Bytes
5686ade 7e20750 fc19138 9dfb393 5686ade 02ff91f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | ---
title: SpindleFlow RL
emoji: π€
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: "1.40.0"
app_file: streamlit_app.py
pinned: false
---
# SpindleFlow RL β Delegation Policy RL Environment
An RL environment that trains an orchestrator to **learn** delegation strategy,
built on top of the SpindleFlow multi-agent execution system.
## Architecture
```
SpindleFlow (TypeScript) β execution backend
SpindleFlow RL (Python) β RL training layer
```
The RL agent learns *which specialists to call, in what mode, and when to stop* β
not how to write YAML. SpindleFlow executes the decisions; the RL policy makes them.
## Key Design Decisions
| Component | Design | Why |
|---|---|---|
| Reward | Tiered cascade (0/1/2/3) with episode-level tier lock | Valid delta, no tier drift, $8/1000-episode run |
| Roster | Capability embeddings (all-MiniLM-L6-v2, 384-dim) | Zero-shot generalization to new specialists |
| Delegation | DAG with cycle detection + action masking | No AβBβA loops |
| Policy | LSTM PPO (RecurrentPPO, SB3) | POMDP-safe for scratchpad context |
| Graph encoding | Padded adjacency MLP (not GNN) | Hackathon-feasible; GNN for production |
| Consistency | Dirichlet prior (alpha=1.0) | Non-zero reward from Episode 1 |
| Stopping | STOP as explicit learned action (Head 1) | Adaptive, not hardcoded |
## Quick Start
```bash
# 1. Install dependencies
pip install -r requirements.txt
pip install sb3-contrib
# 2. Set environment variables
cp .env.example .env
# Edit .env with your OPENAI_API_KEY
# 3. Run smoke tests
pytest tests/ -v
# 4. Pre-compute demo assets
python demo/precompute_demo.py
# 5. Start training (Phase 1)
python training/train.py --phase 1 --timesteps 50000
# 6. Watch training curves
tensorboard --logdir tensorboard_logs/
# 7. Run demo
python demo/run_demo.py
```
## Reward Function
```python
total_reward = (
quality_delta # specialist_score - baseline_score (same tier)
- efficiency_penalty # 0.05 * max(0, n_specialists - expected)
- failure_penalty # 0.3 per timeout, 0.2 per error (reduced if fallback)
+ recovery_bonus # 0.1 if fallback recovered successfully
- conflict_penalty # 0.1 per unresolved conflict
+ conflict_bonus # 0.05 per resolved conflict
+ consistency_bonus # 0.1 * Dirichlet-prior path consistency
- latency_penalty # latency_weight * overage_fraction (tunable)
+ explanation_bonus # 0.05 if delegation is auditable
)
```
## Project Structure
```
spindleflow-rl/
βββ env/ β Gymnasium environment + state/action/graph
βββ reward/ β Tiered reward, failure/conflict/latency signals
βββ agents/ β Task decomposer, fallback chains, conflict resolver
βββ policy/ β LSTM policy, state encoder, action heads
βββ training/ β PPO training loop, curriculum, task bank
βββ transfer/ β Cross-company fine-tuning strategy
βββ audit/ β Delegation trace + explanation generation
βββ security/ β Scratchpad sandbox isolation
βββ demo/ β Before/after demo assets + precompute script
βββ colab/ β Google Colab training notebook
βββ huggingface_blog/ β HuggingFace mini-blog
βββ tests/ β Pytest test suite (20 tests, all passing)
βββ configs/ β Specialist catalog + training hyperparameters
```
## OpenEnv Compliance
`SpindleFlow-v0` is registered with OpenEnv (hackathon requirement):
```python
import env.openenv_wrapper # triggers registration
from env.openenv_wrapper import verify_openenv_compliance
verify_openenv_compliance() # True
```
## Observation Space
Flat `(5490,)` float32 vector (for `max_specialists=6`):
| Component | Dim |
|---|---|
| Task embedding | 384 |
| Roster embeddings (6Γ384) | 2304 |
| Called embeddings (6Γ384) | 2304 |
| Scratchpad embedding | 384 |
| Delegation graph adjacency | 100 |
| Called specialist mask | 6 |
| Scalar features | 8 |
| **Total** | **5490** |
## Action Space
Flat `(12,)` continuous Box (for `max_specialists=6`):
| Slot | Meaning |
|---|---|
| `[0]` | Meta-action (CALL_SPECIALIST / STOP / β¦) |
| `[1:7]` | Specialist selection logits (multi-hot) |
| `[7]` | Delegation mode (SEQUENTIAL / PARALLEL / β¦) |
| `[8:12]` | Mode parameters (rounds, threshold, budget) |
## Training
```bash
# Demo mode (no OpenAI calls, fast)
python training/train.py --phase 1 --timesteps 50000 --demo-mode
# Full run with T2 reward
python training/train.py --phase 1 --timesteps 100000
# Resume from checkpoint
python training/train.py --checkpoint checkpoints/spindleflow_rl_50000_steps.zip
```
## Colab
See [colab/README_COLAB.md](colab/README_COLAB.md) for Google Colab quick start (T4 GPU, free tier).
## HuggingFace
See [huggingface_blog/blog_post.md](huggingface_blog/blog_post.md) for the submission blog post.
|