Spaces:
Runtime error
Runtime error
| title: SpindleFlow RL | |
| emoji: π€ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: streamlit | |
| sdk_version: "1.40.0" | |
| app_file: streamlit_app.py | |
| pinned: false | |
| # SpindleFlow RL β Delegation Policy RL Environment | |
| An RL environment that trains an orchestrator to **learn** delegation strategy, | |
| built on top of the SpindleFlow multi-agent execution system. | |
| ## Architecture | |
| ``` | |
| SpindleFlow (TypeScript) β execution backend | |
| SpindleFlow RL (Python) β RL training layer | |
| ``` | |
| The RL agent learns *which specialists to call, in what mode, and when to stop* β | |
| not how to write YAML. SpindleFlow executes the decisions; the RL policy makes them. | |
| ## Key Design Decisions | |
| | Component | Design | Why | | |
| |---|---|---| | |
| | Reward | Tiered cascade (0/1/2/3) with episode-level tier lock | Valid delta, no tier drift, $8/1000-episode run | | |
| | Roster | Capability embeddings (all-MiniLM-L6-v2, 384-dim) | Zero-shot generalization to new specialists | | |
| | Delegation | DAG with cycle detection + action masking | No AβBβA loops | | |
| | Policy | LSTM PPO (RecurrentPPO, SB3) | POMDP-safe for scratchpad context | | |
| | Graph encoding | Padded adjacency MLP (not GNN) | Hackathon-feasible; GNN for production | | |
| | Consistency | Dirichlet prior (alpha=1.0) | Non-zero reward from Episode 1 | | |
| | Stopping | STOP as explicit learned action (Head 1) | Adaptive, not hardcoded | | |
| ## Quick Start | |
| ```bash | |
| # 1. Install dependencies | |
| pip install -r requirements.txt | |
| pip install sb3-contrib | |
| # 2. Set environment variables | |
| cp .env.example .env | |
| # Edit .env with your OPENAI_API_KEY | |
| # 3. Run smoke tests | |
| pytest tests/ -v | |
| # 4. Pre-compute demo assets | |
| python demo/precompute_demo.py | |
| # 5. Start training (Phase 1) | |
| python training/train.py --phase 1 --timesteps 50000 | |
| # 6. Watch training curves | |
| tensorboard --logdir tensorboard_logs/ | |
| # 7. Run demo | |
| python demo/run_demo.py | |
| ``` | |
| ## Reward Function | |
| ```python | |
| total_reward = ( | |
| quality_delta # specialist_score - baseline_score (same tier) | |
| - efficiency_penalty # 0.05 * max(0, n_specialists - expected) | |
| - failure_penalty # 0.3 per timeout, 0.2 per error (reduced if fallback) | |
| + recovery_bonus # 0.1 if fallback recovered successfully | |
| - conflict_penalty # 0.1 per unresolved conflict | |
| + conflict_bonus # 0.05 per resolved conflict | |
| + consistency_bonus # 0.1 * Dirichlet-prior path consistency | |
| - latency_penalty # latency_weight * overage_fraction (tunable) | |
| + explanation_bonus # 0.05 if delegation is auditable | |
| ) | |
| ``` | |
| ## Project Structure | |
| ``` | |
| spindleflow-rl/ | |
| βββ env/ β Gymnasium environment + state/action/graph | |
| βββ reward/ β Tiered reward, failure/conflict/latency signals | |
| βββ agents/ β Task decomposer, fallback chains, conflict resolver | |
| βββ policy/ β LSTM policy, state encoder, action heads | |
| βββ training/ β PPO training loop, curriculum, task bank | |
| βββ transfer/ β Cross-company fine-tuning strategy | |
| βββ audit/ β Delegation trace + explanation generation | |
| βββ security/ β Scratchpad sandbox isolation | |
| βββ demo/ β Before/after demo assets + precompute script | |
| βββ colab/ β Google Colab training notebook | |
| βββ huggingface_blog/ β HuggingFace mini-blog | |
| βββ tests/ β Pytest test suite (20 tests, all passing) | |
| βββ configs/ β Specialist catalog + training hyperparameters | |
| ``` | |
| ## OpenEnv Compliance | |
| `SpindleFlow-v0` is registered with OpenEnv (hackathon requirement): | |
| ```python | |
| import env.openenv_wrapper # triggers registration | |
| from env.openenv_wrapper import verify_openenv_compliance | |
| verify_openenv_compliance() # True | |
| ``` | |
| ## Observation Space | |
| Flat `(5490,)` float32 vector (for `max_specialists=6`): | |
| | Component | Dim | | |
| |---|---| | |
| | Task embedding | 384 | | |
| | Roster embeddings (6Γ384) | 2304 | | |
| | Called embeddings (6Γ384) | 2304 | | |
| | Scratchpad embedding | 384 | | |
| | Delegation graph adjacency | 100 | | |
| | Called specialist mask | 6 | | |
| | Scalar features | 8 | | |
| | **Total** | **5490** | | |
| ## Action Space | |
| Flat `(12,)` continuous Box (for `max_specialists=6`): | |
| | Slot | Meaning | | |
| |---|---| | |
| | `[0]` | Meta-action (CALL_SPECIALIST / STOP / β¦) | | |
| | `[1:7]` | Specialist selection logits (multi-hot) | | |
| | `[7]` | Delegation mode (SEQUENTIAL / PARALLEL / β¦) | | |
| | `[8:12]` | Mode parameters (rounds, threshold, budget) | | |
| ## Training | |
| ```bash | |
| # Demo mode (no OpenAI calls, fast) | |
| python training/train.py --phase 1 --timesteps 50000 --demo-mode | |
| # Full run with T2 reward | |
| python training/train.py --phase 1 --timesteps 100000 | |
| # Resume from checkpoint | |
| python training/train.py --checkpoint checkpoints/spindleflow_rl_50000_steps.zip | |
| ``` | |
| ## Colab | |
| See [colab/README_COLAB.md](colab/README_COLAB.md) for Google Colab quick start (T4 GPU, free tier). | |
| ## HuggingFace | |
| See [huggingface_blog/blog_post.md](huggingface_blog/blog_post.md) for the submission blog post. | |