--- title: SpindleFlow RL emoji: 🤖 colorFrom: blue colorTo: purple sdk: gradio sdk_version: "5.50.0" app_file: app.py pinned: false --- # SpindleFlow RL — Delegation Policy RL Environment An RL environment that trains an orchestrator to **learn** delegation strategy, built on top of the SpindleFlow multi-agent execution system. ## Architecture ``` SpindleFlow (TypeScript) ← execution backend SpindleFlow RL (Python) ← RL training layer ``` The RL agent learns *which specialists to call, in what mode, and when to stop* — not how to write YAML. SpindleFlow executes the decisions; the RL policy makes them. ## Key Design Decisions | Component | Design | Why | |---|---|---| | Reward | Tiered cascade (0/1/2/3) with episode-level tier lock | Valid delta, no tier drift, $8/1000-episode run | | Roster | Capability embeddings (all-MiniLM-L6-v2, 384-dim) | Zero-shot generalization to new specialists | | Delegation | DAG with cycle detection + action masking | No A→B→A loops | | Policy | LSTM PPO (RecurrentPPO, SB3) | POMDP-safe for scratchpad context | | Graph encoding | Padded adjacency MLP (not GNN) | Hackathon-feasible; GNN for production | | Consistency | Dirichlet prior (alpha=1.0) | Non-zero reward from Episode 1 | | Stopping | STOP as explicit learned action (Head 1) | Adaptive, not hardcoded | ## Quick Start ```bash # 1. Install dependencies pip install -r requirements.txt pip install sb3-contrib # 2. Set environment variables cp .env.example .env # Edit .env with your OPENAI_API_KEY # 3. Run smoke tests pytest tests/ -v # 4. Pre-compute demo assets python demo/precompute_demo.py # 5. Start training (Phase 1) python training/train.py --phase 1 --timesteps 50000 # 6. Watch training curves tensorboard --logdir tensorboard_logs/ # 7. Run demo python demo/run_demo.py ``` ## Reward Function ```python total_reward = ( quality_delta # specialist_score - baseline_score (same tier) - efficiency_penalty # 0.05 * max(0, n_specialists - expected) - failure_penalty # 0.3 per timeout, 0.2 per error (reduced if fallback) + recovery_bonus # 0.1 if fallback recovered successfully - conflict_penalty # 0.1 per unresolved conflict + conflict_bonus # 0.05 per resolved conflict + consistency_bonus # 0.1 * Dirichlet-prior path consistency - latency_penalty # latency_weight * overage_fraction (tunable) + explanation_bonus # 0.05 if delegation is auditable ) ``` ## Project Structure ``` spindleflow-rl/ ├── env/ ← Gymnasium environment + state/action/graph ├── reward/ ← Tiered reward, failure/conflict/latency signals ├── agents/ ← Task decomposer, fallback chains, conflict resolver ├── policy/ ← LSTM policy, state encoder, action heads ├── training/ ← PPO training loop, curriculum, task bank ├── transfer/ ← Cross-company fine-tuning strategy ├── audit/ ← Delegation trace + explanation generation ├── security/ ← Scratchpad sandbox isolation ├── demo/ ← Before/after demo assets + precompute script ├── colab/ ← Google Colab training notebook ├── huggingface_blog/ ← HuggingFace mini-blog ├── tests/ ← Pytest test suite (20 tests, all passing) └── configs/ ← Specialist catalog + training hyperparameters ``` ## OpenEnv Compliance `SpindleFlow-v0` is registered with OpenEnv (hackathon requirement): ```python import env.openenv_wrapper # triggers registration from env.openenv_wrapper import verify_openenv_compliance verify_openenv_compliance() # True ``` ## Observation Space Flat `(5490,)` float32 vector (for `max_specialists=6`): | Component | Dim | |---|---| | Task embedding | 384 | | Roster embeddings (6×384) | 2304 | | Called embeddings (6×384) | 2304 | | Scratchpad embedding | 384 | | Delegation graph adjacency | 100 | | Called specialist mask | 6 | | Scalar features | 8 | | **Total** | **5490** | ## Action Space Flat `(12,)` continuous Box (for `max_specialists=6`): | Slot | Meaning | |---|---| | `[0]` | Meta-action (CALL_SPECIALIST / STOP / …) | | `[1:7]` | Specialist selection logits (multi-hot) | | `[7]` | Delegation mode (SEQUENTIAL / PARALLEL / …) | | `[8:12]` | Mode parameters (rounds, threshold, budget) | ## Training ```bash # Demo mode (no OpenAI calls, fast) python training/train.py --phase 1 --timesteps 50000 --demo-mode # Full run with T2 reward python training/train.py --phase 1 --timesteps 100000 # Resume from checkpoint python training/train.py --checkpoint checkpoints/spindleflow_rl_50000_steps.zip ``` ## Colab See [colab/README_COLAB.md](colab/README_COLAB.md) for Google Colab quick start (T4 GPU, free tier). ## HuggingFace See [huggingface_blog/blog_post.md](huggingface_blog/blog_post.md) for the submission blog post.