SpindleFlow-RL / README.md
garvitsachdeva's picture
fix: downgrade sdk_version 1.44.0β†’1.40.0 β€” HF health check compatibility
fc19138
---
title: SpindleFlow RL
emoji: πŸ€–
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: "1.40.0"
app_file: streamlit_app.py
pinned: false
---
# SpindleFlow RL β€” Delegation Policy RL Environment
An RL environment that trains an orchestrator to **learn** delegation strategy,
built on top of the SpindleFlow multi-agent execution system.
## Architecture
```
SpindleFlow (TypeScript) ← execution backend
SpindleFlow RL (Python) ← RL training layer
```
The RL agent learns *which specialists to call, in what mode, and when to stop* β€”
not how to write YAML. SpindleFlow executes the decisions; the RL policy makes them.
## Key Design Decisions
| Component | Design | Why |
|---|---|---|
| Reward | Tiered cascade (0/1/2/3) with episode-level tier lock | Valid delta, no tier drift, $8/1000-episode run |
| Roster | Capability embeddings (all-MiniLM-L6-v2, 384-dim) | Zero-shot generalization to new specialists |
| Delegation | DAG with cycle detection + action masking | No A→B→A loops |
| Policy | LSTM PPO (RecurrentPPO, SB3) | POMDP-safe for scratchpad context |
| Graph encoding | Padded adjacency MLP (not GNN) | Hackathon-feasible; GNN for production |
| Consistency | Dirichlet prior (alpha=1.0) | Non-zero reward from Episode 1 |
| Stopping | STOP as explicit learned action (Head 1) | Adaptive, not hardcoded |
## Quick Start
```bash
# 1. Install dependencies
pip install -r requirements.txt
pip install sb3-contrib
# 2. Set environment variables
cp .env.example .env
# Edit .env with your OPENAI_API_KEY
# 3. Run smoke tests
pytest tests/ -v
# 4. Pre-compute demo assets
python demo/precompute_demo.py
# 5. Start training (Phase 1)
python training/train.py --phase 1 --timesteps 50000
# 6. Watch training curves
tensorboard --logdir tensorboard_logs/
# 7. Run demo
python demo/run_demo.py
```
## Reward Function
```python
total_reward = (
quality_delta # specialist_score - baseline_score (same tier)
- efficiency_penalty # 0.05 * max(0, n_specialists - expected)
- failure_penalty # 0.3 per timeout, 0.2 per error (reduced if fallback)
+ recovery_bonus # 0.1 if fallback recovered successfully
- conflict_penalty # 0.1 per unresolved conflict
+ conflict_bonus # 0.05 per resolved conflict
+ consistency_bonus # 0.1 * Dirichlet-prior path consistency
- latency_penalty # latency_weight * overage_fraction (tunable)
+ explanation_bonus # 0.05 if delegation is auditable
)
```
## Project Structure
```
spindleflow-rl/
β”œβ”€β”€ env/ ← Gymnasium environment + state/action/graph
β”œβ”€β”€ reward/ ← Tiered reward, failure/conflict/latency signals
β”œβ”€β”€ agents/ ← Task decomposer, fallback chains, conflict resolver
β”œβ”€β”€ policy/ ← LSTM policy, state encoder, action heads
β”œβ”€β”€ training/ ← PPO training loop, curriculum, task bank
β”œβ”€β”€ transfer/ ← Cross-company fine-tuning strategy
β”œβ”€β”€ audit/ ← Delegation trace + explanation generation
β”œβ”€β”€ security/ ← Scratchpad sandbox isolation
β”œβ”€β”€ demo/ ← Before/after demo assets + precompute script
β”œβ”€β”€ colab/ ← Google Colab training notebook
β”œβ”€β”€ huggingface_blog/ ← HuggingFace mini-blog
β”œβ”€β”€ tests/ ← Pytest test suite (20 tests, all passing)
└── configs/ ← Specialist catalog + training hyperparameters
```
## OpenEnv Compliance
`SpindleFlow-v0` is registered with OpenEnv (hackathon requirement):
```python
import env.openenv_wrapper # triggers registration
from env.openenv_wrapper import verify_openenv_compliance
verify_openenv_compliance() # True
```
## Observation Space
Flat `(5490,)` float32 vector (for `max_specialists=6`):
| Component | Dim |
|---|---|
| Task embedding | 384 |
| Roster embeddings (6Γ—384) | 2304 |
| Called embeddings (6Γ—384) | 2304 |
| Scratchpad embedding | 384 |
| Delegation graph adjacency | 100 |
| Called specialist mask | 6 |
| Scalar features | 8 |
| **Total** | **5490** |
## Action Space
Flat `(12,)` continuous Box (for `max_specialists=6`):
| Slot | Meaning |
|---|---|
| `[0]` | Meta-action (CALL_SPECIALIST / STOP / …) |
| `[1:7]` | Specialist selection logits (multi-hot) |
| `[7]` | Delegation mode (SEQUENTIAL / PARALLEL / …) |
| `[8:12]` | Mode parameters (rounds, threshold, budget) |
## Training
```bash
# Demo mode (no OpenAI calls, fast)
python training/train.py --phase 1 --timesteps 50000 --demo-mode
# Full run with T2 reward
python training/train.py --phase 1 --timesteps 100000
# Resume from checkpoint
python training/train.py --checkpoint checkpoints/spindleflow_rl_50000_steps.zip
```
## Colab
See [colab/README_COLAB.md](colab/README_COLAB.md) for Google Colab quick start (T4 GPU, free tier).
## HuggingFace
See [huggingface_blog/blog_post.md](huggingface_blog/blog_post.md) for the submission blog post.