Spaces:

garvitsachdeva
/

SpindleFlow-RL

Runtime error

App Files Files Community

SpindleFlow-RL / README.md

garvitsachdeva

fix: downgrade sdk_version 1.44.0→1.40.0 — HF health check compatibility

fc19138 20 days ago

preview code

raw

history blame contribute delete

5.03 kB

A newer version of the Streamlit SDK is available: 1.57.0

Upgrade

metadata

title: SpindleFlow RL
emoji: 🤖
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.40.0
app_file: streamlit_app.py
pinned: false

SpindleFlow RL — Delegation Policy RL Environment

An RL environment that trains an orchestrator to learn delegation strategy, built on top of the SpindleFlow multi-agent execution system.

Architecture

SpindleFlow (TypeScript) ← execution backend
SpindleFlow RL (Python)  ← RL training layer

The RL agent learns which specialists to call, in what mode, and when to stop — not how to write YAML. SpindleFlow executes the decisions; the RL policy makes them.

Key Design Decisions

Component	Design	Why
Reward	Tiered cascade (0/1/2/3) with episode-level tier lock	Valid delta, no tier drift, $8/1000-episode run
Roster	Capability embeddings (all-MiniLM-L6-v2, 384-dim)	Zero-shot generalization to new specialists
Delegation	DAG with cycle detection + action masking	No A→B→A loops
Policy	LSTM PPO (RecurrentPPO, SB3)	POMDP-safe for scratchpad context
Graph encoding	Padded adjacency MLP (not GNN)	Hackathon-feasible; GNN for production
Consistency	Dirichlet prior (alpha=1.0)	Non-zero reward from Episode 1
Stopping	STOP as explicit learned action (Head 1)	Adaptive, not hardcoded

Quick Start

# 1. Install dependencies
pip install -r requirements.txt
pip install sb3-contrib

# 2. Set environment variables
cp .env.example .env
# Edit .env with your OPENAI_API_KEY

# 3. Run smoke tests
pytest tests/ -v

# 4. Pre-compute demo assets
python demo/precompute_demo.py

# 5. Start training (Phase 1)
python training/train.py --phase 1 --timesteps 50000

# 6. Watch training curves
tensorboard --logdir tensorboard_logs/

# 7. Run demo
python demo/run_demo.py

Reward Function

total_reward = (
    quality_delta          # specialist_score - baseline_score (same tier)
  - efficiency_penalty     # 0.05 * max(0, n_specialists - expected)
  - failure_penalty        # 0.3 per timeout, 0.2 per error (reduced if fallback)
  + recovery_bonus         # 0.1 if fallback recovered successfully
  - conflict_penalty       # 0.1 per unresolved conflict
  + conflict_bonus         # 0.05 per resolved conflict
  + consistency_bonus      # 0.1 * Dirichlet-prior path consistency
  - latency_penalty        # latency_weight * overage_fraction (tunable)
  + explanation_bonus      # 0.05 if delegation is auditable
)

Project Structure

spindleflow-rl/
├── env/                   ← Gymnasium environment + state/action/graph
├── reward/                ← Tiered reward, failure/conflict/latency signals
├── agents/                ← Task decomposer, fallback chains, conflict resolver
├── policy/                ← LSTM policy, state encoder, action heads
├── training/              ← PPO training loop, curriculum, task bank
├── transfer/              ← Cross-company fine-tuning strategy
├── audit/                 ← Delegation trace + explanation generation
├── security/              ← Scratchpad sandbox isolation
├── demo/                  ← Before/after demo assets + precompute script
├── colab/                 ← Google Colab training notebook
├── huggingface_blog/      ← HuggingFace mini-blog
├── tests/                 ← Pytest test suite (20 tests, all passing)
└── configs/               ← Specialist catalog + training hyperparameters

OpenEnv Compliance

SpindleFlow-v0 is registered with OpenEnv (hackathon requirement):

import env.openenv_wrapper  # triggers registration
from env.openenv_wrapper import verify_openenv_compliance
verify_openenv_compliance()  # True

Observation Space

Flat (5490,) float32 vector (for max_specialists=6):

Component	Dim
Task embedding	384
Roster embeddings (6×384)	2304
Called embeddings (6×384)	2304
Scratchpad embedding	384
Delegation graph adjacency	100
Called specialist mask	6
Scalar features	8
Total	5490

Action Space

Flat (12,) continuous Box (for max_specialists=6):

Slot	Meaning
`[0]`	Meta-action (CALL_SPECIALIST / STOP / …)
`[1:7]`	Specialist selection logits (multi-hot)
`[7]`	Delegation mode (SEQUENTIAL / PARALLEL / …)
`[8:12]`	Mode parameters (rounds, threshold, budget)

Training

# Demo mode (no OpenAI calls, fast)
python training/train.py --phase 1 --timesteps 50000 --demo-mode

# Full run with T2 reward
python training/train.py --phase 1 --timesteps 100000

# Resume from checkpoint
python training/train.py --checkpoint checkpoints/spindleflow_rl_50000_steps.zip

Colab

See colab/README_COLAB.md for Google Colab quick start (T4 GPU, free tier).

HuggingFace

See huggingface_blog/blog_post.md for the submission blog post.