File size: 5,030 Bytes
5686ade
 
 
 
 
7e20750
fc19138
9dfb393
5686ade
 
 
02ff91f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
---
title: SpindleFlow RL
emoji: πŸ€–
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: "1.40.0"
app_file: streamlit_app.py
pinned: false
---

# SpindleFlow RL β€” Delegation Policy RL Environment

An RL environment that trains an orchestrator to **learn** delegation strategy,
built on top of the SpindleFlow multi-agent execution system.

## Architecture

```
SpindleFlow (TypeScript) ← execution backend
SpindleFlow RL (Python)  ← RL training layer
```

The RL agent learns *which specialists to call, in what mode, and when to stop* β€”
not how to write YAML. SpindleFlow executes the decisions; the RL policy makes them.

## Key Design Decisions

| Component | Design | Why |
|---|---|---|
| Reward | Tiered cascade (0/1/2/3) with episode-level tier lock | Valid delta, no tier drift, $8/1000-episode run |
| Roster | Capability embeddings (all-MiniLM-L6-v2, 384-dim) | Zero-shot generalization to new specialists |
| Delegation | DAG with cycle detection + action masking | No A→B→A loops |
| Policy | LSTM PPO (RecurrentPPO, SB3) | POMDP-safe for scratchpad context |
| Graph encoding | Padded adjacency MLP (not GNN) | Hackathon-feasible; GNN for production |
| Consistency | Dirichlet prior (alpha=1.0) | Non-zero reward from Episode 1 |
| Stopping | STOP as explicit learned action (Head 1) | Adaptive, not hardcoded |

## Quick Start

```bash
# 1. Install dependencies
pip install -r requirements.txt
pip install sb3-contrib

# 2. Set environment variables
cp .env.example .env
# Edit .env with your OPENAI_API_KEY

# 3. Run smoke tests
pytest tests/ -v

# 4. Pre-compute demo assets
python demo/precompute_demo.py

# 5. Start training (Phase 1)
python training/train.py --phase 1 --timesteps 50000

# 6. Watch training curves
tensorboard --logdir tensorboard_logs/

# 7. Run demo
python demo/run_demo.py
```

## Reward Function

```python
total_reward = (
    quality_delta          # specialist_score - baseline_score (same tier)
  - efficiency_penalty     # 0.05 * max(0, n_specialists - expected)
  - failure_penalty        # 0.3 per timeout, 0.2 per error (reduced if fallback)
  + recovery_bonus         # 0.1 if fallback recovered successfully
  - conflict_penalty       # 0.1 per unresolved conflict
  + conflict_bonus         # 0.05 per resolved conflict
  + consistency_bonus      # 0.1 * Dirichlet-prior path consistency
  - latency_penalty        # latency_weight * overage_fraction (tunable)
  + explanation_bonus      # 0.05 if delegation is auditable
)
```

## Project Structure

```
spindleflow-rl/
β”œβ”€β”€ env/                   ← Gymnasium environment + state/action/graph
β”œβ”€β”€ reward/                ← Tiered reward, failure/conflict/latency signals
β”œβ”€β”€ agents/                ← Task decomposer, fallback chains, conflict resolver
β”œβ”€β”€ policy/                ← LSTM policy, state encoder, action heads
β”œβ”€β”€ training/              ← PPO training loop, curriculum, task bank
β”œβ”€β”€ transfer/              ← Cross-company fine-tuning strategy
β”œβ”€β”€ audit/                 ← Delegation trace + explanation generation
β”œβ”€β”€ security/              ← Scratchpad sandbox isolation
β”œβ”€β”€ demo/                  ← Before/after demo assets + precompute script
β”œβ”€β”€ colab/                 ← Google Colab training notebook
β”œβ”€β”€ huggingface_blog/      ← HuggingFace mini-blog
β”œβ”€β”€ tests/                 ← Pytest test suite (20 tests, all passing)
└── configs/               ← Specialist catalog + training hyperparameters
```

## OpenEnv Compliance

`SpindleFlow-v0` is registered with OpenEnv (hackathon requirement):

```python
import env.openenv_wrapper  # triggers registration
from env.openenv_wrapper import verify_openenv_compliance
verify_openenv_compliance()  # True
```

## Observation Space

Flat `(5490,)` float32 vector (for `max_specialists=6`):

| Component | Dim |
|---|---|
| Task embedding | 384 |
| Roster embeddings (6Γ—384) | 2304 |
| Called embeddings (6Γ—384) | 2304 |
| Scratchpad embedding | 384 |
| Delegation graph adjacency | 100 |
| Called specialist mask | 6 |
| Scalar features | 8 |
| **Total** | **5490** |

## Action Space

Flat `(12,)` continuous Box (for `max_specialists=6`):

| Slot | Meaning |
|---|---|
| `[0]` | Meta-action (CALL_SPECIALIST / STOP / …) |
| `[1:7]` | Specialist selection logits (multi-hot) |
| `[7]` | Delegation mode (SEQUENTIAL / PARALLEL / …) |
| `[8:12]` | Mode parameters (rounds, threshold, budget) |

## Training

```bash
# Demo mode (no OpenAI calls, fast)
python training/train.py --phase 1 --timesteps 50000 --demo-mode

# Full run with T2 reward
python training/train.py --phase 1 --timesteps 100000

# Resume from checkpoint
python training/train.py --checkpoint checkpoints/spindleflow_rl_50000_steps.zip
```

## Colab

See [colab/README_COLAB.md](colab/README_COLAB.md) for Google Colab quick start (T4 GPU, free tier).

## HuggingFace

See [huggingface_blog/blog_post.md](huggingface_blog/blog_post.md) for the submission blog post.