workflow-twin / README.md
NDGCodes's picture
fix repo structure for HF
1a692ce
---
sdk: docker
app_port: 8000
---
# WorkflowTwin
An OpenEnv-compatible environment for training and evaluating agents under memory and resource constraints.
This environment simulates multi-step ticket resolution pipelines with:
- queueing, prioritization, and dependencies
- stochastic arrivals and agent failures
- strict memory budgets on agent state
We introduce a **quantized memory policy** based on:
- random orthogonal projection
- scalar vector quantization
- random projection residual sketching
to study how compression affects agent performance under resource constraints.
## Motivation
Real-world agents must operate under limited memory and compute.
Without compression:
- state grows unbounded
- agents violate system constraints
With quantized memory:
- state is compressed
- agents remain feasible under tight budgets
This environment enables controlled evaluation of this tradeoff.
## Key Results
We evaluate two modes:
- **baseline**: no compression (truncation under pressure)
- **quant**: rotated quantized memory compression
This establishes a clear crossover point where compression transitions from unnecessary to essential.
### Memory Budget vs Feasibility
![Memory Budget vs Compliance Rate](experiments/figures/memory_budget_vs_compliance.svg)
### Key Findings
- **Feasibility threshold shift:**
Baseline requires ~6000 memory, while quantized memory achieves full compliance at ~3000.
- **2× efficiency gain:**
Compression halves the memory required for feasible operation.
- **No-regret behavior:**
Under no memory pressure, both methods perform identically.
- **Constraint robustness:**
Under tight budgets, baseline fails (0% compliance) while quantized memory remains fully feasible (100%).
**Conclusion:** Compression extends the feasible operating regime without degrading task performance.
## Structure
- `env/`: core environment logic, models, scoring, reward
- includes `quantizer.py` with rotated vector quantization primitives
- `server/`: FastAPI app exposing `reset`, `step`, `state`
- `tasks/`: JSON task definitions by difficulty
- `baseline/`: non-LLM heuristic policy
- `baselines/`: research evaluation baselines for `workflow_twin`
- `inference.py`: local rollout entrypoint
- `openenv.yaml`: environment spec
## Quickstart
```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn server.app:app --reload
```
Server endpoints:
- `POST /reset`
- `POST /step` with body `{ "action_type": "triage|respond|resolve|escalate", "note": "..." }`
- `GET /state`
- `GET /config` (resolved runtime config loaded from env vars)
Run baseline inference:
```bash
python inference.py
```
Inference environment variables:
- `API_BASE_URL`: OpenAI-compatible endpoint base URL
- `HF_TOKEN`: API token (used as `api_key`)
- `MODEL_NAME`: chat model name (default: `gpt-4o-mini`)
If `API_BASE_URL` or `HF_TOKEN` is missing, inference automatically falls back to heuristic policy.
`inference.py` result fields:
- `score`: final reported score (`env_score` when available, otherwise `partial_score`)
- `env_score`: environment-provided score from `env.state()`
- `partial_score`: fallback score from normalized accumulated reward
- `openai_client_configured`: `true` when both `API_BASE_URL` and `HF_TOKEN` are present
## Method: Quantized Memory Policy
We implement a rotated vector quantization pipeline:
1. **Random Orthogonal Projection**
- decorrelates embedding dimensions
2. **Scalar Quantization**
- coordinate-wise discretization
3. **Residual Random Projection Sketch**
- preserves inner-product structure
Reward shaping includes:
- distortion penalty (MSE)
- inner-product preservation penalty
## Research-Grade WorkflowTwin (L1-L5)
A new package `workflow_twin/` is now implemented to evolve the simulator from single-ticket MVP to multi-ticket workflow research environment.
### Included
- `workflow_twin/core/entities.py`: multi-ticket state, agents, time, SLA/resource fields
- `workflow_twin/core/dynamics.py`: queue logic, SLA penalties, dependencies, stochastic arrivals/failures
- `workflow_twin/core/config.py`: level configs (L1-L5)
- `workflow_twin/environment.py`: main level-aware environment (`WorkflowTwinEnv`)
- `workflow_twin/memory.py`: `MemoryBoundedEnv` wrapper using rotated quantized memory compression
- `workflow_twin/levels/`: level hooks for L1 simple → L5 memory pressure
- `baselines/heuristics.py`: simple queue baseline policy
- `tasks/level1..level5/`: task scaffolding per level
### Quick Example
```bash
python - <<'PY'
from workflow_twin.environment import WorkflowTwinEnv
from baselines.heuristics import greedy_queue_policy
env = WorkflowTwinEnv(level=3, seed=42)
obs = env.reset()
for _ in range(10):
action = greedy_queue_policy(obs)
obs, reward, done, info = env.step(action)
print(info["step_count"], reward, info["queue"])
if done:
break
PY
```
### Memory-Bounded Wrapper Example (L5)
```bash
python - <<'PY'
from workflow_twin.environment import WorkflowTwinEnv
from workflow_twin.memory import MemoryBoundedEnv
base_env = WorkflowTwinEnv(level=5, seed=42)
env = MemoryBoundedEnv(base_env, memory_budget=3500, bits=3)
obs = env.reset()
obs, reward, done, info = env.step({"action_type": "triage", "note": "memory-check"})
print(info["memory"])
PY
```
## Docker
```bash
docker build -t workflowtwin .
docker run -p 8000:8000 workflowtwin
```
## Controlled A/B Quantized Memory Evaluation
Run the controlled experiment suite:
```bash
python -m experiments.ab_quantized_memory_eval
```
This executes two tests with shared metrics:
- control_no_memory_pressure (Level 1, large memory budget)
- critical_memory_constrained_long_horizon (Level 5, tight memory budget)
- memory_budget_sweep (budgets: 2000, 3000, 4000, 6000)
Modes compared:
- baseline: no compression, truncation under pressure
- quant: rotated quantized memory compression under pressure
Reported metrics:
- avg_reward
- success_rate (resolved/total)
- avg_sla_violations
- avg_memory_used vs avg_memory_budget
- memory_compliance_rate
- steps_per_sec
Figure (generated by the experiment runner):
![Memory Budget vs Compliance Rate](experiments/figures/memory_budget_vs_compliance.svg)