---
sdk: docker
app_port: 8000
---

# WorkflowTwin

An OpenEnv-compatible environment for training and evaluating agents under memory and resource constraints.

This environment simulates multi-step ticket resolution pipelines with:
- queueing, prioritization, and dependencies
- stochastic arrivals and agent failures
- strict memory budgets on agent state

We introduce a **quantized memory policy** based on:
- random orthogonal projection
- scalar vector quantization
- random projection residual sketching

to study how compression affects agent performance under resource constraints.

## Motivation

Real-world agents must operate under limited memory and compute.

Without compression:
- state grows unbounded
- agents violate system constraints

With quantized memory:
- state is compressed
- agents remain feasible under tight budgets

This environment enables controlled evaluation of this tradeoff.

## Key Results

We evaluate two modes:
- **baseline**: no compression (truncation under pressure)
- **quant**: rotated quantized memory compression

This establishes a clear crossover point where compression transitions from unnecessary to essential.

### Memory Budget vs Feasibility

![Memory Budget vs Compliance Rate](experiments/figures/memory_budget_vs_compliance.svg)

### Key Findings

- **Feasibility threshold shift:**  
  Baseline requires ~6000 memory, while quantized memory achieves full compliance at ~3000.

- **2× efficiency gain:**  
  Compression halves the memory required for feasible operation.

- **No-regret behavior:**  
  Under no memory pressure, both methods perform identically.

- **Constraint robustness:**  
  Under tight budgets, baseline fails (0% compliance) while quantized memory remains fully feasible (100%).

**Conclusion:** Compression extends the feasible operating regime without degrading task performance.

## Structure

- `env/`: core environment logic, models, scoring, reward
	- includes `quantizer.py` with rotated vector quantization primitives
- `server/`: FastAPI app exposing `reset`, `step`, `state`
- `tasks/`: JSON task definitions by difficulty
- `baseline/`: non-LLM heuristic policy
- `baselines/`: research evaluation baselines for `workflow_twin`
- `inference.py`: local rollout entrypoint
- `openenv.yaml`: environment spec

## Quickstart

```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn server.app:app --reload
```

Server endpoints:

- `POST /reset`
- `POST /step` with body `{ "action_type": "triage|respond|resolve|escalate", "note": "..." }`
- `GET /state`
- `GET /config` (resolved runtime config loaded from env vars)

Run baseline inference:

```bash
python inference.py
```

Inference environment variables:

- `API_BASE_URL`: OpenAI-compatible endpoint base URL
- `HF_TOKEN`: API token (used as `api_key`)
- `MODEL_NAME`: chat model name (default: `gpt-4o-mini`)

If `API_BASE_URL` or `HF_TOKEN` is missing, inference automatically falls back to heuristic policy.

`inference.py` result fields:

- `score`: final reported score (`env_score` when available, otherwise `partial_score`)
- `env_score`: environment-provided score from `env.state()`
- `partial_score`: fallback score from normalized accumulated reward
- `openai_client_configured`: `true` when both `API_BASE_URL` and `HF_TOKEN` are present

## Method: Quantized Memory Policy

We implement a rotated vector quantization pipeline:

1. **Random Orthogonal Projection**
   - decorrelates embedding dimensions

2. **Scalar Quantization**
   - coordinate-wise discretization

3. **Residual Random Projection Sketch**
   - preserves inner-product structure

Reward shaping includes:
- distortion penalty (MSE)
- inner-product preservation penalty

## Research-Grade WorkflowTwin (L1-L5)

A new package `workflow_twin/` is now implemented to evolve the simulator from single-ticket MVP to multi-ticket workflow research environment.

### Included

- `workflow_twin/core/entities.py`: multi-ticket state, agents, time, SLA/resource fields
- `workflow_twin/core/dynamics.py`: queue logic, SLA penalties, dependencies, stochastic arrivals/failures
- `workflow_twin/core/config.py`: level configs (L1-L5)
- `workflow_twin/environment.py`: main level-aware environment (`WorkflowTwinEnv`)
- `workflow_twin/memory.py`: `MemoryBoundedEnv` wrapper using rotated quantized memory compression
- `workflow_twin/levels/`: level hooks for L1 simple → L5 memory pressure
- `baselines/heuristics.py`: simple queue baseline policy
- `tasks/level1..level5/`: task scaffolding per level

### Quick Example

```bash
python - <<'PY'
from workflow_twin.environment import WorkflowTwinEnv
from baselines.heuristics import greedy_queue_policy

env = WorkflowTwinEnv(level=3, seed=42)
obs = env.reset()

for _ in range(10):
	action = greedy_queue_policy(obs)
	obs, reward, done, info = env.step(action)
	print(info["step_count"], reward, info["queue"])
	if done:
		break
PY
```

### Memory-Bounded Wrapper Example (L5)

```bash
python - <<'PY'
from workflow_twin.environment import WorkflowTwinEnv
from workflow_twin.memory import MemoryBoundedEnv

base_env = WorkflowTwinEnv(level=5, seed=42)
env = MemoryBoundedEnv(base_env, memory_budget=3500, bits=3)
obs = env.reset()
obs, reward, done, info = env.step({"action_type": "triage", "note": "memory-check"})
print(info["memory"])
PY
```

## Docker

```bash
docker build -t workflowtwin .
docker run -p 8000:8000 workflowtwin
```

## Controlled A/B Quantized Memory Evaluation

Run the controlled experiment suite:

```bash
python -m experiments.ab_quantized_memory_eval
```

This executes two tests with shared metrics:

- control_no_memory_pressure (Level 1, large memory budget)
- critical_memory_constrained_long_horizon (Level 5, tight memory budget)
- memory_budget_sweep (budgets: 2000, 3000, 4000, 6000)

Modes compared:

- baseline: no compression, truncation under pressure
- quant: rotated quantized memory compression under pressure

Reported metrics:

- avg_reward
- success_rate (resolved/total)
- avg_sla_violations
- avg_memory_used vs avg_memory_budget
- memory_compliance_rate
- steps_per_sec

Figure (generated by the experiment runner):

![Memory Budget vs Compliance Rate](experiments/figures/memory_budget_vs_compliance.svg)