Spaces:
Sleeping
title: PermitPathfinder OpenEnv
emoji: ποΈ
colorFrom: yellow
colorTo: purple
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
- rl
- agent
- planning
- real-world
PermitPathfinder
An OpenEnv environment where an LLM agent opens a small business by navigating a stateful municipal permitting DAG β a real-world planning task with dense partial-credit reward, per-episode randomization, and multi-tier difficulty progression.
Interactive Demo | Expert Trajectories Dataset | Training Script (TRL GRPO)
Why municipal permits?
Opening a restaurant in the United States requires an average of 15+ permits across 3-5 government agencies. The SBA estimates that 22% of small-business failures cite regulatory burden as a contributing factor. Every permit has prerequisites, fees, inspections, and deadlines β a tangled DAG that even experienced business owners find daunting.
This isn't a toy or a game. It's a real planning problem that millions of people face, and it's the kind of multi-step, constrained, partially observable task that an AI agent deployed as a "digital assistant" has to master. The env rewards real reasoning β a model that doesn't understand the DAG structure, budget constraints, and prerequisite chains cannot score well, as demonstrated by our baseline results showing 8B models scoring near zero while 70B models score 0.9+.
Real-world mapping
Every mechanic in this environment corresponds to a real permit workflow pattern:
| Env Mechanic | Real-World Equivalent |
|---|---|
| Permit DAG with prerequisites | NYC DOB requires Certificate of Occupancy only after all trade permits (plumbing, electrical, HVAC) pass inspection |
| Fee jitter per episode | Municipal fee schedules update quarterly; expedite fees vary by workload |
| Budget constraint | Small businesses operate on fixed startup capital β the SBA reports median startup costs of $40,000 |
| Missing-document event | ~30% of permit applications are returned for "insufficient documentation" (ICC Building Safety Journal, 2024) |
| Hidden prerequisites (medium/hard) | Applicants frequently discover new requirements mid-process β "we also need a grease trap permit" |
| Inquiry budget | Phone hold times average 45 minutes per agency; each call is a real cost |
| Regulatory update event (hard) | Zoning code amendments, fee schedule updates, and new environmental review requirements happen mid-project |
| Waste penalty for illegal actions | Submitting incomplete applications wastes staff time and delays your timeline |
Permit DAGs by difficulty
Easy: Food Truck (3 permits, no dependencies)
graph LR
BL[business_license] --> ISSUED1((ISSUED))
FH[food_handler_cert] --> ISSUED2((ISSUED))
MV[mobile_vendor_permit] --> ISSUED3((ISSUED))
Medium: Neighborhood Cafe (6 permits, 2 dependency chains)
graph LR
BL[business_license] --> SG[signage_permit]
ZA[zoning_approval] --> HP[health_permit]
ZA --> FI[fire_inspection]
HP --> FSL[food_service_license]
FI --> FSL
Hard: Full-Service Restaurant (10 permits, 3 agencies, cross-deps + missing-doc event)
graph LR
BL[business_license] --> LL[liquor_license]
ZV[zoning_variance] --> BP[building_permit]
ZV --> LL
BP --> PP[plumbing_permit]
BP --> EP[electrical_permit]
BP --> HV[hvac_permit]
PP --> HP[health_permit]
EP --> FC[fire_certificate]
HV --> FC
HP --> FSL[food_service_license]
FC --> FSL
On the hard tier, a random missing-document event reverts one already-issued permit back to paid (requiring re-inspection), forcing the agent to re-plan mid-episode.
Tasks
| Task ID | Description | Permits | Budget (base) | Max Steps |
|---|---|---|---|---|
easy_foodtruck |
Open a mobile food vendor (flat DAG) | 3 | $500 | 20 |
medium_cafe |
Open a 20-seat cafe (2 dependency chains) | 6 | $1,000 | 40 |
hard_restaurant |
Full restaurant + bar (3 agencies, cross-deps, missing-doc) | 10 | $2,500 | 70 |
Each reset() randomizes the episode:
- Budget jittered +/-10%
- Every permit fee jittered +/-20%
- Permit iteration order shuffled
- All seeded by
(episode_id, seed, task_name)β deterministic given the same seed, different across resets
A policy that hard-codes a fixed action sequence will not generalize across resets.
Action space
class PermitAction(Action):
action_type: str # submit | pay | inspect | query | list | set_task
permit_id: str # target permit ID (or task name for set_task)
| Action | Effect | Legal when |
|---|---|---|
list |
Returns a message listing all permits | Always |
query |
Returns stage, fee, prereqs for one permit | permit_id exists |
submit |
Advances available -> approved |
Permit is available |
pay |
Deducts fee, advances approved -> paid |
Permit is approved AND budget >= fee |
inspect |
Advances paid -> issued, may unlock downstream permits |
Permit is paid |
set_task |
Switches the active task (legacy; prefer reset(task_name=...)) |
Any |
Illegal actions increment wasted_submissions and are penalized in the reward.
Observation space
class PermitObservation(Observation):
message: str # status text from last action
permits: dict # {permit_id: {stage, fee, prereqs, prereqs_met}}
budget_remaining: float # dollars left
wasted_submissions: int # count of illegal attempts
last_action_error: str | None # raw error from last step, or None
available_actions: list # ACTION TYPES currently legal (no permit IDs!)
task_name: str # current task
available_actions intentionally lists only action types (e.g. ["list", "query", "submit", "pay"]), not pre-built action strings with permit IDs. The agent must read the permits dict and reason about which permit to target β this prevents trivial "pick the first string" solutions.
Reward design
Dense partial-credit reward computed on every step, clamped to [0.0, 1.0]:
base = mean( stage_index(p) / 6 for p in permits )
budget_bonus = 0.1 * (budget_remaining / initial_budget) * base
waste_penalty = min(0.25, 0.02 * wasted_submissions)
reward = clamp(base + budget_bonus - waste_penalty, 0, 1)
The final per-task score emitted by inference.py:
score = max(rewards_history) - 0.003 * steps_taken
Peak progress minus a small per-step efficiency penalty. A run that completes in 9 steps outscores one that completes in 40 steps.
Worked example
At step 8 of medium_cafe with seed=42: 3 of 6 permits issued, 2 approved, 1 available. Budget $648/$1,020 remaining. 0 wasted submissions.
base = mean([6/6, 6/6, 6/6, 3/6, 3/6, 1/6]) = 0.611
budget_bonus = 0.1 * (648/1020) * 0.611 = 0.039
waste_penalty = 0.0
reward = 0.611 + 0.039 - 0.0 = 0.650
At the end (step 18, all issued): score = max(1.0, ...) - 0.003 * 18 = 0.946
Baseline scores
Tested on 2 vCPU / 8 GB, averaged over 3 seeds:
| Model | easy | medium | hard | Notes |
|---|---|---|---|---|
llama-3.3-70b-versatile (Groq) |
0.97 | 0.95 | 0.91 | Near-optimal. Navigates DAG and handles missing-doc. |
llama-3.1-8b-instant (Groq) |
0.51 | 0.01 | 0.00 | Struggles to pick correct permit IDs from observation. |
| No-LLM fallback (control) | 0.60 | 0.55 | 0.00 | Safe list() fallback only. Cannot advance the FSM. |
Key insight: The environment meaningfully differentiates model capability. Small models cannot solve medium/hard because they fail to reason about the prerequisite DAG and budget constraints. The no-LLM control proves the env is not trivially solvable by heuristics.
Total runtime for all 3 tasks with 70B: ~90 seconds (well under the 20-minute budget).
Example run trace (hard_restaurant, 70b)
[START] task=hard_restaurant env=permit_pathfinder model=llama-3.3-70b-versatile
[STEP] step=1 action=submit(business_license) reward=0.07 done=false error=null
[STEP] step=2 action=submit(zoning_variance) reward=0.11 done=false error=null
[STEP] step=3 action=pay(business_license) reward=0.13 done=false error=null
[STEP] step=4 action=pay(zoning_variance) reward=0.15 done=false error=null
[STEP] step=5 action=inspect(business_license) reward=0.18 done=false error=null
[STEP] step=6 action=inspect(zoning_variance) reward=0.25 done=false error=null
[STEP] step=7 action=submit(building_permit) reward=0.29 done=false error=null
[STEP] step=8 action=submit(liquor_license) reward=0.33 done=false error=null
[STEP] step=9 action=pay(liquor_license) reward=0.34 done=false error=null
[STEP] step=10 action=inspect(liquor_license) reward=0.34 done=false error=null
... [EVENT] Missing document: liquor_license reverted to PAID
[STEP] step=11 action=pay(building_permit) reward=0.33 done=false error=null
[STEP] step=12 action=inspect(building_permit) reward=0.42 done=false error=null
... (13 more steps: plumbing -> electrical -> hvac -> health -> fire -> food_service)
[STEP] step=30 action=inspect(food_service_license) reward=0.98 done=false error=null
[STEP] step=31 action=inspect(liquor_license) reward=1.00 done=true error=null
[END] success=true steps=31 score=0.907 rewards=0.07,0.11,...,0.98,1.00
Notice: the missing-doc event at step 10 reverts liquor_license from ISSUED to PAID. The agent recovers by completing all other permits first, then re-inspecting liquor_license as the final step. Score = max(1.0) - 0.003 * 31 = 0.907.
Environment variables
| Variable | Purpose | Default |
|---|---|---|
API_BASE_URL |
OpenAI-compatible LLM endpoint | https://router.huggingface.co/v1 |
MODEL_NAME |
Model identifier (auto-downgrades if proxy doesn't serve it) | Qwen/Qwen2.5-72B-Instruct |
API_KEY / HF_TOKEN |
Credential for the LLM proxy (API_KEY preferred) |
required, no default |
LOCAL_IMAGE_NAME |
Docker image to launch env from | optional |
OPENENV_BASE_URL |
Direct URL to a running env server | optional |
PERMIT_TASK |
Default task for reset() |
easy_foodtruck |
inference.py makes two guaranteed LLM proxy calls before any task loop:
client.models.list()β discovers a valid modelclient.chat.completions.create(...)β readiness check
This prevents the silent-fallback failure mode where a deterministic heuristic solves the env without any real LLM input.
Local setup
# Build
cd 03-PermitPathfinder
openenv build -t permit-pathfinder:local
# Run the env server
docker run -d --rm -p 8000:8000 --name pp permit-pathfinder:local
# Verify
curl -X POST -H 'Content-Type: application/json' -d '{}' http://localhost:8000/reset
# Run inference against the local container
API_BASE_URL=https://api.groq.com/openai/v1 \
MODEL_NAME=llama-3.3-70b-versatile \
API_KEY=$GROQ_API_KEY \
OPENENV_BASE_URL=http://localhost:8000 \
python inference.py
# Or let inference.py launch the container:
LOCAL_IMAGE_NAME=permit-pathfinder:local \
API_KEY=$GROQ_API_KEY \
python inference.py
# Validate
openenv validate
bash ../pre-validation.py http://localhost:8000 .
# Run tests
pip install pytest
PYTHONPATH=. pytest tests/ -v
Training with TRL
train.py provides a minimal GRPO training loop that uses PermitPathfinder as the reward source, following the official TRL OpenEnv integration pattern.
Three reward signals are combined:
reward_env_scoreβ the env's dense partial-credit reward (primary signal)reward_efficiencyβ bonus for completing in fewer stepsreward_no_wasteβ penalty for illegal actions
# Terminal 1: Start the env
docker run -d -p 8001:8000 permit-pathfinder:local
# Terminal 2: Start vLLM inference server
CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/Qwen2.5-0.5B-Instruct --port 8000
# Terminal 3: Run GRPO training
CUDA_VISIBLE_DEVICES=1 python train.py
Expert trajectories for supervised pre-training or offline RL are available at yashppawar/permit-pathfinder-trajectories (60 episodes, 45 scripted-optimal + 15 LLM-generated).
Architecture
03-PermitPathfinder/
βββ inference.py # [START]/[STEP]/[END] logger + LLM agent loop
βββ train.py # TRL GRPO training script (requires 2x GPU)
βββ openenv.yaml # spec v1, fastapi runtime, port 8000
βββ Dockerfile # root copy (for pre-validator)
βββ LICENSE # BSD 3-Clause
βββ pyproject.toml # openenv-core dependency
βββ models.py # PermitAction, PermitObservation (typed)
βββ client.py # EnvClient subclass (sync + from_docker_image)
βββ __init__.py # re-exports PermitEnv, PermitAction
βββ trajectories.jsonl # 60 expert episodes (HF Dataset source)
βββ tests/
β βββ test_fsm.py # FSM transitions, optimal policy, edge cases
β βββ test_randomization.py # seed determinism, fee jitter, budget jitter
βββ scripts/
β βββ generate_trajectories.py # trajectory generation script
βββ demo/
β βββ app.py # Gradio interactive demo (separate HF Space)
βββ server/
βββ app.py # create_app(PermitEnvironment, ...)
βββ permit_env_environment.py # FSM, 3 tasks, grader, missing-doc event
βββ Dockerfile # multi-stage on openenv-base
The server uses OpenEnv's create_app(...) factory. POST /reset (with empty {} body), POST /step, GET /state, GET /health, and GET /docs are provided automatically.
train.py follows the TRL OpenEnv integration with 3 decomposed reward signals.
License
BSD-style. See the LICENSE file in the repository root.