permit-pathfinder / README.md
yashppawar's picture
Upload folder using huggingface_hub
b22b2e7 verified
metadata
title: PermitPathfinder OpenEnv
emoji: πŸ›οΈ
colorFrom: yellow
colorTo: purple
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - rl
  - agent
  - planning
  - real-world

PermitPathfinder

An OpenEnv environment where an LLM agent opens a small business by navigating a stateful municipal permitting DAG β€” a real-world planning task with dense partial-credit reward, per-episode randomization, and multi-tier difficulty progression.

OpenEnv Python 3.10+ Docker Tasks Tests

Interactive Demo | Expert Trajectories Dataset | Training Script (TRL GRPO)


Why municipal permits?

Opening a restaurant in the United States requires an average of 15+ permits across 3-5 government agencies. The SBA estimates that 22% of small-business failures cite regulatory burden as a contributing factor. Every permit has prerequisites, fees, inspections, and deadlines β€” a tangled DAG that even experienced business owners find daunting.

This isn't a toy or a game. It's a real planning problem that millions of people face, and it's the kind of multi-step, constrained, partially observable task that an AI agent deployed as a "digital assistant" has to master. The env rewards real reasoning β€” a model that doesn't understand the DAG structure, budget constraints, and prerequisite chains cannot score well, as demonstrated by our baseline results showing 8B models scoring near zero while 70B models score 0.9+.


Real-world mapping

Every mechanic in this environment corresponds to a real permit workflow pattern:

Env Mechanic Real-World Equivalent
Permit DAG with prerequisites NYC DOB requires Certificate of Occupancy only after all trade permits (plumbing, electrical, HVAC) pass inspection
Fee jitter per episode Municipal fee schedules update quarterly; expedite fees vary by workload
Budget constraint Small businesses operate on fixed startup capital β€” the SBA reports median startup costs of $40,000
Missing-document event ~30% of permit applications are returned for "insufficient documentation" (ICC Building Safety Journal, 2024)
Hidden prerequisites (medium/hard) Applicants frequently discover new requirements mid-process β€” "we also need a grease trap permit"
Inquiry budget Phone hold times average 45 minutes per agency; each call is a real cost
Regulatory update event (hard) Zoning code amendments, fee schedule updates, and new environmental review requirements happen mid-project
Waste penalty for illegal actions Submitting incomplete applications wastes staff time and delays your timeline

Permit DAGs by difficulty

Easy: Food Truck (3 permits, no dependencies)

graph LR
    BL[business_license] --> ISSUED1((ISSUED))
    FH[food_handler_cert] --> ISSUED2((ISSUED))
    MV[mobile_vendor_permit] --> ISSUED3((ISSUED))

Medium: Neighborhood Cafe (6 permits, 2 dependency chains)

graph LR
    BL[business_license] --> SG[signage_permit]
    ZA[zoning_approval] --> HP[health_permit]
    ZA --> FI[fire_inspection]
    HP --> FSL[food_service_license]
    FI --> FSL

Hard: Full-Service Restaurant (10 permits, 3 agencies, cross-deps + missing-doc event)

graph LR
    BL[business_license] --> LL[liquor_license]
    ZV[zoning_variance] --> BP[building_permit]
    ZV --> LL
    BP --> PP[plumbing_permit]
    BP --> EP[electrical_permit]
    BP --> HV[hvac_permit]
    PP --> HP[health_permit]
    EP --> FC[fire_certificate]
    HV --> FC
    HP --> FSL[food_service_license]
    FC --> FSL

On the hard tier, a random missing-document event reverts one already-issued permit back to paid (requiring re-inspection), forcing the agent to re-plan mid-episode.


Tasks

Task ID Description Permits Budget (base) Max Steps
easy_foodtruck Open a mobile food vendor (flat DAG) 3 $500 20
medium_cafe Open a 20-seat cafe (2 dependency chains) 6 $1,000 40
hard_restaurant Full restaurant + bar (3 agencies, cross-deps, missing-doc) 10 $2,500 70

Each reset() randomizes the episode:

  • Budget jittered +/-10%
  • Every permit fee jittered +/-20%
  • Permit iteration order shuffled
  • All seeded by (episode_id, seed, task_name) β€” deterministic given the same seed, different across resets

A policy that hard-codes a fixed action sequence will not generalize across resets.


Action space

class PermitAction(Action):
    action_type: str   # submit | pay | inspect | query | list | set_task
    permit_id: str     # target permit ID (or task name for set_task)
Action Effect Legal when
list Returns a message listing all permits Always
query Returns stage, fee, prereqs for one permit permit_id exists
submit Advances available -> approved Permit is available
pay Deducts fee, advances approved -> paid Permit is approved AND budget >= fee
inspect Advances paid -> issued, may unlock downstream permits Permit is paid
set_task Switches the active task (legacy; prefer reset(task_name=...)) Any

Illegal actions increment wasted_submissions and are penalized in the reward.


Observation space

class PermitObservation(Observation):
    message: str                    # status text from last action
    permits: dict                   # {permit_id: {stage, fee, prereqs, prereqs_met}}
    budget_remaining: float         # dollars left
    wasted_submissions: int         # count of illegal attempts
    last_action_error: str | None   # raw error from last step, or None
    available_actions: list         # ACTION TYPES currently legal (no permit IDs!)
    task_name: str                  # current task

available_actions intentionally lists only action types (e.g. ["list", "query", "submit", "pay"]), not pre-built action strings with permit IDs. The agent must read the permits dict and reason about which permit to target β€” this prevents trivial "pick the first string" solutions.


Reward design

Dense partial-credit reward computed on every step, clamped to [0.0, 1.0]:

base         = mean( stage_index(p) / 6  for p in permits )
budget_bonus = 0.1 * (budget_remaining / initial_budget) * base
waste_penalty = min(0.25, 0.02 * wasted_submissions)

reward = clamp(base + budget_bonus - waste_penalty, 0, 1)

The final per-task score emitted by inference.py:

score = max(rewards_history) - 0.003 * steps_taken

Peak progress minus a small per-step efficiency penalty. A run that completes in 9 steps outscores one that completes in 40 steps.

Worked example

At step 8 of medium_cafe with seed=42: 3 of 6 permits issued, 2 approved, 1 available. Budget $648/$1,020 remaining. 0 wasted submissions.

base          = mean([6/6, 6/6, 6/6, 3/6, 3/6, 1/6])  = 0.611
budget_bonus  = 0.1 * (648/1020) * 0.611               = 0.039
waste_penalty = 0.0
reward        = 0.611 + 0.039 - 0.0                    = 0.650

At the end (step 18, all issued): score = max(1.0, ...) - 0.003 * 18 = 0.946


Baseline scores

Tested on 2 vCPU / 8 GB, averaged over 3 seeds:

Model easy medium hard Notes
llama-3.3-70b-versatile (Groq) 0.97 0.95 0.91 Near-optimal. Navigates DAG and handles missing-doc.
llama-3.1-8b-instant (Groq) 0.51 0.01 0.00 Struggles to pick correct permit IDs from observation.
No-LLM fallback (control) 0.60 0.55 0.00 Safe list() fallback only. Cannot advance the FSM.

Key insight: The environment meaningfully differentiates model capability. Small models cannot solve medium/hard because they fail to reason about the prerequisite DAG and budget constraints. The no-LLM control proves the env is not trivially solvable by heuristics.

Total runtime for all 3 tasks with 70B: ~90 seconds (well under the 20-minute budget).


Example run trace (hard_restaurant, 70b)

[START] task=hard_restaurant env=permit_pathfinder model=llama-3.3-70b-versatile
[STEP] step=1  action=submit(business_license)   reward=0.07 done=false error=null
[STEP] step=2  action=submit(zoning_variance)     reward=0.11 done=false error=null
[STEP] step=3  action=pay(business_license)       reward=0.13 done=false error=null
[STEP] step=4  action=pay(zoning_variance)        reward=0.15 done=false error=null
[STEP] step=5  action=inspect(business_license)   reward=0.18 done=false error=null
[STEP] step=6  action=inspect(zoning_variance)    reward=0.25 done=false error=null
[STEP] step=7  action=submit(building_permit)     reward=0.29 done=false error=null
[STEP] step=8  action=submit(liquor_license)      reward=0.33 done=false error=null
[STEP] step=9  action=pay(liquor_license)         reward=0.34 done=false error=null
[STEP] step=10 action=inspect(liquor_license)     reward=0.34 done=false error=null
   ... [EVENT] Missing document: liquor_license reverted to PAID
[STEP] step=11 action=pay(building_permit)        reward=0.33 done=false error=null
[STEP] step=12 action=inspect(building_permit)    reward=0.42 done=false error=null
   ... (13 more steps: plumbing -> electrical -> hvac -> health -> fire -> food_service)
[STEP] step=30 action=inspect(food_service_license) reward=0.98 done=false error=null
[STEP] step=31 action=inspect(liquor_license)     reward=1.00 done=true  error=null
[END] success=true steps=31 score=0.907 rewards=0.07,0.11,...,0.98,1.00

Notice: the missing-doc event at step 10 reverts liquor_license from ISSUED to PAID. The agent recovers by completing all other permits first, then re-inspecting liquor_license as the final step. Score = max(1.0) - 0.003 * 31 = 0.907.


Environment variables

Variable Purpose Default
API_BASE_URL OpenAI-compatible LLM endpoint https://router.huggingface.co/v1
MODEL_NAME Model identifier (auto-downgrades if proxy doesn't serve it) Qwen/Qwen2.5-72B-Instruct
API_KEY / HF_TOKEN Credential for the LLM proxy (API_KEY preferred) required, no default
LOCAL_IMAGE_NAME Docker image to launch env from optional
OPENENV_BASE_URL Direct URL to a running env server optional
PERMIT_TASK Default task for reset() easy_foodtruck

inference.py makes two guaranteed LLM proxy calls before any task loop:

  1. client.models.list() β€” discovers a valid model
  2. client.chat.completions.create(...) β€” readiness check

This prevents the silent-fallback failure mode where a deterministic heuristic solves the env without any real LLM input.


Local setup

# Build
cd 03-PermitPathfinder
openenv build -t permit-pathfinder:local

# Run the env server
docker run -d --rm -p 8000:8000 --name pp permit-pathfinder:local

# Verify
curl -X POST -H 'Content-Type: application/json' -d '{}' http://localhost:8000/reset

# Run inference against the local container
API_BASE_URL=https://api.groq.com/openai/v1 \
MODEL_NAME=llama-3.3-70b-versatile \
API_KEY=$GROQ_API_KEY \
OPENENV_BASE_URL=http://localhost:8000 \
python inference.py

# Or let inference.py launch the container:
LOCAL_IMAGE_NAME=permit-pathfinder:local \
API_KEY=$GROQ_API_KEY \
python inference.py

# Validate
openenv validate
bash ../pre-validation.py http://localhost:8000 .

# Run tests
pip install pytest
PYTHONPATH=. pytest tests/ -v

Training with TRL

train.py provides a minimal GRPO training loop that uses PermitPathfinder as the reward source, following the official TRL OpenEnv integration pattern.

Three reward signals are combined:

  • reward_env_score β€” the env's dense partial-credit reward (primary signal)
  • reward_efficiency β€” bonus for completing in fewer steps
  • reward_no_waste β€” penalty for illegal actions
# Terminal 1: Start the env
docker run -d -p 8001:8000 permit-pathfinder:local

# Terminal 2: Start vLLM inference server
CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/Qwen2.5-0.5B-Instruct --port 8000

# Terminal 3: Run GRPO training
CUDA_VISIBLE_DEVICES=1 python train.py

Expert trajectories for supervised pre-training or offline RL are available at yashppawar/permit-pathfinder-trajectories (60 episodes, 45 scripted-optimal + 15 LLM-generated).


Architecture

03-PermitPathfinder/
β”œβ”€β”€ inference.py                    # [START]/[STEP]/[END] logger + LLM agent loop
β”œβ”€β”€ train.py                        # TRL GRPO training script (requires 2x GPU)
β”œβ”€β”€ openenv.yaml                    # spec v1, fastapi runtime, port 8000
β”œβ”€β”€ Dockerfile                      # root copy (for pre-validator)
β”œβ”€β”€ LICENSE                         # BSD 3-Clause
β”œβ”€β”€ pyproject.toml                  # openenv-core dependency
β”œβ”€β”€ models.py                       # PermitAction, PermitObservation (typed)
β”œβ”€β”€ client.py                       # EnvClient subclass (sync + from_docker_image)
β”œβ”€β”€ __init__.py                     # re-exports PermitEnv, PermitAction
β”œβ”€β”€ trajectories.jsonl              # 60 expert episodes (HF Dataset source)
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_fsm.py                 # FSM transitions, optimal policy, edge cases
β”‚   └── test_randomization.py       # seed determinism, fee jitter, budget jitter
β”œβ”€β”€ scripts/
β”‚   └── generate_trajectories.py    # trajectory generation script
β”œβ”€β”€ demo/
β”‚   └── app.py                      # Gradio interactive demo (separate HF Space)
└── server/
    β”œβ”€β”€ app.py                      # create_app(PermitEnvironment, ...)
    β”œβ”€β”€ permit_env_environment.py   # FSM, 3 tasks, grader, missing-doc event
    └── Dockerfile                  # multi-stage on openenv-base

The server uses OpenEnv's create_app(...) factory. POST /reset (with empty {} body), POST /step, GET /state, GET /health, and GET /docs are provided automatically.

train.py follows the TRL OpenEnv integration with 3 decomposed reward signals.


License

BSD-style. See the LICENSE file in the repository root.