Spaces:

yashppawar
/

permit-pathfinder

Sleeping

App Files Files Community

permit-pathfinder / README.md

yashppawar

Upload folder using huggingface_hub

b22b2e7 verified 7 days ago

preview code

raw

history blame contribute delete

15 kB

metadata

title: PermitPathfinder OpenEnv
emoji: 🏛️
colorFrom: yellow
colorTo: purple
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - rl
  - agent
  - planning
  - real-world

PermitPathfinder

An OpenEnv environment where an LLM agent opens a small business by navigating a stateful municipal permitting DAG — a real-world planning task with dense partial-credit reward, per-episode randomization, and multi-tier difficulty progression.

Interactive Demo | Expert Trajectories Dataset | Training Script (TRL GRPO)

Why municipal permits?

Opening a restaurant in the United States requires an average of 15+ permits across 3-5 government agencies. The SBA estimates that 22% of small-business failures cite regulatory burden as a contributing factor. Every permit has prerequisites, fees, inspections, and deadlines — a tangled DAG that even experienced business owners find daunting.

This isn't a toy or a game. It's a real planning problem that millions of people face, and it's the kind of multi-step, constrained, partially observable task that an AI agent deployed as a "digital assistant" has to master. The env rewards real reasoning — a model that doesn't understand the DAG structure, budget constraints, and prerequisite chains cannot score well, as demonstrated by our baseline results showing 8B models scoring near zero while 70B models score 0.9+.

Real-world mapping

Every mechanic in this environment corresponds to a real permit workflow pattern:

Env Mechanic	Real-World Equivalent
Permit DAG with prerequisites	NYC DOB requires Certificate of Occupancy only after all trade permits (plumbing, electrical, HVAC) pass inspection
Fee jitter per episode	Municipal fee schedules update quarterly; expedite fees vary by workload
Budget constraint	Small businesses operate on fixed startup capital — the SBA reports median startup costs of $40,000
Missing-document event	~30% of permit applications are returned for "insufficient documentation" (ICC Building Safety Journal, 2024)
Hidden prerequisites (medium/hard)	Applicants frequently discover new requirements mid-process — "we also need a grease trap permit"
Inquiry budget	Phone hold times average 45 minutes per agency; each call is a real cost
Regulatory update event (hard)	Zoning code amendments, fee schedule updates, and new environmental review requirements happen mid-project
Waste penalty for illegal actions	Submitting incomplete applications wastes staff time and delays your timeline

Permit DAGs by difficulty

Easy: Food Truck (3 permits, no dependencies)

graph LR
    BL[business_license] --> ISSUED1((ISSUED))
    FH[food_handler_cert] --> ISSUED2((ISSUED))
    MV[mobile_vendor_permit] --> ISSUED3((ISSUED))

Medium: Neighborhood Cafe (6 permits, 2 dependency chains)

graph LR
    BL[business_license] --> SG[signage_permit]
    ZA[zoning_approval] --> HP[health_permit]
    ZA --> FI[fire_inspection]
    HP --> FSL[food_service_license]
    FI --> FSL

Hard: Full-Service Restaurant (10 permits, 3 agencies, cross-deps + missing-doc event)

graph LR
    BL[business_license] --> LL[liquor_license]
    ZV[zoning_variance] --> BP[building_permit]
    ZV --> LL
    BP --> PP[plumbing_permit]
    BP --> EP[electrical_permit]
    BP --> HV[hvac_permit]
    PP --> HP[health_permit]
    EP --> FC[fire_certificate]
    HV --> FC
    HP --> FSL[food_service_license]
    FC --> FSL

On the hard tier, a random missing-document event reverts one already-issued permit back to paid (requiring re-inspection), forcing the agent to re-plan mid-episode.

Tasks

Task ID	Description	Permits	Budget (base)	Max Steps
`easy_foodtruck`	Open a mobile food vendor (flat DAG)	3	$500	20
`medium_cafe`	Open a 20-seat cafe (2 dependency chains)	6	$1,000	40
`hard_restaurant`	Full restaurant + bar (3 agencies, cross-deps, missing-doc)	10	$2,500	70

Each reset() randomizes the episode:

Budget jittered +/-10%
Every permit fee jittered +/-20%
Permit iteration order shuffled
All seeded by (episode_id, seed, task_name) — deterministic given the same seed, different across resets

A policy that hard-codes a fixed action sequence will not generalize across resets.

Action space

class PermitAction(Action):
    action_type: str   # submit | pay | inspect | query | list | set_task
    permit_id: str     # target permit ID (or task name for set_task)

Action	Effect	Legal when
`list`	Returns a message listing all permits	Always
`query`	Returns stage, fee, prereqs for one permit	`permit_id` exists
`submit`	Advances `available` -> `approved`	Permit is `available`
`pay`	Deducts fee, advances `approved` -> `paid`	Permit is `approved` AND budget >= fee
`inspect`	Advances `paid` -> `issued`, may unlock downstream permits	Permit is `paid`
`set_task`	Switches the active task (legacy; prefer `reset(task_name=...)`)	Any

Illegal actions increment wasted_submissions and are penalized in the reward.

Observation space

class PermitObservation(Observation):
    message: str                    # status text from last action
    permits: dict                   # {permit_id: {stage, fee, prereqs, prereqs_met}}
    budget_remaining: float         # dollars left
    wasted_submissions: int         # count of illegal attempts
    last_action_error: str | None   # raw error from last step, or None
    available_actions: list         # ACTION TYPES currently legal (no permit IDs!)
    task_name: str                  # current task

available_actions intentionally lists only action types (e.g. ["list", "query", "submit", "pay"]), not pre-built action strings with permit IDs. The agent must read the permits dict and reason about which permit to target — this prevents trivial "pick the first string" solutions.

Reward design

Dense partial-credit reward computed on every step, clamped to [0.0, 1.0]:

base         = mean( stage_index(p) / 6  for p in permits )
budget_bonus = 0.1 * (budget_remaining / initial_budget) * base
waste_penalty = min(0.25, 0.02 * wasted_submissions)

reward = clamp(base + budget_bonus - waste_penalty, 0, 1)

The final per-task score emitted by inference.py:

score = max(rewards_history) - 0.003 * steps_taken

Peak progress minus a small per-step efficiency penalty. A run that completes in 9 steps outscores one that completes in 40 steps.

Worked example

At step 8 of medium_cafe with seed=42: 3 of 6 permits issued, 2 approved, 1 available. Budget $648/$1,020 remaining. 0 wasted submissions.

base          = mean([6/6, 6/6, 6/6, 3/6, 3/6, 1/6])  = 0.611
budget_bonus  = 0.1 * (648/1020) * 0.611               = 0.039
waste_penalty = 0.0
reward        = 0.611 + 0.039 - 0.0                    = 0.650

At the end (step 18, all issued): score = max(1.0, ...) - 0.003 * 18 = 0.946

Baseline scores

Tested on 2 vCPU / 8 GB, averaged over 3 seeds:

Model	easy	medium	hard	Notes
`llama-3.3-70b-versatile` (Groq)	0.97	0.95	0.91	Near-optimal. Navigates DAG and handles missing-doc.
`llama-3.1-8b-instant` (Groq)	0.51	0.01	0.00	Struggles to pick correct permit IDs from observation.
No-LLM fallback (control)	0.60	0.55	0.00	Safe `list()` fallback only. Cannot advance the FSM.

Key insight: The environment meaningfully differentiates model capability. Small models cannot solve medium/hard because they fail to reason about the prerequisite DAG and budget constraints. The no-LLM control proves the env is not trivially solvable by heuristics.

Total runtime for all 3 tasks with 70B: ~90 seconds (well under the 20-minute budget).

Example run trace (hard_restaurant, 70b)

[START] task=hard_restaurant env=permit_pathfinder model=llama-3.3-70b-versatile
[STEP] step=1  action=submit(business_license)   reward=0.07 done=false error=null
[STEP] step=2  action=submit(zoning_variance)     reward=0.11 done=false error=null
[STEP] step=3  action=pay(business_license)       reward=0.13 done=false error=null
[STEP] step=4  action=pay(zoning_variance)        reward=0.15 done=false error=null
[STEP] step=5  action=inspect(business_license)   reward=0.18 done=false error=null
[STEP] step=6  action=inspect(zoning_variance)    reward=0.25 done=false error=null
[STEP] step=7  action=submit(building_permit)     reward=0.29 done=false error=null
[STEP] step=8  action=submit(liquor_license)      reward=0.33 done=false error=null
[STEP] step=9  action=pay(liquor_license)         reward=0.34 done=false error=null
[STEP] step=10 action=inspect(liquor_license)     reward=0.34 done=false error=null
   ... [EVENT] Missing document: liquor_license reverted to PAID
[STEP] step=11 action=pay(building_permit)        reward=0.33 done=false error=null
[STEP] step=12 action=inspect(building_permit)    reward=0.42 done=false error=null
   ... (13 more steps: plumbing -> electrical -> hvac -> health -> fire -> food_service)
[STEP] step=30 action=inspect(food_service_license) reward=0.98 done=false error=null
[STEP] step=31 action=inspect(liquor_license)     reward=1.00 done=true  error=null
[END] success=true steps=31 score=0.907 rewards=0.07,0.11,...,0.98,1.00

Notice: the missing-doc event at step 10 reverts liquor_license from ISSUED to PAID. The agent recovers by completing all other permits first, then re-inspecting liquor_license as the final step. Score = max(1.0) - 0.003 * 31 = 0.907.

Environment variables

Variable	Purpose	Default
`API_BASE_URL`	OpenAI-compatible LLM endpoint	`https://router.huggingface.co/v1`
`MODEL_NAME`	Model identifier (auto-downgrades if proxy doesn't serve it)	`Qwen/Qwen2.5-72B-Instruct`
`API_KEY` / `HF_TOKEN`	Credential for the LLM proxy (`API_KEY` preferred)	required, no default
`LOCAL_IMAGE_NAME`	Docker image to launch env from	optional
`OPENENV_BASE_URL`	Direct URL to a running env server	optional
`PERMIT_TASK`	Default task for `reset()`	`easy_foodtruck`

inference.py makes two guaranteed LLM proxy calls before any task loop:

client.models.list() — discovers a valid model
client.chat.completions.create(...) — readiness check

This prevents the silent-fallback failure mode where a deterministic heuristic solves the env without any real LLM input.

Local setup

# Build
cd 03-PermitPathfinder
openenv build -t permit-pathfinder:local

# Run the env server
docker run -d --rm -p 8000:8000 --name pp permit-pathfinder:local

# Verify
curl -X POST -H 'Content-Type: application/json' -d '{}' http://localhost:8000/reset

# Run inference against the local container
API_BASE_URL=https://api.groq.com/openai/v1 \
MODEL_NAME=llama-3.3-70b-versatile \
API_KEY=$GROQ_API_KEY \
OPENENV_BASE_URL=http://localhost:8000 \
python inference.py

# Or let inference.py launch the container:
LOCAL_IMAGE_NAME=permit-pathfinder:local \
API_KEY=$GROQ_API_KEY \
python inference.py

# Validate
openenv validate
bash ../pre-validation.py http://localhost:8000 .

# Run tests
pip install pytest
PYTHONPATH=. pytest tests/ -v

Training with TRL

train.py provides a minimal GRPO training loop that uses PermitPathfinder as the reward source, following the official TRL OpenEnv integration pattern.

Three reward signals are combined:

reward_env_score — the env's dense partial-credit reward (primary signal)
reward_efficiency — bonus for completing in fewer steps
reward_no_waste — penalty for illegal actions

# Terminal 1: Start the env
docker run -d -p 8001:8000 permit-pathfinder:local

# Terminal 2: Start vLLM inference server
CUDA_VISIBLE_DEVICES=0 trl vllm-serve --model Qwen/Qwen2.5-0.5B-Instruct --port 8000

# Terminal 3: Run GRPO training
CUDA_VISIBLE_DEVICES=1 python train.py

Expert trajectories for supervised pre-training or offline RL are available at yashppawar/permit-pathfinder-trajectories (60 episodes, 45 scripted-optimal + 15 LLM-generated).

Architecture

03-PermitPathfinder/
├── inference.py                    # [START]/[STEP]/[END] logger + LLM agent loop
├── train.py                        # TRL GRPO training script (requires 2x GPU)
├── openenv.yaml                    # spec v1, fastapi runtime, port 8000
├── Dockerfile                      # root copy (for pre-validator)
├── LICENSE                         # BSD 3-Clause
├── pyproject.toml                  # openenv-core dependency
├── models.py                       # PermitAction, PermitObservation (typed)
├── client.py                       # EnvClient subclass (sync + from_docker_image)
├── __init__.py                     # re-exports PermitEnv, PermitAction
├── trajectories.jsonl              # 60 expert episodes (HF Dataset source)
├── tests/
│   ├── test_fsm.py                 # FSM transitions, optimal policy, edge cases
│   └── test_randomization.py       # seed determinism, fee jitter, budget jitter
├── scripts/
│   └── generate_trajectories.py    # trajectory generation script
├── demo/
│   └── app.py                      # Gradio interactive demo (separate HF Space)
└── server/
    ├── app.py                      # create_app(PermitEnvironment, ...)
    ├── permit_env_environment.py   # FSM, 3 tasks, grader, missing-doc event
    └── Dockerfile                  # multi-stage on openenv-base

The server uses OpenEnv's create_app(...) factory. POST /reset (with empty {} body), POST /step, GET /state, GET /health, and GET /docs are provided automatically.

train.py follows the TRL OpenEnv integration with 3 decomposed reward signals.

License

BSD-style. See the LICENSE file in the repository root.