Godreign-Y commited on
Commit
5ace282
·
1 Parent(s): 8942e3d

Add training loop, colab notebook, and final README

Browse files
Docs/implementation_report.md ADDED
@@ -0,0 +1,597 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Policy-to-Logic RL Environment — Complete Implementation Report
2
+
3
+ > **Purpose**: Exhaustive description of everything implemented, with exact logic, edge cases, and formulas. Intended for AI-assisted gap analysis against the original plan.
4
+
5
+ ---
6
+
7
+ ## 1. Project Architecture
8
+
9
+ ```
10
+ OpenenvHack/
11
+ ├── main.py # Entry point: uvicorn on port 7860
12
+ ├── Dockerfile # Docker SDK deployment for HF Spaces
13
+ ├── inference.py # LLM agent loop (Qwen2.5-72B via OpenAI API)
14
+ ├── pyproject.toml # UV project: pydantic, fastapi, uvicorn, openai, huggingface-hub
15
+ ├── test_hf_spaces.py # Remote endpoint tests against HF Spaces
16
+ ├── test_all.py # Local test runner (starts server, runs tests, stops)
17
+ ├── test_local.py / test_endpoints.py # Additional test scripts
18
+ ├── policy_to_logic_env/
19
+ │ ├── __init__.py # Package exports: models + client
20
+ │ ├── models.py # Pydantic models: Action, Observation, State, StepResult
21
+ │ ├── client.py # HTTP client wrapper for the environment
22
+ │ ├── openenv.yaml # OpenEnv specification file
23
+ │ └── server/
24
+ │ ├── app.py # FastAPI app with 6 endpoints
25
+ │ ├── environment.py # Core environment: reset(), step(), state()
26
+ │ ├── policies.py # 3 task definitions with clarification maps
27
+ │ ├── ground_truth.py # Programmatic ground truth + clarification oracle
28
+ │ ├── scenario_generator.py # 4-strategy scenario generation (seeded)
29
+ │ ├── graders.py # Rule grading against scenarios
30
+ │ ├── dsl_engine.py # JSON DSL parser, validator, executor
31
+ │ ├── rewards.py # 4-component reward system
32
+ │ └── requirements.txt # Server deps: openenv-core, pydantic, fastapi, uvicorn, requests
33
+ ```
34
+
35
+ **Deployment**: Docker on HF Spaces at `https://godreign-policy2logic.hf.space`, port 7860.
36
+
37
+ ---
38
+
39
+ ## 2. HTTP API (app.py)
40
+
41
+ Single FastAPI app with CORS `allow_origins=["*"]`. One global `PolicyToLogicEnvironment()` instance (single-session).
42
+
43
+ | Endpoint | Method | Request Body | Response | Purpose |
44
+ |---|---|---|---|---|
45
+ | `/` | GET | — | `{name, version, status, endpoints, docs, redoc}` | Root probe / API info |
46
+ | `/health` | GET | — | `{status: "ok", environment: "policy_to_logic"}` | Health check |
47
+ | `/tasks` | GET | — | `{tasks: {name: {difficulty, max_steps, scenario_count, valid_decisions, variables}}}` | List all 3 tasks |
48
+ | `/reset` | POST | `{task_name: str \| null}` | `StepResult` (observation + reward=0 + done=false) | Start new episode |
49
+ | `/step` | POST | `{action_type: str, content: str}` | `StepResult` (observation + reward + done) | Take an action |
50
+ | `/state` | GET | — | `PolicyToLogicState` (full episode metadata) | Get current state |
51
+
52
+ If `task_name` is null or invalid in `/reset`, defaults to `"data_access"`.
53
+
54
+ ---
55
+
56
+ ## 3. Data Models (models.py)
57
+
58
+ ### Action
59
+ ```
60
+ action_type: Literal["ask_clarification", "propose_rules", "refine_rules"]
61
+ content: str # JSON string payload
62
+ ```
63
+
64
+ ### Observation (returned in every StepResult)
65
+ ```
66
+ policy_text: str # The natural language policy (always present)
67
+ task_name: str
68
+ step_number: int # 0 on reset, 1+ on steps
69
+ max_steps: int
70
+ clarification_response: str | None # Oracle answer if ask_clarification
71
+ test_results: dict | None # {passed, failed, total, score, sample_failures}
72
+ current_accuracy: float # 0.0-1.0
73
+ available_actions: list[str] # What the agent can do next
74
+ feedback: str | None # Human-readable feedback
75
+ dsl_format: str # DSL syntax instructions (always present)
76
+ ```
77
+
78
+ ### State
79
+ ```
80
+ episode_id: str
81
+ step_count: int
82
+ task_name: str
83
+ current_rules: list | None
84
+ accuracy_history: list[float]
85
+ questions_asked: int
86
+ questions_log: list[str]
87
+ done: bool
88
+ total_reward: float
89
+ ```
90
+
91
+ ### StepResult
92
+ ```
93
+ observation: Observation
94
+ reward: float # 0.0-1.0 per step
95
+ done: bool
96
+ info: dict # Contains reward_breakdown, episode_score, errors, etc.
97
+ ```
98
+
99
+ ---
100
+
101
+ ## 4. Episode Lifecycle (environment.py)
102
+
103
+ ### reset(task_name)
104
+ 1. Load task config from registry (defaults to `"data_access"`)
105
+ 2. Generate scenarios via `generate_scenarios(task_name)` with `seed=42`
106
+ 3. Initialize state: `step_count=0`, `accuracy=0`, `done=false`
107
+ 4. Return observation with policy text, DSL format, available decisions/variables
108
+
109
+ ### step(action)
110
+ 1. Guard: if `state is None` or `done == True` → error result
111
+ 2. Increment `step_count`
112
+ 3. Dispatch by `action_type`:
113
+ - `"ask_clarification"` → `_handle_clarification()`
114
+ - `"propose_rules"` → `_handle_propose()`
115
+ - `"refine_rules"` → `_handle_refine()`
116
+
117
+ ### Termination Conditions
118
+ Episode ends (`done=True`) when **either**:
119
+ - `accuracy >= 0.9` (success)
120
+ - `step_count >= max_steps` (budget exhausted)
121
+
122
+ ### Clarification Handling
123
+ 1. Parse content as JSON to extract `question`, or use raw content as the question
124
+ 2. Call `answer_clarification(task_name, question)` → deterministic oracle answer
125
+ 3. Usefulness check: `is_useful = "I can provide information" not in answer`
126
+ 4. Compute reward (accuracy stays unchanged, clarification component applies)
127
+ 5. `refine_rules` is only available after at least one `propose_rules`
128
+
129
+ ### Rule Proposal/Refinement Handling
130
+ 1. Parse JSON content via `parse_rules()` → validates DSL structure
131
+ 2. If invalid: penalty reward, feedback with parse errors
132
+ 3. If valid: grade rules against stored scenarios → accuracy
133
+ 4. Compute reward using accuracy delta
134
+ 5. Feedback includes: accuracy, improvement direction, passed/total, sample failure
135
+ 6. If `accuracy >= 0.9`: feedback says "Target accuracy reached! Episode complete."
136
+ 7. On episode end: compute `episode_score` and include in info
137
+
138
+ ---
139
+
140
+ ## 5. The Three Tasks (policies.py)
141
+
142
+ ### Task 1: `data_access` (Easy)
143
+
144
+ | Property | Value |
145
+ |---|---|
146
+ | Difficulty | easy |
147
+ | Max Steps | 5 |
148
+ | Scenario Count | 30 |
149
+ | Variables | `time` (0-23), `data_type` (sensitive, public, internal) |
150
+ | Valid Decisions | ALLOW, DENY |
151
+ | Hidden Params | `work_start=9`, `work_end=18` |
152
+
153
+ **Policy Text** (what the agent sees):
154
+ > Employees must not access sensitive data after working hours. Working hours are from 9 AM to 6 PM (9:00 to 18:00). Public data can be accessed at any time. Internal data follows the same rules as sensitive data.
155
+
156
+ ---
157
+
158
+ ### Task 2: `resource_access` (Medium)
159
+
160
+ | Property | Value |
161
+ |---|---|
162
+ | Difficulty | medium |
163
+ | Max Steps | 7 |
164
+ | Scenario Count | 50 |
165
+ | Variables | `role` (junior, senior, contractor), `time` (0-23), `document_type` (public, internal, confidential) |
166
+ | Valid Decisions | ALLOW, DENY |
167
+ | Hidden Params | `business_start=8`, `business_end=17` |
168
+
169
+ **Policy Text**:
170
+ > Junior employees cannot access confidential documents outside business hours. Senior employees have unrestricted access to all document types. Contractors can only access public documents, regardless of time. During business hours, junior employees may access public and internal documents.
171
+
172
+ **Intentional Ambiguity**: The policy says juniors "cannot access confidential documents outside business hours" — implying they CAN during business hours. But the ground truth DENIES confidential for juniors at ALL times. This is a deliberate trap the agent must discover through testing.
173
+
174
+ ---
175
+
176
+ ### Task 3: `transaction_approval` (Hard)
177
+
178
+ | Property | Value |
179
+ |---|---|
180
+ | Difficulty | hard |
181
+ | Max Steps | 7 |
182
+ | Scenario Count | 80 |
183
+ | Variables | `amount` (100..50000, 12 values), `transfer_type` (domestic, international), `time` (0-23), `initiator_role` (employee, manager, system) |
184
+ | Valid Decisions | APPROVE, REQUIRE_APPROVAL, COMPLIANCE_REVIEW, HOLD |
185
+ | Hidden Params | `standard_limit=5000`, `high_value_threshold=10000`, `business_start=9`, `business_end=17` |
186
+
187
+ **Policy Text**:
188
+ > Transactions exceeding the standard limit require manager approval. International transfers always need compliance review regardless of amount. High-value domestic transactions during non-business hours are automatically held for review. Routine domestic transactions within limits are auto-approved. Manager-initiated transactions are exempt from the standard limit.
189
+
190
+ ---
191
+
192
+ ## 6. Ground Truth Logic (ground_truth.py)
193
+
194
+ ### Task 1: `_ground_truth_data_access`
195
+
196
+ ```python
197
+ if data_type == "public": → ALLOW
198
+ if 9 <= time < 18: → ALLOW # sensitive or internal
199
+ else: → DENY
200
+ ```
201
+
202
+ **Complete Decision Table**:
203
+
204
+ | data_type | time | Decision | Why |
205
+ |---|---|---|---|
206
+ | public | any (0-23) | ALLOW | Public is always accessible |
207
+ | sensitive | 0-8 | DENY | Before working hours |
208
+ | sensitive | 9-17 | ALLOW | During working hours |
209
+ | sensitive | 18-23 | DENY | After working hours (18 is OUTSIDE) |
210
+ | internal | 0-8 | DENY | Same rules as sensitive |
211
+ | internal | 9-17 | ALLOW | Same rules as sensitive |
212
+ | internal | 18-23 | DENY | Same rules as sensitive |
213
+
214
+ > [!IMPORTANT]
215
+ > **Critical boundary**: `time=18` → DENY. The interval is half-open: `[9, 18)`. Hour 18 is the first after-hours hour. Hour 17 is the last working hour.
216
+
217
+ ---
218
+
219
+ ### Task 2: `_ground_truth_resource_access`
220
+
221
+ ```python
222
+ if role == "senior": → ALLOW
223
+ if role == "contractor":
224
+ if doc_type == "public": → ALLOW
225
+ else: → DENY
226
+ # Junior employee:
227
+ is_business_hours = (8 <= time < 17)
228
+ if doc_type == "public": → ALLOW
229
+ if is_business_hours and doc_type == "internal": → ALLOW
230
+ else: → DENY
231
+ ```
232
+
233
+ **Complete Decision Table for Junior Employees**:
234
+
235
+ | document_type | time | Decision | Why |
236
+ |---|---|---|---|
237
+ | public | any (0-23) | ALLOW | Public always allowed for all roles |
238
+ | internal | 0-7 | DENY | Before business hours |
239
+ | internal | 8-16 | ALLOW | During business hours |
240
+ | internal | 17-23 | DENY | After business hours (17 is OUTSIDE) |
241
+ | confidential | any (0-23) | **DENY** | **Always denied for juniors** |
242
+
243
+ **Senior**: ALLOW for everything, always.
244
+ **Contractor**: ALLOW only for `public`, DENY for `internal` and `confidential`, at all times.
245
+
246
+ > [!IMPORTANT]
247
+ > **Critical boundary**: `time=17` → outside business hours. Interval: `[8, 17)`. Hour 16 is the last business hour.
248
+ >
249
+ > **Critical trap**: `confidential` is ALWAYS denied for juniors, even during business hours. The policy text misleadingly implies otherwise.
250
+
251
+ ---
252
+
253
+ ### Task 3: `_ground_truth_transaction_approval`
254
+
255
+ Rules evaluated in strict priority order (first match wins):
256
+
257
+ ```python
258
+ # Rule 1: International → COMPLIANCE_REVIEW (always, regardless of everything)
259
+ if transfer_type == "international": → COMPLIANCE_REVIEW
260
+
261
+ # Rule 2: High-value domestic outside business hours → HOLD
262
+ if amount >= 10000 and not (9 <= time < 17): → HOLD
263
+
264
+ # Rule 3: Above standard limit, not manager → REQUIRE_APPROVAL
265
+ if amount > 5000 and initiator_role != "manager": → REQUIRE_APPROVAL
266
+
267
+ # Rule 4: Everything else → APPROVE
268
+ else: → APPROVE
269
+ ```
270
+
271
+ **Critical Edge Cases**:
272
+
273
+ | amount | transfer_type | time | initiator_role | Decision | Why |
274
+ |---|---|---|---|---|---|
275
+ | 5000 | domestic | 12 | employee | **APPROVE** | At limit, not above (> 5000 fails) |
276
+ | 5001 | domestic | 12 | employee | REQUIRE_APPROVAL | Above limit, not manager |
277
+ | 5001 | domestic | 12 | manager | **APPROVE** | Manager exempt from limit |
278
+ | 10000 | domestic | 20 | employee | **HOLD** | High-value + non-business hours |
279
+ | 10000 | domestic | 12 | employee | REQUIRE_APPROVAL | High-value but business hours (Rule 2 skipped, Rule 3 matches) |
280
+ | 10000 | domestic | 17 | employee | **HOLD** | 17 is non-business hours |
281
+ | 10000 | domestic | 20 | **manager** | **HOLD** | Managers NOT exempt from HOLD rule |
282
+ | 100 | international | 12 | employee | COMPLIANCE_REVIEW | International always |
283
+ | 50000 | international | 3 | manager | COMPLIANCE_REVIEW | International trumps everything |
284
+ | 9999 | domestic | 20 | employee | REQUIRE_APPROVAL | NOT high-value (< 10000), but above limit |
285
+ | 100 | domestic | 3 | employee | APPROVE | Within limit |
286
+ | 100 | domestic | 3 | system | APPROVE | System = employee |
287
+
288
+ > [!IMPORTANT]
289
+ > **Standard limit comparison**: `amount > 5000` (strict greater than). $5,000 exactly = APPROVE.
290
+ >
291
+ > **High-value comparison**: `amount >= 10000` (greater than or equal). $10,000 exactly = high-value.
292
+ >
293
+ > **Manager exemption scope**: Only exempts from Rule 3 (standard limit). Managers are still subject to Rule 1 (international) and Rule 2 (high-value HOLD).
294
+ >
295
+ > **Business hours**: `[9, 17)`. Hour 17 is non-business.
296
+
297
+ ---
298
+
299
+ ## 7. Clarification Oracle (ground_truth.py)
300
+
301
+ ### Matching Algorithm
302
+
303
+ ```
304
+ Input: question string (free text from agent)
305
+ Output: best matching answer from task's clarification_map
306
+
307
+ Algorithm:
308
+ 1. Lowercase the question
309
+ 2. For each keyword in clarification_map:
310
+ a. Split keyword into parts by spaces
311
+ b. Check if ALL parts appear as substrings in the question
312
+ c. Score = (number_of_parts, total_keyword_length)
313
+ d. Highest score wins
314
+ 3. If no match: return generic fallback (contains "I can provide information")
315
+ ```
316
+
317
+ **Key property**: "junior confidential" matches when BOTH "junior" AND "confidential" appear anywhere in the question (order-independent). This 2-part keyword beats any 1-part keyword like "junior" alone.
318
+
319
+ ### Usefulness Detection
320
+
321
+ In `environment.py`, line 203:
322
+ ```python
323
+ is_useful = "I can provide information" not in answer
324
+ ```
325
+ Any answer that matches a keyword entry is "useful". Only the generic fallback is "not useful".
326
+
327
+ ### 3-Tier Progressive Revelation Design
328
+
329
+ Each task's `clarification_map` has three levels:
330
+
331
+ | Tier | Keyword Type | Answer Quality | Training Purpose |
332
+ |---|---|---|---|
333
+ | Level 1 | Single short words | Partial truths, technically correct but incomplete/misleading | Agent builds initial (wrong) rules |
334
+ | Level 2 | Common phrases | More detail, boundary still ambiguous | Agent narrows down the problem |
335
+ | Level 3 | Compound/multi-word | Precise, ground-truth-aligned, corrects Level 1 | Agent fixes rules after failures |
336
+
337
+ **Example — resource_access contradiction**:
338
+ - Agent asks "What can junior employees access?" → matches `"junior"` (Level 1) → *"...but not confidential documents outside business hours"* (implies CAN during hours)
339
+ - Agent proposes rules allowing junior+confidential during hours → **fails**
340
+ - Agent asks "Can junior employees access confidential documents?" → matches `"junior confidential"` (Level 3, 2 parts > 1 part) → *"CANNOT access confidential at ANY time"*
341
+ - Agent refines rules → **passes**
342
+
343
+ ### Clarification Map Entry Counts
344
+
345
+ | Task | Level 1 | Level 2 | Level 3 | Total |
346
+ |---|---|---|---|---|
347
+ | data_access | 5 | 3 | 6 | 14 |
348
+ | resource_access | 7 | 3 | 8 | 18 |
349
+ | transaction_approval | 9 | 7 | 10 | 26 |
350
+
351
+ ---
352
+
353
+ ## 8. DSL Engine (dsl_engine.py)
354
+
355
+ ### DSL Format
356
+
357
+ ```json
358
+ {
359
+ "rules": [
360
+ {
361
+ "if": [
362
+ {"field": "<name>", "op": "<operator>", "value": <value>}
363
+ ],
364
+ "then": "<DECISION>"
365
+ }
366
+ ],
367
+ "default": "<DEFAULT_DECISION>"
368
+ }
369
+ ```
370
+
371
+ ### Supported Operators
372
+ `>`, `<`, `>=`, `<=`, `==`, `!=`
373
+
374
+ ### Validation (`validate_rules`)
375
+ Checks:
376
+ - Root is a dict
377
+ - Has `"rules"` key (must be list)
378
+ - Has `"default"` key (must be string)
379
+ - Each rule has `"if"` (list) and `"then"` (string)
380
+ - Each condition has `"field"` (string), `"op"` (valid operator), `"value"`
381
+
382
+ Returns `(is_valid: bool, errors: list[str])`.
383
+
384
+ ### Execution (`execute_rules`)
385
+ 1. Iterate rules top-to-bottom
386
+ 2. For each rule, evaluate ALL conditions (AND logic)
387
+ 3. First rule where all conditions match → return its `"then"` decision
388
+ 4. If no rules match → return `"default"`
389
+
390
+ ### Type Coercion
391
+ If scenario has `time=9` (int) and rule has `"value": "9"` (str), coerces the string to int. Works both directions. If coercion fails, condition evaluates to `False`.
392
+
393
+ ### Parsing (`parse_rules`)
394
+ 1. `json.loads()` the content string
395
+ 2. `validate_rules()` on the parsed dict
396
+ 3. Returns `(rules_data, [])` on success or `(None, errors)` on failure
397
+
398
+ ---
399
+
400
+ ## 9. Scenario Generator (scenario_generator.py)
401
+
402
+ ### Strategy Allocation
403
+
404
+ | Strategy | Share | Purpose |
405
+ |---|---|---|
406
+ | Boundary | ~20% | Edge values near hidden param thresholds |
407
+ | Pairwise | ~30% | Systematic variable combinations |
408
+ | Adversarial | ~20% | Hand-crafted traps for common mistakes |
409
+ | Random | remainder | Uniform sampling from variable space |
410
+
411
+ All seeded with `seed=42` for reproducibility. Scenarios are deduplicated by field values.
412
+
413
+ ### Boundary Strategy
414
+ Extracts numeric hidden params, generates scenarios at `param ± 1` and at variable min/max.
415
+
416
+ ### Pairwise Strategy
417
+ For each pair of variables, samples up to 4 representative values (min, max, middle, random), generates cross-product combinations.
418
+
419
+ ### Adversarial Strategy
420
+ **Hand-crafted per task** — these are the exact scenarios:
421
+
422
+ #### data_access adversarial:
423
+ | time | data_type | Expected | Tests |
424
+ |---|---|---|---|
425
+ | 9 | sensitive | ALLOW | Start boundary |
426
+ | 18 | sensitive | DENY | End boundary (exclusive) |
427
+ | 8 | sensitive | DENY | Just before start |
428
+ | 17 | sensitive | ALLOW | Just before end |
429
+ | 0 | public | ALLOW | Public at midnight |
430
+ | 23 | internal | DENY | Internal late night |
431
+ | 12 | internal | ALLOW | Internal during hours |
432
+
433
+ #### resource_access adversarial:
434
+ | role | time | document_type | Expected | Tests |
435
+ |---|---|---|---|---|
436
+ | junior | 8 | confidential | DENY | Confidential at business start |
437
+ | junior | 7 | internal | DENY | Internal before hours |
438
+ | junior | 17 | internal | DENY | Internal at boundary (17=outside) |
439
+ | junior | 16 | internal | ALLOW | Internal just before boundary |
440
+ | contractor | 12 | internal | DENY | Contractor restricted |
441
+ | senior | 2 | confidential | ALLOW | Senior unrestricted |
442
+ | junior | 12 | public | ALLOW | Junior public during hours |
443
+ | contractor | 12 | public | ALLOW | Contractor public |
444
+
445
+ #### transaction_approval adversarial:
446
+ | amount | transfer | time | role | Expected | Tests |
447
+ |---|---|---|---|---|---|
448
+ | 5000 | domestic | 12 | employee | APPROVE | At limit (not above) |
449
+ | 5001 | domestic | 12 | employee | REQ_APPROVAL | Just above limit |
450
+ | 5001 | domestic | 12 | manager | APPROVE | Manager exempt |
451
+ | 10000 | domestic | 20 | employee | HOLD | High-value non-business |
452
+ | 10000 | domestic | 12 | employee | REQ_APPROVAL | High-value business hours |
453
+ | 100 | international | 12 | employee | COMPLIANCE | International small |
454
+ | 50000 | international | 3 | manager | COMPLIANCE | International manager |
455
+ | 9999 | domestic | 20 | employee | REQ_APPROVAL | Below high-value threshold |
456
+ | 10000 | domestic | 9 | employee | REQ_APPROVAL | High-value at business start |
457
+ | 10000 | domestic | 17 | employee | HOLD | 17=non-business |
458
+
459
+ ---
460
+
461
+ ## 10. Grading (graders.py)
462
+
463
+ ### `grade_task(task_name, rules_data, scenarios)`
464
+ 1. Validate rules → if invalid, return `score=0.0`
465
+ 2. For each scenario: execute agent's rules, compare to `expected_decision`
466
+ 3. Comparison: `actual.upper() == expected.upper()` (case-insensitive)
467
+ 4. `score = passed / total`
468
+ 5. Returns up to 5 `sample_failures` with scenario details, expected, got
469
+
470
+ ### `quick_grade(task_name, rules_data, scenarios)`
471
+ Same logic, returns only the float score. Used during step processing.
472
+
473
+ ---
474
+
475
+ ## 11. Reward System (rewards.py)
476
+
477
+ ### Per-Step Reward: `compute_reward()`
478
+
479
+ 4 components, clamped to `[0.0, 1.0]`:
480
+
481
+ | Component | Weight | Formula |
482
+ |---|---|---|
483
+ | **Accuracy** | 0.50 | `current_accuracy × 0.50` |
484
+ | **Improvement** | 0.20 | `min(delta × 2.0, 1.0) × 0.20` if delta > 0; `max(delta × 1.5, -0.5) × 0.20` if delta < 0; `0` if unchanged |
485
+ | **Efficiency** | 0.15 | `max(-0.02 × step_number [+ 0.05 × steps_saved if acc≥0.9], -0.15) × 0.15` |
486
+ | **Clarification** | 0.15 | See below |
487
+
488
+ **Clarification component details**:
489
+ - `ask_clarification` + useful + questions ≤ 3: `+0.3 × 0.15 = +0.045`
490
+ - `ask_clarification` + useful + questions > 3: `+0.1 × 0.15 = +0.015` (diminishing)
491
+ - `ask_clarification` + not useful: `-0.05 × 0.15 = -0.0075`
492
+ - `propose_rules/refine_rules` + invalid DSL: `-0.1 × 0.15 = -0.015`
493
+ - `propose_rules/refine_rules` + valid DSL: `0`
494
+
495
+ ### Episode Score: `compute_episode_score()`
496
+
497
+ Used for final grading, `[0.0, 1.0]`:
498
+
499
+ ```
500
+ score = final_accuracy × 0.80
501
+ + max(0, 1 - steps/max_steps) × 0.10
502
+ + question_bonus × 0.10
503
+
504
+ question_bonus = 1.0 if questions ≤ 2
505
+ = 0.5 if questions ≤ 4
506
+ = 0.0 if questions > 4
507
+ ```
508
+
509
+ ---
510
+
511
+ ## 12. Inference Agent (inference.py)
512
+
513
+ ### Configuration
514
+ - Model: `Qwen/Qwen2.5-72B-Instruct` (via `HF_TOKEN`)
515
+ - API: `https://router.huggingface.co/v1` (OpenAI-compatible)
516
+ - Temperature: 0.3, Max tokens: 1024
517
+ - Env URL: `http://localhost:7860` (configurable via `ENV_BASE_URL`)
518
+
519
+ ### Agent Loop
520
+ ```
521
+ for each task in [data_access, resource_access, transaction_approval]:
522
+ result = env.reset(task)
523
+ for step in 1..max_steps:
524
+ if result.done: break
525
+ action_type, content = get_agent_action(llm, observation, step, history)
526
+ result = env.step(action)
527
+ history.append(summary)
528
+ ```
529
+
530
+ ### Prompt Design
531
+ - **System prompt**: Describes available actions, DSL format, strategy guidelines
532
+ - **User prompt**: Built per-step with policy text, feedback, clarification answers, test results, sample failures, DSL format, action history (last 3)
533
+ - LLM response parsed as JSON: `{"action_type": "...", "content": "..."}`
534
+ - Handles markdown code blocks (`\`\`\`json ... \`\`\``)
535
+ - Fallback: if unparseable, tries extracting `"rules"`, otherwise submits empty rules
536
+
537
+ ### Output Format
538
+ ```
539
+ [START] task=<name> env=policy_to_logic model=<model>
540
+ [STEP] step=<n> action=<summary> reward=<float> done=<bool> error=<msg|null>
541
+ [END] success=<bool> steps=<n> score=<float> rewards=<r1,r2,...>
542
+ ```
543
+
544
+ ---
545
+
546
+ ## 13. Client Library (client.py)
547
+
548
+ HTTP client using `requests.Session()`:
549
+ - `reset(task_name)` → POST `/reset` → `PolicyToLogicStepResult`
550
+ - `step(action)` → POST `/step` → `PolicyToLogicStepResult`
551
+ - `state()` → GET `/state` → `PolicyToLogicState`
552
+ - `health()` → GET `/health` → dict
553
+ - `list_tasks()` → GET `/tasks` → dict
554
+ - Context manager support (`with PolicyToLogicEnv() as env:`)
555
+
556
+ ---
557
+
558
+ ## 14. Deployment
559
+
560
+ ### Dockerfile
561
+ ```dockerfile
562
+ FROM python:3.11-slim
563
+ WORKDIR /app
564
+ COPY policy_to_logic_env/server/requirements.txt → pip install
565
+ COPY policy_to_logic_env/, main.py, inference.py
566
+ EXPOSE 7860
567
+ HEALTHCHECK: curl -f http://localhost:7860/health
568
+ CMD: python -m uvicorn policy_to_logic_env.server.app:app --host 0.0.0.0 --port 7860
569
+ ```
570
+
571
+ ### HF Spaces Config (README.md)
572
+ ```yaml
573
+ sdk: docker
574
+ app_port: 7860
575
+ ```
576
+
577
+ Live at: `https://godreign-policy2logic.hf.space`
578
+
579
+ ---
580
+
581
+ ## 15. Known Design Decisions & Limitations
582
+
583
+ 1. **Single-session**: One global environment instance. Concurrent clients will interfere. Suitable for sequential benchmarking, not parallel RL training.
584
+
585
+ 2. **Deterministic scenarios**: `seed=42` always produces the same scenarios. Agent is graded on the same set every episode. Prevents overfitting variance but could lead to memorization.
586
+
587
+ 3. **Stateful server**: The environment holds state in memory. Server restart loses episode state. No persistence layer.
588
+
589
+ 4. **Clarification is keyword-based**: The oracle is not an LLM — it's a deterministic keyword matcher. Agent questions that don't contain any keyword get the generic fallback (penalized as "not useful").
590
+
591
+ 5. **Progressive revelation by design**: Level 1 clarification answers are intentionally misleading partial truths. This is NOT a bug — it's the core RL training signal. Agents that trust Level 1 answers will fail and must learn to ask better (Level 3) questions.
592
+
593
+ 6. **No `refine_rules` before `propose_rules`**: The environment returns a feedback message if the agent tries to refine before proposing. Not an error, just 0 reward + feedback.
594
+
595
+ 7. **Case-insensitive grading**: `actual.upper() == expected.upper()`. Agent can output "allow" or "Allow" or "ALLOW".
596
+
597
+ 8. **DSL type coercion**: Integer-string mismatches are auto-coerced. `"9"` and `9` compare equally.
IMPLEMENTATION_HANDOFF.md ADDED
@@ -0,0 +1,1045 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Policy-to-Logic RL Environment — Final Implementation Handoff
2
+ ## 7-Hour Sprint Document (AI-Executable Instructions)
3
+
4
+ > **Context**: This document is written for an AI-powered IDE to execute. Every instruction is explicit, ordered, and complete. Do not skip steps. Do not reorder steps. Validate each step before moving to the next.
5
+
6
+ ---
7
+
8
+ ## CRITICAL: Validation Checklist (Must Pass All 5 Before Submission)
9
+
10
+ The hackathon validator checks these automatically. If any fails, the submission is rejected before a human sees it.
11
+
12
+ | # | Requirement | How to Verify |
13
+ |---|---|---|
14
+ | 1 | Public HF Space at submitted URL | Open in logged-out browser — must load, no 404 |
15
+ | 2 | Valid OpenEnv structure (base class + reset/step/state + openenv.yaml) | Already implemented — verify yaml is parseable |
16
+ | 3 | Training evidence as committed `.png` / `.jpg` files in repo | NOT Wandb links — actual image files in repo |
17
+ | 4 | Runnable training script (Colab notebook preferred) | Must re-execute end-to-end |
18
+ | 5 | README links all deliverables with plots embedded inline | Validator must reach every link from README |
19
+
20
+ **These 5 items drive the entire 7-hour plan. Every task below serves one or more of them.**
21
+
22
+ ---
23
+
24
+ ## Hour-by-Hour Execution Plan
25
+
26
+ ---
27
+
28
+ ## HOUR 1-2: Build the Reward-Guided Trajectory Training Loop
29
+
30
+ ### What This Is
31
+
32
+ This is NOT fine-tuning. This is a **reward-guided few-shot accumulation loop** — a legitimate optimization strategy where:
33
+ - The agent runs episodes against the environment
34
+ - Trajectories (full interaction sequences) are stored with their rewards
35
+ - High-reward trajectories become few-shot examples for subsequent episodes
36
+ - Agent performance measurably improves across episodes using the reward signal
37
+
38
+ **This is your "training loop" for the submission.** It is honest, demonstrable, and buildable.
39
+
40
+ ---
41
+
42
+ ### File to Create: `training/trajectory_optimizer.py`
43
+
44
+ Create this file in the repo root under a new `training/` directory.
45
+
46
+ ```python
47
+ """
48
+ Reward-Guided Trajectory Optimization Loop
49
+ ==========================================
50
+ Optimizes agent behavior across episodes by accumulating high-reward
51
+ trajectories as few-shot examples. Uses environment reward signal to
52
+ drive improvement — no weight updates required.
53
+
54
+ This implements a policy improvement loop where:
55
+ - reward_signal → trajectory_selection → context_construction → improved_policy
56
+ """
57
+
58
+ import json
59
+ import os
60
+ import time
61
+ import requests
62
+ from dataclasses import dataclass, field
63
+ from typing import Optional
64
+ from openai import OpenAI
65
+
66
+ # ── Configuration ────────────────────────────────────────────────────────────
67
+
68
+ ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:7860")
69
+ HF_TOKEN = os.getenv("HF_TOKEN", "")
70
+ MODEL = "Qwen/Qwen2.5-72B-Instruct"
71
+ TEMPERATURE = 0.3
72
+ MAX_TOKENS = 1024
73
+
74
+ # Training hyperparameters
75
+ NUM_EPISODES_PER_TASK = 8 # Episodes to run per task
76
+ TOP_K_TRAJECTORIES = 3 # Max few-shot examples to keep
77
+ MIN_REWARD_THRESHOLD = 0.3 # Minimum reward to store trajectory
78
+ TASKS = ["data_access", "resource_access", "transaction_approval"]
79
+
80
+ # ── Data Structures ──────────────────────────────────────────────────────────
81
+
82
+ @dataclass
83
+ class Step:
84
+ step_number: int
85
+ action_type: str
86
+ action_content: str
87
+ reward: float
88
+ accuracy: float
89
+ feedback: str
90
+ clarification_response: Optional[str] = None
91
+
92
+ @dataclass
93
+ class Trajectory:
94
+ task_name: str
95
+ episode_id: int
96
+ steps: list[Step] = field(default_factory=list)
97
+ total_reward: float = 0.0
98
+ final_accuracy: float = 0.0
99
+ success: bool = False
100
+
101
+ def to_few_shot_string(self) -> str:
102
+ """Convert trajectory to a few-shot example string for prompting."""
103
+ lines = [
104
+ f"=== Example Episode (reward={self.total_reward:.2f}, accuracy={self.final_accuracy:.2f}) ===",
105
+ ]
106
+ for s in self.steps:
107
+ lines.append(f"Step {s.step_number}: action={s.action_type}")
108
+ lines.append(f" Content: {s.action_content[:200]}")
109
+ lines.append(f" Result: accuracy={s.accuracy:.2f}, reward={s.reward:.2f}")
110
+ if s.feedback:
111
+ lines.append(f" Feedback: {s.feedback[:150]}")
112
+ return "\n".join(lines)
113
+
114
+ # ── Environment Client ────────────────────────────────────────────────────────
115
+
116
+ class EnvClient:
117
+ def __init__(self, base_url: str):
118
+ self.base_url = base_url.rstrip("/")
119
+ self.session = requests.Session()
120
+
121
+ def reset(self, task_name: str) -> dict:
122
+ r = self.session.post(f"{self.base_url}/reset", json={"task_name": task_name})
123
+ r.raise_for_status()
124
+ return r.json()
125
+
126
+ def step(self, action_type: str, content: str) -> dict:
127
+ r = self.session.post(f"{self.base_url}/step", json={
128
+ "action_type": action_type,
129
+ "content": content
130
+ })
131
+ r.raise_for_status()
132
+ return r.json()
133
+
134
+ def health(self) -> bool:
135
+ try:
136
+ r = self.session.get(f"{self.base_url}/health", timeout=5)
137
+ return r.status_code == 200
138
+ except Exception:
139
+ return False
140
+
141
+ # ── LLM Agent ────────────────────────────────────────────────────────────────
142
+
143
+ class Agent:
144
+ def __init__(self, hf_token: str):
145
+ self.client = OpenAI(
146
+ base_url="https://router.huggingface.co/v1",
147
+ api_key=hf_token
148
+ )
149
+
150
+ def get_action(
151
+ self,
152
+ observation: dict,
153
+ step_number: int,
154
+ episode_history: list[str],
155
+ few_shot_examples: list[Trajectory]
156
+ ) -> tuple[str, str]:
157
+ """
158
+ Returns (action_type, content_json_string).
159
+ action_type: one of ask_clarification | propose_rules | refine_rules
160
+ content: JSON string appropriate for that action
161
+ """
162
+ system_prompt = self._build_system_prompt(few_shot_examples)
163
+ user_prompt = self._build_user_prompt(observation, step_number, episode_history)
164
+
165
+ try:
166
+ response = self.client.chat.completions.create(
167
+ model=MODEL,
168
+ messages=[
169
+ {"role": "system", "content": system_prompt},
170
+ {"role": "user", "content": user_prompt}
171
+ ],
172
+ temperature=TEMPERATURE,
173
+ max_tokens=MAX_TOKENS
174
+ )
175
+ raw = response.choices[0].message.content.strip()
176
+ return self._parse_response(raw, observation)
177
+ except Exception as e:
178
+ print(f" [LLM ERROR] {e}")
179
+ return "propose_rules", json.dumps({"rules": [], "default": "DENY"})
180
+
181
+ def _build_system_prompt(self, few_shot_examples: list[Trajectory]) -> str:
182
+ base = """You are a policy-to-logic agent. Your job is to convert natural language policies into executable rules.
183
+
184
+ AVAILABLE ACTIONS:
185
+ 1. ask_clarification: {"type": "clarification", "question": "your question"}
186
+ 2. propose_rules: {"rules": [...], "default": "DECISION"}
187
+ 3. refine_rules: {"rules": [...], "default": "DECISION"}
188
+
189
+ DSL FORMAT for rules:
190
+ {
191
+ "rules": [
192
+ {
193
+ "if": [
194
+ {"field": "FIELD_NAME", "op": "OPERATOR", "value": VALUE}
195
+ ],
196
+ "then": "DECISION"
197
+ }
198
+ ],
199
+ "default": "FALLBACK_DECISION"
200
+ }
201
+
202
+ Operators: >, <, >=, <=, ==, !=
203
+ Rules execute top-to-bottom. First match wins. Default applies if no rule matches.
204
+
205
+ STRATEGY:
206
+ - Step 1: Ask 1-2 targeted clarification questions about ambiguous terms
207
+ - Step 2: Propose initial rules based on policy + clarifications
208
+ - Step 3+: Refine rules based on failure feedback
209
+
210
+ OUTPUT FORMAT: Respond ONLY with valid JSON. No markdown. No explanation.
211
+ {"action_type": "propose_rules", "content": "{...escaped json string...}"}
212
+ """
213
+ if few_shot_examples:
214
+ base += "\n\nLEARNED FROM PREVIOUS EPISODES (high-reward strategies):\n"
215
+ for traj in few_shot_examples[-TOP_K_TRAJECTORIES:]:
216
+ base += "\n" + traj.to_few_shot_string() + "\n"
217
+ return base
218
+
219
+ def _build_user_prompt(self, obs: dict, step: int, history: list[str]) -> str:
220
+ lines = [
221
+ f"TASK: {obs.get('task_name', 'unknown')}",
222
+ f"STEP: {step} of {obs.get('max_steps', 7)}",
223
+ f"\nPOLICY:\n{obs.get('policy_text', '')}",
224
+ ]
225
+ if obs.get("clarification_response"):
226
+ lines.append(f"\nLAST CLARIFICATION ANSWER:\n{obs['clarification_response']}")
227
+ if obs.get("test_results"):
228
+ tr = obs["test_results"]
229
+ lines.append(f"\nTEST RESULTS: {tr.get('passed', 0)}/{tr.get('total', 0)} passed (accuracy={obs.get('current_accuracy', 0):.2f})")
230
+ if tr.get("sample_failures"):
231
+ lines.append("SAMPLE FAILURES:")
232
+ for f in tr["sample_failures"][:3]:
233
+ lines.append(f" - {f}")
234
+ if obs.get("feedback"):
235
+ lines.append(f"\nFEEDBACK: {obs['feedback']}")
236
+ if history:
237
+ lines.append(f"\nACTION HISTORY (last 3):\n" + "\n".join(history[-3:]))
238
+ lines.append(f"\nAVAILABLE ACTIONS: {obs.get('available_actions', [])}")
239
+ lines.append("\nRespond with JSON only: {\"action_type\": \"...\", \"content\": \"...\"}")
240
+ return "\n".join(lines)
241
+
242
+ def _parse_response(self, raw: str, obs: dict) -> tuple[str, str]:
243
+ # Strip markdown code fences if present
244
+ if "```" in raw:
245
+ raw = raw.split("```")[1]
246
+ if raw.startswith("json"):
247
+ raw = raw[4:]
248
+ raw = raw.strip()
249
+
250
+ try:
251
+ parsed = json.loads(raw)
252
+ action_type = parsed.get("action_type", "propose_rules")
253
+ content = parsed.get("content", "{}")
254
+
255
+ # Validate action_type
256
+ valid_actions = obs.get("available_actions", ["propose_rules", "ask_clarification"])
257
+ if action_type not in valid_actions:
258
+ action_type = "propose_rules" if "propose_rules" in valid_actions else valid_actions[0]
259
+
260
+ # Ensure content is a string
261
+ if isinstance(content, dict):
262
+ content = json.dumps(content)
263
+ return action_type, content
264
+ except Exception:
265
+ return "propose_rules", json.dumps({"rules": [], "default": "DENY"})
266
+
267
+ # ── Trajectory Bank ───────────────────────────────────────────────────────────
268
+
269
+ class TrajectoryBank:
270
+ """Stores and retrieves high-reward trajectories per task."""
271
+
272
+ def __init__(self):
273
+ self.bank: dict[str, list[Trajectory]] = {task: [] for task in TASKS}
274
+
275
+ def store(self, trajectory: Trajectory):
276
+ if trajectory.total_reward >= MIN_REWARD_THRESHOLD:
277
+ self.bank[trajectory.task_name].append(trajectory)
278
+ # Keep only top-K by reward
279
+ self.bank[trajectory.task_name].sort(key=lambda t: t.total_reward, reverse=True)
280
+ self.bank[trajectory.task_name] = self.bank[trajectory.task_name][:TOP_K_TRAJECTORIES]
281
+
282
+ def get_examples(self, task_name: str) -> list[Trajectory]:
283
+ return self.bank.get(task_name, [])
284
+
285
+ def summary(self) -> dict:
286
+ return {
287
+ task: {
288
+ "stored": len(trajs),
289
+ "best_reward": max((t.total_reward for t in trajs), default=0),
290
+ "best_accuracy": max((t.final_accuracy for t in trajs), default=0)
291
+ }
292
+ for task, trajs in self.bank.items()
293
+ }
294
+
295
+ # ── Training Loop ─────────────────────────────────────────────────────────────
296
+
297
+ class TrainingLoop:
298
+ def __init__(self, env_url: str, hf_token: str):
299
+ self.env = EnvClient(env_url)
300
+ self.agent = Agent(hf_token)
301
+ self.bank = TrajectoryBank()
302
+ self.metrics = [] # List of {episode, task, reward, accuracy, success}
303
+
304
+ def run_episode(self, task_name: str, episode_id: int) -> Trajectory:
305
+ """Run a single episode and return the trajectory."""
306
+ few_shots = self.bank.get_examples(task_name)
307
+ trajectory = Trajectory(task_name=task_name, episode_id=episode_id)
308
+
309
+ # Reset environment
310
+ result = self.env.reset(task_name)
311
+ obs = result.get("observation", {})
312
+ done = result.get("done", False)
313
+ history = []
314
+
315
+ print(f" [Episode {episode_id}] task={task_name} few_shots={len(few_shots)}")
316
+
317
+ step_num = 0
318
+ while not done and step_num < obs.get("max_steps", 7):
319
+ step_num += 1
320
+
321
+ # Get action from agent
322
+ action_type, content = self.agent.get_action(
323
+ observation=obs,
324
+ step_number=step_num,
325
+ episode_history=history,
326
+ few_shot_examples=few_shots
327
+ )
328
+
329
+ # Execute action
330
+ result = self.env.step(action_type, content)
331
+ reward = result.get("reward", 0.0)
332
+ done = result.get("done", False)
333
+ obs = result.get("observation", {})
334
+ info = result.get("info", {})
335
+
336
+ # Record step
337
+ step = Step(
338
+ step_number=step_num,
339
+ action_type=action_type,
340
+ action_content=content[:300],
341
+ reward=reward,
342
+ accuracy=obs.get("current_accuracy", 0.0),
343
+ feedback=obs.get("feedback", "") or "",
344
+ clarification_response=obs.get("clarification_response")
345
+ )
346
+ trajectory.steps.append(step)
347
+ trajectory.total_reward += reward
348
+
349
+ # Update history
350
+ history.append(f"Step {step_num}: {action_type} → reward={reward:.2f} acc={step.accuracy:.2f}")
351
+
352
+ print(f" step={step_num} action={action_type} reward={reward:.3f} acc={step.accuracy:.2f}")
353
+
354
+ if done:
355
+ episode_score = info.get("episode_score", obs.get("current_accuracy", 0.0))
356
+ trajectory.final_accuracy = episode_score
357
+ trajectory.success = obs.get("current_accuracy", 0.0) >= 0.9
358
+ break
359
+
360
+ if not trajectory.steps:
361
+ trajectory.final_accuracy = 0.0
362
+
363
+ return trajectory
364
+
365
+ def run(self):
366
+ """Run full training loop across all tasks."""
367
+ print("=" * 60)
368
+ print("REWARD-GUIDED TRAJECTORY OPTIMIZATION")
369
+ print(f"Tasks: {TASKS}")
370
+ print(f"Episodes per task: {NUM_EPISODES_PER_TASK}")
371
+ print(f"Top-K trajectories: {TOP_K_TRAJECTORIES}")
372
+ print("=" * 60)
373
+
374
+ # Health check
375
+ if not self.env.health():
376
+ raise RuntimeError(f"Environment not reachable at {ENV_BASE_URL}")
377
+ print(f"Environment: OK ({ENV_BASE_URL})\n")
378
+
379
+ global_episode = 0
380
+
381
+ for task in TASKS:
382
+ print(f"\n{'─'*40}")
383
+ print(f"TASK: {task}")
384
+ print(f"{'─'*40}")
385
+
386
+ task_rewards = []
387
+ task_accuracies = []
388
+
389
+ for ep in range(1, NUM_EPISODES_PER_TASK + 1):
390
+ global_episode += 1
391
+ trajectory = self.run_episode(task, ep)
392
+
393
+ # Store in bank
394
+ self.bank.store(trajectory)
395
+
396
+ # Record metrics
397
+ self.metrics.append({
398
+ "global_episode": global_episode,
399
+ "task": task,
400
+ "episode_in_task": ep,
401
+ "total_reward": trajectory.total_reward,
402
+ "final_accuracy": trajectory.final_accuracy,
403
+ "success": trajectory.success,
404
+ "num_steps": len(trajectory.steps),
405
+ "few_shots_used": len(self.bank.get_examples(task)) - (1 if trajectory.total_reward >= MIN_REWARD_THRESHOLD else 0)
406
+ })
407
+
408
+ task_rewards.append(trajectory.total_reward)
409
+ task_accuracies.append(trajectory.final_accuracy)
410
+
411
+ print(f" → Episode {ep} complete: reward={trajectory.total_reward:.3f} accuracy={trajectory.final_accuracy:.2f} success={trajectory.success}")
412
+ time.sleep(0.5) # Rate limiting
413
+
414
+ print(f"\n Task summary:")
415
+ print(f" First episode reward: {task_rewards[0]:.3f}")
416
+ print(f" Last episode reward: {task_rewards[-1]:.3f}")
417
+ print(f" Improvement: {task_rewards[-1] - task_rewards[0]:+.3f}")
418
+
419
+ print("\n" + "=" * 60)
420
+ print("TRAINING COMPLETE")
421
+ print(f"Bank summary: {self.bank.summary()}")
422
+ print("=" * 60)
423
+
424
+ return self.metrics
425
+
426
+ # ── Plot Generation ───────────────────────────────────────────────────────────
427
+
428
+ def save_plots(metrics: list[dict]):
429
+ """
430
+ Save reward curve and accuracy curve as PNG files.
431
+ These are REQUIRED for hackathon submission — must be committed to repo.
432
+ """
433
+ try:
434
+ import matplotlib
435
+ matplotlib.use("Agg") # Non-interactive backend
436
+ import matplotlib.pyplot as plt
437
+ import numpy as np
438
+ except ImportError:
439
+ print("matplotlib not installed. Run: pip install matplotlib")
440
+ return
441
+
442
+ os.makedirs("training/plots", exist_ok=True)
443
+
444
+ episodes = [m["global_episode"] for m in metrics]
445
+ rewards = [m["total_reward"] for m in metrics]
446
+ accuracies = [m["final_accuracy"] for m in metrics]
447
+ tasks = [m["task"] for m in metrics]
448
+
449
+ colors = {
450
+ "data_access": "#2196F3",
451
+ "resource_access": "#FF9800",
452
+ "transaction_approval": "#4CAF50"
453
+ }
454
+
455
+ # ── Plot 1: Reward Curve ──────────────────────────────────────────────────
456
+ fig, ax = plt.subplots(figsize=(10, 5))
457
+
458
+ for task in TASKS:
459
+ task_eps = [m["global_episode"] for m in metrics if m["task"] == task]
460
+ task_rews = [m["total_reward"] for m in metrics if m["task"] == task]
461
+ ax.plot(task_eps, task_rews, marker="o", label=task,
462
+ color=colors.get(task, "gray"), linewidth=2, markersize=5)
463
+
464
+ # Trend line
465
+ z = np.polyfit(episodes, rewards, 1)
466
+ p = np.poly1d(z)
467
+ ax.plot(episodes, p(episodes), "--", color="red", alpha=0.5, linewidth=1.5, label="overall trend")
468
+
469
+ ax.set_xlabel("Episode")
470
+ ax.set_ylabel("Total Reward")
471
+ ax.set_title("Reward Curve — Reward-Guided Trajectory Optimization")
472
+ ax.legend()
473
+ ax.grid(True, alpha=0.3)
474
+ ax.set_ylim(bottom=0)
475
+
476
+ plt.tight_layout()
477
+ plt.savefig("training/plots/reward_curve.png", dpi=150, bbox_inches="tight")
478
+ plt.close()
479
+ print("Saved: training/plots/reward_curve.png")
480
+
481
+ # ── Plot 2: Accuracy Curve ────────────────────────────────────────────────
482
+ fig, ax = plt.subplots(figsize=(10, 5))
483
+
484
+ for task in TASKS:
485
+ task_eps = [m["global_episode"] for m in metrics if m["task"] == task]
486
+ task_accs = [m["final_accuracy"] for m in metrics if m["task"] == task]
487
+ ax.plot(task_eps, task_accs, marker="s", label=task,
488
+ color=colors.get(task, "gray"), linewidth=2, markersize=5)
489
+
490
+ ax.axhline(y=0.9, color="red", linestyle="--", alpha=0.7, label="success threshold (0.9)")
491
+
492
+ ax.set_xlabel("Episode")
493
+ ax.set_ylabel("Final Accuracy")
494
+ ax.set_title("Accuracy Curve — Policy-to-Logic Agent")
495
+ ax.legend()
496
+ ax.grid(True, alpha=0.3)
497
+ ax.set_ylim(0, 1.05)
498
+
499
+ plt.tight_layout()
500
+ plt.savefig("training/plots/accuracy_curve.png", dpi=150, bbox_inches="tight")
501
+ plt.close()
502
+ print("Saved: training/plots/accuracy_curve.png")
503
+
504
+ # ── Plot 3: Per-Task Improvement Bar Chart ────────────────────────────────
505
+ fig, ax = plt.subplots(figsize=(8, 5))
506
+
507
+ task_names = []
508
+ improvements = []
509
+
510
+ for task in TASKS:
511
+ task_accs = [m["final_accuracy"] for m in metrics if m["task"] == task]
512
+ if len(task_accs) >= 2:
513
+ first = task_accs[0]
514
+ last = task_accs[-1]
515
+ task_names.append(task.replace("_", "\n"))
516
+ improvements.append(last - first)
517
+
518
+ bars = ax.bar(task_names, improvements,
519
+ color=["#2196F3", "#FF9800", "#4CAF50"][:len(task_names)],
520
+ edgecolor="white", linewidth=1.5)
521
+
522
+ ax.axhline(y=0, color="black", linewidth=0.8)
523
+ ax.set_ylabel("Accuracy Improvement (last - first episode)")
524
+ ax.set_title("Per-Task Improvement from Trajectory Accumulation")
525
+ ax.grid(True, axis="y", alpha=0.3)
526
+
527
+ for bar, val in zip(bars, improvements):
528
+ ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01,
529
+ f"{val:+.2f}", ha="center", va="bottom", fontweight="bold")
530
+
531
+ plt.tight_layout()
532
+ plt.savefig("training/plots/improvement_chart.png", dpi=150, bbox_inches="tight")
533
+ plt.close()
534
+ print("Saved: training/plots/improvement_chart.png")
535
+
536
+ # ── Save raw metrics as JSON ──────────────────────────────────────────────
537
+ with open("training/plots/metrics.json", "w") as f:
538
+ json.dump(metrics, f, indent=2)
539
+ print("Saved: training/plots/metrics.json")
540
+
541
+ # ── Entry Point ───────────────────────────────────────────────────────────────
542
+
543
+ if __name__ == "__main__":
544
+ hf_token = os.getenv("HF_TOKEN", "")
545
+ if not hf_token:
546
+ raise ValueError("HF_TOKEN environment variable not set")
547
+
548
+ loop = TrainingLoop(ENV_BASE_URL, hf_token)
549
+ metrics = loop.run()
550
+ save_plots(metrics)
551
+
552
+ print("\nNext step: commit training/plots/*.png to repo for submission.")
553
+ ```
554
+
555
+ ---
556
+
557
+ ## HOUR 2-3: Build the Colab Training Notebook
558
+
559
+ ### File to Create: `training/colab_training.ipynb`
560
+
561
+ This must be a runnable Colab notebook. Create it with the following cells in order.
562
+
563
+ **Cell 1 — Install dependencies:**
564
+ ```python
565
+ # Cell 1: Install dependencies
566
+ !pip install openai requests matplotlib numpy
567
+ ```
568
+
569
+ **Cell 2 — Configuration:**
570
+ ```python
571
+ # Cell 2: Configuration
572
+ import os
573
+
574
+ # SET THESE BEFORE RUNNING
575
+ HF_TOKEN = "" # Your Hugging Face token with inference access
576
+ ENV_URL = "https://godreign-policy2logic.hf.space" # Your deployed environment URL
577
+
578
+ os.environ["HF_TOKEN"] = HF_TOKEN
579
+ os.environ["ENV_BASE_URL"] = ENV_URL
580
+
581
+ print(f"Environment URL: {ENV_URL}")
582
+ print(f"HF Token set: {'Yes' if HF_TOKEN else 'NO - MUST SET THIS'}")
583
+ ```
584
+
585
+ **Cell 3 — Health check:**
586
+ ```python
587
+ # Cell 3: Verify environment is reachable
588
+ import requests
589
+
590
+ r = requests.get(f"{ENV_URL}/health")
591
+ print(f"Status: {r.status_code}")
592
+ print(f"Response: {r.json()}")
593
+
594
+ r2 = requests.get(f"{ENV_URL}/tasks")
595
+ tasks = r2.json()
596
+ print(f"\nAvailable tasks: {list(tasks['tasks'].keys())}")
597
+ ```
598
+
599
+ **Cell 4 — Paste entire `trajectory_optimizer.py` content here as a cell.**
600
+
601
+ Add this comment at the top of the cell:
602
+ ```python
603
+ # Cell 4: Training loop implementation
604
+ # (paste full contents of training/trajectory_optimizer.py here)
605
+ ```
606
+
607
+ **Cell 5 — Run training:**
608
+ ```python
609
+ # Cell 5: Run training loop
610
+ loop = TrainingLoop(ENV_URL, HF_TOKEN)
611
+ metrics = loop.run()
612
+ print(f"\nTotal episodes run: {len(metrics)}")
613
+ ```
614
+
615
+ **Cell 6 — Generate and display plots:**
616
+ ```python
617
+ # Cell 6: Generate plots and display inline
618
+ save_plots(metrics)
619
+
620
+ from IPython.display import Image, display
621
+ display(Image("training/plots/reward_curve.png"))
622
+ display(Image("training/plots/accuracy_curve.png"))
623
+ display(Image("training/plots/improvement_chart.png"))
624
+ ```
625
+
626
+ **Cell 7 — Download plots (CRITICAL — these must be committed to repo):**
627
+ ```python
628
+ # Cell 7: Download plots to commit to repo
629
+ # After running this, download the files and commit them to your GitHub repo
630
+ from google.colab import files
631
+
632
+ files.download("training/plots/reward_curve.png")
633
+ files.download("training/plots/accuracy_curve.png")
634
+ files.download("training/plots/improvement_chart.png")
635
+ files.download("training/plots/metrics.json")
636
+
637
+ print("Downloaded. Now commit these files to: training/plots/ in your repo.")
638
+ ```
639
+
640
+ ---
641
+
642
+ ## HOUR 3-4: Run Training and Capture Results
643
+
644
+ ### Steps (execute in order):
645
+
646
+ 1. Start the environment server locally OR use the deployed HF Space URL.
647
+ 2. Open the Colab notebook.
648
+ 3. Set `HF_TOKEN` in Cell 2.
649
+ 4. Set `ENV_URL` to your HF Space URL: `https://godreign-policy2logic.hf.space`
650
+ 5. Run all cells top to bottom.
651
+ 6. Wait for training to complete (~20-30 minutes for 8 episodes × 3 tasks).
652
+ 7. Cell 7 will download the plot PNG files.
653
+ 8. **Immediately commit the PNG files to the repo** under `training/plots/`.
654
+
655
+ ### Git commands after downloading plots:
656
+ ```bash
657
+ git add training/plots/reward_curve.png
658
+ git add training/plots/accuracy_curve.png
659
+ git add training/plots/improvement_chart.png
660
+ git add training/plots/metrics.json
661
+ git commit -m "Add training evidence: reward and accuracy curves"
662
+ git push
663
+ ```
664
+
665
+ ### If training takes too long (fallback):
666
+ Reduce `NUM_EPISODES_PER_TASK = 4` in the configuration. 4 episodes × 3 tasks = 12 total, which is enough to show a trend.
667
+
668
+ ---
669
+
670
+ ## HOUR 4-5: Write the README (CRITICAL — Validator Reads This)
671
+
672
+ ### File to Replace: `README.md`
673
+
674
+ The README must link every deliverable. The validator traverses links from README. If a link is broken or missing, that deliverable is marked absent.
675
+
676
+ ```markdown
677
+ # Policy-to-Logic RL Environment
678
+
679
+ > A verifiable reinforcement learning environment for policy-to-logic reasoning,
680
+ > where an agent learns to iteratively convert natural language policies into
681
+ > executable rules through interaction and reward-guided optimization.
682
+
683
+ ---
684
+
685
+ ## 🔗 Deliverables
686
+
687
+ | Deliverable | Link |
688
+ |---|---|
689
+ | **HF Space (Live Environment)** | [godreign-policy2logic.hf.space](https://godreign-policy2logic.hf.space) |
690
+ | **Training Notebook (Colab)** | [Open in Colab](https://colab.research.google.com/github/YOUR_USERNAME/YOUR_REPO/blob/main/training/colab_training.ipynb) |
691
+ | **Writeup / Slides** | [Link to your blog/slides/video here] |
692
+
693
+ > Replace `YOUR_USERNAME/YOUR_REPO` with your actual GitHub path.
694
+
695
+ ---
696
+
697
+ ## 📊 Training Results
698
+
699
+ The agent is trained using a **reward-guided trajectory optimization loop**.
700
+ High-reward interaction sequences are accumulated as few-shot examples,
701
+ improving agent behavior across episodes without weight updates.
702
+
703
+ ### Reward Curve
704
+ ![Reward Curve](training/plots/reward_curve.png)
705
+
706
+ ### Accuracy Curve
707
+ ![Accuracy Curve](training/plots/accuracy_curve.png)
708
+
709
+ ### Per-Task Improvement
710
+ ![Improvement Chart](training/plots/improvement_chart.png)
711
+
712
+ ---
713
+
714
+ ## 🧠 What This Is
715
+
716
+ This project builds a **verifiable RL environment** where:
717
+ - Policies are stated in natural language
718
+ - An agent converts them to executable JSON rules (DSL)
719
+ - The environment evaluates rules against generated scenarios
720
+ - Reward signals drive measurable improvement across episodes
721
+
722
+ **This is not a finished product. It is a training and evaluation framework.**
723
+
724
+ ---
725
+
726
+ ## 🏗️ Architecture
727
+
728
+ ```
729
+ Policy → Agent → (Ask / Propose / Refine)
730
+ → Environment → (Scenarios + Evaluation)
731
+ → Reward → Trajectory Bank → Improved Agent
732
+ ```
733
+
734
+ ### Three Tasks (increasing difficulty)
735
+
736
+ | Task | Difficulty | Variables | Decisions |
737
+ |---|---|---|---|
738
+ | data_access | Easy | time, data_type | ALLOW, DENY |
739
+ | resource_access | Medium | role, time, document_type | ALLOW, DENY |
740
+ | transaction_approval | Hard | amount, transfer_type, time, role | APPROVE, REQUIRE_APPROVAL, COMPLIANCE_REVIEW, HOLD |
741
+
742
+ ---
743
+
744
+ ## 🎮 Environment API
745
+
746
+ Live at: `https://godreign-policy2logic.hf.space`
747
+
748
+ | Endpoint | Method | Purpose |
749
+ |---|---|---|
750
+ | `/health` | GET | Health check |
751
+ | `/tasks` | GET | List available tasks |
752
+ | `/reset` | POST | Start new episode |
753
+ | `/step` | POST | Take action |
754
+ | `/state` | GET | Get episode state |
755
+
756
+ ### Quick Start
757
+
758
+ ```python
759
+ import requests
760
+
761
+ base = "https://godreign-policy2logic.hf.space"
762
+
763
+ # Start episode
764
+ result = requests.post(f"{base}/reset", json={"task_name": "data_access"}).json()
765
+ print(result["observation"]["policy_text"])
766
+
767
+ # Take action
768
+ action = requests.post(f"{base}/step", json={
769
+ "action_type": "propose_rules",
770
+ "content": '{"rules": [{"if": [{"field": "time", "op": ">=", "value": 9}, {"field": "time", "op": "<", "value": 18}], "then": "ALLOW"}], "default": "DENY"}'
771
+ }).json()
772
+ print(f"Reward: {action['reward']}, Accuracy: {action['observation']['current_accuracy']}")
773
+ ```
774
+
775
+ ---
776
+
777
+ ## 🔁 Training Loop
778
+
779
+ The training approach uses **reward-guided trajectory accumulation**:
780
+
781
+ 1. Agent runs episode zero-shot
782
+ 2. High-reward trajectories stored in trajectory bank
783
+ 3. Next episode uses top-K trajectories as few-shot context
784
+ 4. Agent performance improves as bank accumulates better examples
785
+
786
+ **This is a legitimate policy improvement loop driven by environment reward signal.**
787
+
788
+ ### Run Training Locally
789
+
790
+ ```bash
791
+ # Install dependencies
792
+ pip install openai requests matplotlib numpy
793
+
794
+ # Set environment variables
795
+ export HF_TOKEN=your_token_here
796
+ export ENV_BASE_URL=https://godreign-policy2logic.hf.space
797
+
798
+ # Run
799
+ python training/trajectory_optimizer.py
800
+ ```
801
+
802
+ ---
803
+
804
+ ## 📁 Repository Structure
805
+
806
+ ```
807
+ ├── policy_to_logic_env/
808
+ │ ├── server/
809
+ │ │ ├── app.py # FastAPI endpoints
810
+ │ │ ├── environment.py # Core RL environment (reset/step/state)
811
+ │ │ ├── policies.py # 3 task definitions
812
+ │ │ ├── ground_truth.py # Ground truth + clarification oracle
813
+ │ │ ├── scenario_generator.py # 4-strategy scenario generation
814
+ │ │ ├── dsl_engine.py # JSON DSL parser and executor
815
+ │ │ ├── rewards.py # Multi-component reward system
816
+ │ │ └── graders.py # Rule evaluation
817
+ │ ├── models.py # Pydantic data models
818
+ │ ├── client.py # HTTP client library
819
+ │ └── openenv.yaml # OpenEnv specification
820
+ ├── training/
821
+ │ ├── trajectory_optimizer.py # Training loop
822
+ │ ├── colab_training.ipynb # Colab notebook
823
+ │ └── plots/
824
+ │ ├── reward_curve.png # Training evidence (committed)
825
+ │ ├── accuracy_curve.png # Training evidence (committed)
826
+ │ └── improvement_chart.png
827
+ ├── main.py # Server entry point
828
+ ├── Dockerfile # HF Spaces deployment
829
+ └── README.md # This file
830
+ ```
831
+
832
+ ---
833
+
834
+ ## ⚙️ OpenEnv Compliance
835
+
836
+ This environment implements the OpenEnv specification:
837
+ - Gym-style `reset()` / `step()` / `state()` interface
838
+ - Valid `openenv.yaml` at `policy_to_logic_env/openenv.yaml`
839
+ - Pydantic models for all inputs/outputs
840
+ - HTTP API for remote agent interaction
841
+
842
+ ---
843
+
844
+ ## ⚠️ Known Limitations
845
+
846
+ 1. Single-session server (sequential episodes only, not parallel)
847
+ 2. Deterministic scenario seed — same scenarios every episode
848
+ 3. Training loop uses trajectory accumulation, not weight updates
849
+ 4. Clarification oracle is keyword-based, not semantic
850
+
851
+ ---
852
+
853
+ ## 🧾 Reward System
854
+
855
+ | Component | Weight | Signal |
856
+ |---|---|---|
857
+ | Accuracy | 50% | Rules correct vs ground truth |
858
+ | Improvement | 20% | Accuracy delta per step |
859
+ | Efficiency | 15% | Steps used vs budget |
860
+ | Clarification | 15% | Question usefulness |
861
+ ```
862
+
863
+ ---
864
+
865
+ ## HOUR 5: Verify openenv.yaml is Valid
866
+
867
+ ### Check existing file: `policy_to_logic_env/openenv.yaml`
868
+
869
+ Open the file and verify it contains at minimum:
870
+
871
+ ```yaml
872
+ name: policy-to-logic-env
873
+ version: "1.0.0"
874
+ description: "RL environment for converting natural language policies into executable rules"
875
+
876
+ environment:
877
+ type: Environment
878
+ reset_endpoint: /reset
879
+ step_endpoint: /step
880
+ state_endpoint: /state
881
+
882
+ tasks:
883
+ - name: data_access
884
+ difficulty: easy
885
+ max_steps: 5
886
+ - name: resource_access
887
+ difficulty: medium
888
+ max_steps: 7
889
+ - name: transaction_approval
890
+ difficulty: hard
891
+ max_steps: 7
892
+
893
+ observation_space:
894
+ policy_text: string
895
+ current_accuracy: float
896
+ available_actions: list
897
+ feedback: string
898
+
899
+ action_space:
900
+ - ask_clarification
901
+ - propose_rules
902
+ - refine_rules
903
+
904
+ reward:
905
+ min: 0.0
906
+ max: 1.0
907
+ ```
908
+
909
+ If this file exists and is different, do NOT replace it — just verify it is parseable YAML and contains `reset`, `step`, `state` references.
910
+
911
+ If it is missing these fields, add them.
912
+
913
+ ---
914
+
915
+ ## HOUR 6: Final Verification Pass
916
+
917
+ Run through every checklist item explicitly. Do not assume. Verify each one.
918
+
919
+ ### Check 1: HF Space is Public and Reachable
920
+
921
+ ```bash
922
+ # Open this URL in a logged-out browser (incognito window)
923
+ # https://godreign-policy2logic.hf.space
924
+ # Must load without login prompt
925
+ # Must return 200 on /health endpoint
926
+
927
+ curl https://godreign-policy2logic.hf.space/health
928
+ # Expected: {"status": "ok", "environment": "policy_to_logic"}
929
+ ```
930
+
931
+ ### Check 2: OpenEnv Structure
932
+
933
+ ```bash
934
+ # Verify openenv.yaml exists and is valid YAML
935
+ python -c "import yaml; yaml.safe_load(open('policy_to_logic_env/openenv.yaml'))"
936
+ # No error = valid
937
+
938
+ # Verify reset/step/state endpoints respond
939
+ curl -X POST https://godreign-policy2logic.hf.space/reset -H "Content-Type: application/json" -d '{"task_name": "data_access"}'
940
+ curl -X GET https://godreign-policy2logic.hf.space/state
941
+ ```
942
+
943
+ ### Check 3: Plot PNG Files Are Committed
944
+
945
+ ```bash
946
+ git ls-files training/plots/
947
+ # Must output:
948
+ # training/plots/reward_curve.png
949
+ # training/plots/accuracy_curve.png
950
+ # training/plots/improvement_chart.png
951
+
952
+ # Verify they are actual image files (not empty)
953
+ ls -lh training/plots/*.png
954
+ # Each must be > 10KB
955
+ ```
956
+
957
+ ### Check 4: Training Script is Runnable
958
+
959
+ ```bash
960
+ # Verify the Python script runs (dry run — just check imports and config)
961
+ python -c "
962
+ import sys
963
+ sys.path.insert(0, '.')
964
+ # Check imports
965
+ import json, os, time, requests
966
+ from openai import OpenAI
967
+ print('All imports OK')
968
+ print('Training script: training/trajectory_optimizer.py — OK')
969
+ "
970
+
971
+ # Verify Colab notebook exists
972
+ ls -la training/colab_training.ipynb
973
+ ```
974
+
975
+ ### Check 5: README Links
976
+
977
+ Open `README.md` and manually verify:
978
+ - HF Space link is correct and matches actual URL
979
+ - Colab badge link uses correct GitHub username and repo name
980
+ - Both `![Reward Curve](training/plots/reward_curve.png)` images render when viewed on GitHub
981
+ - Writeup/slides link is filled in (not placeholder)
982
+
983
+ ---
984
+
985
+ ## HOUR 7: Buffer — Fix Whatever Failed Check
986
+
987
+ Use this hour to fix anything that failed the verification pass.
988
+
989
+ **Most likely failures and fixes:**
990
+
991
+ | Failure | Fix |
992
+ |---|---|
993
+ | HF Space returns 404 | Rebuild Docker and redeploy to HF Spaces |
994
+ | PNG files not in repo | Download from Colab, `git add`, `git commit`, `git push` |
995
+ | openenv.yaml missing fields | Add missing fields, push |
996
+ | Colab link broken | Fix GitHub path in README |
997
+ | Plots not rendering in README | Verify relative path matches actual file location |
998
+
999
+ ---
1000
+
1001
+ ## Fallback: If Training Produces Flat Curves
1002
+
1003
+ If the reward curve shows no improvement (all episodes get similar rewards), do NOT fabricate results. Instead:
1004
+
1005
+ 1. Run more episodes — increase `NUM_EPISODES_PER_TASK = 12`
1006
+ 2. Lower `MIN_REWARD_THRESHOLD = 0.1` to accumulate more examples
1007
+ 3. If still flat, the submission narrative becomes: *"We demonstrate that the environment produces consistent reward signals and the agent achieves non-trivial baseline performance. Future work includes fine-tuning with TRL/GRPO."*
1008
+
1009
+ A flat but honest curve is better than a fabricated improving curve.
1010
+
1011
+ ---
1012
+
1013
+ ## What to Say to Judges (Prepared Answers)
1014
+
1015
+ **Q: Is this RL training?**
1016
+ > "We implement a reward-guided trajectory optimization loop. The environment's reward signal selects high-value interaction trajectories which are accumulated as few-shot context, improving agent policy across episodes. This is a form of in-context policy improvement driven by environment feedback."
1017
+
1018
+ **Q: Why not fine-tune the model?**
1019
+ > "Given hackathon constraints, we demonstrate the environment's training capability through trajectory accumulation. The environment is fully compatible with TRL/GRPO fine-tuning — the reward signal, episode structure, and action space are all defined. Fine-tuning is the natural next step."
1020
+
1021
+ **Q: What does the agent actually learn?**
1022
+ > "The agent learns when to ask clarifying questions versus when to propose rules, and how to refine rules based on failure feedback. The trajectory bank accumulates successful strategies that improve decision-making in subsequent episodes."
1023
+
1024
+ **Q: Why is the simulation not realistic?**
1025
+ > "The environment is a verification harness, not a simulation. It functions like unit testing for policy logic — correctness is the goal, not realism. This gives us objective, programmatic reward signals suitable for RL."
1026
+
1027
+ ---
1028
+
1029
+ ## File Checklist (Everything That Must Exist at Submission)
1030
+
1031
+ ```
1032
+ ✅ policy_to_logic_env/openenv.yaml — already exists, verify valid
1033
+ ✅ policy_to_logic_env/server/environment.py — already exists
1034
+ ✅ policy_to_logic_env/server/app.py — already exists
1035
+ ✅ training/trajectory_optimizer.py — CREATE THIS (Hour 1-2)
1036
+ ✅ training/colab_training.ipynb — CREATE THIS (Hour 2-3)
1037
+ ✅ training/plots/reward_curve.png — GENERATE AND COMMIT (Hour 3-4)
1038
+ ✅ training/plots/accuracy_curve.png — GENERATE AND COMMIT (Hour 3-4)
1039
+ ✅ training/plots/improvement_chart.png — GENERATE AND COMMIT (Hour 3-4)
1040
+ ✅ README.md — REWRITE (Hour 4-5)
1041
+ ```
1042
+
1043
+ ---
1044
+
1045
+ *End of handoff document. Every step above is required. Execute in order.*
README.md CHANGED
@@ -9,4 +9,185 @@ pinned: false
9
  short_description: Meta pytorch hugging face hackathon
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  short_description: Meta pytorch hugging face hackathon
10
  ---
11
 
12
+ # Policy-to-Logic RL Environment
13
+
14
+ > A verifiable reinforcement learning environment for policy-to-logic reasoning,
15
+ > where an agent learns to iteratively convert natural language policies into
16
+ > executable rules through interaction and reward-guided optimization.
17
+
18
+ ---
19
+
20
+ ## 🔗 Deliverables
21
+
22
+ | Deliverable | Link |
23
+ |---|---|
24
+ | **HF Space (Live Environment)** | [godreign-policy2logic.hf.space](https://godreign-policy2logic.hf.space) |
25
+ | **Training Notebook (Colab)** | [Open in Colab](https://colab.research.google.com/github/GodreignElgin/policy2logic/blob/main/training/colab_training.ipynb) |
26
+ | **Writeup / Slides** | *TBD — add your link here* |
27
+
28
+ ---
29
+
30
+ ## 📊 Training Results
31
+
32
+ The agent is trained using a **reward-guided trajectory optimization loop**.
33
+ High-reward interaction sequences are accumulated as few-shot examples,
34
+ improving agent behavior across episodes without weight updates.
35
+
36
+ ### Reward Curve
37
+ ![Reward Curve](training/plots/reward_curve.png)
38
+
39
+ ### Accuracy Curve
40
+ ![Accuracy Curve](training/plots/accuracy_curve.png)
41
+
42
+ ### Per-Task Improvement
43
+ ![Improvement Chart](training/plots/improvement_chart.png)
44
+
45
+ ---
46
+
47
+ ## 🧠 What This Is
48
+
49
+ This project builds a **verifiable RL environment** where:
50
+ - Policies are stated in natural language
51
+ - An agent converts them to executable JSON rules (DSL)
52
+ - The environment evaluates rules against generated scenarios
53
+ - Reward signals drive measurable improvement across episodes
54
+
55
+ **This is not a finished product. It is a training and evaluation framework.**
56
+
57
+ ---
58
+
59
+ ## 🏗️ Architecture
60
+
61
+ ```
62
+ Policy → Agent → (Ask / Propose / Refine)
63
+ → Environment → (Scenarios + Evaluation)
64
+ → Reward → Trajectory Bank → Improved Agent
65
+ ```
66
+
67
+ ### Three Tasks (increasing difficulty)
68
+
69
+ | Task | Difficulty | Variables | Decisions |
70
+ |---|---|---|---|
71
+ | data_access | Easy | time, data_type | ALLOW, DENY |
72
+ | resource_access | Medium | role, time, document_type | ALLOW, DENY |
73
+ | transaction_approval | Hard | amount, transfer_type, time, role | APPROVE, REQUIRE_APPROVAL, COMPLIANCE_REVIEW, HOLD |
74
+
75
+ ---
76
+
77
+ ## 🎮 Environment API
78
+
79
+ Live at: `https://godreign-policy2logic.hf.space`
80
+
81
+ | Endpoint | Method | Purpose |
82
+ |---|---|---|
83
+ | `/health` | GET | Health check |
84
+ | `/tasks` | GET | List available tasks |
85
+ | `/reset` | POST | Start new episode |
86
+ | `/step` | POST | Take action |
87
+ | `/state` | GET | Get episode state |
88
+
89
+ ### Quick Start
90
+
91
+ ```python
92
+ import requests
93
+
94
+ base = "https://godreign-policy2logic.hf.space"
95
+
96
+ # Start episode
97
+ result = requests.post(f"{base}/reset", json={"task_name": "data_access"}).json()
98
+ print(result["observation"]["policy_text"])
99
+
100
+ # Take action
101
+ action = requests.post(f"{base}/step", json={
102
+ "action_type": "propose_rules",
103
+ "content": '{"rules": [{"if": [{"field": "time", "op": ">=", "value": 9}, {"field": "time", "op": "<", "value": 18}], "then": "ALLOW"}], "default": "DENY"}'
104
+ }).json()
105
+ print(f"Reward: {action['reward']}, Accuracy: {action['observation']['current_accuracy']}")
106
+ ```
107
+
108
+ ---
109
+
110
+ ## 🔁 Training Loop
111
+
112
+ The training approach uses **reward-guided trajectory accumulation**:
113
+
114
+ 1. Agent runs episode zero-shot
115
+ 2. High-reward trajectories stored in trajectory bank
116
+ 3. Next episode uses top-K trajectories as few-shot context
117
+ 4. Agent performance improves as bank accumulates better examples
118
+
119
+ **This is a legitimate policy improvement loop driven by environment reward signal.**
120
+
121
+ ### Run Training Locally
122
+
123
+ ```bash
124
+ # Install dependencies
125
+ pip install openai requests matplotlib numpy
126
+
127
+ # Set environment variables
128
+ export HF_TOKEN=your_token_here
129
+ export ENV_BASE_URL=https://godreign-policy2logic.hf.space
130
+
131
+ # Run
132
+ python training/trajectory_optimizer.py
133
+ ```
134
+
135
+ ---
136
+
137
+ ## 📁 Repository Structure
138
+
139
+ ```
140
+ ├── policy_to_logic_env/
141
+ │ ├── server/
142
+ │ │ ├── app.py # FastAPI endpoints
143
+ │ │ ├── environment.py # Core RL environment (reset/step/state)
144
+ │ │ ├── policies.py # 3 task definitions
145
+ │ │ ├── ground_truth.py # Ground truth + clarification oracle
146
+ │ │ ├── scenario_generator.py # 4-strategy scenario generation
147
+ │ │ ├── dsl_engine.py # JSON DSL parser and executor
148
+ │ │ ├── rewards.py # Multi-component reward system
149
+ │ │ └── graders.py # Rule evaluation
150
+ │ ├── models.py # Pydantic data models
151
+ │ ├── client.py # HTTP client library
152
+ │ └── openenv.yaml # OpenEnv specification
153
+ ├── training/
154
+ │ ├── trajectory_optimizer.py # Training loop
155
+ │ ├── colab_training.ipynb # Colab notebook
156
+ │ └── plots/
157
+ │ ├── reward_curve.png # Training evidence (committed)
158
+ │ ├─��� accuracy_curve.png # Training evidence (committed)
159
+ │ └── improvement_chart.png
160
+ ├── main.py # Server entry point
161
+ ├── Dockerfile # HF Spaces deployment
162
+ └── README.md # This file
163
+ ```
164
+
165
+ ---
166
+
167
+ ## ⚙️ OpenEnv Compliance
168
+
169
+ This environment implements the OpenEnv specification:
170
+ - Gym-style `reset()` / `step()` / `state()` interface
171
+ - Valid `openenv.yaml` at `policy_to_logic_env/openenv.yaml`
172
+ - Pydantic models for all inputs/outputs
173
+ - HTTP API for remote agent interaction
174
+
175
+ ---
176
+
177
+ ## ⚠️ Known Limitations
178
+
179
+ 1. Single-session server (sequential episodes only, not parallel)
180
+ 2. Deterministic scenario seed — same scenarios every episode
181
+ 3. Training loop uses trajectory accumulation, not weight updates
182
+ 4. Clarification oracle is keyword-based, not semantic
183
+
184
+ ---
185
+
186
+ ## 🧾 Reward System
187
+
188
+ | Component | Weight | Signal |
189
+ |---|---|---|
190
+ | Accuracy | 50% | Rules correct vs ground truth |
191
+ | Improvement | 20% | Accuracy delta per step |
192
+ | Efficiency | 15% | Steps used vs budget |
193
+ | Clarification | 15% | Question usefulness |
implementation_report.md ADDED
@@ -0,0 +1,597 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Policy-to-Logic RL Environment — Complete Implementation Report
2
+
3
+ > **Purpose**: Exhaustive description of everything implemented, with exact logic, edge cases, and formulas. Intended for AI-assisted gap analysis against the original plan.
4
+
5
+ ---
6
+
7
+ ## 1. Project Architecture
8
+
9
+ ```
10
+ OpenenvHack/
11
+ ├── main.py # Entry point: uvicorn on port 7860
12
+ ├── Dockerfile # Docker SDK deployment for HF Spaces
13
+ ├── inference.py # LLM agent loop (Qwen2.5-72B via OpenAI API)
14
+ ├── pyproject.toml # UV project: pydantic, fastapi, uvicorn, openai, huggingface-hub
15
+ ├── test_hf_spaces.py # Remote endpoint tests against HF Spaces
16
+ ├── test_all.py # Local test runner (starts server, runs tests, stops)
17
+ ├── test_local.py / test_endpoints.py # Additional test scripts
18
+ ├── policy_to_logic_env/
19
+ │ ├── __init__.py # Package exports: models + client
20
+ │ ├── models.py # Pydantic models: Action, Observation, State, StepResult
21
+ │ ├── client.py # HTTP client wrapper for the environment
22
+ │ ├── openenv.yaml # OpenEnv specification file
23
+ │ └── server/
24
+ │ ├── app.py # FastAPI app with 6 endpoints
25
+ │ ├── environment.py # Core environment: reset(), step(), state()
26
+ │ ├── policies.py # 3 task definitions with clarification maps
27
+ │ ├── ground_truth.py # Programmatic ground truth + clarification oracle
28
+ │ ├── scenario_generator.py # 4-strategy scenario generation (seeded)
29
+ │ ├── graders.py # Rule grading against scenarios
30
+ │ ├── dsl_engine.py # JSON DSL parser, validator, executor
31
+ │ ├── rewards.py # 4-component reward system
32
+ │ └── requirements.txt # Server deps: openenv-core, pydantic, fastapi, uvicorn, requests
33
+ ```
34
+
35
+ **Deployment**: Docker on HF Spaces at `https://godreign-policy2logic.hf.space`, port 7860.
36
+
37
+ ---
38
+
39
+ ## 2. HTTP API (app.py)
40
+
41
+ Single FastAPI app with CORS `allow_origins=["*"]`. One global `PolicyToLogicEnvironment()` instance (single-session).
42
+
43
+ | Endpoint | Method | Request Body | Response | Purpose |
44
+ |---|---|---|---|---|
45
+ | `/` | GET | — | `{name, version, status, endpoints, docs, redoc}` | Root probe / API info |
46
+ | `/health` | GET | — | `{status: "ok", environment: "policy_to_logic"}` | Health check |
47
+ | `/tasks` | GET | — | `{tasks: {name: {difficulty, max_steps, scenario_count, valid_decisions, variables}}}` | List all 3 tasks |
48
+ | `/reset` | POST | `{task_name: str \| null}` | `StepResult` (observation + reward=0 + done=false) | Start new episode |
49
+ | `/step` | POST | `{action_type: str, content: str}` | `StepResult` (observation + reward + done) | Take an action |
50
+ | `/state` | GET | — | `PolicyToLogicState` (full episode metadata) | Get current state |
51
+
52
+ If `task_name` is null or invalid in `/reset`, defaults to `"data_access"`.
53
+
54
+ ---
55
+
56
+ ## 3. Data Models (models.py)
57
+
58
+ ### Action
59
+ ```
60
+ action_type: Literal["ask_clarification", "propose_rules", "refine_rules"]
61
+ content: str # JSON string payload
62
+ ```
63
+
64
+ ### Observation (returned in every StepResult)
65
+ ```
66
+ policy_text: str # The natural language policy (always present)
67
+ task_name: str
68
+ step_number: int # 0 on reset, 1+ on steps
69
+ max_steps: int
70
+ clarification_response: str | None # Oracle answer if ask_clarification
71
+ test_results: dict | None # {passed, failed, total, score, sample_failures}
72
+ current_accuracy: float # 0.0-1.0
73
+ available_actions: list[str] # What the agent can do next
74
+ feedback: str | None # Human-readable feedback
75
+ dsl_format: str # DSL syntax instructions (always present)
76
+ ```
77
+
78
+ ### State
79
+ ```
80
+ episode_id: str
81
+ step_count: int
82
+ task_name: str
83
+ current_rules: list | None
84
+ accuracy_history: list[float]
85
+ questions_asked: int
86
+ questions_log: list[str]
87
+ done: bool
88
+ total_reward: float
89
+ ```
90
+
91
+ ### StepResult
92
+ ```
93
+ observation: Observation
94
+ reward: float # 0.0-1.0 per step
95
+ done: bool
96
+ info: dict # Contains reward_breakdown, episode_score, errors, etc.
97
+ ```
98
+
99
+ ---
100
+
101
+ ## 4. Episode Lifecycle (environment.py)
102
+
103
+ ### reset(task_name)
104
+ 1. Load task config from registry (defaults to `"data_access"`)
105
+ 2. Generate scenarios via `generate_scenarios(task_name)` with `seed=42`
106
+ 3. Initialize state: `step_count=0`, `accuracy=0`, `done=false`
107
+ 4. Return observation with policy text, DSL format, available decisions/variables
108
+
109
+ ### step(action)
110
+ 1. Guard: if `state is None` or `done == True` → error result
111
+ 2. Increment `step_count`
112
+ 3. Dispatch by `action_type`:
113
+ - `"ask_clarification"` → `_handle_clarification()`
114
+ - `"propose_rules"` → `_handle_propose()`
115
+ - `"refine_rules"` → `_handle_refine()`
116
+
117
+ ### Termination Conditions
118
+ Episode ends (`done=True`) when **either**:
119
+ - `accuracy >= 0.9` (success)
120
+ - `step_count >= max_steps` (budget exhausted)
121
+
122
+ ### Clarification Handling
123
+ 1. Parse content as JSON to extract `question`, or use raw content as the question
124
+ 2. Call `answer_clarification(task_name, question)` → deterministic oracle answer
125
+ 3. Usefulness check: `is_useful = "I can provide information" not in answer`
126
+ 4. Compute reward (accuracy stays unchanged, clarification component applies)
127
+ 5. `refine_rules` is only available after at least one `propose_rules`
128
+
129
+ ### Rule Proposal/Refinement Handling
130
+ 1. Parse JSON content via `parse_rules()` → validates DSL structure
131
+ 2. If invalid: penalty reward, feedback with parse errors
132
+ 3. If valid: grade rules against stored scenarios → accuracy
133
+ 4. Compute reward using accuracy delta
134
+ 5. Feedback includes: accuracy, improvement direction, passed/total, sample failure
135
+ 6. If `accuracy >= 0.9`: feedback says "Target accuracy reached! Episode complete."
136
+ 7. On episode end: compute `episode_score` and include in info
137
+
138
+ ---
139
+
140
+ ## 5. The Three Tasks (policies.py)
141
+
142
+ ### Task 1: `data_access` (Easy)
143
+
144
+ | Property | Value |
145
+ |---|---|
146
+ | Difficulty | easy |
147
+ | Max Steps | 5 |
148
+ | Scenario Count | 30 |
149
+ | Variables | `time` (0-23), `data_type` (sensitive, public, internal) |
150
+ | Valid Decisions | ALLOW, DENY |
151
+ | Hidden Params | `work_start=9`, `work_end=18` |
152
+
153
+ **Policy Text** (what the agent sees):
154
+ > Employees must not access sensitive data after working hours. Working hours are from 9 AM to 6 PM (9:00 to 18:00). Public data can be accessed at any time. Internal data follows the same rules as sensitive data.
155
+
156
+ ---
157
+
158
+ ### Task 2: `resource_access` (Medium)
159
+
160
+ | Property | Value |
161
+ |---|---|
162
+ | Difficulty | medium |
163
+ | Max Steps | 7 |
164
+ | Scenario Count | 50 |
165
+ | Variables | `role` (junior, senior, contractor), `time` (0-23), `document_type` (public, internal, confidential) |
166
+ | Valid Decisions | ALLOW, DENY |
167
+ | Hidden Params | `business_start=8`, `business_end=17` |
168
+
169
+ **Policy Text**:
170
+ > Junior employees cannot access confidential documents outside business hours. Senior employees have unrestricted access to all document types. Contractors can only access public documents, regardless of time. During business hours, junior employees may access public and internal documents.
171
+
172
+ **Intentional Ambiguity**: The policy says juniors "cannot access confidential documents outside business hours" — implying they CAN during business hours. But the ground truth DENIES confidential for juniors at ALL times. This is a deliberate trap the agent must discover through testing.
173
+
174
+ ---
175
+
176
+ ### Task 3: `transaction_approval` (Hard)
177
+
178
+ | Property | Value |
179
+ |---|---|
180
+ | Difficulty | hard |
181
+ | Max Steps | 7 |
182
+ | Scenario Count | 80 |
183
+ | Variables | `amount` (100..50000, 12 values), `transfer_type` (domestic, international), `time` (0-23), `initiator_role` (employee, manager, system) |
184
+ | Valid Decisions | APPROVE, REQUIRE_APPROVAL, COMPLIANCE_REVIEW, HOLD |
185
+ | Hidden Params | `standard_limit=5000`, `high_value_threshold=10000`, `business_start=9`, `business_end=17` |
186
+
187
+ **Policy Text**:
188
+ > Transactions exceeding the standard limit require manager approval. International transfers always need compliance review regardless of amount. High-value domestic transactions during non-business hours are automatically held for review. Routine domestic transactions within limits are auto-approved. Manager-initiated transactions are exempt from the standard limit.
189
+
190
+ ---
191
+
192
+ ## 6. Ground Truth Logic (ground_truth.py)
193
+
194
+ ### Task 1: `_ground_truth_data_access`
195
+
196
+ ```python
197
+ if data_type == "public": → ALLOW
198
+ if 9 <= time < 18: → ALLOW # sensitive or internal
199
+ else: → DENY
200
+ ```
201
+
202
+ **Complete Decision Table**:
203
+
204
+ | data_type | time | Decision | Why |
205
+ |---|---|---|---|
206
+ | public | any (0-23) | ALLOW | Public is always accessible |
207
+ | sensitive | 0-8 | DENY | Before working hours |
208
+ | sensitive | 9-17 | ALLOW | During working hours |
209
+ | sensitive | 18-23 | DENY | After working hours (18 is OUTSIDE) |
210
+ | internal | 0-8 | DENY | Same rules as sensitive |
211
+ | internal | 9-17 | ALLOW | Same rules as sensitive |
212
+ | internal | 18-23 | DENY | Same rules as sensitive |
213
+
214
+ > [!IMPORTANT]
215
+ > **Critical boundary**: `time=18` → DENY. The interval is half-open: `[9, 18)`. Hour 18 is the first after-hours hour. Hour 17 is the last working hour.
216
+
217
+ ---
218
+
219
+ ### Task 2: `_ground_truth_resource_access`
220
+
221
+ ```python
222
+ if role == "senior": → ALLOW
223
+ if role == "contractor":
224
+ if doc_type == "public": → ALLOW
225
+ else: → DENY
226
+ # Junior employee:
227
+ is_business_hours = (8 <= time < 17)
228
+ if doc_type == "public": → ALLOW
229
+ if is_business_hours and doc_type == "internal": → ALLOW
230
+ else: → DENY
231
+ ```
232
+
233
+ **Complete Decision Table for Junior Employees**:
234
+
235
+ | document_type | time | Decision | Why |
236
+ |---|---|---|---|
237
+ | public | any (0-23) | ALLOW | Public always allowed for all roles |
238
+ | internal | 0-7 | DENY | Before business hours |
239
+ | internal | 8-16 | ALLOW | During business hours |
240
+ | internal | 17-23 | DENY | After business hours (17 is OUTSIDE) |
241
+ | confidential | any (0-23) | **DENY** | **Always denied for juniors** |
242
+
243
+ **Senior**: ALLOW for everything, always.
244
+ **Contractor**: ALLOW only for `public`, DENY for `internal` and `confidential`, at all times.
245
+
246
+ > [!IMPORTANT]
247
+ > **Critical boundary**: `time=17` → outside business hours. Interval: `[8, 17)`. Hour 16 is the last business hour.
248
+ >
249
+ > **Critical trap**: `confidential` is ALWAYS denied for juniors, even during business hours. The policy text misleadingly implies otherwise.
250
+
251
+ ---
252
+
253
+ ### Task 3: `_ground_truth_transaction_approval`
254
+
255
+ Rules evaluated in strict priority order (first match wins):
256
+
257
+ ```python
258
+ # Rule 1: International → COMPLIANCE_REVIEW (always, regardless of everything)
259
+ if transfer_type == "international": → COMPLIANCE_REVIEW
260
+
261
+ # Rule 2: High-value domestic outside business hours → HOLD
262
+ if amount >= 10000 and not (9 <= time < 17): → HOLD
263
+
264
+ # Rule 3: Above standard limit, not manager → REQUIRE_APPROVAL
265
+ if amount > 5000 and initiator_role != "manager": → REQUIRE_APPROVAL
266
+
267
+ # Rule 4: Everything else → APPROVE
268
+ else: → APPROVE
269
+ ```
270
+
271
+ **Critical Edge Cases**:
272
+
273
+ | amount | transfer_type | time | initiator_role | Decision | Why |
274
+ |---|---|---|---|---|---|
275
+ | 5000 | domestic | 12 | employee | **APPROVE** | At limit, not above (> 5000 fails) |
276
+ | 5001 | domestic | 12 | employee | REQUIRE_APPROVAL | Above limit, not manager |
277
+ | 5001 | domestic | 12 | manager | **APPROVE** | Manager exempt from limit |
278
+ | 10000 | domestic | 20 | employee | **HOLD** | High-value + non-business hours |
279
+ | 10000 | domestic | 12 | employee | REQUIRE_APPROVAL | High-value but business hours (Rule 2 skipped, Rule 3 matches) |
280
+ | 10000 | domestic | 17 | employee | **HOLD** | 17 is non-business hours |
281
+ | 10000 | domestic | 20 | **manager** | **HOLD** | Managers NOT exempt from HOLD rule |
282
+ | 100 | international | 12 | employee | COMPLIANCE_REVIEW | International always |
283
+ | 50000 | international | 3 | manager | COMPLIANCE_REVIEW | International trumps everything |
284
+ | 9999 | domestic | 20 | employee | REQUIRE_APPROVAL | NOT high-value (< 10000), but above limit |
285
+ | 100 | domestic | 3 | employee | APPROVE | Within limit |
286
+ | 100 | domestic | 3 | system | APPROVE | System = employee |
287
+
288
+ > [!IMPORTANT]
289
+ > **Standard limit comparison**: `amount > 5000` (strict greater than). $5,000 exactly = APPROVE.
290
+ >
291
+ > **High-value comparison**: `amount >= 10000` (greater than or equal). $10,000 exactly = high-value.
292
+ >
293
+ > **Manager exemption scope**: Only exempts from Rule 3 (standard limit). Managers are still subject to Rule 1 (international) and Rule 2 (high-value HOLD).
294
+ >
295
+ > **Business hours**: `[9, 17)`. Hour 17 is non-business.
296
+
297
+ ---
298
+
299
+ ## 7. Clarification Oracle (ground_truth.py)
300
+
301
+ ### Matching Algorithm
302
+
303
+ ```
304
+ Input: question string (free text from agent)
305
+ Output: best matching answer from task's clarification_map
306
+
307
+ Algorithm:
308
+ 1. Lowercase the question
309
+ 2. For each keyword in clarification_map:
310
+ a. Split keyword into parts by spaces
311
+ b. Check if ALL parts appear as substrings in the question
312
+ c. Score = (number_of_parts, total_keyword_length)
313
+ d. Highest score wins
314
+ 3. If no match: return generic fallback (contains "I can provide information")
315
+ ```
316
+
317
+ **Key property**: "junior confidential" matches when BOTH "junior" AND "confidential" appear anywhere in the question (order-independent). This 2-part keyword beats any 1-part keyword like "junior" alone.
318
+
319
+ ### Usefulness Detection
320
+
321
+ In `environment.py`, line 203:
322
+ ```python
323
+ is_useful = "I can provide information" not in answer
324
+ ```
325
+ Any answer that matches a keyword entry is "useful". Only the generic fallback is "not useful".
326
+
327
+ ### 3-Tier Progressive Revelation Design
328
+
329
+ Each task's `clarification_map` has three levels:
330
+
331
+ | Tier | Keyword Type | Answer Quality | Training Purpose |
332
+ |---|---|---|---|
333
+ | Level 1 | Single short words | Partial truths, technically correct but incomplete/misleading | Agent builds initial (wrong) rules |
334
+ | Level 2 | Common phrases | More detail, boundary still ambiguous | Agent narrows down the problem |
335
+ | Level 3 | Compound/multi-word | Precise, ground-truth-aligned, corrects Level 1 | Agent fixes rules after failures |
336
+
337
+ **Example — resource_access contradiction**:
338
+ - Agent asks "What can junior employees access?" → matches `"junior"` (Level 1) → *"...but not confidential documents outside business hours"* (implies CAN during hours)
339
+ - Agent proposes rules allowing junior+confidential during hours → **fails**
340
+ - Agent asks "Can junior employees access confidential documents?" → matches `"junior confidential"` (Level 3, 2 parts > 1 part) → *"CANNOT access confidential at ANY time"*
341
+ - Agent refines rules → **passes**
342
+
343
+ ### Clarification Map Entry Counts
344
+
345
+ | Task | Level 1 | Level 2 | Level 3 | Total |
346
+ |---|---|---|---|---|
347
+ | data_access | 5 | 3 | 6 | 14 |
348
+ | resource_access | 7 | 3 | 8 | 18 |
349
+ | transaction_approval | 9 | 7 | 10 | 26 |
350
+
351
+ ---
352
+
353
+ ## 8. DSL Engine (dsl_engine.py)
354
+
355
+ ### DSL Format
356
+
357
+ ```json
358
+ {
359
+ "rules": [
360
+ {
361
+ "if": [
362
+ {"field": "<name>", "op": "<operator>", "value": <value>}
363
+ ],
364
+ "then": "<DECISION>"
365
+ }
366
+ ],
367
+ "default": "<DEFAULT_DECISION>"
368
+ }
369
+ ```
370
+
371
+ ### Supported Operators
372
+ `>`, `<`, `>=`, `<=`, `==`, `!=`
373
+
374
+ ### Validation (`validate_rules`)
375
+ Checks:
376
+ - Root is a dict
377
+ - Has `"rules"` key (must be list)
378
+ - Has `"default"` key (must be string)
379
+ - Each rule has `"if"` (list) and `"then"` (string)
380
+ - Each condition has `"field"` (string), `"op"` (valid operator), `"value"`
381
+
382
+ Returns `(is_valid: bool, errors: list[str])`.
383
+
384
+ ### Execution (`execute_rules`)
385
+ 1. Iterate rules top-to-bottom
386
+ 2. For each rule, evaluate ALL conditions (AND logic)
387
+ 3. First rule where all conditions match → return its `"then"` decision
388
+ 4. If no rules match → return `"default"`
389
+
390
+ ### Type Coercion
391
+ If scenario has `time=9` (int) and rule has `"value": "9"` (str), coerces the string to int. Works both directions. If coercion fails, condition evaluates to `False`.
392
+
393
+ ### Parsing (`parse_rules`)
394
+ 1. `json.loads()` the content string
395
+ 2. `validate_rules()` on the parsed dict
396
+ 3. Returns `(rules_data, [])` on success or `(None, errors)` on failure
397
+
398
+ ---
399
+
400
+ ## 9. Scenario Generator (scenario_generator.py)
401
+
402
+ ### Strategy Allocation
403
+
404
+ | Strategy | Share | Purpose |
405
+ |---|---|---|
406
+ | Boundary | ~20% | Edge values near hidden param thresholds |
407
+ | Pairwise | ~30% | Systematic variable combinations |
408
+ | Adversarial | ~20% | Hand-crafted traps for common mistakes |
409
+ | Random | remainder | Uniform sampling from variable space |
410
+
411
+ All seeded with `seed=42` for reproducibility. Scenarios are deduplicated by field values.
412
+
413
+ ### Boundary Strategy
414
+ Extracts numeric hidden params, generates scenarios at `param ± 1` and at variable min/max.
415
+
416
+ ### Pairwise Strategy
417
+ For each pair of variables, samples up to 4 representative values (min, max, middle, random), generates cross-product combinations.
418
+
419
+ ### Adversarial Strategy
420
+ **Hand-crafted per task** — these are the exact scenarios:
421
+
422
+ #### data_access adversarial:
423
+ | time | data_type | Expected | Tests |
424
+ |---|---|---|---|
425
+ | 9 | sensitive | ALLOW | Start boundary |
426
+ | 18 | sensitive | DENY | End boundary (exclusive) |
427
+ | 8 | sensitive | DENY | Just before start |
428
+ | 17 | sensitive | ALLOW | Just before end |
429
+ | 0 | public | ALLOW | Public at midnight |
430
+ | 23 | internal | DENY | Internal late night |
431
+ | 12 | internal | ALLOW | Internal during hours |
432
+
433
+ #### resource_access adversarial:
434
+ | role | time | document_type | Expected | Tests |
435
+ |---|---|---|---|---|
436
+ | junior | 8 | confidential | DENY | Confidential at business start |
437
+ | junior | 7 | internal | DENY | Internal before hours |
438
+ | junior | 17 | internal | DENY | Internal at boundary (17=outside) |
439
+ | junior | 16 | internal | ALLOW | Internal just before boundary |
440
+ | contractor | 12 | internal | DENY | Contractor restricted |
441
+ | senior | 2 | confidential | ALLOW | Senior unrestricted |
442
+ | junior | 12 | public | ALLOW | Junior public during hours |
443
+ | contractor | 12 | public | ALLOW | Contractor public |
444
+
445
+ #### transaction_approval adversarial:
446
+ | amount | transfer | time | role | Expected | Tests |
447
+ |---|---|---|---|---|---|
448
+ | 5000 | domestic | 12 | employee | APPROVE | At limit (not above) |
449
+ | 5001 | domestic | 12 | employee | REQ_APPROVAL | Just above limit |
450
+ | 5001 | domestic | 12 | manager | APPROVE | Manager exempt |
451
+ | 10000 | domestic | 20 | employee | HOLD | High-value non-business |
452
+ | 10000 | domestic | 12 | employee | REQ_APPROVAL | High-value business hours |
453
+ | 100 | international | 12 | employee | COMPLIANCE | International small |
454
+ | 50000 | international | 3 | manager | COMPLIANCE | International manager |
455
+ | 9999 | domestic | 20 | employee | REQ_APPROVAL | Below high-value threshold |
456
+ | 10000 | domestic | 9 | employee | REQ_APPROVAL | High-value at business start |
457
+ | 10000 | domestic | 17 | employee | HOLD | 17=non-business |
458
+
459
+ ---
460
+
461
+ ## 10. Grading (graders.py)
462
+
463
+ ### `grade_task(task_name, rules_data, scenarios)`
464
+ 1. Validate rules → if invalid, return `score=0.0`
465
+ 2. For each scenario: execute agent's rules, compare to `expected_decision`
466
+ 3. Comparison: `actual.upper() == expected.upper()` (case-insensitive)
467
+ 4. `score = passed / total`
468
+ 5. Returns up to 5 `sample_failures` with scenario details, expected, got
469
+
470
+ ### `quick_grade(task_name, rules_data, scenarios)`
471
+ Same logic, returns only the float score. Used during step processing.
472
+
473
+ ---
474
+
475
+ ## 11. Reward System (rewards.py)
476
+
477
+ ### Per-Step Reward: `compute_reward()`
478
+
479
+ 4 components, clamped to `[0.0, 1.0]`:
480
+
481
+ | Component | Weight | Formula |
482
+ |---|---|---|
483
+ | **Accuracy** | 0.50 | `current_accuracy × 0.50` |
484
+ | **Improvement** | 0.20 | `min(delta × 2.0, 1.0) × 0.20` if delta > 0; `max(delta × 1.5, -0.5) × 0.20` if delta < 0; `0` if unchanged |
485
+ | **Efficiency** | 0.15 | `max(-0.02 × step_number [+ 0.05 × steps_saved if acc≥0.9], -0.15) × 0.15` |
486
+ | **Clarification** | 0.15 | See below |
487
+
488
+ **Clarification component details**:
489
+ - `ask_clarification` + useful + questions ≤ 3: `+0.3 × 0.15 = +0.045`
490
+ - `ask_clarification` + useful + questions > 3: `+0.1 × 0.15 = +0.015` (diminishing)
491
+ - `ask_clarification` + not useful: `-0.05 × 0.15 = -0.0075`
492
+ - `propose_rules/refine_rules` + invalid DSL: `-0.1 × 0.15 = -0.015`
493
+ - `propose_rules/refine_rules` + valid DSL: `0`
494
+
495
+ ### Episode Score: `compute_episode_score()`
496
+
497
+ Used for final grading, `[0.0, 1.0]`:
498
+
499
+ ```
500
+ score = final_accuracy × 0.80
501
+ + max(0, 1 - steps/max_steps) × 0.10
502
+ + question_bonus × 0.10
503
+
504
+ question_bonus = 1.0 if questions ≤ 2
505
+ = 0.5 if questions ≤ 4
506
+ = 0.0 if questions > 4
507
+ ```
508
+
509
+ ---
510
+
511
+ ## 12. Inference Agent (inference.py)
512
+
513
+ ### Configuration
514
+ - Model: `Qwen/Qwen2.5-72B-Instruct` (via `HF_TOKEN`)
515
+ - API: `https://router.huggingface.co/v1` (OpenAI-compatible)
516
+ - Temperature: 0.3, Max tokens: 1024
517
+ - Env URL: `http://localhost:7860` (configurable via `ENV_BASE_URL`)
518
+
519
+ ### Agent Loop
520
+ ```
521
+ for each task in [data_access, resource_access, transaction_approval]:
522
+ result = env.reset(task)
523
+ for step in 1..max_steps:
524
+ if result.done: break
525
+ action_type, content = get_agent_action(llm, observation, step, history)
526
+ result = env.step(action)
527
+ history.append(summary)
528
+ ```
529
+
530
+ ### Prompt Design
531
+ - **System prompt**: Describes available actions, DSL format, strategy guidelines
532
+ - **User prompt**: Built per-step with policy text, feedback, clarification answers, test results, sample failures, DSL format, action history (last 3)
533
+ - LLM response parsed as JSON: `{"action_type": "...", "content": "..."}`
534
+ - Handles markdown code blocks (`\`\`\`json ... \`\`\``)
535
+ - Fallback: if unparseable, tries extracting `"rules"`, otherwise submits empty rules
536
+
537
+ ### Output Format
538
+ ```
539
+ [START] task=<name> env=policy_to_logic model=<model>
540
+ [STEP] step=<n> action=<summary> reward=<float> done=<bool> error=<msg|null>
541
+ [END] success=<bool> steps=<n> score=<float> rewards=<r1,r2,...>
542
+ ```
543
+
544
+ ---
545
+
546
+ ## 13. Client Library (client.py)
547
+
548
+ HTTP client using `requests.Session()`:
549
+ - `reset(task_name)` → POST `/reset` → `PolicyToLogicStepResult`
550
+ - `step(action)` → POST `/step` → `PolicyToLogicStepResult`
551
+ - `state()` → GET `/state` → `PolicyToLogicState`
552
+ - `health()` → GET `/health` → dict
553
+ - `list_tasks()` → GET `/tasks` → dict
554
+ - Context manager support (`with PolicyToLogicEnv() as env:`)
555
+
556
+ ---
557
+
558
+ ## 14. Deployment
559
+
560
+ ### Dockerfile
561
+ ```dockerfile
562
+ FROM python:3.11-slim
563
+ WORKDIR /app
564
+ COPY policy_to_logic_env/server/requirements.txt → pip install
565
+ COPY policy_to_logic_env/, main.py, inference.py
566
+ EXPOSE 7860
567
+ HEALTHCHECK: curl -f http://localhost:7860/health
568
+ CMD: python -m uvicorn policy_to_logic_env.server.app:app --host 0.0.0.0 --port 7860
569
+ ```
570
+
571
+ ### HF Spaces Config (README.md)
572
+ ```yaml
573
+ sdk: docker
574
+ app_port: 7860
575
+ ```
576
+
577
+ Live at: `https://godreign-policy2logic.hf.space`
578
+
579
+ ---
580
+
581
+ ## 15. Known Design Decisions & Limitations
582
+
583
+ 1. **Single-session**: One global environment instance. Concurrent clients will interfere. Suitable for sequential benchmarking, not parallel RL training.
584
+
585
+ 2. **Deterministic scenarios**: `seed=42` always produces the same scenarios. Agent is graded on the same set every episode. Prevents overfitting variance but could lead to memorization.
586
+
587
+ 3. **Stateful server**: The environment holds state in memory. Server restart loses episode state. No persistence layer.
588
+
589
+ 4. **Clarification is keyword-based**: The oracle is not an LLM — it's a deterministic keyword matcher. Agent questions that don't contain any keyword get the generic fallback (penalized as "not useful").
590
+
591
+ 5. **Progressive revelation by design**: Level 1 clarification answers are intentionally misleading partial truths. This is NOT a bug — it's the core RL training signal. Agents that trust Level 1 answers will fail and must learn to ask better (Level 3) questions.
592
+
593
+ 6. **No `refine_rules` before `propose_rules`**: The environment returns a feedback message if the agent tries to refine before proposing. Not an error, just 0 reward + feedback.
594
+
595
+ 7. **Case-insensitive grading**: `actual.upper() == expected.upper()`. Agent can output "allow" or "Allow" or "ALLOW".
596
+
597
+ 8. **DSL type coercion**: Integer-string mismatches are auto-coerced. `"9"` and `9` compare equally.
policy_to_logic_env/openenv.yaml CHANGED
@@ -15,6 +15,9 @@ tags:
15
  environment:
16
  entry_point: "policy_to_logic_env.server.app:app"
17
  python_version: ">=3.10"
 
 
 
18
 
19
  tasks:
20
  - name: data_access
 
15
  environment:
16
  entry_point: "policy_to_logic_env.server.app:app"
17
  python_version: ">=3.10"
18
+ reset_endpoint: /reset
19
+ step_endpoint: /step
20
+ state_endpoint: /state
21
 
22
  tasks:
23
  - name: data_access
pyproject.toml CHANGED
@@ -11,6 +11,8 @@ dependencies = [
11
  "openai>=1.0.0",
12
  "huggingface>=0.0.1",
13
  "huggingface-hub>=1.12.0",
 
 
14
  ]
15
 
16
  [project.optional-dependencies]
 
11
  "openai>=1.0.0",
12
  "huggingface>=0.0.1",
13
  "huggingface-hub>=1.12.0",
14
+ "matplotlib>=3.7.0",
15
+ "numpy>=1.24.0",
16
  ]
17
 
18
  [project.optional-dependencies]
training/colab_training.ipynb ADDED
@@ -0,0 +1,378 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "nbformat": 4,
3
+ "nbformat_minor": 0,
4
+ "metadata": {
5
+ "colab": {
6
+ "provenance": [],
7
+ "name": "Policy-to-Logic Training"
8
+ },
9
+ "kernelspec": {
10
+ "name": "python3",
11
+ "display_name": "Python 3"
12
+ },
13
+ "language_info": {
14
+ "name": "python"
15
+ }
16
+ },
17
+ "cells": [
18
+ {
19
+ "cell_type": "markdown",
20
+ "metadata": {},
21
+ "source": [
22
+ "# Policy-to-Logic RL Environment — Training Notebook\n",
23
+ "\n",
24
+ "This notebook runs the **reward-guided trajectory optimization loop** against the deployed environment.\n",
25
+ "\n",
26
+ "**What it does:**\n",
27
+ "1. Connects to the live HF Spaces environment\n",
28
+ "2. Runs 8 episodes per task (3 tasks = 24 total episodes)\n",
29
+ "3. Accumulates high-reward trajectories as few-shot examples\n",
30
+ "4. Generates training evidence plots (reward curve, accuracy curve, improvement chart)"
31
+ ]
32
+ },
33
+ {
34
+ "cell_type": "code",
35
+ "execution_count": null,
36
+ "metadata": {},
37
+ "outputs": [],
38
+ "source": [
39
+ "# Cell 1: Install dependencies\n",
40
+ "!pip install openai requests matplotlib numpy"
41
+ ]
42
+ },
43
+ {
44
+ "cell_type": "code",
45
+ "execution_count": null,
46
+ "metadata": {},
47
+ "outputs": [],
48
+ "source": [
49
+ "# Cell 2: Configuration\n",
50
+ "import os\n",
51
+ "\n",
52
+ "# SET THESE BEFORE RUNNING\n",
53
+ "HF_TOKEN = \"\" # Your Hugging Face token with inference access\n",
54
+ "ENV_URL = \"https://godreign-policy2logic.hf.space\" # Your deployed environment URL\n",
55
+ "\n",
56
+ "os.environ[\"HF_TOKEN\"] = HF_TOKEN\n",
57
+ "os.environ[\"ENV_BASE_URL\"] = ENV_URL\n",
58
+ "\n",
59
+ "print(f\"Environment URL: {ENV_URL}\")\n",
60
+ "print(f\"HF Token set: {'Yes' if HF_TOKEN else 'NO - MUST SET THIS'}\")"
61
+ ]
62
+ },
63
+ {
64
+ "cell_type": "code",
65
+ "execution_count": null,
66
+ "metadata": {},
67
+ "outputs": [],
68
+ "source": [
69
+ "# Cell 3: Verify environment is reachable\n",
70
+ "import requests\n",
71
+ "\n",
72
+ "r = requests.get(f\"{ENV_URL}/health\")\n",
73
+ "print(f\"Status: {r.status_code}\")\n",
74
+ "print(f\"Response: {r.json()}\")\n",
75
+ "\n",
76
+ "r2 = requests.get(f\"{ENV_URL}/tasks\")\n",
77
+ "tasks = r2.json()\n",
78
+ "print(f\"\\nAvailable tasks: {list(tasks['tasks'].keys())}\")"
79
+ ]
80
+ },
81
+ {
82
+ "cell_type": "code",
83
+ "execution_count": null,
84
+ "metadata": {},
85
+ "outputs": [],
86
+ "source": [
87
+ "# Cell 4: Training loop implementation\n",
88
+ "# Full contents of training/trajectory_optimizer.py\n",
89
+ "\n",
90
+ "import json\n",
91
+ "import os\n",
92
+ "import time\n",
93
+ "import requests\n",
94
+ "from dataclasses import dataclass, field\n",
95
+ "from typing import Optional\n",
96
+ "from openai import OpenAI\n",
97
+ "\n",
98
+ "# ── Configuration ────────────────────────────────────────────────────────────\n",
99
+ "\n",
100
+ "ENV_BASE_URL = os.getenv(\"ENV_BASE_URL\", \"http://localhost:7860\")\n",
101
+ "HF_TOKEN = os.getenv(\"HF_TOKEN\", \"\")\n",
102
+ "MODEL = \"Qwen/Qwen2.5-72B-Instruct\"\n",
103
+ "TEMPERATURE = 0.3\n",
104
+ "MAX_TOKENS = 1024\n",
105
+ "\n",
106
+ "NUM_EPISODES_PER_TASK = 8\n",
107
+ "TOP_K_TRAJECTORIES = 3\n",
108
+ "MIN_REWARD_THRESHOLD = 0.3\n",
109
+ "TASKS = [\"data_access\", \"resource_access\", \"transaction_approval\"]\n",
110
+ "\n",
111
+ "@dataclass\n",
112
+ "class Step:\n",
113
+ " step_number: int\n",
114
+ " action_type: str\n",
115
+ " action_content: str\n",
116
+ " reward: float\n",
117
+ " accuracy: float\n",
118
+ " feedback: str\n",
119
+ " clarification_response: Optional[str] = None\n",
120
+ "\n",
121
+ "@dataclass\n",
122
+ "class Trajectory:\n",
123
+ " task_name: str\n",
124
+ " episode_id: int\n",
125
+ " steps: list[Step] = field(default_factory=list)\n",
126
+ " total_reward: float = 0.0\n",
127
+ " final_accuracy: float = 0.0\n",
128
+ " success: bool = False\n",
129
+ "\n",
130
+ " def to_few_shot_string(self) -> str:\n",
131
+ " lines = [f\"=== Example Episode (reward={self.total_reward:.2f}, accuracy={self.final_accuracy:.2f}) ===\"]\n",
132
+ " for s in self.steps:\n",
133
+ " lines.append(f\"Step {s.step_number}: action={s.action_type}\")\n",
134
+ " lines.append(f\" Content: {s.action_content[:200]}\")\n",
135
+ " lines.append(f\" Result: accuracy={s.accuracy:.2f}, reward={s.reward:.2f}\")\n",
136
+ " if s.feedback:\n",
137
+ " lines.append(f\" Feedback: {s.feedback[:150]}\")\n",
138
+ " return \"\\n\".join(lines)\n",
139
+ "\n",
140
+ "class EnvClient:\n",
141
+ " def __init__(self, base_url: str):\n",
142
+ " self.base_url = base_url.rstrip(\"/\")\n",
143
+ " self.session = requests.Session()\n",
144
+ "\n",
145
+ " def reset(self, task_name: str) -> dict:\n",
146
+ " r = self.session.post(f\"{self.base_url}/reset\", json={\"task_name\": task_name})\n",
147
+ " r.raise_for_status()\n",
148
+ " return r.json()\n",
149
+ "\n",
150
+ " def step(self, action_type: str, content: str) -> dict:\n",
151
+ " r = self.session.post(f\"{self.base_url}/step\", json={\"action_type\": action_type, \"content\": content})\n",
152
+ " r.raise_for_status()\n",
153
+ " return r.json()\n",
154
+ "\n",
155
+ " def health(self) -> bool:\n",
156
+ " try:\n",
157
+ " r = self.session.get(f\"{self.base_url}/health\", timeout=5)\n",
158
+ " return r.status_code == 200\n",
159
+ " except Exception:\n",
160
+ " return False\n",
161
+ "\n",
162
+ "class Agent:\n",
163
+ " def __init__(self, hf_token: str):\n",
164
+ " self.client = OpenAI(base_url=\"https://router.huggingface.co/v1\", api_key=hf_token)\n",
165
+ "\n",
166
+ " def get_action(self, observation, step_number, episode_history, few_shot_examples):\n",
167
+ " system_prompt = self._build_system_prompt(few_shot_examples)\n",
168
+ " user_prompt = self._build_user_prompt(observation, step_number, episode_history)\n",
169
+ " try:\n",
170
+ " response = self.client.chat.completions.create(\n",
171
+ " model=MODEL,\n",
172
+ " messages=[{\"role\": \"system\", \"content\": system_prompt}, {\"role\": \"user\", \"content\": user_prompt}],\n",
173
+ " temperature=TEMPERATURE, max_tokens=MAX_TOKENS\n",
174
+ " )\n",
175
+ " raw = response.choices[0].message.content.strip()\n",
176
+ " return self._parse_response(raw, observation)\n",
177
+ " except Exception as e:\n",
178
+ " print(f\" [LLM ERROR] {e}\")\n",
179
+ " return \"propose_rules\", json.dumps({\"rules\": [], \"default\": \"DENY\"})\n",
180
+ "\n",
181
+ " def _build_system_prompt(self, few_shot_examples):\n",
182
+ " base = \"\"\"You are a policy-to-logic agent. Convert natural language policies into executable rules.\n",
183
+ "\n",
184
+ "AVAILABLE ACTIONS:\n",
185
+ "1. ask_clarification: {\"type\": \"clarification\", \"question\": \"your question\"}\n",
186
+ "2. propose_rules: {\"rules\": [...], \"default\": \"DECISION\"}\n",
187
+ "3. refine_rules: {\"rules\": [...], \"default\": \"DECISION\"}\n",
188
+ "\n",
189
+ "DSL FORMAT: {\"rules\": [{\"if\": [{\"field\": \"NAME\", \"op\": \"OP\", \"value\": VAL}], \"then\": \"DECISION\"}], \"default\": \"FALLBACK\"}\n",
190
+ "Operators: >, <, >=, <=, ==, !=. Rules execute top-to-bottom, first match wins.\n",
191
+ "\n",
192
+ "STRATEGY: Ask 1-2 clarifications first, then propose rules, then refine based on failures.\n",
193
+ "OUTPUT: Respond ONLY with valid JSON: {\"action_type\": \"...\", \"content\": \"...\"}\"\"\"\n",
194
+ " if few_shot_examples:\n",
195
+ " base += \"\\n\\nLEARNED FROM PREVIOUS EPISODES:\\n\"\n",
196
+ " for traj in few_shot_examples[-TOP_K_TRAJECTORIES:]:\n",
197
+ " base += \"\\n\" + traj.to_few_shot_string() + \"\\n\"\n",
198
+ " return base\n",
199
+ "\n",
200
+ " def _build_user_prompt(self, obs, step, history):\n",
201
+ " lines = [f\"TASK: {obs.get('task_name', 'unknown')}\", f\"STEP: {step} of {obs.get('max_steps', 7)}\", f\"\\nPOLICY:\\n{obs.get('policy_text', '')}\"]\n",
202
+ " if obs.get(\"clarification_response\"): lines.append(f\"\\nLAST CLARIFICATION:\\n{obs['clarification_response']}\")\n",
203
+ " if obs.get(\"test_results\"):\n",
204
+ " tr = obs[\"test_results\"]\n",
205
+ " lines.append(f\"\\nTEST RESULTS: {tr.get('passed',0)}/{tr.get('total',0)} (acc={obs.get('current_accuracy',0):.2f})\")\n",
206
+ " if tr.get(\"sample_failures\"): lines.extend([f\" - {f}\" for f in tr[\"sample_failures\"][:3]])\n",
207
+ " if obs.get(\"feedback\"): lines.append(f\"\\nFEEDBACK: {obs['feedback']}\")\n",
208
+ " if history: lines.append(f\"\\nHISTORY:\\n\" + \"\\n\".join(history[-3:]))\n",
209
+ " lines.append(f\"\\nAVAILABLE: {obs.get('available_actions', [])}\")\n",
210
+ " lines.append(\"\\nRespond with JSON only.\")\n",
211
+ " return \"\\n\".join(lines)\n",
212
+ "\n",
213
+ " def _parse_response(self, raw, obs):\n",
214
+ " if \"```\" in raw:\n",
215
+ " raw = raw.split(\"```\")[1]\n",
216
+ " if raw.startswith(\"json\"): raw = raw[4:]\n",
217
+ " raw = raw.strip()\n",
218
+ " try:\n",
219
+ " parsed = json.loads(raw)\n",
220
+ " action_type = parsed.get(\"action_type\", \"propose_rules\")\n",
221
+ " content = parsed.get(\"content\", \"{}\")\n",
222
+ " valid = obs.get(\"available_actions\", [\"propose_rules\"])\n",
223
+ " if action_type not in valid: action_type = valid[0]\n",
224
+ " if isinstance(content, dict): content = json.dumps(content)\n",
225
+ " return action_type, content\n",
226
+ " except: return \"propose_rules\", json.dumps({\"rules\": [], \"default\": \"DENY\"})\n",
227
+ "\n",
228
+ "class TrajectoryBank:\n",
229
+ " def __init__(self): self.bank = {task: [] for task in TASKS}\n",
230
+ " def store(self, t):\n",
231
+ " if t.total_reward >= MIN_REWARD_THRESHOLD:\n",
232
+ " self.bank[t.task_name].append(t)\n",
233
+ " self.bank[t.task_name].sort(key=lambda x: x.total_reward, reverse=True)\n",
234
+ " self.bank[t.task_name] = self.bank[t.task_name][:TOP_K_TRAJECTORIES]\n",
235
+ " def get_examples(self, task): return self.bank.get(task, [])\n",
236
+ " def summary(self): return {t: {\"stored\": len(v), \"best_reward\": max((x.total_reward for x in v), default=0)} for t,v in self.bank.items()}\n",
237
+ "\n",
238
+ "class TrainingLoop:\n",
239
+ " def __init__(self, env_url, hf_token):\n",
240
+ " self.env = EnvClient(env_url)\n",
241
+ " self.agent = Agent(hf_token)\n",
242
+ " self.bank = TrajectoryBank()\n",
243
+ " self.metrics = []\n",
244
+ "\n",
245
+ " def run_episode(self, task_name, episode_id):\n",
246
+ " few_shots = self.bank.get_examples(task_name)\n",
247
+ " traj = Trajectory(task_name=task_name, episode_id=episode_id)\n",
248
+ " result = self.env.reset(task_name)\n",
249
+ " obs, done, history = result.get(\"observation\", {}), result.get(\"done\", False), []\n",
250
+ " print(f\" [Episode {episode_id}] task={task_name} few_shots={len(few_shots)}\")\n",
251
+ " step_num = 0\n",
252
+ " while not done and step_num < obs.get(\"max_steps\", 7):\n",
253
+ " step_num += 1\n",
254
+ " action_type, content = self.agent.get_action(obs, step_num, history, few_shots)\n",
255
+ " result = self.env.step(action_type, content)\n",
256
+ " reward, done = result.get(\"reward\", 0.0), result.get(\"done\", False)\n",
257
+ " obs, info = result.get(\"observation\", {}), result.get(\"info\", {})\n",
258
+ " step = Step(step_num, action_type, content[:300], reward, obs.get(\"current_accuracy\", 0.0), obs.get(\"feedback\", \"\") or \"\", obs.get(\"clarification_response\"))\n",
259
+ " traj.steps.append(step); traj.total_reward += reward\n",
260
+ " history.append(f\"Step {step_num}: {action_type} -> reward={reward:.2f} acc={step.accuracy:.2f}\")\n",
261
+ " print(f\" step={step_num} action={action_type} reward={reward:.3f} acc={step.accuracy:.2f}\")\n",
262
+ " if done:\n",
263
+ " traj.final_accuracy = info.get(\"episode_score\", obs.get(\"current_accuracy\", 0.0))\n",
264
+ " traj.success = obs.get(\"current_accuracy\", 0.0) >= 0.9\n",
265
+ " break\n",
266
+ " if not traj.steps: traj.final_accuracy = 0.0\n",
267
+ " return traj\n",
268
+ "\n",
269
+ " def run(self):\n",
270
+ " print(\"=\" * 60)\n",
271
+ " print(\"REWARD-GUIDED TRAJECTORY OPTIMIZATION\")\n",
272
+ " print(f\"Tasks: {TASKS}, Episodes/task: {NUM_EPISODES_PER_TASK}\")\n",
273
+ " print(\"=\" * 60)\n",
274
+ " if not self.env.health(): raise RuntimeError(f\"Env not reachable at {ENV_BASE_URL}\")\n",
275
+ " print(f\"Environment: OK\\n\")\n",
276
+ " global_ep = 0\n",
277
+ " for task in TASKS:\n",
278
+ " print(f\"\\n--- TASK: {task} ---\")\n",
279
+ " task_rewards = []\n",
280
+ " for ep in range(1, NUM_EPISODES_PER_TASK + 1):\n",
281
+ " global_ep += 1\n",
282
+ " traj = self.run_episode(task, ep)\n",
283
+ " self.bank.store(traj)\n",
284
+ " self.metrics.append({\"global_episode\": global_ep, \"task\": task, \"episode_in_task\": ep, \"total_reward\": traj.total_reward, \"final_accuracy\": traj.final_accuracy, \"success\": traj.success, \"num_steps\": len(traj.steps)})\n",
285
+ " task_rewards.append(traj.total_reward)\n",
286
+ " print(f\" -> Ep {ep}: reward={traj.total_reward:.3f} acc={traj.final_accuracy:.2f} success={traj.success}\")\n",
287
+ " time.sleep(0.5)\n",
288
+ " print(f\" Improvement: {task_rewards[-1] - task_rewards[0]:+.3f}\")\n",
289
+ " print(\"\\n\" + \"=\" * 60 + \"\\nTRAINING COMPLETE\\n\" + \"=\" * 60)\n",
290
+ " return self.metrics\n",
291
+ "\n",
292
+ "def save_plots(metrics):\n",
293
+ " import matplotlib; matplotlib.use(\"Agg\")\n",
294
+ " import matplotlib.pyplot as plt; import numpy as np\n",
295
+ " os.makedirs(\"training/plots\", exist_ok=True)\n",
296
+ " episodes = [m[\"global_episode\"] for m in metrics]\n",
297
+ " rewards = [m[\"total_reward\"] for m in metrics]\n",
298
+ " colors = {\"data_access\": \"#2196F3\", \"resource_access\": \"#FF9800\", \"transaction_approval\": \"#4CAF50\"}\n",
299
+ " # Plot 1: Reward\n",
300
+ " fig, ax = plt.subplots(figsize=(10, 5))\n",
301
+ " for task in TASKS:\n",
302
+ " te = [m[\"global_episode\"] for m in metrics if m[\"task\"]==task]\n",
303
+ " tr = [m[\"total_reward\"] for m in metrics if m[\"task\"]==task]\n",
304
+ " ax.plot(te, tr, marker=\"o\", label=task, color=colors.get(task), linewidth=2, markersize=5)\n",
305
+ " z = np.polyfit(episodes, rewards, 1); p = np.poly1d(z)\n",
306
+ " ax.plot(episodes, p(episodes), \"--\", color=\"red\", alpha=0.5, label=\"trend\")\n",
307
+ " ax.set_xlabel(\"Episode\"); ax.set_ylabel(\"Total Reward\"); ax.set_title(\"Reward Curve\"); ax.legend(); ax.grid(True, alpha=0.3); ax.set_ylim(bottom=0)\n",
308
+ " plt.tight_layout(); plt.savefig(\"training/plots/reward_curve.png\", dpi=150); plt.close()\n",
309
+ " # Plot 2: Accuracy\n",
310
+ " fig, ax = plt.subplots(figsize=(10, 5))\n",
311
+ " for task in TASKS:\n",
312
+ " te = [m[\"global_episode\"] for m in metrics if m[\"task\"]==task]\n",
313
+ " ta = [m[\"final_accuracy\"] for m in metrics if m[\"task\"]==task]\n",
314
+ " ax.plot(te, ta, marker=\"s\", label=task, color=colors.get(task), linewidth=2, markersize=5)\n",
315
+ " ax.axhline(y=0.9, color=\"red\", linestyle=\"--\", alpha=0.7, label=\"threshold\")\n",
316
+ " ax.set_xlabel(\"Episode\"); ax.set_ylabel(\"Accuracy\"); ax.set_title(\"Accuracy Curve\"); ax.legend(); ax.grid(True, alpha=0.3); ax.set_ylim(0, 1.05)\n",
317
+ " plt.tight_layout(); plt.savefig(\"training/plots/accuracy_curve.png\", dpi=150); plt.close()\n",
318
+ " # Plot 3: Improvement\n",
319
+ " fig, ax = plt.subplots(figsize=(8, 5))\n",
320
+ " tnames, imps = [], []\n",
321
+ " for task in TASKS:\n",
322
+ " accs = [m[\"final_accuracy\"] for m in metrics if m[\"task\"]==task]\n",
323
+ " if len(accs) >= 2: tnames.append(task.replace(\"_\",\"\\n\")); imps.append(accs[-1]-accs[0])\n",
324
+ " bars = ax.bar(tnames, imps, color=[\"#2196F3\",\"#FF9800\",\"#4CAF50\"][:len(tnames)])\n",
325
+ " ax.axhline(y=0, color=\"black\"); ax.set_ylabel(\"Improvement\"); ax.set_title(\"Per-Task Improvement\"); ax.grid(True, axis=\"y\", alpha=0.3)\n",
326
+ " for bar, val in zip(bars, imps): ax.text(bar.get_x()+bar.get_width()/2, bar.get_height()+0.01, f\"{val:+.2f}\", ha=\"center\", fontweight=\"bold\")\n",
327
+ " plt.tight_layout(); plt.savefig(\"training/plots/improvement_chart.png\", dpi=150); plt.close()\n",
328
+ " with open(\"training/plots/metrics.json\", \"w\") as f: json.dump(metrics, f, indent=2)\n",
329
+ " print(\"All plots saved to training/plots/\")"
330
+ ]
331
+ },
332
+ {
333
+ "cell_type": "code",
334
+ "execution_count": null,
335
+ "metadata": {},
336
+ "outputs": [],
337
+ "source": [
338
+ "# Cell 5: Run training loop\n",
339
+ "loop = TrainingLoop(ENV_URL, HF_TOKEN)\n",
340
+ "metrics = loop.run()\n",
341
+ "print(f\"\\nTotal episodes run: {len(metrics)}\")"
342
+ ]
343
+ },
344
+ {
345
+ "cell_type": "code",
346
+ "execution_count": null,
347
+ "metadata": {},
348
+ "outputs": [],
349
+ "source": [
350
+ "# Cell 6: Generate plots and display inline\n",
351
+ "save_plots(metrics)\n",
352
+ "\n",
353
+ "from IPython.display import Image, display\n",
354
+ "display(Image(\"training/plots/reward_curve.png\"))\n",
355
+ "display(Image(\"training/plots/accuracy_curve.png\"))\n",
356
+ "display(Image(\"training/plots/improvement_chart.png\"))"
357
+ ]
358
+ },
359
+ {
360
+ "cell_type": "code",
361
+ "execution_count": null,
362
+ "metadata": {},
363
+ "outputs": [],
364
+ "source": [
365
+ "# Cell 7: Download plots to commit to repo\n",
366
+ "# After running this, download the files and commit them to your GitHub repo\n",
367
+ "from google.colab import files\n",
368
+ "\n",
369
+ "files.download(\"training/plots/reward_curve.png\")\n",
370
+ "files.download(\"training/plots/accuracy_curve.png\")\n",
371
+ "files.download(\"training/plots/improvement_chart.png\")\n",
372
+ "files.download(\"training/plots/metrics.json\")\n",
373
+ "\n",
374
+ "print(\"Downloaded. Now commit these files to: training/plots/ in your repo.\")"
375
+ ]
376
+ }
377
+ ]
378
+ }
training/trajectory_optimizer.py ADDED
@@ -0,0 +1,506 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Reward-Guided Trajectory Optimization Loop
3
+ ==========================================
4
+ Optimizes agent behavior across episodes by accumulating high-reward
5
+ trajectories as few-shot examples. Uses environment reward signal to
6
+ drive improvement — no weight updates required.
7
+
8
+ This implements a policy improvement loop where:
9
+ - reward_signal → trajectory_selection → context_construction → improved_policy
10
+ """
11
+
12
+ import json
13
+ import os
14
+ import time
15
+ import requests
16
+ from dataclasses import dataclass, field
17
+ from typing import Optional
18
+ from openai import OpenAI
19
+
20
+ # ── Configuration ────────────────────────────────────────────────────────────
21
+
22
+ ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:7860")
23
+ HF_TOKEN = os.getenv("HF_TOKEN", "")
24
+ MODEL = "Qwen/Qwen2.5-72B-Instruct"
25
+ TEMPERATURE = 0.3
26
+ MAX_TOKENS = 1024
27
+
28
+ # Training hyperparameters
29
+ NUM_EPISODES_PER_TASK = 8 # Episodes to run per task
30
+ TOP_K_TRAJECTORIES = 3 # Max few-shot examples to keep
31
+ MIN_REWARD_THRESHOLD = 0.3 # Minimum reward to store trajectory
32
+ TASKS = ["data_access", "resource_access", "transaction_approval"]
33
+
34
+ # ── Data Structures ──────────────────────────────────────────────────────────
35
+
36
+ @dataclass
37
+ class Step:
38
+ step_number: int
39
+ action_type: str
40
+ action_content: str
41
+ reward: float
42
+ accuracy: float
43
+ feedback: str
44
+ clarification_response: Optional[str] = None
45
+
46
+ @dataclass
47
+ class Trajectory:
48
+ task_name: str
49
+ episode_id: int
50
+ steps: list[Step] = field(default_factory=list)
51
+ total_reward: float = 0.0
52
+ final_accuracy: float = 0.0
53
+ success: bool = False
54
+
55
+ def to_few_shot_string(self) -> str:
56
+ """Convert trajectory to a few-shot example string for prompting."""
57
+ lines = [
58
+ f"=== Example Episode (reward={self.total_reward:.2f}, accuracy={self.final_accuracy:.2f}) ===",
59
+ ]
60
+ for s in self.steps:
61
+ lines.append(f"Step {s.step_number}: action={s.action_type}")
62
+ lines.append(f" Content: {s.action_content[:200]}")
63
+ lines.append(f" Result: accuracy={s.accuracy:.2f}, reward={s.reward:.2f}")
64
+ if s.feedback:
65
+ lines.append(f" Feedback: {s.feedback[:150]}")
66
+ return "\n".join(lines)
67
+
68
+ # ── Environment Client ────────────────────────────────────────────────────────
69
+
70
+ class EnvClient:
71
+ def __init__(self, base_url: str):
72
+ self.base_url = base_url.rstrip("/")
73
+ self.session = requests.Session()
74
+
75
+ def reset(self, task_name: str) -> dict:
76
+ r = self.session.post(f"{self.base_url}/reset", json={"task_name": task_name})
77
+ r.raise_for_status()
78
+ return r.json()
79
+
80
+ def step(self, action_type: str, content: str) -> dict:
81
+ r = self.session.post(f"{self.base_url}/step", json={
82
+ "action_type": action_type,
83
+ "content": content
84
+ })
85
+ r.raise_for_status()
86
+ return r.json()
87
+
88
+ def health(self) -> bool:
89
+ try:
90
+ r = self.session.get(f"{self.base_url}/health", timeout=5)
91
+ return r.status_code == 200
92
+ except Exception:
93
+ return False
94
+
95
+ # ── LLM Agent ────────────────────────────────────────────────────────────────
96
+
97
+ class Agent:
98
+ def __init__(self, hf_token: str):
99
+ self.client = OpenAI(
100
+ base_url="https://router.huggingface.co/v1",
101
+ api_key=hf_token
102
+ )
103
+
104
+ def get_action(
105
+ self,
106
+ observation: dict,
107
+ step_number: int,
108
+ episode_history: list[str],
109
+ few_shot_examples: list[Trajectory]
110
+ ) -> tuple[str, str]:
111
+ """
112
+ Returns (action_type, content_json_string).
113
+ action_type: one of ask_clarification | propose_rules | refine_rules
114
+ content: JSON string appropriate for that action
115
+ """
116
+ system_prompt = self._build_system_prompt(few_shot_examples)
117
+ user_prompt = self._build_user_prompt(observation, step_number, episode_history)
118
+
119
+ try:
120
+ response = self.client.chat.completions.create(
121
+ model=MODEL,
122
+ messages=[
123
+ {"role": "system", "content": system_prompt},
124
+ {"role": "user", "content": user_prompt}
125
+ ],
126
+ temperature=TEMPERATURE,
127
+ max_tokens=MAX_TOKENS
128
+ )
129
+ raw = response.choices[0].message.content.strip()
130
+ return self._parse_response(raw, observation)
131
+ except Exception as e:
132
+ print(f" [LLM ERROR] {e}")
133
+ return "propose_rules", json.dumps({"rules": [], "default": "DENY"})
134
+
135
+ def _build_system_prompt(self, few_shot_examples: list[Trajectory]) -> str:
136
+ base = """You are a policy-to-logic agent. Your job is to convert natural language policies into executable rules.
137
+
138
+ AVAILABLE ACTIONS:
139
+ 1. ask_clarification: {"type": "clarification", "question": "your question"}
140
+ 2. propose_rules: {"rules": [...], "default": "DECISION"}
141
+ 3. refine_rules: {"rules": [...], "default": "DECISION"}
142
+
143
+ DSL FORMAT for rules:
144
+ {
145
+ "rules": [
146
+ {
147
+ "if": [
148
+ {"field": "FIELD_NAME", "op": "OPERATOR", "value": VALUE}
149
+ ],
150
+ "then": "DECISION"
151
+ }
152
+ ],
153
+ "default": "FALLBACK_DECISION"
154
+ }
155
+
156
+ Operators: >, <, >=, <=, ==, !=
157
+ Rules execute top-to-bottom. First match wins. Default applies if no rule matches.
158
+
159
+ STRATEGY:
160
+ - Step 1: Ask 1-2 targeted clarification questions about ambiguous terms
161
+ - Step 2: Propose initial rules based on policy + clarifications
162
+ - Step 3+: Refine rules based on failure feedback
163
+
164
+ OUTPUT FORMAT: Respond ONLY with valid JSON. No markdown. No explanation.
165
+ {"action_type": "propose_rules", "content": "{...escaped json string...}"}
166
+ """
167
+ if few_shot_examples:
168
+ base += "\n\nLEARNED FROM PREVIOUS EPISODES (high-reward strategies):\n"
169
+ for traj in few_shot_examples[-TOP_K_TRAJECTORIES:]:
170
+ base += "\n" + traj.to_few_shot_string() + "\n"
171
+ return base
172
+
173
+ def _build_user_prompt(self, obs: dict, step: int, history: list[str]) -> str:
174
+ lines = [
175
+ f"TASK: {obs.get('task_name', 'unknown')}",
176
+ f"STEP: {step} of {obs.get('max_steps', 7)}",
177
+ f"\nPOLICY:\n{obs.get('policy_text', '')}",
178
+ ]
179
+ if obs.get("clarification_response"):
180
+ lines.append(f"\nLAST CLARIFICATION ANSWER:\n{obs['clarification_response']}")
181
+ if obs.get("test_results"):
182
+ tr = obs["test_results"]
183
+ lines.append(f"\nTEST RESULTS: {tr.get('passed', 0)}/{tr.get('total', 0)} passed (accuracy={obs.get('current_accuracy', 0):.2f})")
184
+ if tr.get("sample_failures"):
185
+ lines.append("SAMPLE FAILURES:")
186
+ for f in tr["sample_failures"][:3]:
187
+ lines.append(f" - {f}")
188
+ if obs.get("feedback"):
189
+ lines.append(f"\nFEEDBACK: {obs['feedback']}")
190
+ if history:
191
+ lines.append(f"\nACTION HISTORY (last 3):\n" + "\n".join(history[-3:]))
192
+ lines.append(f"\nAVAILABLE ACTIONS: {obs.get('available_actions', [])}")
193
+ lines.append("\nRespond with JSON only: {\"action_type\": \"...\", \"content\": \"...\"}")
194
+ return "\n".join(lines)
195
+
196
+ def _parse_response(self, raw: str, obs: dict) -> tuple[str, str]:
197
+ # Strip markdown code fences if present
198
+ if "```" in raw:
199
+ raw = raw.split("```")[1]
200
+ if raw.startswith("json"):
201
+ raw = raw[4:]
202
+ raw = raw.strip()
203
+
204
+ try:
205
+ parsed = json.loads(raw)
206
+ action_type = parsed.get("action_type", "propose_rules")
207
+ content = parsed.get("content", "{}")
208
+
209
+ # Validate action_type
210
+ valid_actions = obs.get("available_actions", ["propose_rules", "ask_clarification"])
211
+ if action_type not in valid_actions:
212
+ action_type = "propose_rules" if "propose_rules" in valid_actions else valid_actions[0]
213
+
214
+ # Ensure content is a string
215
+ if isinstance(content, dict):
216
+ content = json.dumps(content)
217
+ return action_type, content
218
+ except Exception:
219
+ return "propose_rules", json.dumps({"rules": [], "default": "DENY"})
220
+
221
+ # ── Trajectory Bank ───────────────────────────────────────────────────────────
222
+
223
+ class TrajectoryBank:
224
+ """Stores and retrieves high-reward trajectories per task."""
225
+
226
+ def __init__(self):
227
+ self.bank: dict[str, list[Trajectory]] = {task: [] for task in TASKS}
228
+
229
+ def store(self, trajectory: Trajectory):
230
+ if trajectory.total_reward >= MIN_REWARD_THRESHOLD:
231
+ self.bank[trajectory.task_name].append(trajectory)
232
+ # Keep only top-K by reward
233
+ self.bank[trajectory.task_name].sort(key=lambda t: t.total_reward, reverse=True)
234
+ self.bank[trajectory.task_name] = self.bank[trajectory.task_name][:TOP_K_TRAJECTORIES]
235
+
236
+ def get_examples(self, task_name: str) -> list[Trajectory]:
237
+ return self.bank.get(task_name, [])
238
+
239
+ def summary(self) -> dict:
240
+ return {
241
+ task: {
242
+ "stored": len(trajs),
243
+ "best_reward": max((t.total_reward for t in trajs), default=0),
244
+ "best_accuracy": max((t.final_accuracy for t in trajs), default=0)
245
+ }
246
+ for task, trajs in self.bank.items()
247
+ }
248
+
249
+ # ── Training Loop ────────────��────────────────────────────────────────────────
250
+
251
+ class TrainingLoop:
252
+ def __init__(self, env_url: str, hf_token: str):
253
+ self.env = EnvClient(env_url)
254
+ self.agent = Agent(hf_token)
255
+ self.bank = TrajectoryBank()
256
+ self.metrics = [] # List of {episode, task, reward, accuracy, success}
257
+
258
+ def run_episode(self, task_name: str, episode_id: int) -> Trajectory:
259
+ """Run a single episode and return the trajectory."""
260
+ few_shots = self.bank.get_examples(task_name)
261
+ trajectory = Trajectory(task_name=task_name, episode_id=episode_id)
262
+
263
+ # Reset environment
264
+ result = self.env.reset(task_name)
265
+ obs = result.get("observation", {})
266
+ done = result.get("done", False)
267
+ history = []
268
+
269
+ print(f" [Episode {episode_id}] task={task_name} few_shots={len(few_shots)}")
270
+
271
+ step_num = 0
272
+ while not done and step_num < obs.get("max_steps", 7):
273
+ step_num += 1
274
+
275
+ # Get action from agent
276
+ action_type, content = self.agent.get_action(
277
+ observation=obs,
278
+ step_number=step_num,
279
+ episode_history=history,
280
+ few_shot_examples=few_shots
281
+ )
282
+
283
+ # Execute action
284
+ result = self.env.step(action_type, content)
285
+ reward = result.get("reward", 0.0)
286
+ done = result.get("done", False)
287
+ obs = result.get("observation", {})
288
+ info = result.get("info", {})
289
+
290
+ # Record step
291
+ step = Step(
292
+ step_number=step_num,
293
+ action_type=action_type,
294
+ action_content=content[:300],
295
+ reward=reward,
296
+ accuracy=obs.get("current_accuracy", 0.0),
297
+ feedback=obs.get("feedback", "") or "",
298
+ clarification_response=obs.get("clarification_response")
299
+ )
300
+ trajectory.steps.append(step)
301
+ trajectory.total_reward += reward
302
+
303
+ # Update history
304
+ history.append(f"Step {step_num}: {action_type} → reward={reward:.2f} acc={step.accuracy:.2f}")
305
+
306
+ print(f" step={step_num} action={action_type} reward={reward:.3f} acc={step.accuracy:.2f}")
307
+
308
+ if done:
309
+ episode_score = info.get("episode_score", obs.get("current_accuracy", 0.0))
310
+ trajectory.final_accuracy = episode_score
311
+ trajectory.success = obs.get("current_accuracy", 0.0) >= 0.9
312
+ break
313
+
314
+ if not trajectory.steps:
315
+ trajectory.final_accuracy = 0.0
316
+
317
+ return trajectory
318
+
319
+ def run(self):
320
+ """Run full training loop across all tasks."""
321
+ print("=" * 60)
322
+ print("REWARD-GUIDED TRAJECTORY OPTIMIZATION")
323
+ print(f"Tasks: {TASKS}")
324
+ print(f"Episodes per task: {NUM_EPISODES_PER_TASK}")
325
+ print(f"Top-K trajectories: {TOP_K_TRAJECTORIES}")
326
+ print("=" * 60)
327
+
328
+ # Health check
329
+ if not self.env.health():
330
+ raise RuntimeError(f"Environment not reachable at {ENV_BASE_URL}")
331
+ print(f"Environment: OK ({ENV_BASE_URL})\n")
332
+
333
+ global_episode = 0
334
+
335
+ for task in TASKS:
336
+ print(f"\n{'─'*40}")
337
+ print(f"TASK: {task}")
338
+ print(f"{'─'*40}")
339
+
340
+ task_rewards = []
341
+ task_accuracies = []
342
+
343
+ for ep in range(1, NUM_EPISODES_PER_TASK + 1):
344
+ global_episode += 1
345
+ trajectory = self.run_episode(task, ep)
346
+
347
+ # Store in bank
348
+ self.bank.store(trajectory)
349
+
350
+ # Record metrics
351
+ self.metrics.append({
352
+ "global_episode": global_episode,
353
+ "task": task,
354
+ "episode_in_task": ep,
355
+ "total_reward": trajectory.total_reward,
356
+ "final_accuracy": trajectory.final_accuracy,
357
+ "success": trajectory.success,
358
+ "num_steps": len(trajectory.steps),
359
+ "few_shots_used": len(self.bank.get_examples(task)) - (1 if trajectory.total_reward >= MIN_REWARD_THRESHOLD else 0)
360
+ })
361
+
362
+ task_rewards.append(trajectory.total_reward)
363
+ task_accuracies.append(trajectory.final_accuracy)
364
+
365
+ print(f" → Episode {ep} complete: reward={trajectory.total_reward:.3f} accuracy={trajectory.final_accuracy:.2f} success={trajectory.success}")
366
+ time.sleep(0.5) # Rate limiting
367
+
368
+ print(f"\n Task summary:")
369
+ print(f" First episode reward: {task_rewards[0]:.3f}")
370
+ print(f" Last episode reward: {task_rewards[-1]:.3f}")
371
+ print(f" Improvement: {task_rewards[-1] - task_rewards[0]:+.3f}")
372
+
373
+ print("\n" + "=" * 60)
374
+ print("TRAINING COMPLETE")
375
+ print(f"Bank summary: {self.bank.summary()}")
376
+ print("=" * 60)
377
+
378
+ return self.metrics
379
+
380
+ # ── Plot Generation ───────────────────────────────────────────────────────────
381
+
382
+ def save_plots(metrics: list[dict]):
383
+ """
384
+ Save reward curve and accuracy curve as PNG files.
385
+ These are REQUIRED for hackathon submission — must be committed to repo.
386
+ """
387
+ try:
388
+ import matplotlib
389
+ matplotlib.use("Agg") # Non-interactive backend
390
+ import matplotlib.pyplot as plt
391
+ import numpy as np
392
+ except ImportError:
393
+ print("matplotlib not installed. Run: pip install matplotlib")
394
+ return
395
+
396
+ os.makedirs("training/plots", exist_ok=True)
397
+
398
+ episodes = [m["global_episode"] for m in metrics]
399
+ rewards = [m["total_reward"] for m in metrics]
400
+ accuracies = [m["final_accuracy"] for m in metrics]
401
+ tasks = [m["task"] for m in metrics]
402
+
403
+ colors = {
404
+ "data_access": "#2196F3",
405
+ "resource_access": "#FF9800",
406
+ "transaction_approval": "#4CAF50"
407
+ }
408
+
409
+ # ── Plot 1: Reward Curve ──────────────────────────────────────────────────
410
+ fig, ax = plt.subplots(figsize=(10, 5))
411
+
412
+ for task in TASKS:
413
+ task_eps = [m["global_episode"] for m in metrics if m["task"] == task]
414
+ task_rews = [m["total_reward"] for m in metrics if m["task"] == task]
415
+ ax.plot(task_eps, task_rews, marker="o", label=task,
416
+ color=colors.get(task, "gray"), linewidth=2, markersize=5)
417
+
418
+ # Trend line
419
+ z = np.polyfit(episodes, rewards, 1)
420
+ p = np.poly1d(z)
421
+ ax.plot(episodes, p(episodes), "--", color="red", alpha=0.5, linewidth=1.5, label="overall trend")
422
+
423
+ ax.set_xlabel("Episode")
424
+ ax.set_ylabel("Total Reward")
425
+ ax.set_title("Reward Curve — Reward-Guided Trajectory Optimization")
426
+ ax.legend()
427
+ ax.grid(True, alpha=0.3)
428
+ ax.set_ylim(bottom=0)
429
+
430
+ plt.tight_layout()
431
+ plt.savefig("training/plots/reward_curve.png", dpi=150, bbox_inches="tight")
432
+ plt.close()
433
+ print("Saved: training/plots/reward_curve.png")
434
+
435
+ # ── Plot 2: Accuracy Curve ────────────────────────────────────────────────
436
+ fig, ax = plt.subplots(figsize=(10, 5))
437
+
438
+ for task in TASKS:
439
+ task_eps = [m["global_episode"] for m in metrics if m["task"] == task]
440
+ task_accs = [m["final_accuracy"] for m in metrics if m["task"] == task]
441
+ ax.plot(task_eps, task_accs, marker="s", label=task,
442
+ color=colors.get(task, "gray"), linewidth=2, markersize=5)
443
+
444
+ ax.axhline(y=0.9, color="red", linestyle="--", alpha=0.7, label="success threshold (0.9)")
445
+
446
+ ax.set_xlabel("Episode")
447
+ ax.set_ylabel("Final Accuracy")
448
+ ax.set_title("Accuracy Curve — Policy-to-Logic Agent")
449
+ ax.legend()
450
+ ax.grid(True, alpha=0.3)
451
+ ax.set_ylim(0, 1.05)
452
+
453
+ plt.tight_layout()
454
+ plt.savefig("training/plots/accuracy_curve.png", dpi=150, bbox_inches="tight")
455
+ plt.close()
456
+ print("Saved: training/plots/accuracy_curve.png")
457
+
458
+ # ── Plot 3: Per-Task Improvement Bar Chart ────────────────────────────────
459
+ fig, ax = plt.subplots(figsize=(8, 5))
460
+
461
+ task_names = []
462
+ improvements = []
463
+
464
+ for task in TASKS:
465
+ task_accs = [m["final_accuracy"] for m in metrics if m["task"] == task]
466
+ if len(task_accs) >= 2:
467
+ first = task_accs[0]
468
+ last = task_accs[-1]
469
+ task_names.append(task.replace("_", "\n"))
470
+ improvements.append(last - first)
471
+
472
+ bars = ax.bar(task_names, improvements,
473
+ color=["#2196F3", "#FF9800", "#4CAF50"][:len(task_names)],
474
+ edgecolor="white", linewidth=1.5)
475
+
476
+ ax.axhline(y=0, color="black", linewidth=0.8)
477
+ ax.set_ylabel("Accuracy Improvement (last - first episode)")
478
+ ax.set_title("Per-Task Improvement from Trajectory Accumulation")
479
+ ax.grid(True, axis="y", alpha=0.3)
480
+
481
+ for bar, val in zip(bars, improvements):
482
+ ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01,
483
+ f"{val:+.2f}", ha="center", va="bottom", fontweight="bold")
484
+
485
+ plt.tight_layout()
486
+ plt.savefig("training/plots/improvement_chart.png", dpi=150, bbox_inches="tight")
487
+ plt.close()
488
+ print("Saved: training/plots/improvement_chart.png")
489
+
490
+ # ── Save raw metrics as JSON ──────────────────────────────────────────────
491
+ with open("training/plots/metrics.json", "w") as f:
492
+ json.dump(metrics, f, indent=2)
493
+ print("Saved: training/plots/metrics.json")
494
+
495
+ # ── Entry Point ─────────────────────────────────���─────────────────────────────
496
+
497
+ if __name__ == "__main__":
498
+ hf_token = os.getenv("HF_TOKEN", "")
499
+ if not hf_token:
500
+ raise ValueError("HF_TOKEN environment variable not set")
501
+
502
+ loop = TrainingLoop(ENV_BASE_URL, hf_token)
503
+ metrics = loop.run()
504
+ save_plots(metrics)
505
+
506
+ print("\nNext step: commit training/plots/*.png to repo for submission.")
uv.lock CHANGED
The diff for this file is too large to render. See raw diff