Spaces:
Sleeping
Sleeping
databoysu commited on
Commit ·
0bc4dba
1
Parent(s): 1a7ff25
improve tooling and whitespace error handling error
Browse files- inference.py +14 -19
inference.py
CHANGED
|
@@ -96,13 +96,6 @@ Action policy:
|
|
| 96 |
- After RUN_TESTS, do not choose RUN_TESTS again immediately unless test evidence is genuinely missing.
|
| 97 |
- Treat "no output" as invalid reasoning when pass_count_summary or traceback text is present.
|
| 98 |
|
| 99 |
-
Strict state transition rules (no looping on same action):
|
| 100 |
-
- If pass_count_summary=unknown (no test output yet), MUST run RUN_TESTS next (never VIEW_CODE or REPLACE_LINES).
|
| 101 |
-
- After VIEW_CODE, prefer RUN_TESTS next to get test evidence (never do VIEW_CODE twice in a row).
|
| 102 |
-
- CRITICAL RULE: After you use REPLACE_LINES, your VERY NEXT action MUST be RUN_TESTS to verify the edit. Do NOT use REPLACE_LINES twice in a row. Do NOT guess whether you made a syntax error; always run tests to get proof.
|
| 103 |
-
- If REPLACE_LINES+RUN_TESTS did not fix the issue, do VIEW_CODE next to re-orient before another edit.
|
| 104 |
-
- Do not attempt the same edit twice; if it fails, change strategy (VIEW_CODE or UNDO_EDIT or RESET_TO_ORIGINAL).
|
| 105 |
-
|
| 106 |
Worked examples (generic, no benchmark task leakage):
|
| 107 |
|
| 108 |
Example 1: failing tests after RUN_TESTS -> choose REPLACE_LINES
|
|
@@ -125,17 +118,6 @@ Observation: output explicitly shows Tests Passed: 3/3 and includes the success
|
|
| 125 |
Valid action JSON:
|
| 126 |
{"thought":"Observation: ... Diagnosis: ... Plan: ...","action_type":"SUBMIT","start_line":null,"end_line":null,"new_code_block":null}
|
| 127 |
|
| 128 |
-
Example 4: NO TEST OUTPUT YET + UNKNOWN PASS COUNT -> MUST choose RUN_TESTS (not VIEW_CODE)
|
| 129 |
-
Input evidence snippet:
|
| 130 |
-
- pass_count_summary=unknown
|
| 131 |
-
- all_tests_pass_signal=false
|
| 132 |
-
- last_execution_output is empty ""
|
| 133 |
-
Valid thought:
|
| 134 |
-
Observation: no test execution has occurred; pass_count_summary is unknown and no test output is available. Diagnosis: cannot diagnose code bugs without seeing test failures; must run tests first. Plan: choose RUN_TESTS to collect test evidence and determine what needs fixing.
|
| 135 |
-
Valid action JSON (WRONG - violates state rule):
|
| 136 |
-
BAD: {"thought":"...","action_type":"VIEW_CODE","start_line":null,"end_line":null,"new_code_block":null}
|
| 137 |
-
CORRECT: {"thought":"...","action_type":"RUN_TESTS","start_line":null,"end_line":null,"new_code_block":null}
|
| 138 |
-
|
| 139 |
Submit gate (hard rule):
|
| 140 |
- If any failure, error, traceback, xfailed/unfinished signal, or uncertainty remains, do not SUBMIT.
|
| 141 |
- If all-tests-passed signal is present, do SUBMIT immediately on this turn.
|
|
@@ -368,6 +350,18 @@ async def run(difficulty: Optional[str] = None, show_thought: bool = False) -> N
|
|
| 368 |
obs_last_output = str(getattr(result.observation, "last_execution_output", "") or "")
|
| 369 |
pass_count_text, all_tests_pass_signal = _extract_pass_signal_fields(obs_last_output)
|
| 370 |
last_action = action_trajectory[-1] if action_trajectory else "none"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 371 |
if show_thought:
|
| 372 |
output_preview = "\\n".join(obs_last_output.splitlines()[:6])
|
| 373 |
print("[OBS_DEBUG]", file=sys.stderr, flush=True)
|
|
@@ -386,8 +380,9 @@ async def run(difficulty: Optional[str] = None, show_thought: bool = False) -> N
|
|
| 386 |
"If all_tests_pass_signal=true, you must choose SUBMIT now and must not choose RUN_TESTS again. "
|
| 387 |
"Do not wait for additional test output when all_tests_pass_signal=true. "
|
| 388 |
"If last_action was RUN_TESTS and all_tests_pass_signal=false, choose REPLACE_LINES or VIEW_CODE next, not RUN_TESTS again.\n\n"
|
|
|
|
|
|
|
| 389 |
f"decision_guard: last_action={last_action}, pass_count_summary={pass_count_text}, all_tests_pass_signal={str(all_tests_pass_signal).lower()}\n\n"
|
| 390 |
-
f"action_trajectory={(' -> '.join(action_trajectory) if action_trajectory else 'none')}\n\n"
|
| 391 |
f"{obs_text}"
|
| 392 |
),
|
| 393 |
}
|
|
|
|
| 96 |
- After RUN_TESTS, do not choose RUN_TESTS again immediately unless test evidence is genuinely missing.
|
| 97 |
- Treat "no output" as invalid reasoning when pass_count_summary or traceback text is present.
|
| 98 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
Worked examples (generic, no benchmark task leakage):
|
| 100 |
|
| 101 |
Example 1: failing tests after RUN_TESTS -> choose REPLACE_LINES
|
|
|
|
| 118 |
Valid action JSON:
|
| 119 |
{"thought":"Observation: ... Diagnosis: ... Plan: ...","action_type":"SUBMIT","start_line":null,"end_line":null,"new_code_block":null}
|
| 120 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
Submit gate (hard rule):
|
| 122 |
- If any failure, error, traceback, xfailed/unfinished signal, or uncertainty remains, do not SUBMIT.
|
| 123 |
- If all-tests-passed signal is present, do SUBMIT immediately on this turn.
|
|
|
|
| 350 |
obs_last_output = str(getattr(result.observation, "last_execution_output", "") or "")
|
| 351 |
pass_count_text, all_tests_pass_signal = _extract_pass_signal_fields(obs_last_output)
|
| 352 |
last_action = action_trajectory[-1] if action_trajectory else "none"
|
| 353 |
+
dynamic_override = ""
|
| 354 |
+
if action_trajectory and action_trajectory[-1] == "REPLACE_LINES":
|
| 355 |
+
dynamic_override = (
|
| 356 |
+
"\n[SYSTEM OVERRIDE]: Your last action was REPLACE_LINES. "
|
| 357 |
+
"You are STRICTLY FORBIDDEN from editing the code again. "
|
| 358 |
+
"Your action_type MUST be RUN_TESTS to verify the changes.\n"
|
| 359 |
+
)
|
| 360 |
+
elif action_trajectory and action_trajectory[-1] == "VIEW_CODE":
|
| 361 |
+
dynamic_override = (
|
| 362 |
+
"\n[SYSTEM OVERRIDE]: Your last action was VIEW_CODE. "
|
| 363 |
+
"You MUST choose RUN_TESTS next to get test evidence.\n"
|
| 364 |
+
)
|
| 365 |
if show_thought:
|
| 366 |
output_preview = "\\n".join(obs_last_output.splitlines()[:6])
|
| 367 |
print("[OBS_DEBUG]", file=sys.stderr, flush=True)
|
|
|
|
| 380 |
"If all_tests_pass_signal=true, you must choose SUBMIT now and must not choose RUN_TESTS again. "
|
| 381 |
"Do not wait for additional test output when all_tests_pass_signal=true. "
|
| 382 |
"If last_action was RUN_TESTS and all_tests_pass_signal=false, choose REPLACE_LINES or VIEW_CODE next, not RUN_TESTS again.\n\n"
|
| 383 |
+
f"action_trajectory={(' -> '.join(action_trajectory) if action_trajectory else 'none')}\n"
|
| 384 |
+
f"{dynamic_override}\n"
|
| 385 |
f"decision_guard: last_action={last_action}, pass_count_summary={pass_count_text}, all_tests_pass_signal={str(all_tests_pass_signal).lower()}\n\n"
|
|
|
|
| 386 |
f"{obs_text}"
|
| 387 |
),
|
| 388 |
}
|