Spaces:
Sleeping
Sleeping
databoysu commited on
Commit ·
3313317
1
Parent(s): d3f1b00
improve tooling and whitespace error handling
Browse files- inference.py +32 -5
inference.py
CHANGED
|
@@ -96,6 +96,13 @@ Action policy:
|
|
| 96 |
- After RUN_TESTS, do not choose RUN_TESTS again immediately unless test evidence is genuinely missing.
|
| 97 |
- Treat "no output" as invalid reasoning when pass_count_summary or traceback text is present.
|
| 98 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
Worked examples (generic, no benchmark task leakage):
|
| 100 |
|
| 101 |
Example 1: failing tests after RUN_TESTS -> choose REPLACE_LINES
|
|
@@ -128,6 +135,29 @@ Observation: syntax_error=true and traceback provides a concrete syntax failure
|
|
| 128 |
Valid action JSON:
|
| 129 |
{"thought":"Observation: ... Diagnosis: ... Plan: ...","action_type":"REPLACE_LINES","start_line":8,"end_line":9,"new_code_block":" # syntax-fixed code"}
|
| 130 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 131 |
Submit gate (hard rule):
|
| 132 |
- If any failure, error, traceback, xfailed/unfinished signal, or uncertainty remains, do not SUBMIT.
|
| 133 |
- If all-tests-passed signal is present, do SUBMIT immediately on this turn.
|
|
@@ -436,13 +466,10 @@ async def run(difficulty: Optional[str] = None, show_thought: bool = False) -> N
|
|
| 436 |
consecutive_same_action_count = 1
|
| 437 |
last_action_type = current_action_type
|
| 438 |
|
| 439 |
-
if
|
| 440 |
-
current_action_type == "RUN_TESTS"
|
| 441 |
-
and consecutive_same_action_count >= 3
|
| 442 |
-
):
|
| 443 |
kill_switch_triggered = True
|
| 444 |
history.append(
|
| 445 |
-
"KILL_SWITCH:
|
| 446 |
"Terminating episode early to prevent looping."
|
| 447 |
)
|
| 448 |
steps_taken = step
|
|
|
|
| 96 |
- After RUN_TESTS, do not choose RUN_TESTS again immediately unless test evidence is genuinely missing.
|
| 97 |
- Treat "no output" as invalid reasoning when pass_count_summary or traceback text is present.
|
| 98 |
|
| 99 |
+
Strict state transition rules (no looping on same action):
|
| 100 |
+
- If pass_count_summary=unknown (no test output yet), MUST run RUN_TESTS next (never VIEW_CODE or REPLACE_LINES).
|
| 101 |
+
- After VIEW_CODE, prefer RUN_TESTS next to get test evidence (never do VIEW_CODE twice in a row).
|
| 102 |
+
- After REPLACE_LINES, always do RUN_TESTS next to validate the fix (never do REPLACE_LINES twice in a row).
|
| 103 |
+
- If REPLACE_LINES+RUN_TESTS did not fix the issue, do VIEW_CODE next to re-orient before another edit.
|
| 104 |
+
- Do not attempt the same edit twice; if it fails, change strategy (VIEW_CODE or UNDO_EDIT or RESET_TO_ORIGINAL).
|
| 105 |
+
|
| 106 |
Worked examples (generic, no benchmark task leakage):
|
| 107 |
|
| 108 |
Example 1: failing tests after RUN_TESTS -> choose REPLACE_LINES
|
|
|
|
| 135 |
Valid action JSON:
|
| 136 |
{"thought":"Observation: ... Diagnosis: ... Plan: ...","action_type":"REPLACE_LINES","start_line":8,"end_line":9,"new_code_block":" # syntax-fixed code"}
|
| 137 |
|
| 138 |
+
Example 4: NO TEST OUTPUT YET + UNKNOWN PASS COUNT -> MUST choose RUN_TESTS (not VIEW_CODE)
|
| 139 |
+
Input evidence snippet:
|
| 140 |
+
- pass_count_summary=unknown
|
| 141 |
+
- all_tests_pass_signal=false
|
| 142 |
+
- last_execution_output is empty ""
|
| 143 |
+
Valid thought:
|
| 144 |
+
Observation: no test execution has occurred; pass_count_summary is unknown and no test output is available. Diagnosis: cannot diagnose code bugs without seeing test failures; must run tests first. Plan: choose RUN_TESTS to collect test evidence and determine what needs fixing.
|
| 145 |
+
Valid action JSON (WRONG - violates state rule):
|
| 146 |
+
BAD: {"thought":"...","action_type":"VIEW_CODE","start_line":null,"end_line":null,"new_code_block":null}
|
| 147 |
+
CORRECT: {"thought":"...","action_type":"RUN_TESTS","start_line":null,"end_line":null,"new_code_block":null}
|
| 148 |
+
|
| 149 |
+
Example 5: EDIT FAILED - SAME ACTION TWICE IS WRONG
|
| 150 |
+
Previous step: REPLACE_LINES on line 4 (indentation fix attempt 1)
|
| 151 |
+
Current observation:
|
| 152 |
+
- syntax_error=true (indentation still wrong)
|
| 153 |
+
- same error message about dedentation
|
| 154 |
+
- pass_count_summary=unknown
|
| 155 |
+
Valid thought:
|
| 156 |
+
Observation: same indentation error persists despite the previous REPLACE_LINES attempt. The fix did not work. Diagnosis: attempting the identical REPLACE_LINES again will create an infinite loop; must change strategy. Plan: call VIEW_CODE next to re-examine the context and understand why the fix didn't apply properly, then revise the approach.
|
| 157 |
+
Valid action JSON (WRONG - causes looping):
|
| 158 |
+
BAD (same edit): {"thought":"...","action_type":"REPLACE_LINES","start_line":4,"end_line":4,"new_code_block":" fixed code"}
|
| 159 |
+
CORRECT (change strategy): {"thought":"...","action_type":"VIEW_CODE","start_line":null,"end_line":null,"new_code_block":null}
|
| 160 |
+
|
| 161 |
Submit gate (hard rule):
|
| 162 |
- If any failure, error, traceback, xfailed/unfinished signal, or uncertainty remains, do not SUBMIT.
|
| 163 |
- If all-tests-passed signal is present, do SUBMIT immediately on this turn.
|
|
|
|
| 466 |
consecutive_same_action_count = 1
|
| 467 |
last_action_type = current_action_type
|
| 468 |
|
| 469 |
+
if consecutive_same_action_count >= 3:
|
|
|
|
|
|
|
|
|
|
| 470 |
kill_switch_triggered = True
|
| 471 |
history.append(
|
| 472 |
+
f"KILL_SWITCH: {current_action_type} selected 3 times consecutively. "
|
| 473 |
"Terminating episode early to prevent looping."
|
| 474 |
)
|
| 475 |
steps_taken = step
|