databoysu commited on
Commit
3313317
·
1 Parent(s): d3f1b00

improve tooling and whitespace error handling

Browse files
Files changed (1) hide show
  1. inference.py +32 -5
inference.py CHANGED
@@ -96,6 +96,13 @@ Action policy:
96
  - After RUN_TESTS, do not choose RUN_TESTS again immediately unless test evidence is genuinely missing.
97
  - Treat "no output" as invalid reasoning when pass_count_summary or traceback text is present.
98
 
 
 
 
 
 
 
 
99
  Worked examples (generic, no benchmark task leakage):
100
 
101
  Example 1: failing tests after RUN_TESTS -> choose REPLACE_LINES
@@ -128,6 +135,29 @@ Observation: syntax_error=true and traceback provides a concrete syntax failure
128
  Valid action JSON:
129
  {"thought":"Observation: ... Diagnosis: ... Plan: ...","action_type":"REPLACE_LINES","start_line":8,"end_line":9,"new_code_block":" # syntax-fixed code"}
130
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
  Submit gate (hard rule):
132
  - If any failure, error, traceback, xfailed/unfinished signal, or uncertainty remains, do not SUBMIT.
133
  - If all-tests-passed signal is present, do SUBMIT immediately on this turn.
@@ -436,13 +466,10 @@ async def run(difficulty: Optional[str] = None, show_thought: bool = False) -> N
436
  consecutive_same_action_count = 1
437
  last_action_type = current_action_type
438
 
439
- if (
440
- current_action_type == "RUN_TESTS"
441
- and consecutive_same_action_count >= 3
442
- ):
443
  kill_switch_triggered = True
444
  history.append(
445
- "KILL_SWITCH: RUN_TESTS selected 3 times consecutively. "
446
  "Terminating episode early to prevent looping."
447
  )
448
  steps_taken = step
 
96
  - After RUN_TESTS, do not choose RUN_TESTS again immediately unless test evidence is genuinely missing.
97
  - Treat "no output" as invalid reasoning when pass_count_summary or traceback text is present.
98
 
99
+ Strict state transition rules (no looping on same action):
100
+ - If pass_count_summary=unknown (no test output yet), MUST run RUN_TESTS next (never VIEW_CODE or REPLACE_LINES).
101
+ - After VIEW_CODE, prefer RUN_TESTS next to get test evidence (never do VIEW_CODE twice in a row).
102
+ - After REPLACE_LINES, always do RUN_TESTS next to validate the fix (never do REPLACE_LINES twice in a row).
103
+ - If REPLACE_LINES+RUN_TESTS did not fix the issue, do VIEW_CODE next to re-orient before another edit.
104
+ - Do not attempt the same edit twice; if it fails, change strategy (VIEW_CODE or UNDO_EDIT or RESET_TO_ORIGINAL).
105
+
106
  Worked examples (generic, no benchmark task leakage):
107
 
108
  Example 1: failing tests after RUN_TESTS -> choose REPLACE_LINES
 
135
  Valid action JSON:
136
  {"thought":"Observation: ... Diagnosis: ... Plan: ...","action_type":"REPLACE_LINES","start_line":8,"end_line":9,"new_code_block":" # syntax-fixed code"}
137
 
138
+ Example 4: NO TEST OUTPUT YET + UNKNOWN PASS COUNT -> MUST choose RUN_TESTS (not VIEW_CODE)
139
+ Input evidence snippet:
140
+ - pass_count_summary=unknown
141
+ - all_tests_pass_signal=false
142
+ - last_execution_output is empty ""
143
+ Valid thought:
144
+ Observation: no test execution has occurred; pass_count_summary is unknown and no test output is available. Diagnosis: cannot diagnose code bugs without seeing test failures; must run tests first. Plan: choose RUN_TESTS to collect test evidence and determine what needs fixing.
145
+ Valid action JSON (WRONG - violates state rule):
146
+ BAD: {"thought":"...","action_type":"VIEW_CODE","start_line":null,"end_line":null,"new_code_block":null}
147
+ CORRECT: {"thought":"...","action_type":"RUN_TESTS","start_line":null,"end_line":null,"new_code_block":null}
148
+
149
+ Example 5: EDIT FAILED - SAME ACTION TWICE IS WRONG
150
+ Previous step: REPLACE_LINES on line 4 (indentation fix attempt 1)
151
+ Current observation:
152
+ - syntax_error=true (indentation still wrong)
153
+ - same error message about dedentation
154
+ - pass_count_summary=unknown
155
+ Valid thought:
156
+ Observation: same indentation error persists despite the previous REPLACE_LINES attempt. The fix did not work. Diagnosis: attempting the identical REPLACE_LINES again will create an infinite loop; must change strategy. Plan: call VIEW_CODE next to re-examine the context and understand why the fix didn't apply properly, then revise the approach.
157
+ Valid action JSON (WRONG - causes looping):
158
+ BAD (same edit): {"thought":"...","action_type":"REPLACE_LINES","start_line":4,"end_line":4,"new_code_block":" fixed code"}
159
+ CORRECT (change strategy): {"thought":"...","action_type":"VIEW_CODE","start_line":null,"end_line":null,"new_code_block":null}
160
+
161
  Submit gate (hard rule):
162
  - If any failure, error, traceback, xfailed/unfinished signal, or uncertainty remains, do not SUBMIT.
163
  - If all-tests-passed signal is present, do SUBMIT immediately on this turn.
 
466
  consecutive_same_action_count = 1
467
  last_action_type = current_action_type
468
 
469
+ if consecutive_same_action_count >= 3:
 
 
 
470
  kill_switch_triggered = True
471
  history.append(
472
+ f"KILL_SWITCH: {current_action_type} selected 3 times consecutively. "
473
  "Terminating episode early to prevent looping."
474
  )
475
  steps_taken = step