databoysu commited on
Commit
d3f1b00
·
1 Parent(s): 26f55d2

improving thought and reasoning

Browse files
Files changed (1) hide show
  1. inference.py +49 -5
inference.py CHANGED
@@ -67,10 +67,10 @@ Output contract (strict):
67
  - If action_type is not REPLACE_LINES, set start_line=null, end_line=null, new_code_block=null.
68
  - If action_type is REPLACE_LINES, set start_line and end_line to exact integer keys from code_dict and provide new_code_block as replacement code only.
69
 
70
- Mandatory thought format:
71
- Observation: summarize concrete evidence from localized_context and/or last_execution_output.
72
- Diagnosis: identify the most likely root cause and exact line numbers to edit when applicable.
73
- Plan: choose the next action_type and justify it briefly.
74
 
75
  How to read last_execution_output correctly:
76
  - Prefer traceback and assertion text over assumptions.
@@ -96,6 +96,38 @@ Action policy:
96
  - After RUN_TESTS, do not choose RUN_TESTS again immediately unless test evidence is genuinely missing.
97
  - Treat "no output" as invalid reasoning when pass_count_summary or traceback text is present.
98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
  Submit gate (hard rule):
100
  - If any failure, error, traceback, xfailed/unfinished signal, or uncertainty remains, do not SUBMIT.
101
  - If all-tests-passed signal is present, do SUBMIT immediately on this turn.
@@ -104,7 +136,7 @@ Self-check before finalizing response:
104
  - Is this valid JSON?
105
  - Are all values schema-valid primitive types?
106
  - Are nulls set correctly for non-REPLACE_LINES actions?
107
- - Does the thought have exactly 3 sentences in the required Observation/Diagnosis/Plan structure?
108
  """
109
 
110
 
@@ -181,12 +213,15 @@ def _build_observation_text(observation: Any) -> str:
181
  f"{line_num} | {text}"
182
  for line_num, text in sorted_items[:30]
183
  )
 
184
  return (
185
  f"step_count={observation.step_count}\n"
186
  f"steps_remaining={observation.steps_remaining}\n"
187
  f"syntax_error={observation.syntax_error}\n"
188
  f"pass_count_summary={pass_count_text}\n"
189
  f"all_tests_pass_signal={str(all_tests_pass_signal).lower()}\n"
 
 
190
  f"localized_context=\n{observation.localized_context}\n\n"
191
  f"last_execution_output=\n{last_execution_output}\n\n"
192
  f"code_preview=\n{code_preview}"
@@ -325,6 +360,15 @@ async def run(difficulty: Optional[str] = None, show_thought: bool = False) -> N
325
  obs_last_output = str(getattr(result.observation, "last_execution_output", "") or "")
326
  pass_count_text, all_tests_pass_signal = _extract_pass_signal_fields(obs_last_output)
327
  last_action = action_trajectory[-1] if action_trajectory else "none"
 
 
 
 
 
 
 
 
 
328
  history_messages.append(
329
  {
330
  "role": "user",
 
67
  - If action_type is not REPLACE_LINES, set start_line=null, end_line=null, new_code_block=null.
68
  - If action_type is REPLACE_LINES, set start_line and end_line to exact integer keys from code_dict and provide new_code_block as replacement code only.
69
 
70
+ Mandatory thought structure (scratchpad, no sentence cap):
71
+ - Thought must contain these three labeled sections in order: Observation:, Diagnosis:, Plan:.
72
+ - Each section can be multiple sentences and include detailed reasoning.
73
+ - Do not compress reasoning to a fixed sentence count.
74
 
75
  How to read last_execution_output correctly:
76
  - Prefer traceback and assertion text over assumptions.
 
96
  - After RUN_TESTS, do not choose RUN_TESTS again immediately unless test evidence is genuinely missing.
97
  - Treat "no output" as invalid reasoning when pass_count_summary or traceback text is present.
98
 
99
+ Worked examples (generic, no benchmark task leakage):
100
+
101
+ Example 1: failing tests after RUN_TESTS -> choose REPLACE_LINES
102
+ Input evidence snippet:
103
+ - pass_count_summary=Tests Passed: 1/3
104
+ - all_tests_pass_signal=false
105
+ - last_execution_output contains traceback near line 12.
106
+ Valid thought:
107
+ Observation: pass_count_summary shows 1/3 and traceback is present, so test output is available and indicates a real failure near line 12. Diagnosis: logic at line 12 likely violates expected behavior; this is not a missing-output case and rerunning tests immediately would waste a step. Plan: use REPLACE_LINES on the implicated lines, then run RUN_TESTS once to verify.
108
+ Valid action JSON:
109
+ {"thought":"Observation: ... Diagnosis: ... Plan: ...","action_type":"REPLACE_LINES","start_line":12,"end_line":13,"new_code_block":" # corrected code"}
110
+
111
+ Example 2: all tests passed after RUN_TESTS -> choose SUBMIT immediately
112
+ Input evidence snippet:
113
+ - pass_count_summary=Tests Passed: 3/3
114
+ - all_tests_pass_signal=true
115
+ - last_execution_output includes "SUCCESS: ALL TESTS PASSED".
116
+ Valid thought:
117
+ Observation: output explicitly shows Tests Passed: 3/3 and includes the success marker. Diagnosis: there is no remaining failing evidence and additional RUN_TESTS is unnecessary. Plan: choose SUBMIT now to end the episode.
118
+ Valid action JSON:
119
+ {"thought":"Observation: ... Diagnosis: ... Plan: ...","action_type":"SUBMIT","start_line":null,"end_line":null,"new_code_block":null}
120
+
121
+ Example 3: tests ran but syntax error present -> choose REPLACE_LINES (or UNDO_EDIT)
122
+ Input evidence snippet:
123
+ - syntax_error=true
124
+ - pass_count_summary=Tests Passed: 0/3
125
+ - traceback includes SyntaxError with a file line reference.
126
+ Valid thought:
127
+ Observation: syntax_error=true and traceback provides a concrete syntax failure location. Diagnosis: code is syntactically invalid, so functional debugging is blocked until syntax is repaired. Plan: apply REPLACE_LINES at the indicated lines (or UNDO_EDIT if the latest edit caused this), then RUN_TESTS.
128
+ Valid action JSON:
129
+ {"thought":"Observation: ... Diagnosis: ... Plan: ...","action_type":"REPLACE_LINES","start_line":8,"end_line":9,"new_code_block":" # syntax-fixed code"}
130
+
131
  Submit gate (hard rule):
132
  - If any failure, error, traceback, xfailed/unfinished signal, or uncertainty remains, do not SUBMIT.
133
  - If all-tests-passed signal is present, do SUBMIT immediately on this turn.
 
136
  - Is this valid JSON?
137
  - Are all values schema-valid primitive types?
138
  - Are nulls set correctly for non-REPLACE_LINES actions?
139
+ - Does thought include Observation:, Diagnosis:, and Plan: sections with concrete evidence from this turn?
140
  """
141
 
142
 
 
213
  f"{line_num} | {text}"
214
  for line_num, text in sorted_items[:30]
215
  )
216
+ output_head_lines = "\n".join(last_execution_output.splitlines()[:8])
217
  return (
218
  f"step_count={observation.step_count}\n"
219
  f"steps_remaining={observation.steps_remaining}\n"
220
  f"syntax_error={observation.syntax_error}\n"
221
  f"pass_count_summary={pass_count_text}\n"
222
  f"all_tests_pass_signal={str(all_tests_pass_signal).lower()}\n"
223
+ f"last_execution_output_chars={len(last_execution_output)}\n"
224
+ f"last_execution_output_head=\n{output_head_lines}\n\n"
225
  f"localized_context=\n{observation.localized_context}\n\n"
226
  f"last_execution_output=\n{last_execution_output}\n\n"
227
  f"code_preview=\n{code_preview}"
 
360
  obs_last_output = str(getattr(result.observation, "last_execution_output", "") or "")
361
  pass_count_text, all_tests_pass_signal = _extract_pass_signal_fields(obs_last_output)
362
  last_action = action_trajectory[-1] if action_trajectory else "none"
363
+ if show_thought:
364
+ output_preview = "\\n".join(obs_last_output.splitlines()[:6])
365
+ print("[OBS_DEBUG]", file=sys.stderr, flush=True)
366
+ print(
367
+ f"chars={len(obs_last_output)} pass_count={pass_count_text} all_pass={str(all_tests_pass_signal).lower()} last_action={last_action}",
368
+ file=sys.stderr,
369
+ flush=True,
370
+ )
371
+ print(output_preview if output_preview else "<empty last_execution_output>", file=sys.stderr, flush=True)
372
  history_messages.append(
373
  {
374
  "role": "user",