Spaces:
Sleeping
Sleeping
databoysu commited on
Commit ·
d3f1b00
1
Parent(s): 26f55d2
improving thought and reasoning
Browse files- inference.py +49 -5
inference.py
CHANGED
|
@@ -67,10 +67,10 @@ Output contract (strict):
|
|
| 67 |
- If action_type is not REPLACE_LINES, set start_line=null, end_line=null, new_code_block=null.
|
| 68 |
- If action_type is REPLACE_LINES, set start_line and end_line to exact integer keys from code_dict and provide new_code_block as replacement code only.
|
| 69 |
|
| 70 |
-
Mandatory thought
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
|
| 75 |
How to read last_execution_output correctly:
|
| 76 |
- Prefer traceback and assertion text over assumptions.
|
|
@@ -96,6 +96,38 @@ Action policy:
|
|
| 96 |
- After RUN_TESTS, do not choose RUN_TESTS again immediately unless test evidence is genuinely missing.
|
| 97 |
- Treat "no output" as invalid reasoning when pass_count_summary or traceback text is present.
|
| 98 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
Submit gate (hard rule):
|
| 100 |
- If any failure, error, traceback, xfailed/unfinished signal, or uncertainty remains, do not SUBMIT.
|
| 101 |
- If all-tests-passed signal is present, do SUBMIT immediately on this turn.
|
|
@@ -104,7 +136,7 @@ Self-check before finalizing response:
|
|
| 104 |
- Is this valid JSON?
|
| 105 |
- Are all values schema-valid primitive types?
|
| 106 |
- Are nulls set correctly for non-REPLACE_LINES actions?
|
| 107 |
-
- Does
|
| 108 |
"""
|
| 109 |
|
| 110 |
|
|
@@ -181,12 +213,15 @@ def _build_observation_text(observation: Any) -> str:
|
|
| 181 |
f"{line_num} | {text}"
|
| 182 |
for line_num, text in sorted_items[:30]
|
| 183 |
)
|
|
|
|
| 184 |
return (
|
| 185 |
f"step_count={observation.step_count}\n"
|
| 186 |
f"steps_remaining={observation.steps_remaining}\n"
|
| 187 |
f"syntax_error={observation.syntax_error}\n"
|
| 188 |
f"pass_count_summary={pass_count_text}\n"
|
| 189 |
f"all_tests_pass_signal={str(all_tests_pass_signal).lower()}\n"
|
|
|
|
|
|
|
| 190 |
f"localized_context=\n{observation.localized_context}\n\n"
|
| 191 |
f"last_execution_output=\n{last_execution_output}\n\n"
|
| 192 |
f"code_preview=\n{code_preview}"
|
|
@@ -325,6 +360,15 @@ async def run(difficulty: Optional[str] = None, show_thought: bool = False) -> N
|
|
| 325 |
obs_last_output = str(getattr(result.observation, "last_execution_output", "") or "")
|
| 326 |
pass_count_text, all_tests_pass_signal = _extract_pass_signal_fields(obs_last_output)
|
| 327 |
last_action = action_trajectory[-1] if action_trajectory else "none"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 328 |
history_messages.append(
|
| 329 |
{
|
| 330 |
"role": "user",
|
|
|
|
| 67 |
- If action_type is not REPLACE_LINES, set start_line=null, end_line=null, new_code_block=null.
|
| 68 |
- If action_type is REPLACE_LINES, set start_line and end_line to exact integer keys from code_dict and provide new_code_block as replacement code only.
|
| 69 |
|
| 70 |
+
Mandatory thought structure (scratchpad, no sentence cap):
|
| 71 |
+
- Thought must contain these three labeled sections in order: Observation:, Diagnosis:, Plan:.
|
| 72 |
+
- Each section can be multiple sentences and include detailed reasoning.
|
| 73 |
+
- Do not compress reasoning to a fixed sentence count.
|
| 74 |
|
| 75 |
How to read last_execution_output correctly:
|
| 76 |
- Prefer traceback and assertion text over assumptions.
|
|
|
|
| 96 |
- After RUN_TESTS, do not choose RUN_TESTS again immediately unless test evidence is genuinely missing.
|
| 97 |
- Treat "no output" as invalid reasoning when pass_count_summary or traceback text is present.
|
| 98 |
|
| 99 |
+
Worked examples (generic, no benchmark task leakage):
|
| 100 |
+
|
| 101 |
+
Example 1: failing tests after RUN_TESTS -> choose REPLACE_LINES
|
| 102 |
+
Input evidence snippet:
|
| 103 |
+
- pass_count_summary=Tests Passed: 1/3
|
| 104 |
+
- all_tests_pass_signal=false
|
| 105 |
+
- last_execution_output contains traceback near line 12.
|
| 106 |
+
Valid thought:
|
| 107 |
+
Observation: pass_count_summary shows 1/3 and traceback is present, so test output is available and indicates a real failure near line 12. Diagnosis: logic at line 12 likely violates expected behavior; this is not a missing-output case and rerunning tests immediately would waste a step. Plan: use REPLACE_LINES on the implicated lines, then run RUN_TESTS once to verify.
|
| 108 |
+
Valid action JSON:
|
| 109 |
+
{"thought":"Observation: ... Diagnosis: ... Plan: ...","action_type":"REPLACE_LINES","start_line":12,"end_line":13,"new_code_block":" # corrected code"}
|
| 110 |
+
|
| 111 |
+
Example 2: all tests passed after RUN_TESTS -> choose SUBMIT immediately
|
| 112 |
+
Input evidence snippet:
|
| 113 |
+
- pass_count_summary=Tests Passed: 3/3
|
| 114 |
+
- all_tests_pass_signal=true
|
| 115 |
+
- last_execution_output includes "SUCCESS: ALL TESTS PASSED".
|
| 116 |
+
Valid thought:
|
| 117 |
+
Observation: output explicitly shows Tests Passed: 3/3 and includes the success marker. Diagnosis: there is no remaining failing evidence and additional RUN_TESTS is unnecessary. Plan: choose SUBMIT now to end the episode.
|
| 118 |
+
Valid action JSON:
|
| 119 |
+
{"thought":"Observation: ... Diagnosis: ... Plan: ...","action_type":"SUBMIT","start_line":null,"end_line":null,"new_code_block":null}
|
| 120 |
+
|
| 121 |
+
Example 3: tests ran but syntax error present -> choose REPLACE_LINES (or UNDO_EDIT)
|
| 122 |
+
Input evidence snippet:
|
| 123 |
+
- syntax_error=true
|
| 124 |
+
- pass_count_summary=Tests Passed: 0/3
|
| 125 |
+
- traceback includes SyntaxError with a file line reference.
|
| 126 |
+
Valid thought:
|
| 127 |
+
Observation: syntax_error=true and traceback provides a concrete syntax failure location. Diagnosis: code is syntactically invalid, so functional debugging is blocked until syntax is repaired. Plan: apply REPLACE_LINES at the indicated lines (or UNDO_EDIT if the latest edit caused this), then RUN_TESTS.
|
| 128 |
+
Valid action JSON:
|
| 129 |
+
{"thought":"Observation: ... Diagnosis: ... Plan: ...","action_type":"REPLACE_LINES","start_line":8,"end_line":9,"new_code_block":" # syntax-fixed code"}
|
| 130 |
+
|
| 131 |
Submit gate (hard rule):
|
| 132 |
- If any failure, error, traceback, xfailed/unfinished signal, or uncertainty remains, do not SUBMIT.
|
| 133 |
- If all-tests-passed signal is present, do SUBMIT immediately on this turn.
|
|
|
|
| 136 |
- Is this valid JSON?
|
| 137 |
- Are all values schema-valid primitive types?
|
| 138 |
- Are nulls set correctly for non-REPLACE_LINES actions?
|
| 139 |
+
- Does thought include Observation:, Diagnosis:, and Plan: sections with concrete evidence from this turn?
|
| 140 |
"""
|
| 141 |
|
| 142 |
|
|
|
|
| 213 |
f"{line_num} | {text}"
|
| 214 |
for line_num, text in sorted_items[:30]
|
| 215 |
)
|
| 216 |
+
output_head_lines = "\n".join(last_execution_output.splitlines()[:8])
|
| 217 |
return (
|
| 218 |
f"step_count={observation.step_count}\n"
|
| 219 |
f"steps_remaining={observation.steps_remaining}\n"
|
| 220 |
f"syntax_error={observation.syntax_error}\n"
|
| 221 |
f"pass_count_summary={pass_count_text}\n"
|
| 222 |
f"all_tests_pass_signal={str(all_tests_pass_signal).lower()}\n"
|
| 223 |
+
f"last_execution_output_chars={len(last_execution_output)}\n"
|
| 224 |
+
f"last_execution_output_head=\n{output_head_lines}\n\n"
|
| 225 |
f"localized_context=\n{observation.localized_context}\n\n"
|
| 226 |
f"last_execution_output=\n{last_execution_output}\n\n"
|
| 227 |
f"code_preview=\n{code_preview}"
|
|
|
|
| 360 |
obs_last_output = str(getattr(result.observation, "last_execution_output", "") or "")
|
| 361 |
pass_count_text, all_tests_pass_signal = _extract_pass_signal_fields(obs_last_output)
|
| 362 |
last_action = action_trajectory[-1] if action_trajectory else "none"
|
| 363 |
+
if show_thought:
|
| 364 |
+
output_preview = "\\n".join(obs_last_output.splitlines()[:6])
|
| 365 |
+
print("[OBS_DEBUG]", file=sys.stderr, flush=True)
|
| 366 |
+
print(
|
| 367 |
+
f"chars={len(obs_last_output)} pass_count={pass_count_text} all_pass={str(all_tests_pass_signal).lower()} last_action={last_action}",
|
| 368 |
+
file=sys.stderr,
|
| 369 |
+
flush=True,
|
| 370 |
+
)
|
| 371 |
+
print(output_preview if output_preview else "<empty last_execution_output>", file=sys.stderr, flush=True)
|
| 372 |
history_messages.append(
|
| 373 |
{
|
| 374 |
"role": "user",
|