Spaces:
Sleeping
Sleeping
Remove blocking wait_for_aristotle — 9x proof throughput improvement
Browse filesThe agent was sitting idle for 2+ hours per Aristotle wait, doing zero
proof attempts. Now it uses non-blocking check_aristotle_status +
get_aristotle_result, continuing to actively prove while Aristotle runs.
Result: B4 did 36 proof attempts in 12.5 min vs B2's 4 in 5 min
before the old blocking wait kicked in.
- Remove wait_for_aristotle from tool definitions
- Remove _poll_aristotle() blocking loop
- Add graceful fallback if LLM still calls wait_for_aristotle
- Update system prompt: "CRITICAL: Never call wait_for_aristotle"
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
agent.py
CHANGED
|
@@ -12,8 +12,6 @@ from config import (
|
|
| 12 |
OUTPUT_COST_PER_TOKEN,
|
| 13 |
MAX_COST_PER_QUERY,
|
| 14 |
MAX_AGENT_ITERATIONS,
|
| 15 |
-
ARISTOTLE_POLL_INTERVAL,
|
| 16 |
-
ARISTOTLE_MAX_POLLS,
|
| 17 |
)
|
| 18 |
from tools import (
|
| 19 |
search_theorems,
|
|
@@ -60,12 +58,18 @@ Write Lean 4 code yourself and verify with **check_lean_code** (Axle — takes s
|
|
| 60 |
|
| 61 |
### Phase 3: Aristotle + active proving
|
| 62 |
If Axle verification fails after several attempts:
|
| 63 |
-
1. **Submit to Aristotle** —
|
| 64 |
-
2. **
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
### Phase 4: Final answer
|
| 71 |
Call **final_answer** with:
|
|
@@ -75,8 +79,9 @@ Call **final_answer** with:
|
|
| 75 |
- Whether verification succeeded
|
| 76 |
|
| 77 |
## Key principles
|
| 78 |
-
- **NEVER sit idle.** Always be actively trying to prove the result.
|
| 79 |
- Write ALL Lean code yourself — you are the prover. Aristotle is your backup.
|
|
|
|
| 80 |
- The code MUST contain `theorem` or `lemma` declarations.
|
| 81 |
- When Aristotle returns code with sorry, DECOMPOSE and resubmit. Don't give up.
|
| 82 |
- For false statements, PROVE THE NEGATION.
|
|
@@ -259,9 +264,15 @@ async def run_agent_job(job: JobState) -> None:
|
|
| 259 |
job.save()
|
| 260 |
return
|
| 261 |
|
| 262 |
-
#
|
| 263 |
if fn_name == "wait_for_aristotle":
|
| 264 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 265 |
job.messages.append({
|
| 266 |
"role": "tool",
|
| 267 |
"tool_call_id": tool_call.id,
|
|
@@ -390,49 +401,8 @@ async def _handle_tool_call(fn_name: str, fn_args: dict, job: JobState) -> str:
|
|
| 390 |
return json.dumps({"error": f"Unknown tool: {fn_name}"})
|
| 391 |
|
| 392 |
|
| 393 |
-
|
| 394 |
-
|
| 395 |
-
short_id = project_id[:8]
|
| 396 |
-
max_wait_min = (ARISTOTLE_MAX_POLLS * ARISTOTLE_POLL_INTERVAL) // 60
|
| 397 |
-
job.add_status(f"**Waiting for Aristotle** [{short_id}] (timeout: {max_wait_min} min)...")
|
| 398 |
-
job.save()
|
| 399 |
-
|
| 400 |
-
for poll_idx in range(ARISTOTLE_MAX_POLLS):
|
| 401 |
-
await asyncio.sleep(ARISTOTLE_POLL_INTERVAL)
|
| 402 |
-
info = await check_aristotle_status(project_id)
|
| 403 |
-
|
| 404 |
-
if "error" in info:
|
| 405 |
-
job.add_status(f"Aristotle [{short_id}] error: {info['error']}")
|
| 406 |
-
job.save()
|
| 407 |
-
return json.dumps(info)
|
| 408 |
-
|
| 409 |
-
status = info.get("status", "UNKNOWN")
|
| 410 |
-
pct = info.get("percent_complete")
|
| 411 |
-
elapsed = (poll_idx + 1) * ARISTOTLE_POLL_INTERVAL
|
| 412 |
-
elapsed_min = elapsed // 60
|
| 413 |
-
elapsed_sec = elapsed % 60
|
| 414 |
-
pct_str = f" ({pct}%)" if pct is not None else ""
|
| 415 |
-
time_str = f"{elapsed_min}m{elapsed_sec:02d}s" if elapsed_min else f"{elapsed}s"
|
| 416 |
-
job.add_status(f"Aristotle [{short_id}]: {status}{pct_str} — {time_str}")
|
| 417 |
-
|
| 418 |
-
# Update aristotle_jobs
|
| 419 |
-
for aj in job.aristotle_jobs:
|
| 420 |
-
if aj.get("project_id") == project_id:
|
| 421 |
-
aj["status"] = status
|
| 422 |
-
aj["percent_complete"] = pct
|
| 423 |
-
|
| 424 |
-
job.save()
|
| 425 |
-
|
| 426 |
-
if status in TERMINAL_STATUSES:
|
| 427 |
-
if status in ("COMPLETE", "COMPLETE_WITH_ERRORS"):
|
| 428 |
-
job.add_status(f"Aristotle [{short_id}]: downloading result...")
|
| 429 |
-
job.save()
|
| 430 |
-
return await get_aristotle_result(project_id)
|
| 431 |
-
return f"Aristotle project finished with status: {status}"
|
| 432 |
-
|
| 433 |
-
job.add_status(f"**Aristotle [{short_id}] timed out** after {max_wait_min} minutes")
|
| 434 |
-
job.save()
|
| 435 |
-
return f"Aristotle [{short_id}] timed out"
|
| 436 |
|
| 437 |
|
| 438 |
async def _maybe_auto_finalize(
|
|
|
|
| 12 |
OUTPUT_COST_PER_TOKEN,
|
| 13 |
MAX_COST_PER_QUERY,
|
| 14 |
MAX_AGENT_ITERATIONS,
|
|
|
|
|
|
|
| 15 |
)
|
| 16 |
from tools import (
|
| 17 |
search_theorems,
|
|
|
|
| 58 |
|
| 59 |
### Phase 3: Aristotle + active proving
|
| 60 |
If Axle verification fails after several attempts:
|
| 61 |
+
1. **Submit to Aristotle** — preferably Lean code with `sorry` placeholders.
|
| 62 |
+
2. **DO NOT WAIT for Aristotle.** Immediately continue trying to prove it yourself:
|
| 63 |
+
- Try different proof strategies with check_lean_code.
|
| 64 |
+
- Search for more Mathlib declarations.
|
| 65 |
+
- Try decomposing into smaller lemmas.
|
| 66 |
+
3. **Periodically check Aristotle** with check_aristotle_status (every 5-10 iterations).
|
| 67 |
+
4. **When Aristotle is COMPLETE**: call get_aristotle_result, verify the code with check_lean_code.
|
| 68 |
+
5. **If Aristotle returns with sorry**: identify sorry'd sub-lemmas, submit EACH to Aristotle as a new job. Try to prove them yourself too.
|
| 69 |
+
6. **If Aristotle returns sorry-free**: verify with check_lean_code.
|
| 70 |
+
7. Keep iterating until all sorries are filled or budget is exhausted.
|
| 71 |
+
|
| 72 |
+
**CRITICAL: Never call wait_for_aristotle. Always keep actively proving.**
|
| 73 |
|
| 74 |
### Phase 4: Final answer
|
| 75 |
Call **final_answer** with:
|
|
|
|
| 79 |
- Whether verification succeeded
|
| 80 |
|
| 81 |
## Key principles
|
| 82 |
+
- **NEVER sit idle.** Always be actively trying to prove the result. NEVER call wait_for_aristotle.
|
| 83 |
- Write ALL Lean code yourself — you are the prover. Aristotle is your backup.
|
| 84 |
+
- After submitting to Aristotle, IMMEDIATELY try proving it yourself. Check Aristotle every 5-10 iterations.
|
| 85 |
- The code MUST contain `theorem` or `lemma` declarations.
|
| 86 |
- When Aristotle returns code with sorry, DECOMPOSE and resubmit. Don't give up.
|
| 87 |
- For false statements, PROVE THE NEGATION.
|
|
|
|
| 264 |
job.save()
|
| 265 |
return
|
| 266 |
|
| 267 |
+
# wait_for_aristotle removed — redirect to non-blocking check
|
| 268 |
if fn_name == "wait_for_aristotle":
|
| 269 |
+
project_id = fn_args.get("project_id", "")
|
| 270 |
+
info = await check_aristotle_status(project_id)
|
| 271 |
+
status = info.get("status", "UNKNOWN")
|
| 272 |
+
if status in ("COMPLETE", "COMPLETE_WITH_ERRORS"):
|
| 273 |
+
result = await get_aristotle_result(project_id)
|
| 274 |
+
else:
|
| 275 |
+
result = json.dumps(info) + "\n\nAristotle is still running. Do NOT wait — keep working on the proof yourself. Check back later with check_aristotle_status."
|
| 276 |
job.messages.append({
|
| 277 |
"role": "tool",
|
| 278 |
"tool_call_id": tool_call.id,
|
|
|
|
| 401 |
return json.dumps({"error": f"Unknown tool: {fn_name}"})
|
| 402 |
|
| 403 |
|
| 404 |
+
# _poll_aristotle removed — blocking wait was the #1 bottleneck.
|
| 405 |
+
# Agent now uses check_aristotle_status (non-blocking) + get_aristotle_result.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 406 |
|
| 407 |
|
| 408 |
async def _maybe_auto_finalize(
|
log.md
CHANGED
|
@@ -47,3 +47,40 @@ The self-review needs to be stricter. Currently it can't distinguish between a t
|
|
| 47 |
1. Two-stage review: first check the theorem statement against the question (ignoring comments), then check the proof
|
| 48 |
2. Require the reviewer to extract the Lean theorem statement and compare it to the NL question explicitly
|
| 49 |
3. Use a stronger model for review (e.g., Claude) if budget allows
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
1. Two-stage review: first check the theorem statement against the question (ignoring comments), then check the proof
|
| 48 |
2. Require the reviewer to extract the Lean theorem statement and compare it to the NL question explicitly
|
| 49 |
3. Use a stronger model for review (e.g., Claude) if budget allows
|
| 50 |
+
|
| 51 |
+
## Iteration 2 — 2026-03-22 19:30 PDT
|
| 52 |
+
|
| 53 |
+
### Diagnosis
|
| 54 |
+
The #1 bottleneck from iteration 1: **agent sits completely idle while waiting for Aristotle** (2+ hours per wait). The `wait_for_aristotle` tool blocks the entire iteration loop. Despite the system prompt saying "NEVER sit idle", the architecture forced idleness.
|
| 55 |
+
|
| 56 |
+
Comparison: B2 reached iter 13 in 5 minutes, then sat idle for 3.5 hours. A2 similarly burned hours waiting.
|
| 57 |
+
|
| 58 |
+
### Fix: Remove blocking Aristotle wait (HIGH IMPACT)
|
| 59 |
+
- **Removed `wait_for_aristotle`** from tool definitions entirely
|
| 60 |
+
- Agent must now use `check_aristotle_status` (non-blocking) + `get_aristotle_result`
|
| 61 |
+
- If agent still calls `wait_for_aristotle`, graceful fallback: do a single status check and tell agent to keep working
|
| 62 |
+
- Removed `_poll_aristotle()` function (was 40 lines of blocking loop)
|
| 63 |
+
- Updated system prompt: "CRITICAL: Never call wait_for_aristotle. Always keep actively proving."
|
| 64 |
+
|
| 65 |
+
### Test Results
|
| 66 |
+
|
| 67 |
+
| Problem | Time | Iterations | check_lean_code | Aristotle jobs | Behavior |
|
| 68 |
+
|---------|------|------------|-----------------|----------------|----------|
|
| 69 |
+
| **B4** (matrix ineq, new) | 12.5m | 67 | 36 | 3 submitted, 3 results | **Non-blocking!** |
|
| 70 |
+
| **B2** (centroid, from iter 1) | 3.5h | 58 | 18 | 3 | Completed (with sorry) |
|
| 71 |
+
| **A2** (sin bounds, from iter 1) | 3.5h+ | 35 | 27 | 4 | Still running |
|
| 72 |
+
|
| 73 |
+
**Key metric: B4 did 36 proof attempts in 12.5 min vs B2's 4 in 5 min before Aristotle blocked it.** That's ~9x higher throughput of proof attempts per unit time.
|
| 74 |
+
|
| 75 |
+
B4 behavior: submitted to Aristotle → immediately kept trying proofs → checked Aristotle status every few iterations → downloaded result when available → submitted new job with decomposed sub-lemmas → repeat. Zero idle time.
|
| 76 |
+
|
| 77 |
+
### Iteration 1 job updates
|
| 78 |
+
- **B2**: Finally completed after 3.5h, but proof has sorry (hard real analysis). Formalization attempt is correct (right theorem statement, integrability established).
|
| 79 |
+
- **A2**: Still running at 3.5h+ with 4 Aristotle jobs. Making progress (iter 35, $0.82).
|
| 80 |
+
|
| 81 |
+
### Code Changes
|
| 82 |
+
- `agent.py`: Removed `_poll_aristotle()`, removed `wait_for_aristotle` handler (replaced with non-blocking fallback), updated system prompt to prohibit waiting, removed ARISTOTLE_POLL_INTERVAL/MAX_POLLS imports
|
| 83 |
+
- `tools.py`: Removed `wait_for_aristotle` from TOOL_DEFINITIONS
|
| 84 |
+
|
| 85 |
+
### Biggest Improvement Opportunity
|
| 86 |
+
Self-review is still too lenient (from iteration 1 analysis). The non-blocking Aristotle fix was higher priority and is now done. Next: make self-review check the theorem *statement* against the question, not just the theorem *name*.
|
tools.py
CHANGED
|
@@ -311,8 +311,8 @@ TOOL_DEFINITIONS = [
|
|
| 311 |
"description": (
|
| 312 |
"Submit a proof request to Aristotle (automated theorem prover). "
|
| 313 |
"Aristotle proves graduate/research-level math in Lean 4. "
|
| 314 |
-
"Returns a project_id immediately — use check_aristotle_status
|
| 315 |
-
"
|
| 316 |
"PREFERRED FORMAT: Lean 4 code with sorry placeholders. This preserves "
|
| 317 |
"the exact theorem signature and lets Aristotle focus on filling proofs:\n"
|
| 318 |
" 'Fill in the sorries:\\n```lean\\nimport Mathlib\\n\\ntheorem my_thm ... := by\\n sorry\\n```'\n\n"
|
|
@@ -358,27 +358,8 @@ TOOL_DEFINITIONS = [
|
|
| 358 |
},
|
| 359 |
},
|
| 360 |
},
|
| 361 |
-
|
| 362 |
-
|
| 363 |
-
"function": {
|
| 364 |
-
"name": "wait_for_aristotle",
|
| 365 |
-
"description": (
|
| 366 |
-
"Wait for an Aristotle project to complete (polls every 30s, up to 20 min). "
|
| 367 |
-
"Returns the Lean code produced. Only call this when you're ready to "
|
| 368 |
-
"collect results — do other work first."
|
| 369 |
-
),
|
| 370 |
-
"parameters": {
|
| 371 |
-
"type": "object",
|
| 372 |
-
"properties": {
|
| 373 |
-
"project_id": {
|
| 374 |
-
"type": "string",
|
| 375 |
-
"description": "The project_id from submit_to_aristotle.",
|
| 376 |
-
}
|
| 377 |
-
},
|
| 378 |
-
"required": ["project_id"],
|
| 379 |
-
},
|
| 380 |
-
},
|
| 381 |
-
},
|
| 382 |
{
|
| 383 |
"type": "function",
|
| 384 |
"function": {
|
|
|
|
| 311 |
"description": (
|
| 312 |
"Submit a proof request to Aristotle (automated theorem prover). "
|
| 313 |
"Aristotle proves graduate/research-level math in Lean 4. "
|
| 314 |
+
"Returns a project_id immediately — use check_aristotle_status to poll "
|
| 315 |
+
"and get_aristotle_result when COMPLETE.\n\n"
|
| 316 |
"PREFERRED FORMAT: Lean 4 code with sorry placeholders. This preserves "
|
| 317 |
"the exact theorem signature and lets Aristotle focus on filling proofs:\n"
|
| 318 |
" 'Fill in the sorries:\\n```lean\\nimport Mathlib\\n\\ntheorem my_thm ... := by\\n sorry\\n```'\n\n"
|
|
|
|
| 358 |
},
|
| 359 |
},
|
| 360 |
},
|
| 361 |
+
# wait_for_aristotle REMOVED — it blocked the agent for 2+ hours.
|
| 362 |
+
# Agent must use check_aristotle_status (non-blocking) + get_aristotle_result instead.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 363 |
{
|
| 364 |
"type": "function",
|
| 365 |
"function": {
|