Vilin97 Claude Opus 4.6 (1M context) commited on
Commit
c262c9c
·
1 Parent(s): 0a40264

Remove Qwen 3.5, fix sorry/theorem validation, markdown emails

Browse files

Major changes:
- Remove Qwen 3.5 tool — orchestrator writes Lean directly
- Auto-finalize now requires: okay=true AND sorry-free AND has theorem/lemma
- Sorry code triggers "proof incomplete, continuing..." not false VERIFIED
- Code without theorem/lemma declarations rejected from auto-finalize
- Email body renders markdown properly (headers, bold, code, lists)
- LaTeX rendered as serif italic in emails
- System prompt updated for iterative Aristotle decomposition
- ISSUES.md documents all 7 issues from Putnam email review

Retest A6 (correct transcription): verified, sorry-free, has_theorem, 1505-char explanation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (3) hide show
  1. agent.py +23 -30
  2. email_sender.py +98 -1
  3. tools.py +0 -82
agent.py CHANGED
@@ -19,7 +19,6 @@ from tools import (
19
  search_lean_library,
20
  search_loogle,
21
  check_lean_code,
22
- generate_lean_proof,
23
  submit_to_aristotle,
24
  check_aristotle_status,
25
  get_aristotle_result,
@@ -52,40 +51,36 @@ Do NOT just refuse — always try to prove the negation. This is the most valuab
52
  3. Search Mathlib by type pattern with **search_loogle** (e.g. "_ + _ = _ + _", "Nat → Nat → Prop").
53
 
54
  ### Phase 2: Fast attempt (try this first)
55
- Try to prove the result using one or more of:
56
- - Write Lean 4 code yourself and verify with **check_lean_code** (Axletakes seconds).
57
- - Use **generate_lean_proof** to have Qwen 3.5 write the proof, then verify with check_lean_code.
58
- - **CRITICAL: If check_lean_code returns okay=true, IMMEDIATELY call final_answer. Do not wait for Aristotle or do any more work.**
59
- - If errors: try to fix and re-check. Alternate between writing code yourself and using generate_lean_proof.
60
  - If the statement seems false: try proving the NEGATION instead.
 
61
 
62
  ### Phase 3: Aristotle + active proving (if fast attempt fails)
63
  If Axle verification fails after several attempts:
64
- 1. **Submit to Aristotle** — decompose into main result + sub-lemmas, submit ALL.
65
- 2. **DO NOT WAIT for Aristotle.** Instead, keep actively trying to close the proof yourself:
66
- - Use **generate_lean_proof** (Qwen 3.5) with different prompts and hints.
67
- - Search for more Mathlib declarations with search_lean_library and search_loogle.
68
- - Write Lean code yourself with different proof strategies.
69
- - Verify each attempt with check_lean_code.
70
- - **If ANY attempt verifies (okay=true), IMMEDIATELY call final_answer. Do not wait for Aristotle.**
71
- 3. **Periodically check Aristotle** with check_aristotle_status but only between your own proof attempts, never as the primary activity.
72
- 4. If Aristotle completes, download with get_aristotle_result and verify with check_lean_code.
73
- 5. The goal is to RACE: you and Qwen try to prove it while Aristotle also works on it. Whoever finishes first wins.
74
 
75
  ### Phase 4: Final answer
76
  Call **final_answer** with:
77
  - Clear natural language explanation (with LaTeX math)
78
  - Complete verified Lean 4 code (single file, starts with `import Mathlib`)
 
79
  - Whether verification succeeded
80
 
81
  ## Key principles
82
  - **NEVER sit idle.** Always be actively trying to prove the result.
83
- - If check_lean_code says okay=true, IMMEDIATELY call final_answer. This is the highest priority.
84
- - Use BOTH your own code AND generate_lean_proof (Qwen 3.5) try different approaches with different hints.
85
- - Submit to Aristotle early, but keep proving don't call wait_for_aristotle until you've made at least 10 proof attempts.
86
- - When you DO call wait_for_aristotle and it returns, ALWAYS try to verify the result with check_lean_code.
87
- - **NEVER call wait_for_aristotle as your FIRST action after submitting.** Always try at least 5-10 more proof attempts first.
88
- - If after many attempts nothing verifies: call final_answer with the BEST (fewest errors) Lean code you have, verified=false. Include the error messages in your answer. An unverified answer with good Lean code is better than no answer.
89
  - For false statements, PROVE THE NEGATION — don't just refuse.
90
  """
91
 
@@ -294,8 +289,14 @@ async def run_agent(question: str):
294
  "sorry" in code
295
  or any("sorry" in e for e in tool_errors)
296
  )
 
 
 
 
297
  if has_sorry:
298
  add_status("Lean code compiles but contains `sorry` — proof incomplete, continuing...")
 
 
299
  else:
300
  add_status("Lean code verified (sorry-free)! Auto-finalizing...")
301
  log_detail("## Auto-finalize: Axle okay=true, no sorry detected")
@@ -416,14 +417,6 @@ async def _handle_tool_call(fn_name: str, fn_args: dict,
416
  add_status(f'Searching Loogle: "{query}"...')
417
  return await search_loogle(query)
418
 
419
- if fn_name == "generate_lean_proof":
420
- statement = fn_args.get("statement", "")
421
- short_stmt = statement[:80].replace("\n", " ")
422
- add_status(f'Generating proof with Qwen 3.5: "{short_stmt}"...')
423
- result = await generate_lean_proof(statement, fn_args.get("context", ""))
424
- add_status(f"Qwen 3.5 generated {len(result)} chars of Lean code")
425
- return result
426
-
427
  if fn_name == "check_lean_code":
428
  add_status("Verifying Lean code with Axle...")
429
  result = await check_lean_code(fn_args.get("code", ""))
 
19
  search_lean_library,
20
  search_loogle,
21
  check_lean_code,
 
22
  submit_to_aristotle,
23
  check_aristotle_status,
24
  get_aristotle_result,
 
51
  3. Search Mathlib by type pattern with **search_loogle** (e.g. "_ + _ = _ + _", "Nat → Nat → Prop").
52
 
53
  ### Phase 2: Fast attempt (try this first)
54
+ Write Lean 4 code yourself and verify with **check_lean_code** (Axle — takes seconds).
55
+ - The code MUST contain at least one `theorem` or `lemma` declaration not just `example` or `#check`.
56
+ - If errors: analyze the error, fix, and re-check.
 
 
57
  - If the statement seems false: try proving the NEGATION instead.
58
+ - Note: the system will automatically finalize when Axle returns okay=true AND the code is sorry-free. You don't need to call final_answer yourself after a successful check.
59
 
60
  ### Phase 3: Aristotle + active proving (if fast attempt fails)
61
  If Axle verification fails after several attempts:
62
+ 1. **Submit to Aristotle** — submit the main result as a natural language prompt.
63
+ 2. **Keep actively trying** — search more declarations, try different proof strategies, verify with check_lean_code. Don't just wait.
64
+ 3. **Periodically check Aristotle** with check_aristotle_status.
65
+ 4. **If Aristotle returns with sorry**: this is NOT a failure — it's progress. Take Aristotle's output, identify which sub-lemmas still have sorry, and:
66
+ - Submit EACH sorry'd sub-lemma to Aristotle as a NEW job
67
+ - Try to prove the sorry'd sub-lemmas yourself
68
+ - Search Mathlib for the sorry'd statements
69
+ - Keep iterating until all sorries are filled
70
+ 5. If Aristotle completes sorry-free, verify with check_lean_code.
 
71
 
72
  ### Phase 4: Final answer
73
  Call **final_answer** with:
74
  - Clear natural language explanation (with LaTeX math)
75
  - Complete verified Lean 4 code (single file, starts with `import Mathlib`)
76
+ - The code MUST declare at least one `theorem` or `lemma`
77
  - Whether verification succeeded
78
 
79
  ## Key principles
80
  - **NEVER sit idle.** Always be actively trying to prove the result.
81
+ - Write ALL Lean code yourself you are the prover. Aristotle is your backup.
82
+ - The code MUST contain `theorem` or `lemma` declarationsnever just `example`, `#check`, or `sorry`.
83
+ - When Aristotle returns code with sorry, DECOMPOSE the sorry'd lemmas and resubmit. Don't give up.
 
 
 
84
  - For false statements, PROVE THE NEGATION — don't just refuse.
85
  """
86
 
 
289
  "sorry" in code
290
  or any("sorry" in e for e in tool_errors)
291
  )
292
+ has_theorem = any(
293
+ line.strip().startswith(("theorem ", "lemma "))
294
+ for line in code.split("\n")
295
+ )
296
  if has_sorry:
297
  add_status("Lean code compiles but contains `sorry` — proof incomplete, continuing...")
298
+ elif not has_theorem:
299
+ add_status("Lean code compiles but has no theorem/lemma declarations — continuing...")
300
  else:
301
  add_status("Lean code verified (sorry-free)! Auto-finalizing...")
302
  log_detail("## Auto-finalize: Axle okay=true, no sorry detected")
 
417
  add_status(f'Searching Loogle: "{query}"...')
418
  return await search_loogle(query)
419
 
 
 
 
 
 
 
 
 
420
  if fn_name == "check_lean_code":
421
  add_status("Verifying Lean code with Axle...")
422
  result = await check_lean_code(fn_args.get("code", ""))
email_sender.py CHANGED
@@ -89,7 +89,7 @@ def send_result_email(
89
 
90
  <h3>Answer</h3>
91
  <div style="background: #f8f9fa; padding: 16px; border-radius: 8px; margin: 12px 0;">
92
- {_escape(answer).replace(chr(10), '<br>')}
93
  </div>
94
 
95
  <h3>Lean 4 Code ({badge})</h3>
@@ -163,3 +163,100 @@ def _escape(text: str) -> str:
163
  .replace(">", "&gt;")
164
  .replace('"', "&quot;")
165
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
  <h3>Answer</h3>
91
  <div style="background: #f8f9fa; padding: 16px; border-radius: 8px; margin: 12px 0;">
92
+ {_md_to_html(answer)}
93
  </div>
94
 
95
  <h3>Lean 4 Code ({badge})</h3>
 
163
  .replace(">", "&gt;")
164
  .replace('"', "&quot;")
165
  )
166
+
167
+
168
+ def _md_to_html(text: str) -> str:
169
+ """Convert markdown-like text to simple HTML for email rendering.
170
+
171
+ Handles: headers, bold, italic, inline code, code blocks, LaTeX,
172
+ lists, and paragraphs. No external dependencies.
173
+ """
174
+ import re
175
+
176
+ lines = text.split("\n")
177
+ html_lines = []
178
+ in_code_block = False
179
+ in_list = False
180
+
181
+ for line in lines:
182
+ # Code blocks
183
+ if line.strip().startswith("```"):
184
+ if in_code_block:
185
+ html_lines.append("</pre>")
186
+ in_code_block = False
187
+ else:
188
+ html_lines.append('<pre style="background:#1e1e1e;color:#d4d4d4;padding:12px;border-radius:6px;font-size:13px;overflow-x:auto;">')
189
+ in_code_block = True
190
+ continue
191
+ if in_code_block:
192
+ html_lines.append(_escape(line))
193
+ continue
194
+
195
+ # Close list if needed
196
+ if in_list and not line.strip().startswith(("- ", "* ", "1.", "2.", "3.", "4.", "5.")):
197
+ html_lines.append("</ul>")
198
+ in_list = False
199
+
200
+ stripped = line.strip()
201
+
202
+ # Empty line → paragraph break
203
+ if not stripped:
204
+ html_lines.append("<br>")
205
+ continue
206
+
207
+ # Headers
208
+ if stripped.startswith("### "):
209
+ html_lines.append(f"<h4>{_inline_format(_escape(stripped[4:]))}</h4>")
210
+ continue
211
+ if stripped.startswith("## "):
212
+ html_lines.append(f"<h3>{_inline_format(_escape(stripped[3:]))}</h3>")
213
+ continue
214
+ if stripped.startswith("# "):
215
+ html_lines.append(f"<h2>{_inline_format(_escape(stripped[2:]))}</h2>")
216
+ continue
217
+
218
+ # List items
219
+ if stripped.startswith(("- ", "* ")):
220
+ if not in_list:
221
+ html_lines.append("<ul>")
222
+ in_list = True
223
+ html_lines.append(f"<li>{_inline_format(_escape(stripped[2:]))}</li>")
224
+ continue
225
+
226
+ # Regular paragraph
227
+ html_lines.append(f"<p>{_inline_format(_escape(stripped))}</p>")
228
+
229
+ if in_code_block:
230
+ html_lines.append("</pre>")
231
+ if in_list:
232
+ html_lines.append("</ul>")
233
+
234
+ return "\n".join(html_lines)
235
+
236
+
237
+ def _inline_format(text: str) -> str:
238
+ """Apply inline markdown formatting: bold, italic, code, LaTeX."""
239
+ import re
240
+ # Display LaTeX: $$...$$ → styled span
241
+ text = re.sub(
242
+ r'\$\$(.+?)\$\$',
243
+ r'<div style="text-align:center;margin:8px 0;font-family:serif;font-style:italic;">\1</div>',
244
+ text, flags=re.DOTALL,
245
+ )
246
+ # Inline LaTeX: $...$ → styled span (but not $$)
247
+ text = re.sub(
248
+ r'(?<!\$)\$(?!\$)(.+?)(?<!\$)\$(?!\$)',
249
+ r'<span style="font-family:serif;font-style:italic;">\1</span>',
250
+ text,
251
+ )
252
+ # Inline code: `...`
253
+ text = re.sub(
254
+ r'`([^`]+)`',
255
+ r'<code style="background:#f0f0f0;padding:1px 4px;border-radius:3px;font-size:13px;">\1</code>',
256
+ text,
257
+ )
258
+ # Bold: **...**
259
+ text = re.sub(r'\*\*(.+?)\*\*', r'<strong>\1</strong>', text)
260
+ # Italic: *...*
261
+ text = re.sub(r'(?<!\*)\*(?!\*)(.+?)(?<!\*)\*(?!\*)', r'<em>\1</em>', text)
262
+ return text
tools.py CHANGED
@@ -12,8 +12,6 @@ from config import (
12
  AXLE_BASE_URL,
13
  LEAN_ENVIRONMENT,
14
  ARISTOTLE_API_KEY,
15
- TOKEN_FACTORY_API_KEY,
16
- TOKEN_FACTORY_BASE_URL,
17
  )
18
 
19
 
@@ -138,60 +136,6 @@ async def check_lean_code(code: str) -> str:
138
  return json.dumps({"error": f"Axle error: {e}"})
139
 
140
 
141
- # ---------------------------------------------------------------------------
142
- # Qwen 3.5 (Lean proof generation)
143
- # ---------------------------------------------------------------------------
144
-
145
- QWEN_MODEL = "Qwen/Qwen3.5-397B-A17B"
146
- QWEN_LEAN_SYSTEM = """\
147
- You are a Lean 4 proof assistant. You write correct Lean 4 code with Mathlib.
148
- Given a mathematical statement, produce a COMPLETE Lean 4 file that proves it.
149
- Rules:
150
- - Start with `import Mathlib`
151
- - Use Lean 4 syntax (NOT Lean 3)
152
- - The code must compile with Lean 4.28.0 and current Mathlib
153
- - Use tactics like simp, ring, omega, norm_num, linarith, exact?, apply? where appropriate
154
- - Output ONLY the Lean code, no explanations
155
- """
156
-
157
-
158
- async def generate_lean_proof(statement: str, context: str = "") -> str:
159
- """Use Qwen 3.5 to generate Lean 4 proof code."""
160
- from openai import AsyncOpenAI
161
-
162
- client = AsyncOpenAI(
163
- base_url=TOKEN_FACTORY_BASE_URL,
164
- api_key=TOKEN_FACTORY_API_KEY,
165
- )
166
- try:
167
- prompt = f"Write a Lean 4 proof for:\n{statement}"
168
- if context:
169
- prompt += f"\n\nContext (relevant Mathlib declarations):\n{context}"
170
-
171
- response = await client.chat.completions.create(
172
- model=QWEN_MODEL,
173
- messages=[
174
- {"role": "system", "content": QWEN_LEAN_SYSTEM},
175
- {"role": "user", "content": prompt},
176
- ],
177
- temperature=0.6,
178
- max_tokens=8192,
179
- )
180
- content = response.choices[0].message.content or ""
181
- # Extract lean code from markdown code block if present
182
- if "```lean" in content:
183
- start = content.find("```lean")
184
- end = content.find("```", start + 6)
185
- if end > start:
186
- code = content[start:end]
187
- # Remove the ```lean prefix
188
- code = code.split("\n", 1)[1] if "\n" in code else code[7:]
189
- return code.strip()
190
- # If no code block, return as-is (might be raw lean code)
191
- return content.strip()
192
- except Exception as e:
193
- return f"Qwen error: {e}"
194
-
195
 
196
  # ---------------------------------------------------------------------------
197
  # Aristotle (Lean formalization & proving)
@@ -340,32 +284,6 @@ TOOL_DEFINITIONS = [
340
  },
341
  },
342
  },
343
- {
344
- "type": "function",
345
- "function": {
346
- "name": "generate_lean_proof",
347
- "description": (
348
- "Use Qwen 3.5 (a 397B-parameter LLM specialized in code) to generate "
349
- "Lean 4 proof code. Give it the mathematical statement and optionally "
350
- "relevant Mathlib context. Returns Lean code that you should then verify "
351
- "with check_lean_code. Use this as an alternative to writing proofs yourself."
352
- ),
353
- "parameters": {
354
- "type": "object",
355
- "properties": {
356
- "statement": {
357
- "type": "string",
358
- "description": "The mathematical statement to prove, in natural language or Lean-like notation.",
359
- },
360
- "context": {
361
- "type": "string",
362
- "description": "Optional: relevant Mathlib declarations, search results, or hints.",
363
- },
364
- },
365
- "required": ["statement"],
366
- },
367
- },
368
- },
369
  {
370
  "type": "function",
371
  "function": {
 
12
  AXLE_BASE_URL,
13
  LEAN_ENVIRONMENT,
14
  ARISTOTLE_API_KEY,
 
 
15
  )
16
 
17
 
 
136
  return json.dumps({"error": f"Axle error: {e}"})
137
 
138
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139
 
140
  # ---------------------------------------------------------------------------
141
  # Aristotle (Lean formalization & proving)
 
284
  },
285
  },
286
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
287
  {
288
  "type": "function",
289
  "function": {