Spaces:

Vilin97
/

VeriDeepResearch

Sleeping

Vilin97 Claude Opus 4.6 (1M context) commited on 25 days ago

Commit

c262c9c

1 Parent(s): 0a40264

Remove Qwen 3.5, fix sorry/theorem validation, markdown emails

Major changes:
- Remove Qwen 3.5 tool — orchestrator writes Lean directly
- Auto-finalize now requires: okay=true AND sorry-free AND has theorem/lemma
- Sorry code triggers "proof incomplete, continuing..." not false VERIFIED
- Code without theorem/lemma declarations rejected from auto-finalize
- Email body renders markdown properly (headers, bold, code, lists)
- LaTeX rendered as serif italic in emails
- System prompt updated for iterative Aristotle decomposition
- ISSUES.md documents all 7 issues from Putnam email review

Retest A6 (correct transcription): verified, sorry-free, has_theorem, 1505-char explanation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (3) hide show

agent.py +23 -30
email_sender.py +98 -1
tools.py +0 -82

agent.py CHANGED Viewed

@@ -19,7 +19,6 @@ from tools import (
     search_lean_library,
     search_loogle,
     check_lean_code,
-    generate_lean_proof,
     submit_to_aristotle,
     check_aristotle_status,
     get_aristotle_result,
@@ -52,40 +51,36 @@ Do NOT just refuse — always try to prove the negation. This is the most valuab
 3. Search Mathlib by type pattern with **search_loogle** (e.g. "_ + _ = _ + _", "Nat → Nat → Prop").
 ### Phase 2: Fast attempt (try this first)
-Try to prove the result using one or more of:
-- Write Lean 4 code yourself and verify with **check_lean_code** (Axle — takes seconds).
-- Use **generate_lean_proof** to have Qwen 3.5 write the proof, then verify with check_lean_code.
-- **CRITICAL: If check_lean_code returns okay=true, IMMEDIATELY call final_answer. Do not wait for Aristotle or do any more work.**
-- If errors: try to fix and re-check. Alternate between writing code yourself and using generate_lean_proof.
 - If the statement seems false: try proving the NEGATION instead.
 ### Phase 3: Aristotle + active proving (if fast attempt fails)
 If Axle verification fails after several attempts:
-1. **Submit to Aristotle** — decompose into main result + sub-lemmas, submit ALL.
-2. **DO NOT WAIT for Aristotle.** Instead, keep actively trying to close the proof yourself:
-   - Use **generate_lean_proof** (Qwen 3.5) with different prompts and hints.
-   - Search for more Mathlib declarations with search_lean_library and search_loogle.
-   - Write Lean code yourself with different proof strategies.
-   - Verify each attempt with check_lean_code.
-   - **If ANY attempt verifies (okay=true), IMMEDIATELY call final_answer. Do not wait for Aristotle.**
-3. **Periodically check Aristotle** with check_aristotle_status — but only between your own proof attempts, never as the primary activity.
-4. If Aristotle completes, download with get_aristotle_result and verify with check_lean_code.
-5. The goal is to RACE: you and Qwen try to prove it while Aristotle also works on it. Whoever finishes first wins.
 ### Phase 4: Final answer
 Call **final_answer** with:
 - Clear natural language explanation (with LaTeX math)
 - Complete verified Lean 4 code (single file, starts with `import Mathlib`)
 - Whether verification succeeded
 ## Key principles
 - **NEVER sit idle.** Always be actively trying to prove the result.
-- If check_lean_code says okay=true, IMMEDIATELY call final_answer. This is the highest priority.
-- Use BOTH your own code AND generate_lean_proof (Qwen 3.5) — try different approaches with different hints.
-- Submit to Aristotle early, but keep proving — don't call wait_for_aristotle until you've made at least 10 proof attempts.
-- When you DO call wait_for_aristotle and it returns, ALWAYS try to verify the result with check_lean_code.
-- **NEVER call wait_for_aristotle as your FIRST action after submitting.** Always try at least 5-10 more proof attempts first.
-- If after many attempts nothing verifies: call final_answer with the BEST (fewest errors) Lean code you have, verified=false. Include the error messages in your answer. An unverified answer with good Lean code is better than no answer.
 - For false statements, PROVE THE NEGATION — don't just refuse.
 """
@@ -294,8 +289,14 @@ async def run_agent(question: str):
                             "sorry" in code
                             or any("sorry" in e for e in tool_errors)
                         )
                         if has_sorry:
                             add_status("Lean code compiles but contains `sorry` — proof incomplete, continuing...")
                         else:
                             add_status("Lean code verified (sorry-free)! Auto-finalizing...")
                             log_detail("## Auto-finalize: Axle okay=true, no sorry detected")
@@ -416,14 +417,6 @@ async def _handle_tool_call(fn_name: str, fn_args: dict,
         add_status(f'Searching Loogle: "{query}"...')
         return await search_loogle(query)
-    if fn_name == "generate_lean_proof":
-        statement = fn_args.get("statement", "")
-        short_stmt = statement[:80].replace("\n", " ")
-        add_status(f'Generating proof with Qwen 3.5: "{short_stmt}"...')
-        result = await generate_lean_proof(statement, fn_args.get("context", ""))
-        add_status(f"Qwen 3.5 generated {len(result)} chars of Lean code")
-        return result
     if fn_name == "check_lean_code":
         add_status("Verifying Lean code with Axle...")
         result = await check_lean_code(fn_args.get("code", ""))

     search_lean_library,
     search_loogle,
     check_lean_code,
     submit_to_aristotle,
     check_aristotle_status,
     get_aristotle_result,
 3. Search Mathlib by type pattern with **search_loogle** (e.g. "_ + _ = _ + _", "Nat → Nat → Prop").
 ### Phase 2: Fast attempt (try this first)
+Write Lean 4 code yourself and verify with **check_lean_code** (Axle — takes seconds).
+- The code MUST contain at least one `theorem` or `lemma` declaration — not just `example` or `#check`.
+- If errors: analyze the error, fix, and re-check.
 - If the statement seems false: try proving the NEGATION instead.
+- Note: the system will automatically finalize when Axle returns okay=true AND the code is sorry-free. You don't need to call final_answer yourself after a successful check.
 ### Phase 3: Aristotle + active proving (if fast attempt fails)
 If Axle verification fails after several attempts:
+1. **Submit to Aristotle** — submit the main result as a natural language prompt.
+2. **Keep actively trying** — search more declarations, try different proof strategies, verify with check_lean_code. Don't just wait.
+3. **Periodically check Aristotle** with check_aristotle_status.
+4. **If Aristotle returns with sorry**: this is NOT a failure — it's progress. Take Aristotle's output, identify which sub-lemmas still have sorry, and:
+   - Submit EACH sorry'd sub-lemma to Aristotle as a NEW job
+   - Try to prove the sorry'd sub-lemmas yourself
+   - Search Mathlib for the sorry'd statements
+   - Keep iterating until all sorries are filled
+5. If Aristotle completes sorry-free, verify with check_lean_code.
 ### Phase 4: Final answer
 Call **final_answer** with:
 - Clear natural language explanation (with LaTeX math)
 - Complete verified Lean 4 code (single file, starts with `import Mathlib`)
+- The code MUST declare at least one `theorem` or `lemma`
 - Whether verification succeeded
 ## Key principles
 - **NEVER sit idle.** Always be actively trying to prove the result.
+- Write ALL Lean code yourself — you are the prover. Aristotle is your backup.
+- The code MUST contain `theorem` or `lemma` declarations — never just `example`, `#check`, or `sorry`.
+- When Aristotle returns code with sorry, DECOMPOSE the sorry'd lemmas and resubmit. Don't give up.
 - For false statements, PROVE THE NEGATION — don't just refuse.
 """
                             "sorry" in code
                             or any("sorry" in e for e in tool_errors)
                         )
+                        has_theorem = any(
+                            line.strip().startswith(("theorem ", "lemma "))
+                            for line in code.split("\n")
+                        )
                         if has_sorry:
                             add_status("Lean code compiles but contains `sorry` — proof incomplete, continuing...")
+                        elif not has_theorem:
+                            add_status("Lean code compiles but has no theorem/lemma declarations — continuing...")
                         else:
                             add_status("Lean code verified (sorry-free)! Auto-finalizing...")
                             log_detail("## Auto-finalize: Axle okay=true, no sorry detected")
         add_status(f'Searching Loogle: "{query}"...')
         return await search_loogle(query)
     if fn_name == "check_lean_code":
         add_status("Verifying Lean code with Axle...")
         result = await check_lean_code(fn_args.get("code", ""))

email_sender.py CHANGED Viewed

@@ -89,7 +89,7 @@ def send_result_email(
 <h3>Answer</h3>
 <div style="background: #f8f9fa; padding: 16px; border-radius: 8px; margin: 12px 0;">
-{_escape(answer).replace(chr(10), '<br>')}
 </div>
 <h3>Lean 4 Code ({badge})</h3>
@@ -163,3 +163,100 @@ def _escape(text: str) -> str:
         .replace(">", "&gt;")
         .replace('"', "&quot;")
     )

 <h3>Answer</h3>
 <div style="background: #f8f9fa; padding: 16px; border-radius: 8px; margin: 12px 0;">
+{_md_to_html(answer)}
 </div>
 <h3>Lean 4 Code ({badge})</h3>
         .replace(">", "&gt;")
         .replace('"', "&quot;")
     )
+def _md_to_html(text: str) -> str:
+    """Convert markdown-like text to simple HTML for email rendering.
+    Handles: headers, bold, italic, inline code, code blocks, LaTeX,
+    lists, and paragraphs. No external dependencies.
+    """
+    import re
+    lines = text.split("\n")
+    html_lines = []
+    in_code_block = False
+    in_list = False
+    for line in lines:
+        # Code blocks
+        if line.strip().startswith("```"):
+            if in_code_block:
+                html_lines.append("</pre>")
+                in_code_block = False
+            else:
+                html_lines.append('<pre style="background:#1e1e1e;color:#d4d4d4;padding:12px;border-radius:6px;font-size:13px;overflow-x:auto;">')
+                in_code_block = True
+            continue
+        if in_code_block:
+            html_lines.append(_escape(line))
+            continue
+        # Close list if needed
+        if in_list and not line.strip().startswith(("- ", "* ", "1.", "2.", "3.", "4.", "5.")):
+            html_lines.append("</ul>")
+            in_list = False
+        stripped = line.strip()
+        # Empty line → paragraph break
+        if not stripped:
+            html_lines.append("<br>")
+            continue
+        # Headers
+        if stripped.startswith("### "):
+            html_lines.append(f"<h4>{_inline_format(_escape(stripped[4:]))}</h4>")
+            continue
+        if stripped.startswith("## "):
+            html_lines.append(f"<h3>{_inline_format(_escape(stripped[3:]))}</h3>")
+            continue
+        if stripped.startswith("# "):
+            html_lines.append(f"<h2>{_inline_format(_escape(stripped[2:]))}</h2>")
+            continue
+        # List items
+        if stripped.startswith(("- ", "* ")):
+            if not in_list:
+                html_lines.append("<ul>")
+                in_list = True
+            html_lines.append(f"<li>{_inline_format(_escape(stripped[2:]))}</li>")
+            continue
+        # Regular paragraph
+        html_lines.append(f"<p>{_inline_format(_escape(stripped))}</p>")
+    if in_code_block:
+        html_lines.append("</pre>")
+    if in_list:
+        html_lines.append("</ul>")
+    return "\n".join(html_lines)
+def _inline_format(text: str) -> str:
+    """Apply inline markdown formatting: bold, italic, code, LaTeX."""
+    import re
+    # Display LaTeX: $$...$$ → styled span
+    text = re.sub(
+        r'\$\$(.+?)\$\$',
+        r'<div style="text-align:center;margin:8px 0;font-family:serif;font-style:italic;">\1</div>',
+        text, flags=re.DOTALL,
+    )
+    # Inline LaTeX: $...$ → styled span (but not $$)
+    text = re.sub(
+        r'(?<!\$)\$(?!\$)(.+?)(?<!\$)\$(?!\$)',
+        r'<span style="font-family:serif;font-style:italic;">\1</span>',
+        text,
+    )
+    # Inline code: `...`
+    text = re.sub(
+        r'`([^`]+)`',
+        r'<code style="background:#f0f0f0;padding:1px 4px;border-radius:3px;font-size:13px;">\1</code>',
+        text,
+    )
+    # Bold: **...**
+    text = re.sub(r'\*\*(.+?)\*\*', r'<strong>\1</strong>', text)
+    # Italic: *...*
+    text = re.sub(r'(?<!\*)\*(?!\*)(.+?)(?<!\*)\*(?!\*)', r'<em>\1</em>', text)
+    return text

tools.py CHANGED Viewed

@@ -12,8 +12,6 @@ from config import (
     AXLE_BASE_URL,
     LEAN_ENVIRONMENT,
     ARISTOTLE_API_KEY,
-    TOKEN_FACTORY_API_KEY,
-    TOKEN_FACTORY_BASE_URL,
 )
@@ -138,60 +136,6 @@ async def check_lean_code(code: str) -> str:
         return json.dumps({"error": f"Axle error: {e}"})
-# ---------------------------------------------------------------------------
-# Qwen 3.5 (Lean proof generation)
-# ---------------------------------------------------------------------------
-QWEN_MODEL = "Qwen/Qwen3.5-397B-A17B"
-QWEN_LEAN_SYSTEM = """\
-You are a Lean 4 proof assistant. You write correct Lean 4 code with Mathlib.
-Given a mathematical statement, produce a COMPLETE Lean 4 file that proves it.
-Rules:
-- Start with `import Mathlib`
-- Use Lean 4 syntax (NOT Lean 3)
-- The code must compile with Lean 4.28.0 and current Mathlib
-- Use tactics like simp, ring, omega, norm_num, linarith, exact?, apply? where appropriate
-- Output ONLY the Lean code, no explanations
-"""
-async def generate_lean_proof(statement: str, context: str = "") -> str:
-    """Use Qwen 3.5 to generate Lean 4 proof code."""
-    from openai import AsyncOpenAI
-    client = AsyncOpenAI(
-        base_url=TOKEN_FACTORY_BASE_URL,
-        api_key=TOKEN_FACTORY_API_KEY,
-    )
-    try:
-        prompt = f"Write a Lean 4 proof for:\n{statement}"
-        if context:
-            prompt += f"\n\nContext (relevant Mathlib declarations):\n{context}"
-        response = await client.chat.completions.create(
-            model=QWEN_MODEL,
-            messages=[
-                {"role": "system", "content": QWEN_LEAN_SYSTEM},
-                {"role": "user", "content": prompt},
-            ],
-            temperature=0.6,
-            max_tokens=8192,
-        )
-        content = response.choices[0].message.content or ""
-        # Extract lean code from markdown code block if present
-        if "```lean" in content:
-            start = content.find("```lean")
-            end = content.find("```", start + 6)
-            if end > start:
-                code = content[start:end]
-                # Remove the ```lean prefix
-                code = code.split("\n", 1)[1] if "\n" in code else code[7:]
-                return code.strip()
-        # If no code block, return as-is (might be raw lean code)
-        return content.strip()
-    except Exception as e:
-        return f"Qwen error: {e}"
 # ---------------------------------------------------------------------------
 # Aristotle (Lean formalization & proving)
@@ -340,32 +284,6 @@ TOOL_DEFINITIONS = [
             },
         },
     },
-    {
-        "type": "function",
-        "function": {
-            "name": "generate_lean_proof",
-            "description": (
-                "Use Qwen 3.5 (a 397B-parameter LLM specialized in code) to generate "
-                "Lean 4 proof code. Give it the mathematical statement and optionally "
-                "relevant Mathlib context. Returns Lean code that you should then verify "
-                "with check_lean_code. Use this as an alternative to writing proofs yourself."
-            ),
-            "parameters": {
-                "type": "object",
-                "properties": {
-                    "statement": {
-                        "type": "string",
-                        "description": "The mathematical statement to prove, in natural language or Lean-like notation.",
-                    },
-                    "context": {
-                        "type": "string",
-                        "description": "Optional: relevant Mathlib declarations, search results, or hints.",
-                    },
-                },
-                "required": ["statement"],
-            },
-        },
-    },
     {
         "type": "function",
         "function": {

     AXLE_BASE_URL,
     LEAN_ENVIRONMENT,
     ARISTOTLE_API_KEY,
 )
         return json.dumps({"error": f"Axle error: {e}"})
 # ---------------------------------------------------------------------------
 # Aristotle (Lean formalization & proving)
             },
         },
     },
     {
         "type": "function",
         "function": {