--- title: Q.E.D emoji: πŸ”¬ colorFrom: green colorTo: gray sdk: gradio app_file: front.py pinned: true license: mit short_description: LLM-guided formal verification, kernel-certified proofs tags: - track:wood - sponsor:modal - achievement:offbrand - achievement:best-demo - gradio - lean4 - formal-verification - agents - build-small-hackathon - llama.cpp --- # ⊒ Q.E.D An LLM-guided formal verification agent. You give it a theorem statement in Lean 4; it finds a proof that **Lean's kernel certifies as formally correct**. Unlike a chatbot saying "yes, that's true," Lean's kernel is a proof-checker that either accepts or rejects every logical step against its axioms. No hallucinations. No approximations. The result is machine-checked mathematics. ## Demo πŸ“Ή **[DEMO VIDEO: https://huggingface.co/spaces/build-small-hackathon/QED/blob/main/Demo.mkv]** 🐦 **[SOCIAL POST: https://www.reddit.com/r/LocalLLaMA/comments/1u6xo0t/hf_hackathon_submission/]** ## What it does The agent runs a propose β†’ verify β†’ learn loop, using a 27B LLM (well under the 32B limit) to propose Lean 4 proof tactics, Lean's kernel to verify each one, and the kernel's error messages fed back verbatim into the next prompt: ``` theorem β†’ propose 3 tactic candidates (27B LLM via Modal) β†’ verify each in live Lean 4 REPL (Modal container) β†’ kernel rejects? β†’ error text fed back to LLM β†’ kernel accepts? β†’ advance proof state β†’ repeat until complete or stuck ``` Key behaviours: - **Fallback layer** β€” deterministic tactics (`rfl`, `norm_num`, `simp`, `omega`, `contradiction`, `assumption`) are tried at each step before the LLM is called. - **Stuck-state detector** β€” if the same proof state recurs 3 times, or 3 consecutive steps all fail, the agent concludes "not provable as stated" and returns a clean verdict instead of looping indefinitely. This correctly identifies false theorems. - **No cache on demo runs** β€” every run executes the full live loop so the proposeβ†’verifyβ†’learn steps are always visible. **Model**: one model, Qwen3-27B quantized (Q4_K_M GGUF) served via llama.cpp. 27B < 32B. βœ“ ## Best Use of Modal Two Modal deployments power the app: | App | What it runs | |---|---| | `lean-proof-agent` | FastAPI app with a persistent Lean 4 REPL. `min_containers=1` keeps it warm so there is zero cold-start delay. Runs the full agent loop and orchestrates the LLM calls. | | `llama-server` | llama.cpp HTTP server running **Qwen3-27B (Q4_K_M)** on a Modal GPU. Receives the current proof state as context, returns tactic candidates. | Why Modal specifically: - Lean 4 requires a **persistent REPL process** with the full toolchain installed and pre-warmed β€” not something you can spin up per-request. Modal containers hold that state across calls. - `min_containers=1` means the Lean server is **always alive**, which is critical for a live demo with ~30-second proof runs. - The LLM needs a GPU. Modal's on-demand GPU allocation means no dedicated hardware to maintain. - The two apps are **independently scaled**: REPL is CPU-bound, LLM is GPU-bound. ## Off Brand β€” custom UI The frontend is not stock Gradio. Every proof run produces a **live-rendered SVG proof tree** built from the agent's search trace: goal-state nodes connected by tactic edges, failed branches in red, the accepted path in green, a terminal QED node. It runs in the browser with zero JS dependencies β€” generated server-side and injected as HTML. The overall aesthetic is a dark mathematical terminal: JetBrains Mono, GitHub-dark palette (`#0d1117` background, `#2ea043` green accent), styled to match the proof tree's colour scheme. ## Best Agent β€” agent loop design The proof search is a genuine multi-step agentic loop with external tool use and error-driven self-correction: - **Propose**: LLM generates 3 tactic candidates given the current proof state and the previous kernel error (if any) - **Verify**: Lean's formal kernel checks each candidate β€” this is ground truth, not a heuristic - **Learn**: the exact kernel error message is injected back into the next LLM prompt - **Decide**: best partial-progress result is selected; all-failed steps increment a failure counter - **Conclude**: stuck-state detector fires a deliberate "not provable" verdict rather than hitting the step limit The agent correctly proves true theorems and correctly identifies false ones β€” both happen live on every run. ## Try it Load any example theorem and click **⊒ Prove**. The step walkthrough shows every LLM candidate, which ones the kernel rejected, the error fed back, and the tactic that advanced the proof. Each completed proof includes a pre-filled link to [live.lean-lang.org](https://live.lean-lang.org/) so you can paste it into Lean's web kernel and see "Goals accomplished!" yourself.