p4r5kpftnp-cmd commited on
Commit
16fb4fc
·
1 Parent(s): f4afb6d

Repo cleanup: dead code out, test isolation fix, real README

Browse files

- Remove src/lmm_client.py and src/proof_agent.py — MVP-era relics with
zero production importers since the provider routing moved into
rag_chain. Their tests go with them (including the proof_agent
default-model regression test, obsolete with the module deleted).

- Fix sys.modules pollution in test_manual_bug_fixes.py: it installed
MagicMocks at lean_verifier/retriever/rag_chain at import time and
never restored them, poisoning every test module imported later
during discovery (alphabetical order put 6 files after it). Mocks are
now installed only around the langgraph_agent import and restored in
a finally block; the mock-built langgraph_agent is evicted too.

- Replace the 344-line MVP design doc README (Ollama-era, described the
since-removed reranker) with an accurate project README: architecture
diagram, RAG details with real numbers, model matrix, local-run
quickstart, honest small-sample benchmark table, project layout, CI
notes. HF Spaces frontmatter unchanged.

- Track scripts/setup_lean.py — the one-time Lean+Mathlib warmup
utility referenced by the README quickstart.

- UI: rebalance the controls bar (optional Claude key field no longer
gets equal width with the model picker; clearer label).

103 tests pass, ruff clean.

README.md CHANGED
@@ -8,337 +8,98 @@ app_file: app.py
8
  pinned: false
9
  ---
10
 
11
- ## Brief Non-Technical MVP: AI Lean 4 Proof Assistant
12
 
13
- ### Project goal
14
 
15
- Build an AI assistant that helps users write correct **Lean 4 mathematical proofs**.
16
 
17
- The user provides a theorem. The system searches Lean’s Mathlib library, finds useful related theorems, asks an AI model to write a proof, checks the proof using Lean, and automatically retries if the proof fails.
18
 
19
- The key idea is:
20
 
21
- ```text
22
- AI suggests the proof.
23
- Lean verifies the proof.
24
- Only Lean decides if the answer is correct.
 
 
 
25
  ```
26
 
27
- ---
28
-
29
- ## MVP flow
30
-
31
- ```text
32
- User enters theorem
33
-
34
- Lean tooling reads the theorem and current proof goal
35
-
36
- System retrieves useful Mathlib lemmas
37
-
38
- System reranks the lemmas and keeps the best ones
39
-
40
- LangChain organizes the prompt and retrieval flow
41
-
42
- Local LLM generates a Lean proof
43
-
44
- Lean checks the proof
45
-
46
- If correct → return final proof
47
- If wrong → send Lean error back to the AI and retry
48
- ```
49
-
50
- ---
51
-
52
- ## Simple explanation
53
-
54
- This project is like an **AI coding assistant for mathematical proofs**.
55
-
56
- But unlike a normal chatbot, it does not just guess. It uses Lean itself to check whether the proof is truly valid.
57
-
58
- The assistant works in three stages:
59
-
60
- ```text
61
- 1. Search
62
- Find useful existing Mathlib theorems.
63
-
64
- 2. Generate
65
- Ask the AI model to write a proof using those theorems.
66
-
67
- 3. Verify
68
- Run the proof through Lean and retry if Lean finds an error.
69
- ```
70
-
71
- ---
72
-
73
- ## Main MVP components
74
-
75
- ### 1. Lean tooling
76
-
77
- Use **LeanInteract** to connect Python with Lean 4.
78
-
79
- LeanInteract lets Python interact with Lean through the Lean REPL, which means your program can send Lean code, inspect proof goals, and receive Lean feedback programmatically. ([GitHub][1])
80
-
81
- In simple terms:
82
-
83
- ```text
84
- LeanInteract lets the AI system talk to Lean.
85
- ```
86
-
87
- Instead of only seeing an error like:
88
-
89
- ```text
90
- type mismatch
91
- ```
92
-
93
- the system can see more useful information like:
94
 
95
- ```text
96
- Current goal:
97
- n : Nat
98
- ⊢ n = n
99
- ```
100
 
101
- That makes the retry much smarter.
102
 
103
- ---
104
 
105
- ### 2. Retrieve Mathlib lemmas
 
 
 
 
106
 
107
- The system searches Mathlib for useful theorems.
108
 
109
- For example, if the theorem is about even numbers, the retriever may find lemmas related to:
110
-
111
- ```text
112
- Even
113
- Odd
114
- Nat.even_or_odd
115
- parity
116
- multiplication
117
- powers
118
  ```
119
 
120
- This is the “retrieve” part.
121
-
122
- Use:
123
 
124
- ```text
125
- FAISS for fast semantic search
126
- BM25 / keyword search for exact Lean names
127
  ```
128
 
129
- FAISS is useful for quickly finding similar text from a large collection of embedded documents, while keyword search helps when exact theorem names matter.
130
 
131
- ---
132
-
133
- ### 3. Rerank retrieved lemmas
134
-
135
- Retrieval alone is not enough.
136
-
137
- The system may retrieve 50 possible Mathlib lemmas, but many may be weak or irrelevant. So we add a **reranker**.
138
-
139
- The improved flow is:
140
-
141
- ```text
142
- Retrieve top 50 lemmas
143
-
144
- Rerank them based on the current theorem and proof goal
145
-
146
- Keep the best 8–12 lemmas
147
-
148
- Send only those to the AI model
149
  ```
150
 
151
- A Cross-Encoder reranker is useful here because it scores each query-document pair more carefully than the first-pass retriever. Sentence Transformers describes this as a common retrieve-and-rerank setup where fast retrieval finds candidates and the Cross-Encoder reranks them for better relevance. ([SentenceTransformers][2])
152
 
153
- In simple terms:
 
 
 
 
154
 
155
- ```text
156
- Retrieve finds possible useful lemmas.
157
- Rerank chooses the best ones.
158
- ```
159
 
160
- This improves the quality of the context given to the AI.
161
-
162
- ---
163
-
164
- ### 4. LangChain role
165
-
166
- Use **LangChain** for orchestration, not for proof correctness.
167
-
168
- LangChain can help organize:
169
-
170
- ```text
171
- retrieval
172
- reranking
173
- prompt templates
174
- document formatting
175
- LLM calls
176
- retry history
177
  ```
178
-
179
- LangChain has retrieval utilities such as contextual compression, where retrieved documents can be filtered or shortened before sending them to the model. ([LangChain][3])
180
-
181
- But LangChain does **not** make Lean proofs correct by itself.
182
-
183
- The correctness still comes from:
184
-
185
- ```text
186
- Lean compiler
187
- Lean proof state
188
- Mathlib retrieval quality
189
- LLM proof generation
190
- retry loop
 
 
191
  ```
192
 
193
- So the right positioning is:
194
-
195
- ```text
196
- LangChain manages the workflow.
197
- Lean verifies the truth.
198
- ```
199
-
200
- ---
201
-
202
- ### 5. Local LLM first
203
-
204
- For the MVP, use a local model first.
205
-
206
- Example:
207
-
208
- ```text
209
- Ollama + Qwen Coder
210
- ```
211
-
212
- Why local first?
213
-
214
- ```text
215
- No API cost
216
- Easy to test many times
217
- Good for debugging the pipeline
218
- Private and simple for development
219
- ```
220
-
221
- Later, once the workflow works, replace the local model with Claude.
222
-
223
- The architecture stays the same.
224
-
225
- ---
226
-
227
- ### 6. Lean verification and retry
228
-
229
- After the AI generates a proof, the system sends it to Lean.
230
-
231
- If Lean accepts the proof:
232
-
233
- ```text
234
- Success → return proof
235
- ```
236
-
237
- If Lean rejects the proof:
238
-
239
- ```text
240
- Failure → capture Lean error → add error to prompt → retry
241
- ```
242
-
243
- Example:
244
-
245
- ```text
246
- Attempt 1:
247
- AI generates proof.
248
- Lean says: unknown theorem name.
249
-
250
- Attempt 2:
251
- AI sees the error and tries a different lemma.
252
- Lean says: type mismatch.
253
-
254
- Attempt 3:
255
- AI fixes the proof.
256
- Lean accepts it.
257
- ```
258
-
259
- This is the most important part of the MVP.
260
-
261
- ---
262
-
263
- ## Final MVP architecture
264
-
265
- ```text
266
- User Theorem
267
-
268
-
269
- LeanInteract
270
- Get theorem goal and Lean feedback
271
-
272
-
273
- Hybrid Retrieval
274
- FAISS semantic search + keyword search
275
-
276
-
277
- Reranker
278
- Keep only the most useful Mathlib lemmas
279
-
280
-
281
- LangChain Prompt Flow
282
- Organize theorem, lemmas, history, and Lean errors
283
-
284
-
285
- Local LLM
286
- Generate Lean proof
287
-
288
-
289
- Lean Checker
290
- Verify proof
291
-
292
- ├── Success → Return final proof
293
-
294
- └── Failure → Retry with Lean error
295
- ```
296
-
297
- ---
298
-
299
- ## Libraries to use
300
-
301
- | Purpose | Library |
302
- | ---------------------------- | ------------------------------------------------- |
303
- | Interact with Lean | `LeanInteract` |
304
- | Semantic retrieval | `sentence-transformers` |
305
- | Vector search | `faiss-cpu` |
306
- | Keyword search | `rank-bm25` |
307
- | Reranking | `sentence-transformers` CrossEncoder or FlashRank |
308
- | Workflow orchestration | `LangChain` |
309
- | More advanced workflow graph | `LangGraph` |
310
- | Local model | `Ollama` |
311
- | Future stronger model | Claude API |
312
-
313
- LangGraph is useful later if you want the proof process to behave like a clear state machine: retrieve → generate → verify → retry → success/failure. LangGraph is designed for stateful agent workflows, which fits this kind of multi-step proof loop. ([Emergent Mind][4])
314
-
315
- ---
316
-
317
- ## Non-technical value proposition
318
-
319
- This MVP creates an AI assistant that helps people write mathematically verified Lean proofs.
320
-
321
- It does not only generate text. It checks every answer with Lean, learns from errors, and keeps trying until it produces a valid proof.
322
-
323
- The project demonstrates:
324
-
325
- ```text
326
- AI-assisted formal reasoning
327
- verified code generation
328
- search over mathematical knowledge
329
- automatic error correction
330
- LLM + compiler feedback loops
331
- ```
332
-
333
- ---
334
-
335
- ## One-line MVP summary
336
 
337
- ```text
338
- An AI-powered Lean 4 proof assistant that searches Mathlib, reranks useful lemmas, generates a proof with a local LLM, checks it using Lean tooling, and automatically retries using Lean errors until the proof is verified.
 
339
  ```
340
 
341
- [1]: https://github.com/augustepoiroux/LeanInteract?utm_source=chatgpt.com "LeanInteract: A Python Interface for Lean 4"
342
- [2]: https://sbert.net/examples/sentence_transformer/applications/retrieve_rerank/README.html?utm_source=chatgpt.com "Retrieve & Re-Rank Pipeline"
343
- [3]: https://www.langchain.com/blog/improving-document-retrieval-with-contextual-compression?utm_source=chatgpt.com "Improving Document Retrieval with Contextual Compression"
344
- [4]: https://www.emergentmind.com/topics/langgraph?utm_source=chatgpt.com "LangGraph: Modular LLM Agent Orchestration"
 
8
  pinned: false
9
  ---
10
 
11
+ # Lean 4 Proof Assistant
12
 
13
+ An AI agent that completes formal mathematical proofs in **Lean 4**. Paste a theorem containing `sorry`, and the agent retrieves relevant Mathlib lemmas, drafts a proof with an LLM, and verifies it with the Lean compiler — retrying with error feedback until the proof checks or retries run out.
14
 
15
+ The core guarantee: **only Lean decides correctness.** Every accepted proof is formally verified by the Lean 4 REPL, so the system structurally cannot return a hallucinated proof.
16
 
17
+ **[Try it on Hugging Face Spaces →](https://huggingface.co/spaces/Ray5th/lean4-helper)**
18
 
19
+ ## How it works
20
 
21
+ ```mermaid
22
+ flowchart LR
23
+ UI[Gradio UI] --> V[Verify<br/>Lean 4 REPL]
24
+ V -->|errors / goals| R[Retrieve<br/>LeanDojo ByT5 + FAISS]
25
+ R -->|top-5 premises| G[Generate<br/>LLM]
26
+ G -->|candidate proof| V
27
+ V -->|no goals left| OK([Verified proof])
28
  ```
29
 
30
+ A LangGraph state machine drives a `verify → retrieve → generate` loop:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
+ 1. **Verify** — the file is checked by the Lean 4 REPL (via [lean-interact](https://github.com/augustepoiroux/LeanInteract)) with Mathlib available. Open goals and errors are extracted.
33
+ 2. **Retrieve** — the current proof state (goals only, canonical `h : T ⊢ goal` form) is embedded with **LeanDojo's pretrained ByT5 premise retriever** and searched against a FAISS index of **180,973 Mathlib premises**. The index is IVFPQ-compressed, so 1.06 GB of raw embeddings ship as a 17 MB file via Git LFS.
34
+ 3. **Generate** — retrieved premises are passed to the LLM as *optional hints* (RAFT-style distractor-aware framing), and the model must cite which lemmas it actually used (`-- used: Nat.add_comm`). The generated proof goes back to step 1.
 
 
35
 
36
+ Lean error messages feed the LLM prompt on retry but are kept out of the retrieval query — the encoder was trained on clean proof states, and error text degrades the embedding.
37
 
38
+ ## Supported models
39
 
40
+ | Provider | Models | Notes |
41
+ |---|---|---|
42
+ | **Groq** (default) | `openai/gpt-oss-120b`, `openai/gpt-oss-20b`, `qwen/qwen3-32b`, `meta-llama/llama-4-scout-17b-16e-instruct` | Needs `GROQ_API_KEY` (Space secret / env var) |
43
+ | **Anthropic** | Claude Opus 4.7 · Sonnet 4.6 · Haiku 4.5 | Bring-your-own key in the UI; never stored |
44
+ | **Claude CLI** | `claude-cli-opus` etc. | Local `claude -p` subprocess — bills a Claude Pro subscription instead of API credits |
45
 
46
+ ## Running locally
47
 
48
+ ```bash
49
+ git clone https://github.com/ray5th/lean4-helper && cd lean4-helper
50
+ pip install -r requirements.txt
51
+ git lfs pull # fetch the FAISS index
52
+ python scripts/setup_lean.py # one-time Lean + Mathlib warmup (slow first run)
53
+ export GROQ_API_KEY=... # or use a Claude key / CLI in the UI
54
+ python app.py # Gradio UI on :7860
 
 
55
  ```
56
 
57
+ CLI instead of UI:
 
 
58
 
59
+ ```bash
60
+ python scripts/run_agent.py problems/simple_add.lean --model openai/gpt-oss-120b
 
61
  ```
62
 
63
+ Benchmark on local problem files or MiniF2F:
64
 
65
+ ```bash
66
+ python scripts/benchmark.py --problems-dir problems --retries 3 --verbose
67
+ python scripts/benchmark.py --subset 20 --model claude-cli-opus # MiniF2F via HF datasets
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  ```
69
 
70
+ ## Benchmarks (small-sample, honest numbers)
71
 
72
+ | Set | Result | Notes |
73
+ |---|---|---|
74
+ | 16 intro lecture problems (calc / linarith / ring) | 16/16 | 15 solved on first generation |
75
+ | 6 harder problems (named-lemma, induction, divisibility) | 6/6 | retrieved-lemma citations fired on 4/6 |
76
+ | MiniF2F sample (5 AMC/AIME problems) | 3/5 | competition-grade; n far too small to quote as a pass rate |
77
 
78
+ ## Project structure
 
 
 
79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
  ```
81
+ app.py Gradio UI (two-pane editor, model picker, logs)
82
+ src/
83
+ langgraph_agent.py verify→retrieve→generate state machine
84
+ lean_verifier.py Lean 4 REPL wrapper (lean-interact)
85
+ retriever.py FAISS retrieval over Mathlib premises
86
+ byt5_embedder.py LeanDojo ByT5 encoder wrapper
87
+ rag_chain.py prompts + LLM provider routing (Groq/Anthropic/CLI)
88
+ mathlib_corpus.py Mathlib source → premise extraction
89
+ scripts/
90
+ build_leandojo_index.py build the IVFPQ FAISS index from LeanDojo embeddings
91
+ benchmark.py pass@k benchmark (local files or MiniF2F)
92
+ run_agent.py CLI entry point
93
+ setup_lean.py one-time Lean + Mathlib warmup
94
+ problems/ sample .lean problems
95
+ tests/ ~110 unit/fuzz tests (mocked LLM + Lean)
96
  ```
97
 
98
+ ## Testing & CI
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99
 
100
+ ```bash
101
+ python -m unittest discover -s tests # full suite, no API keys or Lean needed
102
+ ruff check .
103
  ```
104
 
105
+ GitHub Actions runs lint + tests on every push and auto-deploys to the Space on merge to `main` (when `HF_TOKEN` is configured).
 
 
 
app.py CHANGED
@@ -621,10 +621,10 @@ with gr.Blocks(
621
  label="Max retries", scale=1,
622
  )
623
  anthropic_key_input = gr.Textbox(
624
- label="Anthropic API key (Claude models only)",
625
- placeholder="sk-ant-…",
626
  type="password",
627
- scale=2,
628
  )
629
 
630
  # ─── Two editor panes ───────────────────────────────────────────
 
621
  label="Max retries", scale=1,
622
  )
623
  anthropic_key_input = gr.Textbox(
624
+ label="Claude API key (optional)",
625
+ placeholder="sk-ant-… — only for Claude models",
626
  type="password",
627
+ scale=1,
628
  )
629
 
630
  # ─── Two editor panes ───────────────────────────────────────────
scripts/setup_lean.py ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ import os
3
+ import sys
4
+
5
+ # Add src to Python path
6
+ sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'src')))
7
+
8
+ from lean_verifier import LeanEnvironment
9
+
10
+
11
+ def main():
12
+ print("Setting up Lean Environment and pre-caching Mathlib...")
13
+ print("This might take a while if Mathlib has not been built yet.")
14
+
15
+ # Initialize the Lean environment with Mathlib
16
+ lean_env = LeanEnvironment(use_mathlib=True)
17
+
18
+ print("\nEnvironment initialized successfully!")
19
+
20
+ # Run a simple test
21
+ print("\nRunning a quick verification test...")
22
+ test_code = """
23
+ import Mathlib
24
+
25
+ example {x : ℝ} (hx : x ^ 2 - 3 * x + 2 = 0) : x = 1 ∨ x = 2 :=
26
+ sorry
27
+ """
28
+ result = lean_env.verify_proof(test_code)
29
+ print(f"Test Result: {result['status']}")
30
+ if result['status'] != 'success':
31
+ print(f"Errors: {result['errors']}")
32
+ print(f"Goals: {result['goals']}")
33
+
34
+ if __name__ == "__main__":
35
+ main()
src/lmm_client.py DELETED
@@ -1,56 +0,0 @@
1
- from typing import List, Optional
2
-
3
- from groq import Groq
4
-
5
-
6
- class LMMClient:
7
- """
8
- Client for interacting with LLMs via Groq API.
9
- """
10
-
11
- def __init__(self, model_name: str = "llama-3.3-70b-versatile"):
12
- self.model_name = model_name
13
- self._client = Groq()
14
-
15
- def chat(self, prompt: str, system_prompt: Optional[str] = None) -> str:
16
- messages = []
17
- if system_prompt:
18
- messages.append({"role": "system", "content": system_prompt})
19
- messages.append({"role": "user", "content": prompt})
20
-
21
- response = self._client.chat.completions.create(
22
- model=self.model_name,
23
- messages=messages,
24
- max_tokens=1024,
25
- )
26
- # Defensive: Groq can return empty `choices` on content filter / quota
27
- # issues — accessing [0] would IndexError and kill the agent.
28
- choices = getattr(response, "choices", None) or []
29
- if not choices:
30
- return ""
31
- message = getattr(choices[0], "message", None)
32
- content = getattr(message, "content", None) if message else None
33
- return content or ""
34
-
35
- def generate_proof_steps(self, lean_code: str, goals: List[str], errors: List[str]) -> str:
36
- system_prompt = (
37
- "You are an expert Lean 4 proof assistant. "
38
- "Your goal is to complete the proof by replacing 'sorry' with valid Lean 4 code. "
39
- "Use Mathlib theorems where appropriate. "
40
- "Respond ONLY with the corrected Lean code block."
41
- )
42
- prompt = f"""
43
- Current Lean Code:
44
- ```lean
45
- {lean_code}
46
- ```
47
-
48
- Current Proof Goals:
49
- {chr(10).join(goals)}
50
-
51
- Lean Errors:
52
- {chr(10).join(errors)}
53
-
54
- Please provide the corrected Lean code. Focus on solving the current goals and fixing the errors.
55
- """
56
- return self.chat(prompt, system_prompt)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/proof_agent.py DELETED
@@ -1,11 +0,0 @@
1
- from langgraph_agent import LangGraphAgent
2
-
3
-
4
- class ProofAgent:
5
- """Thin compatibility wrapper around LangGraphAgent."""
6
-
7
- def __init__(self, model_name: str = "llama-3.3-70b-versatile", max_retries: int = 5):
8
- self._agent = LangGraphAgent(model_name=model_name, max_retries=max_retries)
9
-
10
- def solve_file(self, file_path: str) -> bool:
11
- return self._agent.solve_file(file_path)
 
 
 
 
 
 
 
 
 
 
 
 
tests/test_manual_bug_fixes.py CHANGED
@@ -21,18 +21,30 @@ sys.path.insert(0, "src")
21
  sys.modules.pop("langgraph_agent", None)
22
 
23
  # `langgraph_agent` imports lean_verifier + retriever + rag_chain which each
24
- # pull in heavyweight ML libs. Mock those three modules directly so we can
25
- # import the helpers without installing langchain-groq / lean-interact.
26
- for _mod in ("lean_verifier", "retriever", "rag_chain"):
 
 
 
27
  sys.modules[_mod] = mock.MagicMock()
28
-
29
- import langgraph_agent
30
-
31
- importlib.reload(langgraph_agent) # ensure regex/keyword changes are picked up
32
- _count_theorem_blocks = langgraph_agent._count_theorem_blocks
33
- _extract_lean_code = langgraph_agent._extract_lean_code
34
- _read_file = langgraph_agent._read_file
35
- _write_file = langgraph_agent._write_file
 
 
 
 
 
 
 
 
 
36
 
37
 
38
  class TestCountTheoremBlocksKeywordFix(unittest.TestCase):
@@ -132,25 +144,8 @@ class TestUtf8FileIO(unittest.TestCase):
132
  os.unlink(path)
133
 
134
 
135
- class TestProofAgentDefaultModel(unittest.TestCase):
136
- """
137
- `ProofAgent` used to default to `qwen3-vl:4b` (an Ollama model name) after
138
- the Groq migration, which would fail at runtime. Now defaults to a real
139
- Groq model.
140
- """
141
-
142
- def test_default_model_is_not_a_stale_ollama_name(self):
143
- # Inspect the function default directly — avoids needing a working
144
- # LangGraphAgent mock chain.
145
- import inspect
146
-
147
- import proof_agent
148
-
149
- sig = inspect.signature(proof_agent.ProofAgent.__init__)
150
- default = sig.parameters["model_name"].default
151
- self.assertNotIn("qwen", default.lower())
152
- self.assertNotIn(":", default,
153
- "Default looks like an Ollama 'name:tag' format")
154
 
155
 
156
  if __name__ == "__main__":
 
21
  sys.modules.pop("langgraph_agent", None)
22
 
23
  # `langgraph_agent` imports lean_verifier + retriever + rag_chain which each
24
+ # pull in heavyweight ML libs. Mock those three modules just long enough to
25
+ # import the pure helpers, then RESTORE sys.modules leaving MagicMocks
26
+ # behind poisons every test module imported after this one during discovery.
27
+ _DEPS = ("lean_verifier", "retriever", "rag_chain")
28
+ _saved = {m: sys.modules.get(m) for m in _DEPS}
29
+ for _mod in _DEPS:
30
  sys.modules[_mod] = mock.MagicMock()
31
+ try:
32
+ import langgraph_agent
33
+
34
+ importlib.reload(langgraph_agent) # ensure regex/keyword changes are picked up
35
+ _count_theorem_blocks = langgraph_agent._count_theorem_blocks
36
+ _extract_lean_code = langgraph_agent._extract_lean_code
37
+ _read_file = langgraph_agent._read_file
38
+ _write_file = langgraph_agent._write_file
39
+ finally:
40
+ for _mod, _orig in _saved.items():
41
+ if _orig is not None:
42
+ sys.modules[_mod] = _orig
43
+ else:
44
+ sys.modules.pop(_mod, None)
45
+ # This langgraph_agent was built against mocks — evict it so later
46
+ # importers get a real one.
47
+ sys.modules.pop("langgraph_agent", None)
48
 
49
 
50
  class TestCountTheoremBlocksKeywordFix(unittest.TestCase):
 
144
  os.unlink(path)
145
 
146
 
147
+ # (A TestProofAgentDefaultModel regression test lived here until the unused
148
+ # `proof_agent` wrapper module it covered was deleted.)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
 
151
  if __name__ == "__main__":
tests/test_proof_agent.py DELETED
@@ -1,147 +0,0 @@
1
- import os
2
- import sys
3
- import unittest
4
- from unittest.mock import MagicMock, patch
5
-
6
- # Add src to Python path
7
- sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'src')))
8
-
9
- # Mock langgraph_agent.LangGraphAgent before importing proof_agent (so its
10
- # top-level `from langgraph_agent import LangGraphAgent` doesn't pull in the
11
- # heavy real module). Restore sys.modules afterward so this file doesn't
12
- # poison the langgraph_agent import for other test files in the same run.
13
- _orig_langgraph_agent = sys.modules.get('langgraph_agent')
14
- sys.modules['langgraph_agent'] = MagicMock()
15
-
16
- from proof_agent import ProofAgent
17
-
18
- if _orig_langgraph_agent is not None:
19
- sys.modules['langgraph_agent'] = _orig_langgraph_agent
20
- else:
21
- del sys.modules['langgraph_agent']
22
-
23
-
24
- class TestProofAgentConstructor(unittest.TestCase):
25
- def test_default_args_constructs_langgraph_agent_with_defaults(self):
26
- with patch('proof_agent.LangGraphAgent') as MockLangGraphAgent:
27
- ProofAgent()
28
- MockLangGraphAgent.assert_called_once_with(
29
- model_name="llama-3.3-70b-versatile",
30
- max_retries=5,
31
- )
32
-
33
- def test_custom_model_name_forwarded(self):
34
- with patch('proof_agent.LangGraphAgent') as MockLangGraphAgent:
35
- ProofAgent(model_name="custom-model:7b")
36
- MockLangGraphAgent.assert_called_once_with(
37
- model_name="custom-model:7b",
38
- max_retries=5,
39
- )
40
-
41
- def test_custom_max_retries_forwarded(self):
42
- with patch('proof_agent.LangGraphAgent') as MockLangGraphAgent:
43
- ProofAgent(max_retries=10)
44
- MockLangGraphAgent.assert_called_once_with(
45
- model_name="llama-3.3-70b-versatile",
46
- max_retries=10,
47
- )
48
-
49
- def test_all_custom_args_forwarded(self):
50
- with patch('proof_agent.LangGraphAgent') as MockLangGraphAgent:
51
- ProofAgent(model_name="other-model:13b", max_retries=3)
52
- MockLangGraphAgent.assert_called_once_with(
53
- model_name="other-model:13b",
54
- max_retries=3,
55
- )
56
-
57
- def test_underlying_agent_stored_on_instance(self):
58
- with patch('proof_agent.LangGraphAgent') as MockLangGraphAgent:
59
- mock_instance = MagicMock()
60
- MockLangGraphAgent.return_value = mock_instance
61
- agent = ProofAgent()
62
- self.assertIs(agent._agent, mock_instance)
63
-
64
-
65
- class TestProofAgentSolveFile(unittest.TestCase):
66
- def test_solve_file_delegates_returning_true(self):
67
- with patch('proof_agent.LangGraphAgent') as MockLangGraphAgent:
68
- mock_instance = MagicMock()
69
- mock_instance.solve_file.return_value = True
70
- MockLangGraphAgent.return_value = mock_instance
71
-
72
- agent = ProofAgent()
73
- result = agent.solve_file("path/to/file.lean")
74
-
75
- self.assertTrue(result)
76
- mock_instance.solve_file.assert_called_once_with("path/to/file.lean")
77
-
78
- def test_solve_file_delegates_returning_false(self):
79
- with patch('proof_agent.LangGraphAgent') as MockLangGraphAgent:
80
- mock_instance = MagicMock()
81
- mock_instance.solve_file.return_value = False
82
- MockLangGraphAgent.return_value = mock_instance
83
-
84
- agent = ProofAgent()
85
- result = agent.solve_file("path/to/file.lean")
86
-
87
- self.assertFalse(result)
88
- mock_instance.solve_file.assert_called_once_with("path/to/file.lean")
89
-
90
- def test_solve_file_path_is_forwarded_exactly(self):
91
- with patch('proof_agent.LangGraphAgent') as MockLangGraphAgent:
92
- mock_instance = MagicMock()
93
- mock_instance.solve_file.return_value = True
94
- MockLangGraphAgent.return_value = mock_instance
95
-
96
- agent = ProofAgent()
97
- agent.solve_file("/absolute/path/proof.lean")
98
- mock_instance.solve_file.assert_called_once_with("/absolute/path/proof.lean")
99
-
100
- def test_solve_file_propagates_exception(self):
101
- with patch('proof_agent.LangGraphAgent') as MockLangGraphAgent:
102
- mock_instance = MagicMock()
103
- mock_instance.solve_file.side_effect = RuntimeError("boom")
104
- MockLangGraphAgent.return_value = mock_instance
105
-
106
- agent = ProofAgent()
107
- with self.assertRaises(RuntimeError):
108
- agent.solve_file("foo.lean")
109
-
110
-
111
- class TestProofAgentNoExtraState(unittest.TestCase):
112
- def test_no_unexpected_public_methods(self):
113
- with patch('proof_agent.LangGraphAgent'):
114
- agent = ProofAgent()
115
- public_attrs = {a for a in dir(agent) if not a.startswith('_')}
116
- # The wrapper should only expose solve_file
117
- self.assertEqual(public_attrs, {"solve_file"})
118
-
119
- def test_solve_file_does_not_mutate_wrapper_state(self):
120
- with patch('proof_agent.LangGraphAgent') as MockLangGraphAgent:
121
- mock_instance = MagicMock()
122
- mock_instance.solve_file.return_value = True
123
- MockLangGraphAgent.return_value = mock_instance
124
-
125
- agent = ProofAgent()
126
- before = set(vars(agent).keys())
127
- agent.solve_file("a.lean")
128
- after = set(vars(agent).keys())
129
- self.assertEqual(before, after)
130
-
131
- def test_multiple_calls_use_same_underlying_agent(self):
132
- with patch('proof_agent.LangGraphAgent') as MockLangGraphAgent:
133
- mock_instance = MagicMock()
134
- mock_instance.solve_file.return_value = True
135
- MockLangGraphAgent.return_value = mock_instance
136
-
137
- agent = ProofAgent()
138
- agent.solve_file("a.lean")
139
- agent.solve_file("b.lean")
140
-
141
- # Only one underlying agent constructed
142
- self.assertEqual(MockLangGraphAgent.call_count, 1)
143
- self.assertEqual(mock_instance.solve_file.call_count, 2)
144
-
145
-
146
- if __name__ == '__main__':
147
- unittest.main()