Spaces:
Running
Running
Test Log
All tests run on https://vilin97-verideepresearch.hf.space (local testing via Python 3.10).
Simple / LLM-tricky questions
| # | Question | Time | Verified | Notes |
|---|---|---|---|---|
| 1 | "Show that for all natural numbers n, n + 0 = n." | 16s | Yes | Proved by rfl. |
| 2 | "Prove that the sum of two even numbers is even." | 19s | Yes | Used Even.add + explicit proof. |
| 3 | "Which is larger, 9.9 or 9.11? Prove it." | 4s | Yes | 9.9 > 9.11 via norm_num. Classic LLM failure — verified correct. |
| 4 | "Prove that 1/3 + 1/3 + 1/3 = 1." | 16s | Yes | Over ℚ with norm_num. |
False statements (negation proved)
| # | Question | Time | Verified | Notes |
|---|---|---|---|---|
| 5 | "Prove that 1 = 2." | 2s | N/A | Correctly refused. |
| 6 | "Prove every continuous function on [0,1] is differentiable." | 32s | Yes | Proved negation: |
| 7 | "Prove every monotone sequence of reals converges." | 61s | Yes | Proved negation: a_n = n is monotone but divergent. |
| 8 | "Prove every group of order 6 is abelian." | 73s | Yes | Proved negation: DihedralGroup 3 is non-abelian. |
Graduate / Prelim level
| # | Question | Time | Verified | Notes |
|---|---|---|---|---|
| 9 | "Composition of injective functions is injective." | ~20s | Yes | |
| 10 | "Cantor's theorem: no surjection from S to P(S)." | 16s | Yes | Mathlib one-liner + explicit diagonal argument. |
| 11 | "Every finite integral domain is a field." | 16s | Yes | |
| 12 | "n³ - n divisible by 6." | 39s | Yes | 50+ line proof, case analysis. |
| 13 | "x² + y² + z² ≥ xy + yz + xz." | 19s | Yes | nlinarith with sq_nonneg hints. |
| 14 | "Continuous bijection compact → Hausdorff is homeomorphism." | 52s | Yes | |
| 15 | "Weakest assumptions for abelian group." | 54s | Yes | 3 answers, all verified. |
| 16 | "Cauchy functional equation." | 38s | Yes | |
| 17 | "Weakest condition for limit of continuous functions." | 26s | Yes | Uniform convergence. |
| 18 | "Rolle's theorem without differentiability." | 100s | Yes | Caught Mathlib's deriv convention. |
| 19 | "Weakest condition for countable subcovers." | 192s | Yes | Second-countability → Lindelöf. |
| 20 | "Group of order pq is cyclic." | 203s | Yes | Previously >15min timeout. Auto-finalize fix. |
| 21 | "Schur's inequality t=1." | 67s | Yes | Previously >15min timeout. Qwen 3.5 proved it. |
UW Analysis Prelim (Sept 2020)
| # | Question | Time | Verified | Notes |
|---|---|---|---|---|
| 22 | #9a: "Weakly convergent ⟹ norm bounded." | 53s | Yes | Banach-Steinhaus. |
| 23 | "Well-ordering principle." | 18s | Yes | Nat.find + Loogle. |
Putnam 2025
| # | Problem | Time | Cost | Verified | Notes |
|---|---|---|---|---|---|
| 24 | A1: Coprimality of (2m_k+1, 2n_k+1). | 74s | $0.04 | Yes | Qwen + 2 Loogle + 2 Axle. |
| 25 | A2: Bounds a·x(π-x) ≤ sinx ≤ b·x(π-x). | 370s | $0.03 | Yes | a=0, b=4/π². |
| 26 | A5: Maximal f(s) permutation counting. | 189s | $0.05 | Yes | |
| 27 | A6: b_{2k+1} - 2b_{2k} divisibility. | 45s | $0.03 | Yes | Fastest Putnam solve. |
| 28 | B2: Centroid comparison x₁ < x₂. | 335s | $0.21 | Yes | Submitted to Aristotle, but agent proved it first. |
| 29 | B3: Set S with divisor property of 2025ⁿ-15ⁿ. | 624s | $0.21 | Yes | Hardest — 10+ min, multiple Qwen attempts. |
| 30 | B4: Matrix sum bound S ≤ (n+2)N/3. | 183s | $0.05 | Yes | 3 Qwen attempts, 5 Axle checks. |
| 31 | B5: Modular inverse ordering > p/4-1. | 84s | $0.02 | Yes | Cheapest Putnam solve. |
| 32 | B6: Largest r for g(n+1)-g(n) ≥ (g(g(n)))^r. | 511s | $0.25 | Yes | Aristotle started (1%), agent proved independently. |
| 33 | A3: Digit string game (Alice vs Bob). | 163s | $0.12 | Yes | Combinatorial game theory formalized. |
| 34 | A4: Minimal k for commuting matrices. | 64s | $0.02 | Yes | Fastest Putnam B-side solve. |
| 35 | B1: Plane coloring → monochromatic. | 225s | $0.10 | Yes | Geometric argument formalized. |
Non-math rejection
| # | Question | Time | Result |
|---|---|---|---|
| 33 | "Best Italian restaurant in New York?" | 2s | Rejected |
| 34 | "What is the weather today?" | 2s | Rejected |
Summary
- 12/12 Putnam 2025 problems verified (A1-A6 and B1-B6). Total cost: $1.13.
- 33/33 mathematical questions verified in Lean 4 across all categories
- 3/3 false statements caught with verified negation proofs
- 2/2 non-math questions rejected
- Median response time: ~70 seconds for verified proofs
- All Lean code verified by Axle (Lean 4.28.0 + Mathlib)
- Tools: TheoremSearch, LeanExplore, Loogle, Axle, Qwen 3.5, Aristotle, Kimi K2.5
- Email notification with proof.lean and research_log.md attachments