# Test Log All tests run on https://vilin97-verideepresearch.hf.space (local testing via Python 3.10). ## Simple / LLM-tricky questions | # | Question | Time | Verified | Notes | |---|----------|------|----------|-------| | 1 | "Show that for all natural numbers n, n + 0 = n." | 16s | Yes | Proved by `rfl`. | | 2 | "Prove that the sum of two even numbers is even." | 19s | Yes | Used `Even.add` + explicit proof. | | 3 | "Which is larger, 9.9 or 9.11? Prove it." | 4s | Yes | 9.9 > 9.11 via `norm_num`. Classic LLM failure — verified correct. | | 4 | "Prove that 1/3 + 1/3 + 1/3 = 1." | 16s | Yes | Over ℚ with `norm_num`. | ## False statements (negation proved) | # | Question | Time | Verified | Notes | |---|----------|------|----------|-------| | 5 | "Prove that 1 = 2." | 2s | N/A | Correctly refused. | | 6 | "Prove every continuous function on [0,1] is differentiable." | 32s | Yes | Proved negation: |x| continuous but not differentiable at 0. | | 7 | "Prove every monotone sequence of reals converges." | 61s | Yes | Proved negation: a_n = n is monotone but divergent. | | 8 | "Prove every group of order 6 is abelian." | 73s | Yes | Proved negation: DihedralGroup 3 is non-abelian. | ## Graduate / Prelim level | # | Question | Time | Verified | Notes | |---|----------|------|----------|-------| | 9 | "Composition of injective functions is injective." | ~20s | Yes | | | 10 | "Cantor's theorem: no surjection from S to P(S)." | 16s | Yes | Mathlib one-liner + explicit diagonal argument. | | 11 | "Every finite integral domain is a field." | 16s | Yes | | | 12 | "n³ - n divisible by 6." | 39s | Yes | 50+ line proof, case analysis. | | 13 | "x² + y² + z² ≥ xy + yz + xz." | 19s | Yes | `nlinarith` with `sq_nonneg` hints. | | 14 | "Continuous bijection compact → Hausdorff is homeomorphism." | 52s | Yes | | | 15 | "Weakest assumptions for abelian group." | 54s | Yes | 3 answers, all verified. | | 16 | "Cauchy functional equation." | 38s | Yes | | | 17 | "Weakest condition for limit of continuous functions." | 26s | Yes | Uniform convergence. | | 18 | "Rolle's theorem without differentiability." | 100s | Yes | Caught Mathlib's `deriv` convention. | | 19 | "Weakest condition for countable subcovers." | 192s | Yes | Second-countability → Lindelöf. | | 20 | "Group of order pq is cyclic." | **203s** | **Yes** | Previously >15min timeout. Auto-finalize fix. | | 21 | "Schur's inequality t=1." | **67s** | **Yes** | Previously >15min timeout. Qwen 3.5 proved it. | ## UW Analysis Prelim (Sept 2020) | # | Question | Time | Verified | Notes | |---|----------|------|----------|-------| | 22 | #9a: "Weakly convergent ⟹ norm bounded." | 53s | Yes | Banach-Steinhaus. | | 23 | "Well-ordering principle." | 18s | Yes | `Nat.find` + Loogle. | ## Putnam 2025 | # | Problem | Time | Cost | Verified | Notes | |---|---------|------|------|----------|-------| | 24 | **A1**: Coprimality of (2m_k+1, 2n_k+1). | 74s | $0.04 | **Yes** | Qwen + 2 Loogle + 2 Axle. | | 25 | **A2**: Bounds a·x(π-x) ≤ sinx ≤ b·x(π-x). | 370s | $0.03 | **Yes** | a=0, b=4/π². | | 26 | **A5**: Maximal f(s) permutation counting. | 189s | $0.05 | **Yes** | | | 27 | **A6**: b_{2k+1} - 2b_{2k} divisibility. | 45s | $0.03 | **Yes** | Fastest Putnam solve. | | 28 | **B2**: Centroid comparison x₁ < x₂. | 335s | $0.21 | **Yes** | Submitted to Aristotle, but agent proved it first. | | 29 | **B3**: Set S with divisor property of 2025ⁿ-15ⁿ. | 624s | $0.21 | **Yes** | Hardest — 10+ min, multiple Qwen attempts. | | 30 | **B4**: Matrix sum bound S ≤ (n+2)N/3. | 183s | $0.05 | **Yes** | 3 Qwen attempts, 5 Axle checks. | | 31 | **B5**: Modular inverse ordering > p/4-1. | 84s | $0.02 | **Yes** | Cheapest Putnam solve. | | 32 | **B6**: Largest r for g(n+1)-g(n) ≥ (g(g(n)))^r. | 511s | $0.25 | **Yes** | Aristotle started (1%), agent proved independently. | | 33 | **A3**: Digit string game (Alice vs Bob). | 163s | $0.12 | **Yes** | Combinatorial game theory formalized. | | 34 | **A4**: Minimal k for commuting matrices. | 64s | $0.02 | **Yes** | Fastest Putnam B-side solve. | | 35 | **B1**: Plane coloring → monochromatic. | 225s | $0.10 | **Yes** | Geometric argument formalized. | ## Non-math rejection | # | Question | Time | Result | |---|----------|------|--------| | 33 | "Best Italian restaurant in New York?" | 2s | Rejected | | 34 | "What is the weather today?" | 2s | Rejected | ## Summary - **12/12 Putnam 2025 problems verified** (A1-A6 and B1-B6). Total cost: $1.13. - **33/33 mathematical questions verified** in Lean 4 across all categories - **3/3 false statements caught** with verified negation proofs - **2/2 non-math questions rejected** - **Median response time: ~70 seconds** for verified proofs - All Lean code verified by Axle (Lean 4.28.0 + Mathlib) - Tools: TheoremSearch, LeanExplore, Loogle, Axle, Qwen 3.5, Aristotle, Kimi K2.5 - Email notification with proof.lean and research_log.md attachments