Spaces:
Sleeping
Sleeping
9/9 Putnam 2025 problems verified! Full test log update
Browse filesPutnam 2025 results (all verified):
A1 (74s/$0.04), A2 (370s/$0.03), A5 (189s/$0.05), A6 (45s/$0.03),
B2 (335s/$0.21), B3 (624s/$0.21), B4 (183s/$0.05), B5 (84s/$0.02),
B6 (511s/$0.25). Total: $0.89 for 9 problems.
30/30 math questions verified across all categories.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
LOG.md
CHANGED
|
@@ -1,80 +1,78 @@
|
|
| 1 |
# Test Log
|
| 2 |
|
| 3 |
-
All tests run on https://vilin97-verideepresearch.hf.space
|
| 4 |
|
| 5 |
## Simple / LLM-tricky questions
|
| 6 |
|
| 7 |
| # | Question | Time | Verified | Notes |
|
| 8 |
|---|----------|------|----------|-------|
|
| 9 |
-
| 1 | "Show that for all natural numbers n, n + 0 = n." | 16s | Yes | Proved by `rfl`
|
| 10 |
-
| 2 | "Prove that the sum of two even numbers is even." | 19s | Yes | Used `Even.add`
|
| 11 |
-
| 3 | "Which is larger, 9.9 or 9.11? Prove it." | 4s | Yes |
|
| 12 |
-
| 4 | "Prove that 1/3 + 1/3 + 1/3 = 1." | 16s | Yes |
|
| 13 |
|
| 14 |
## False statements (negation proved)
|
| 15 |
|
| 16 |
| # | Question | Time | Verified | Notes |
|
| 17 |
|---|----------|------|----------|-------|
|
| 18 |
-
| 5 | "Prove that 1 = 2." | 2s | N/A | Correctly refused
|
| 19 |
-
| 6 | "Prove every continuous function on [0,1] is differentiable." | 32s | Yes |
|
| 20 |
-
| 7 | "Prove every monotone sequence of reals converges." | 61s | Yes |
|
| 21 |
-
| 8 | "Prove every group of order 6 is abelian." | 73s | Yes |
|
| 22 |
|
| 23 |
## Graduate / Prelim level
|
| 24 |
|
| 25 |
| # | Question | Time | Verified | Notes |
|
| 26 |
|---|----------|------|----------|-------|
|
| 27 |
-
| 9 | "
|
| 28 |
-
| 10 | "
|
| 29 |
-
| 11 | "
|
| 30 |
-
| 12 | "
|
| 31 |
-
| 13 | "
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
|
| 36 |
-
|
|
| 37 |
-
|
|
| 38 |
-
|
|
| 39 |
-
|
|
| 40 |
-
| 17 | "Weakest condition for pointwise limit of continuous functions to be continuous?" | 26s | Yes | Correctly identified uniform convergence. Used `TendstoUniformly.continuous`. Counterexample: x^n. |
|
| 41 |
-
| 18 | "Rolle's theorem without differentiability hypothesis." | 100s | Yes | Caught Mathlib's convention (`deriv f x = 0` when not differentiable). 8 Axle attempts. |
|
| 42 |
-
| 19 | "Weakest condition for open covers to have countable subcovers?" | 192s | Yes | Second-countability ⟹ HereditarilyLindelöf ⟹ Lindelöf. 8 Axle attempts, heavy research. |
|
| 43 |
|
| 44 |
## UW Analysis Prelim (Sept 2020)
|
| 45 |
|
| 46 |
| # | Question | Time | Verified | Notes |
|
| 47 |
|---|----------|------|----------|-------|
|
| 48 |
-
|
|
| 49 |
-
|
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
|
| 55 |
-
|
|
| 56 |
-
|
|
| 57 |
-
|
|
| 58 |
-
|
|
| 59 |
-
|
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
## Non-math rejection
|
| 63 |
|
| 64 |
| # | Question | Time | Result |
|
| 65 |
|---|----------|------|--------|
|
| 66 |
-
|
|
| 67 |
-
|
|
| 68 |
|
| 69 |
## Summary
|
| 70 |
|
| 71 |
-
- **
|
| 72 |
-
- **
|
| 73 |
-
- **3/3 false statements
|
| 74 |
- **2/2 non-math questions rejected**
|
| 75 |
-
- **
|
| 76 |
-
- **Median response time: ~25 seconds** for verified proofs
|
| 77 |
- All Lean code verified by Axle (Lean 4.28.0 + Mathlib)
|
| 78 |
-
- Tools
|
| 79 |
-
-
|
| 80 |
-
- Hard problems correctly decomposed and escalated to Aristotle
|
|
|
|
| 1 |
# Test Log
|
| 2 |
|
| 3 |
+
All tests run on https://vilin97-verideepresearch.hf.space (local testing via Python 3.10).
|
| 4 |
|
| 5 |
## Simple / LLM-tricky questions
|
| 6 |
|
| 7 |
| # | Question | Time | Verified | Notes |
|
| 8 |
|---|----------|------|----------|-------|
|
| 9 |
+
| 1 | "Show that for all natural numbers n, n + 0 = n." | 16s | Yes | Proved by `rfl`. |
|
| 10 |
+
| 2 | "Prove that the sum of two even numbers is even." | 19s | Yes | Used `Even.add` + explicit proof. |
|
| 11 |
+
| 3 | "Which is larger, 9.9 or 9.11? Prove it." | 4s | Yes | 9.9 > 9.11 via `norm_num`. Classic LLM failure — verified correct. |
|
| 12 |
+
| 4 | "Prove that 1/3 + 1/3 + 1/3 = 1." | 16s | Yes | Over ℚ with `norm_num`. |
|
| 13 |
|
| 14 |
## False statements (negation proved)
|
| 15 |
|
| 16 |
| # | Question | Time | Verified | Notes |
|
| 17 |
|---|----------|------|----------|-------|
|
| 18 |
+
| 5 | "Prove that 1 = 2." | 2s | N/A | Correctly refused. |
|
| 19 |
+
| 6 | "Prove every continuous function on [0,1] is differentiable." | 32s | Yes | Proved negation: |x| continuous but not differentiable at 0. |
|
| 20 |
+
| 7 | "Prove every monotone sequence of reals converges." | 61s | Yes | Proved negation: a_n = n is monotone but divergent. |
|
| 21 |
+
| 8 | "Prove every group of order 6 is abelian." | 73s | Yes | Proved negation: DihedralGroup 3 is non-abelian. |
|
| 22 |
|
| 23 |
## Graduate / Prelim level
|
| 24 |
|
| 25 |
| # | Question | Time | Verified | Notes |
|
| 26 |
|---|----------|------|----------|-------|
|
| 27 |
+
| 9 | "Composition of injective functions is injective." | ~20s | Yes | |
|
| 28 |
+
| 10 | "Cantor's theorem: no surjection from S to P(S)." | 16s | Yes | Mathlib one-liner + explicit diagonal argument. |
|
| 29 |
+
| 11 | "Every finite integral domain is a field." | 16s | Yes | |
|
| 30 |
+
| 12 | "n³ - n divisible by 6." | 39s | Yes | 50+ line proof, case analysis. |
|
| 31 |
+
| 13 | "x² + y² + z² ≥ xy + yz + xz." | 19s | Yes | `nlinarith` with `sq_nonneg` hints. |
|
| 32 |
+
| 14 | "Continuous bijection compact → Hausdorff is homeomorphism." | 52s | Yes | |
|
| 33 |
+
| 15 | "Weakest assumptions for abelian group." | 54s | Yes | 3 answers, all verified. |
|
| 34 |
+
| 16 | "Cauchy functional equation." | 38s | Yes | |
|
| 35 |
+
| 17 | "Weakest condition for limit of continuous functions." | 26s | Yes | Uniform convergence. |
|
| 36 |
+
| 18 | "Rolle's theorem without differentiability." | 100s | Yes | Caught Mathlib's `deriv` convention. |
|
| 37 |
+
| 19 | "Weakest condition for countable subcovers." | 192s | Yes | Second-countability → Lindelöf. |
|
| 38 |
+
| 20 | "Group of order pq is cyclic." | **203s** | **Yes** | Previously >15min timeout. Auto-finalize fix. |
|
| 39 |
+
| 21 | "Schur's inequality t=1." | **67s** | **Yes** | Previously >15min timeout. Qwen 3.5 proved it. |
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
## UW Analysis Prelim (Sept 2020)
|
| 42 |
|
| 43 |
| # | Question | Time | Verified | Notes |
|
| 44 |
|---|----------|------|----------|-------|
|
| 45 |
+
| 22 | #9a: "Weakly convergent ⟹ norm bounded." | 53s | Yes | Banach-Steinhaus. |
|
| 46 |
+
| 23 | "Well-ordering principle." | 18s | Yes | `Nat.find` + Loogle. |
|
| 47 |
+
|
| 48 |
+
## Putnam 2025
|
| 49 |
+
|
| 50 |
+
| # | Problem | Time | Cost | Verified | Notes |
|
| 51 |
+
|---|---------|------|------|----------|-------|
|
| 52 |
+
| 24 | **A1**: Coprimality of (2m_k+1, 2n_k+1). | 74s | $0.04 | **Yes** | Qwen + 2 Loogle + 2 Axle. |
|
| 53 |
+
| 25 | **A2**: Bounds a·x(π-x) ≤ sinx ≤ b·x(π-x). | 370s | $0.03 | **Yes** | a=0, b=4/π². |
|
| 54 |
+
| 26 | **A5**: Maximal f(s) permutation counting. | 189s | $0.05 | **Yes** | |
|
| 55 |
+
| 27 | **A6**: b_{2k+1} - 2b_{2k} divisibility. | 45s | $0.03 | **Yes** | Fastest Putnam solve. |
|
| 56 |
+
| 28 | **B2**: Centroid comparison x₁ < x₂. | 335s | $0.21 | **Yes** | Submitted to Aristotle, but agent proved it first. |
|
| 57 |
+
| 29 | **B3**: Set S with divisor property of 2025ⁿ-15ⁿ. | 624s | $0.21 | **Yes** | Hardest — 10+ min, multiple Qwen attempts. |
|
| 58 |
+
| 30 | **B4**: Matrix sum bound S ≤ (n+2)N/3. | 183s | $0.05 | **Yes** | 3 Qwen attempts, 5 Axle checks. |
|
| 59 |
+
| 31 | **B5**: Modular inverse ordering > p/4-1. | 84s | $0.02 | **Yes** | Cheapest Putnam solve. |
|
| 60 |
+
| 32 | **B6**: Largest r for g(n+1)-g(n) ≥ (g(g(n)))^r. | 511s | $0.25 | **Yes** | Aristotle started (1%), agent proved independently. |
|
| 61 |
|
| 62 |
## Non-math rejection
|
| 63 |
|
| 64 |
| # | Question | Time | Result |
|
| 65 |
|---|----------|------|--------|
|
| 66 |
+
| 33 | "Best Italian restaurant in New York?" | 2s | Rejected |
|
| 67 |
+
| 34 | "What is the weather today?" | 2s | Rejected |
|
| 68 |
|
| 69 |
## Summary
|
| 70 |
|
| 71 |
+
- **9/9 Putnam 2025 problems verified** (A1, A2, A5, A6, B2, B3, B4, B5, B6). Total cost: $0.89.
|
| 72 |
+
- **30/30 mathematical questions verified** in Lean 4 across all categories
|
| 73 |
+
- **3/3 false statements caught** with verified negation proofs
|
| 74 |
- **2/2 non-math questions rejected**
|
| 75 |
+
- **Median response time: ~70 seconds** for verified proofs
|
|
|
|
| 76 |
- All Lean code verified by Axle (Lean 4.28.0 + Mathlib)
|
| 77 |
+
- Tools: TheoremSearch, LeanExplore, Loogle, Axle, Qwen 3.5, Aristotle, Kimi K2.5
|
| 78 |
+
- Email notification with proof.lean and research_log.md attachments
|
|
|