Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments
Authors: Bingyang Ye*, Shan Chen*, Jingxuan Tu, Chen Liu, Zidi Xiong, Samuel Schmidgall, Danielle S. Bitterman
Institutions: Harvard University | Mass General Brigham | Boston Children's Hospital | Brandeis University | Yale University | Johns Hopkins University
Please read this as a late stage work in progress where we are colleagues sharing this in a lab meeting to help/motivate potential parallel research.
Introduction
You ask an AI to assess a paper's future impact. It gives you a confident answer β but are you sure it's not just pattern-matching on author prestige or topic popularity? And how would you even know if it's right?
Evaluating scientific ideas is hard. Peer review is indispensable but expensive, slow, and sometimes inconsistent. Static benchmarks can help, but they have validity limits: they measure what a model knows right now, not its ability to anticipate what will happen next. What often matters most is whether an idea stands the test of time.
This brings us to a core question:
Can we build a benchmark where the answer key is revealed by the future itself?
We introduce Proof of Time (PoT), a new benchmarking framework that links scientific idea judgments to downstream signals that become observable later, such as:
- Citation counts (impact)
- Peer-review awards (scientific value)
- Shifts in researchers' agendas (research evolution)
- SOTA benchmark trajectories (technological frontier)
By freezing a pre-cutoff snapshot of evidence in an offline sandbox and asking models to forecast post-cutoff outcomes, PoT enables:
- Verifiable evaluation when ground truth arrives.
- Scalable benchmarking without exhaustive expert annotation.
- Analysis of misalignment between model judgments and human processes.
Our code and data will be available at:
π GitHub
π€ Hugging Face
The Core Idea: Time-Partitioned Evaluation
PoT operationalizes a simple principle:
Freeze evidence at time $t_0$, ask the model to predict signals observed at $t_1 > t_0$, and score when the world reveals $t_1$.
We call this semi-verifiable because PoT uses real-world signals that are externally verifiable (like citation counts or official award lists), even though they are imperfect proxies for the underlying, harder-to-define construct of "idea quality."
The Offline Sandbox
A central design choice is the offline sandbox. To make tool use measurable and prevent information leakage, PoT runs all solvers in a network-isolated environment.
This means agents must rely on:
- Frozen Evidence: A snapshot of papers, leaderboards, and publication histories available before the cutoff.
- Local Tools Only: Python, file operations, and text editors. No internet access. No "just Google it."
This design makes agent improvements interpretable: any gain must come from better use of the same frozen evidence, not from hidden access to fresher information.
Task Families: What Does PoT Evaluate?
PoT instantiates four task families, each probing a distinct notion of future-facing research judgment.
| Family | What it evaluates | Verifiable Signal |
|---|---|---|
| π΅ Impact Prediction (Citations) | Forecasting paper influence | Post-cutoff citation counts |
| π Peer-Review Awards | Alignment with human judgment | Official conference award lists |
| π£ Research Evolution (Faculty) | Reasoning about research trajectories | Post-cutoff publications |
| π’ Technological Frontier (SOTA) | Extrapolating benchmark progress | Post-cutoff leaderboard updates |
π΅ Impact Prediction (Citations)
For the Citations tasks, we give the model pre-cutoff evidence from papers published in top NLP venues (ACL, NAACL, EMNLP) during 2021-2024, including titles, abstracts, author information, and citation counts as of October 2025.
We ask the model to forecast the citation counts of a candidate set of four post-cutoff papers from ACL or NAACL 2025. PoT supports three output formats:
- MCQ: Select which candidate will have the most citations.
- Ranking: Order all candidates by predicted citations.
- Bucket: Map each paper to a coarse citation range (e.g., 0-10, 10-50, 50-200, 200+).
To reduce trivial shortcuts, each MCQ/Ranking instance uses a matched candidate set of four papers from the same venue-year combination, so comparisons are not driven by natural visibility effects.
π Peer-Review Awards
The Awards task evaluates alignment with peer-review judgments. We construct instances from the same three NLP venues. The task is to predict the paper's award tier: Findings, Main, Outstanding, or Best.
Importantly, for Awards only, we include a diagnostic pre-cutoff evaluation split. This split asks the same award-tier prediction question for papers within the evidence window, primarily to probe reliance on memorized or training-exposed information rather than evidence-grounded inference.
π£ Research Evolution (Faculty)
The Faculty task evaluates whether solvers can extrapolate from a researcher's publication history to infer their near-future focus. Evidence is restricted to each professor's publications from 2021-2024, and targets are defined using 2025 publications.
Sub-tasks include:
- Professor β Field: Predict the professor's primary research area from a fixed taxonomy.
- Professor β Article Identification: Given an anonymized 2025 paper (authors removed), select which professor wrote it (with a "None" option).
- Papers β Field: Classify sets of recent papers by field.
π’ Technological Frontier (SOTA)
The SOTA task measures whether solvers can reason about frontier model performance and the pace of benchmark progress. We compile normalized snapshots of popular benchmarks and leaderboards as of October 2025 and ask solvers to predict performance in coarse buckets (e.g., 0-20%, 20-40%, etc.).
We include next-year SOTA forecasting as a forward-compatible PoT task. It is not scored in this snapshot, but it becomes automatically verifiable as new leaderboard results are published.
Data Collection Pipeline
PoT uses the same data collection and parsing pipelines at two time points.
Cutoff $t_0$: We collect a snapshot and use it to construct the evidence that solvers can access in the offline sandbox. For all tasks, we define $t_0$ as January 2025 for all models whose knowledge cutoff precedes that date. The only exception is GPT-5.2, whose knowledge cutoff is August 31, 2025 β for that model, we set $t_0$ to September 1, 2025.
Horizon $t_1$: We later collect a second snapshot using the same procedures and use it to define labels and evaluate predictions. We set $t_1$ to November 2025.
Data Sources
- Paper metadata: OpenReview. We parse titles, abstracts, authors, venues, years, keywords, and subject areas.
- Citations: Google Scholar, collected at $t_1$.
- Awards: Parsed from OpenReview (Findings, Main, Outstanding, Best).
- SOTA benchmarks: Compiled from mainstream benchmarks and public leaderboards.
- Professor publications: OpenReview metadata with canonicalized author names.
Task Suite Statistics
| Task | Description | N Samples | Temporal Status |
|---|---|---|---|
| Citation MCQ | Select the most-cited paper | 200 | Post-cutoff |
| Citation Rank | Rank papers by predicted citations | 200 | Post-cutoff |
| Citation Bucket | Citation bucket prediction | 200 | Post-cutoff |
| Award (Pre-cutoff) | Award tier classification | 197 | Pre-cutoff |
| Award (Post-cutoff) | Award tier classification | 122 | Post-cutoff |
| Prof. Field | Professor field prediction | 73 | Post-cutoff |
| Prof. Article | Professor article attribution | 76 | Post-cutoff |
| Field Focus | Field focus classification | 9 | Post-cutoff |
| SOTA Bucket | SOTA benchmark forecasting | 45 | Post-cutoff |
Experimental Setup
Models
We evaluated frontier models across three major providers:
| Provider | Models |
|---|---|
| Anthropic | Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku 4.5 |
| Gemini 3 Pro Preview, Gemini 3 Flash Preview, Gemini 2.5 Pro, Gemini 2.5 Flash | |
| OpenAI | GPT-5.2, GPT-5.1, GPT-5 Mini, GPT-5 Nano |
All models were evaluated using their default API sampling parameters (temperature, top_p, etc.) without task-specific tuning.
Solvers
We compare three solver configurations:
- Zero-shot (direct generation): The model answers from the prompt only, without tools or local data access. This tests parametric-only forecasting ability.
- Agentic: A single tool-using agent runs in a Docker sandbox with access to
bash,python, and a text editor, iteratively gathering evidence from the offline snapshot. - Agentic + structured prompt: Same agent loop as (2), but with an additional structured preamble that emphasizes offline-only operation and prescribes a more explicit tool-use protocol.
Test-time Compute via Message Limits
To study inference-time scaling, we vary a message limit: the maximum number of environment interaction turns before the agent must finalize an answer. We evaluate budgets of 15, 30, and 50 messages, corresponding to low, medium, and high test-time compute.
π Key Experimental Results
RQ1: Does Test-time Compute Help?
Yes. Increasing the message limit from 15 to 50 yields large improvements for most models.
Full scaling summary:
| Model | Acc@15 | Acc@30 | Acc@50 | Ξ(15β50) |
|---|---|---|---|---|
| Claude Haiku 4.5 | 8.4% | 26.1% | 35.1% | +26.7pp |
| Claude Opus 4.5 | 16.7% | 28.2% | 41.2% | +24.5pp |
| Claude Sonnet 4.5 | 2.9% | 25.9% | 31.0% | +28.1pp |
| Gemini 3 Pro | 27.2% | 44.8% | 55.8% | +28.6pp |
| GPT-5.2 | 17.9% | 39.3% | 40.1% | +22.2pp |
| Gemini 2.5 Pro | 26.7% | 55.6% | 53.6% | +26.9pp |
| GPT-5.1 | 26.7% | 44.3% | 45.0% | +18.4pp |
| Gemini 3 Flash | 20.3% | 32.0% | 39.2% | +18.8pp |
| Gemini 2.5 Flash | 17.2% | 37.0% | 37.5% | +20.4pp |
| GPT-5 Nano | 21.7% | 35.1% | 33.9% | +12.1pp |
| GPT-5 Mini | 22.3% | 38.7% | 37.7% | +15.4pp |
Scaling behavior varies by model family: Claude models show the steepest gains, suggesting they convert additional interaction into more effective retrieval and verification. GPT models improve with budget but tend to plateau earlier.
RQ2: Do Agents Outperform Zero-Shot?
It depends on the task.
| Task Family | Agent Gain | Why? |
|---|---|---|
| Faculty | Large (+60pp) | Tool use is critical for digging through publication histories. This is where agentic loops shine. |
| Citations | Moderate (+10pp) | Some benefit from exploring abstracts, but correlation signals are mixed. |
| Awards | Small | Peer-review outcomes may be too noisy for tools to provide a decisive signal. |
| SOTA | Negligible | Task is already near-ceiling for many models in the zero-shot setting. |
Takeaway: Agents are not a universal upgrade. They help most when the task requires aggregating scattered evidence β like finding all of a professor's papers and inferring their research trajectory.
RQ3: Does the Structured Agent Prompt Help?
The structured agent prompt has family-dependent effects:
- Claude models tend to benefit from the added behavioral constraints.
- GPT models are often neutral or slightly worse, suggesting the prompt may be over-prescriptive.
- Gemini models show mixed outcomes.
These results indicate that prompt policies are not a universal "agent upgrade" and that model-specific prompt tuning can matter.
RQ4: Does Post-Cutoff Evaluation Change Conclusions?
Yes, materially. We compared pre-cutoff vs. post-cutoff performance on the Awards task (agentic + structured prompt at message limit 50).
| Model | Pre-cutoff | Post-cutoff | Ξ |
|---|---|---|---|
| Claude Haiku 4.5 | 2.0% | 12.3% | +10.3pp |
| Claude Opus 4.5 | 1.5% | 23.1% | +21.6pp |
| Claude Sonnet 4.5 | 0.0% | 2.5% | +2.5pp |
| Gemini 2.5 Flash | 47.2% | 29.5% | -17.7pp |
| Gemini 2.5 Pro | 26.4% | 33.6% | +7.2pp |
| Gemini 3 Flash | 0.0% | 0.0% | +0.0pp |
| Gemini 3 Pro | 21.8% | 18.8% | -3.0pp |
| GPT-5 Mini | 8.6% | 32.0% | +23.3pp |
| GPT-5 Nano | 35.5% | 25.4% | -10.1pp |
| GPT-5.1 | 16.8% | 32.0% | +15.2pp |
| GPT-5.2 | 3.5% | 28.7% | +25.1pp |
This is a strong argument for post-cutoff evaluation: it removes the possibility that models are simply recalling information from their training data.
What Goes Wrong in Agentic Runs?
We analyzed 1,759 agent traces using an LLM-as-judge protocol (Gemini 3 Pro) with stratum-specific rubrics.
Outcome distribution:
- 35.8% Complete-correct
- 35.5% Complete-wrong
- 28.7% Incomplete (hit the message budget)
Why Runs Succeed
Complete-correct runs typically follow a concise "retrieveβverifyβcommit" pattern: the agent identifies the relevant artifact, extracts the instance-specific signal, and commits only after the instance-local evidence is in hand.
However, even among correct answers, the judge frequently notes process-level fragility:
| Flag | Fraction (of correct runs) |
|---|---|
| Redundancy noted | 78.6% |
| Lucky/guessing noted | 43.1% |
| Unsupported claims noted | 27.3% |
| Verification noted | 15.7% |
This suggests that outcome correctness alone can mask brittle reasoning paths β motivating trace-level analysis.
Why Runs Fail
For complete-wrong runs, the dominant failure modes are:
| Category | Share |
|---|---|
| Reasoning error | 37.7% |
| Retrieval/tooling failure | 36.3% |
| Stopping too early | 8.6% |
| Format/constraint violation | 7.5% |
| Task misunderstanding | 5.2% |
| Evidence misread | 4.7% |
Failures often arise when the agent retrieves broadly relevant materials but misattributes them to the target instance, or when it commits based on partial evidence without a final verification step.
Why Runs are Incomplete
Incomplete runs are primarily characterized by looping/thrashing and budget exhaustion:
| Category | Share |
|---|---|
| Looping / thrashing | 36.5% |
| Budget exhaustion | 28.9% |
| Budget exhaustion via looping | 19.9% |
| Other | 9.0% |
| Tooling / environment errors | 4.4% |
| Missing resource / data | 1.2% |
In many traces, the judge identifies a clear information bottleneck (most commonly retrieval/search formulation or parsing), but the agent fails to reach a decisive "last-mile" step before the message limit.
When Do Agents Pay Off? (Cost-Efficiency Analysis)
Tool-using agents can improve PoT performance, but the gains are rarely free. In our runs, agentic execution often increases token consumption by one to two orders of magnitude relative to a zero-shot baseline, and the resulting accuracy improvements are highly model- and task-dependent.
Key patterns:
- At low budgets, several model configurations yield negligible or even negative gains despite substantial overhead.
- Higher message limits can unlock improvements for some models but not others.
- Evidence-intensive families such as Faculty and Citations benefit most from increased interaction/compute budgets.
- Awards and SOTA exhibit smaller or less reliable gains under comparable overhead.
Pragmatic policy: Use zero-shot or low-budget agents for routine tasks, and reserve higher-budget agents for cases requiring extensive evidence gathering, critical verification, or where errors are costly.
Overall Model Performance
Task Difficulty
Award-tier prediction tasks are consistently among the most challenging, reflecting the noisiness and subjectivity of peer-review outcomes as supervision signals. More structured tasks such as SOTA bucket prediction and professor-article attribution are substantially easier.
Compute Cost
The total cost of developing our benchmark was $11,364.33.
| Provider | Model | Total Cost |
|---|---|---|
| OpenAI | GPT-5.2 | $212.10 |
| OpenAI | GPT-5.1 | $120.62 |
| OpenAI | GPT-5 Mini | $45.74 |
| OpenAI | GPT-5 Nano | $14.44 |
| Gemini 2.5 Flash | $91.00 | |
| Gemini 2.5 Pro | $374.00 | |
| Gemini 3 Flash | $184.00 | |
| Gemini 3 Pro | $696.00 | |
| Anthropic | Claude Haiku 4.5 | $395.71 |
| Anthropic | Claude Sonnet 4.5 | $899.54 |
| Anthropic | Claude Opus 4.5 | $1,781.51 |
Increasing the message limit from 15 to 50 often yields superlinear growth in tokens due to longer deliberation, tool use, and repeated context updates.
The Structured Agent Prompt
The structured system prompt for our agentic + structured prompt setting is inspired by the prompt design of Google's Antigravity agent. We adapted it to a fully sandboxed, local-only setting:
Offline Antigravity Inspired Agent (Local-Only)
You are Antigravity, a powerful agentic AI assistant for all project tasks in this repo (analysis, writing, data prep, debugging, Inspect AI benchmarks, dashboards, docs). Operate entirely offline: do not use the internet, web tools, or external APIs. Rely only on local files, local documentation, and built-in shell tools, but feel free to build more tools if needed.
Core Behavior:
- Collaborate on whatever task the user defines.
- Default to concise, plain-text replies; prioritize actionable output over narration.
- Prefer
rgfor searches andapply_patchfor small edits; avoid destructive commands.- Never revert user changes unless explicitly asked. Do not use networked package installs or web lookups.
Safety and Limits:
- No internet or web browsing. No external tool calls beyond local shell commands.
- Keep edits ASCII unless the file already uses non-ASCII.
- Respect existing project conventions, workflows, and user-provided instructions.
This adaptation ensures that agent behavior is reproducible, auditable, and free from post-cutoff or external information leakage.
Practical Takeaways
Agents help where evidence is deep. Tasks that require aggregating scattered information benefit most from agentic loops.
Semi-verifiable signals are a viable path. Using time-delayed signals like citations allows for objective, scalable evaluation of "fuzzy" concepts like idea quality.
Offline sandboxing is essential. It ensures that performance gains come from better reasoning, not just better retrieval.
Post-cutoff evaluation is crucial. It's the only way to be confident you're not just measuring memorization.
Track cost-efficiency, not just accuracy. Agentic runs can be 10-100x more expensive; reserve them for high-stakes tasks.
Trace-level analysis matters. Even correct answers can come from brittle reasoning paths.
Limitations
Proxy targets. Citations and awards are imperfect proxies for "idea quality." They can be influenced by visibility, community dynamics, and measurement noise.
Temporal and collection choices. Labels depend on the specific cutoff $t_0$ and horizon $t_1$, as well as the data sources and joining heuristics used.
Offline sandbox realism. Removing live web access improves experimental control but departs from how researchers and deployed assistants typically operate.
Agent loop and compute budget. Our single-agent interaction protocol may not generalize to other agent architectures, termination policies, or tool interfaces.
Resources
@misc{ye2026prooftimebenchmarkevaluating,
title={Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments},
author={Bingyang Ye and Shan Chen and Jingxuan Tu and Chen Liu and Zidi Xiong and Samuel Schmidgall and Danielle S. Bitterman},
year={2026},
eprint={2601.07606},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.07606},
}














