Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments

Community Article Published January 13, 2026

Authors: Bingyang Ye*, Shan Chen*, Jingxuan Tu, Chen Liu, Zidi Xiong, Samuel Schmidgall, Danielle S. Bitterman

Institutions: Harvard University | Mass General Brigham | Boston Children's Hospital | Brandeis University | Yale University | Johns Hopkins University

Please read this as a late stage work in progress where we are colleagues sharing this in a lab meeting to help/motivate potential parallel research.

Introduction

You ask an AI to assess a paper's future impact. It gives you a confident answer β€” but are you sure it's not just pattern-matching on author prestige or topic popularity? And how would you even know if it's right?

Evaluating scientific ideas is hard. Peer review is indispensable but expensive, slow, and sometimes inconsistent. Static benchmarks can help, but they have validity limits: they measure what a model knows right now, not its ability to anticipate what will happen next. What often matters most is whether an idea stands the test of time.

This brings us to a core question:

Can we build a benchmark where the answer key is revealed by the future itself?

We introduce Proof of Time (PoT), a new benchmarking framework that links scientific idea judgments to downstream signals that become observable later, such as:

  • Citation counts (impact)
  • Peer-review awards (scientific value)
  • Shifts in researchers' agendas (research evolution)
  • SOTA benchmark trajectories (technological frontier)

By freezing a pre-cutoff snapshot of evidence in an offline sandbox and asking models to forecast post-cutoff outcomes, PoT enables:

  1. Verifiable evaluation when ground truth arrives.
  2. Scalable benchmarking without exhaustive expert annotation.
  3. Analysis of misalignment between model judgments and human processes.

Our code and data will be available at:

πŸ”— GitHub

πŸ€— Hugging Face


The Core Idea: Time-Partitioned Evaluation

PoT operationalizes a simple principle:

Freeze evidence at time $t_0$, ask the model to predict signals observed at $t_1 > t_0$, and score when the world reveals $t_1$.

We call this semi-verifiable because PoT uses real-world signals that are externally verifiable (like citation counts or official award lists), even though they are imperfect proxies for the underlying, harder-to-define construct of "idea quality."

The Offline Sandbox

A central design choice is the offline sandbox. To make tool use measurable and prevent information leakage, PoT runs all solvers in a network-isolated environment.

This means agents must rely on:

  1. Frozen Evidence: A snapshot of papers, leaderboards, and publication histories available before the cutoff.
  2. Local Tools Only: Python, file operations, and text editors. No internet access. No "just Google it."

This design makes agent improvements interpretable: any gain must come from better use of the same frozen evidence, not from hidden access to fresher information.

The PoT Workflow


Task Families: What Does PoT Evaluate?

PoT instantiates four task families, each probing a distinct notion of future-facing research judgment.

Family What it evaluates Verifiable Signal
πŸ”΅ Impact Prediction (Citations) Forecasting paper influence Post-cutoff citation counts
🟠 Peer-Review Awards Alignment with human judgment Official conference award lists
🟣 Research Evolution (Faculty) Reasoning about research trajectories Post-cutoff publications
🟒 Technological Frontier (SOTA) Extrapolating benchmark progress Post-cutoff leaderboard updates

πŸ”΅ Impact Prediction (Citations)

For the Citations tasks, we give the model pre-cutoff evidence from papers published in top NLP venues (ACL, NAACL, EMNLP) during 2021-2024, including titles, abstracts, author information, and citation counts as of October 2025.

We ask the model to forecast the citation counts of a candidate set of four post-cutoff papers from ACL or NAACL 2025. PoT supports three output formats:

  • MCQ: Select which candidate will have the most citations.
  • Ranking: Order all candidates by predicted citations.
  • Bucket: Map each paper to a coarse citation range (e.g., 0-10, 10-50, 50-200, 200+).

To reduce trivial shortcuts, each MCQ/Ranking instance uses a matched candidate set of four papers from the same venue-year combination, so comparisons are not driven by natural visibility effects.

🟠 Peer-Review Awards

The Awards task evaluates alignment with peer-review judgments. We construct instances from the same three NLP venues. The task is to predict the paper's award tier: Findings, Main, Outstanding, or Best.

Importantly, for Awards only, we include a diagnostic pre-cutoff evaluation split. This split asks the same award-tier prediction question for papers within the evidence window, primarily to probe reliance on memorized or training-exposed information rather than evidence-grounded inference.

🟣 Research Evolution (Faculty)

The Faculty task evaluates whether solvers can extrapolate from a researcher's publication history to infer their near-future focus. Evidence is restricted to each professor's publications from 2021-2024, and targets are defined using 2025 publications.

Sub-tasks include:

  • Professor β†’ Field: Predict the professor's primary research area from a fixed taxonomy.
  • Professor β†’ Article Identification: Given an anonymized 2025 paper (authors removed), select which professor wrote it (with a "None" option).
  • Papers β†’ Field: Classify sets of recent papers by field.

🟒 Technological Frontier (SOTA)

The SOTA task measures whether solvers can reason about frontier model performance and the pace of benchmark progress. We compile normalized snapshots of popular benchmarks and leaderboards as of October 2025 and ask solvers to predict performance in coarse buckets (e.g., 0-20%, 20-40%, etc.).

We include next-year SOTA forecasting as a forward-compatible PoT task. It is not scored in this snapshot, but it becomes automatically verifiable as new leaderboard results are published.


Data Collection Pipeline

PoT uses the same data collection and parsing pipelines at two time points.

Cutoff $t_0$: We collect a snapshot and use it to construct the evidence that solvers can access in the offline sandbox. For all tasks, we define $t_0$ as January 2025 for all models whose knowledge cutoff precedes that date. The only exception is GPT-5.2, whose knowledge cutoff is August 31, 2025 β€” for that model, we set $t_0$ to September 1, 2025.

Horizon $t_1$: We later collect a second snapshot using the same procedures and use it to define labels and evaluate predictions. We set $t_1$ to November 2025.

Data Sources

  • Paper metadata: OpenReview. We parse titles, abstracts, authors, venues, years, keywords, and subject areas.
  • Citations: Google Scholar, collected at $t_1$.
  • Awards: Parsed from OpenReview (Findings, Main, Outstanding, Best).
  • SOTA benchmarks: Compiled from mainstream benchmarks and public leaderboards.
  • Professor publications: OpenReview metadata with canonicalized author names.

Task Suite Statistics

Task Description N Samples Temporal Status
Citation MCQ Select the most-cited paper 200 Post-cutoff
Citation Rank Rank papers by predicted citations 200 Post-cutoff
Citation Bucket Citation bucket prediction 200 Post-cutoff
Award (Pre-cutoff) Award tier classification 197 Pre-cutoff
Award (Post-cutoff) Award tier classification 122 Post-cutoff
Prof. Field Professor field prediction 73 Post-cutoff
Prof. Article Professor article attribution 76 Post-cutoff
Field Focus Field focus classification 9 Post-cutoff
SOTA Bucket SOTA benchmark forecasting 45 Post-cutoff

Experimental Setup

Models

We evaluated frontier models across three major providers:

Provider Models
Anthropic Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku 4.5
Google Gemini 3 Pro Preview, Gemini 3 Flash Preview, Gemini 2.5 Pro, Gemini 2.5 Flash
OpenAI GPT-5.2, GPT-5.1, GPT-5 Mini, GPT-5 Nano

All models were evaluated using their default API sampling parameters (temperature, top_p, etc.) without task-specific tuning.

Solvers

We compare three solver configurations:

  1. Zero-shot (direct generation): The model answers from the prompt only, without tools or local data access. This tests parametric-only forecasting ability.
  2. Agentic: A single tool-using agent runs in a Docker sandbox with access to bash, python, and a text editor, iteratively gathering evidence from the offline snapshot.
  3. Agentic + structured prompt: Same agent loop as (2), but with an additional structured preamble that emphasizes offline-only operation and prescribes a more explicit tool-use protocol.

Test-time Compute via Message Limits

To study inference-time scaling, we vary a message limit: the maximum number of environment interaction turns before the agent must finalize an answer. We evaluate budgets of 15, 30, and 50 messages, corresponding to low, medium, and high test-time compute.


πŸ”‘ Key Experimental Results

RQ1: Does Test-time Compute Help?

Core Results Summary

Yes. Increasing the message limit from 15 to 50 yields large improvements for most models.

Full scaling summary:

Model Acc@15 Acc@30 Acc@50 Ξ”(15β†’50)
Claude Haiku 4.5 8.4% 26.1% 35.1% +26.7pp
Claude Opus 4.5 16.7% 28.2% 41.2% +24.5pp
Claude Sonnet 4.5 2.9% 25.9% 31.0% +28.1pp
Gemini 3 Pro 27.2% 44.8% 55.8% +28.6pp
GPT-5.2 17.9% 39.3% 40.1% +22.2pp
Gemini 2.5 Pro 26.7% 55.6% 53.6% +26.9pp
GPT-5.1 26.7% 44.3% 45.0% +18.4pp
Gemini 3 Flash 20.3% 32.0% 39.2% +18.8pp
Gemini 2.5 Flash 17.2% 37.0% 37.5% +20.4pp
GPT-5 Nano 21.7% 35.1% 33.9% +12.1pp
GPT-5 Mini 22.3% 38.7% 37.7% +15.4pp

Scaling behavior varies by model family: Claude models show the steepest gains, suggesting they convert additional interaction into more effective retrieval and verification. GPT models improve with budget but tend to plateau earlier.

Scaling Gains

Scaling by Family

Scaling by Model


RQ2: Do Agents Outperform Zero-Shot?

It depends on the task.

Task Family Agent Gain Why?
Faculty Large (+60pp) Tool use is critical for digging through publication histories. This is where agentic loops shine.
Citations Moderate (+10pp) Some benefit from exploring abstracts, but correlation signals are mixed.
Awards Small Peer-review outcomes may be too noisy for tools to provide a decisive signal.
SOTA Negligible Task is already near-ceiling for many models in the zero-shot setting.

Takeaway: Agents are not a universal upgrade. They help most when the task requires aggregating scattered evidence β€” like finding all of a professor's papers and inferring their research trajectory.

Simple vs Agentic

Agentic vs Zero-shot Scatter (Msg 50)

Agentic vs Zero-shot Scatter (Msg 30)

Task Family Efficiency


RQ3: Does the Structured Agent Prompt Help?

The structured agent prompt has family-dependent effects:

  • Claude models tend to benefit from the added behavioral constraints.
  • GPT models are often neutral or slightly worse, suggesting the prompt may be over-prescriptive.
  • Gemini models show mixed outcomes.

These results indicate that prompt policies are not a universal "agent upgrade" and that model-specific prompt tuning can matter.


RQ4: Does Post-Cutoff Evaluation Change Conclusions?

Yes, materially. We compared pre-cutoff vs. post-cutoff performance on the Awards task (agentic + structured prompt at message limit 50).

Model Pre-cutoff Post-cutoff Ξ”
Claude Haiku 4.5 2.0% 12.3% +10.3pp
Claude Opus 4.5 1.5% 23.1% +21.6pp
Claude Sonnet 4.5 0.0% 2.5% +2.5pp
Gemini 2.5 Flash 47.2% 29.5% -17.7pp
Gemini 2.5 Pro 26.4% 33.6% +7.2pp
Gemini 3 Flash 0.0% 0.0% +0.0pp
Gemini 3 Pro 21.8% 18.8% -3.0pp
GPT-5 Mini 8.6% 32.0% +23.3pp
GPT-5 Nano 35.5% 25.4% -10.1pp
GPT-5.1 16.8% 32.0% +15.2pp
GPT-5.2 3.5% 28.7% +25.1pp

This is a strong argument for post-cutoff evaluation: it removes the possibility that models are simply recalling information from their training data.


What Goes Wrong in Agentic Runs?

We analyzed 1,759 agent traces using an LLM-as-judge protocol (Gemini 3 Pro) with stratum-specific rubrics.

Outcome distribution:

  • 35.8% Complete-correct
  • 35.5% Complete-wrong
  • 28.7% Incomplete (hit the message budget)

Why Runs Succeed

Complete-correct runs typically follow a concise "retrieve–verify–commit" pattern: the agent identifies the relevant artifact, extracts the instance-specific signal, and commits only after the instance-local evidence is in hand.

However, even among correct answers, the judge frequently notes process-level fragility:

Flag Fraction (of correct runs)
Redundancy noted 78.6%
Lucky/guessing noted 43.1%
Unsupported claims noted 27.3%
Verification noted 15.7%

This suggests that outcome correctness alone can mask brittle reasoning paths β€” motivating trace-level analysis.

Why Runs Fail

For complete-wrong runs, the dominant failure modes are:

Category Share
Reasoning error 37.7%
Retrieval/tooling failure 36.3%
Stopping too early 8.6%
Format/constraint violation 7.5%
Task misunderstanding 5.2%
Evidence misread 4.7%

Failures often arise when the agent retrieves broadly relevant materials but misattributes them to the target instance, or when it commits based on partial evidence without a final verification step.

Why Runs are Incomplete

Incomplete runs are primarily characterized by looping/thrashing and budget exhaustion:

Category Share
Looping / thrashing 36.5%
Budget exhaustion 28.9%
Budget exhaustion via looping 19.9%
Other 9.0%
Tooling / environment errors 4.4%
Missing resource / data 1.2%

In many traces, the judge identifies a clear information bottleneck (most commonly retrieval/search formulation or parsing), but the agent fails to reach a decisive "last-mile" step before the message limit.

Hit Limit Analysis


When Do Agents Pay Off? (Cost-Efficiency Analysis)

Tool-using agents can improve PoT performance, but the gains are rarely free. In our runs, agentic execution often increases token consumption by one to two orders of magnitude relative to a zero-shot baseline, and the resulting accuracy improvements are highly model- and task-dependent.

Key patterns:

  • At low budgets, several model configurations yield negligible or even negative gains despite substantial overhead.
  • Higher message limits can unlock improvements for some models but not others.
  • Evidence-intensive families such as Faculty and Citations benefit most from increased interaction/compute budgets.
  • Awards and SOTA exhibit smaller or less reliable gains under comparable overhead.

Pragmatic policy: Use zero-shot or low-budget agents for routine tasks, and reserve higher-budget agents for cases requiring extensive evidence gathering, critical verification, or where errors are costly.

Agent Efficiency Frontier

Overall Model Performance

Overall Performance (Msg 50)

Overall Performance (Msg 30)


Task Difficulty

Award-tier prediction tasks are consistently among the most challenging, reflecting the noisiness and subjectivity of peer-review outcomes as supervision signals. More structured tasks such as SOTA bucket prediction and professor-article attribution are substantially easier.

Task Difficulty Ranking

Model-Task Heatmap


Compute Cost

The total cost of developing our benchmark was $11,364.33.

Provider Model Total Cost
OpenAI GPT-5.2 $212.10
OpenAI GPT-5.1 $120.62
OpenAI GPT-5 Mini $45.74
OpenAI GPT-5 Nano $14.44
Google Gemini 2.5 Flash $91.00
Google Gemini 2.5 Pro $374.00
Google Gemini 3 Flash $184.00
Google Gemini 3 Pro $696.00
Anthropic Claude Haiku 4.5 $395.71
Anthropic Claude Sonnet 4.5 $899.54
Anthropic Claude Opus 4.5 $1,781.51

Increasing the message limit from 15 to 50 often yields superlinear growth in tokens due to longer deliberation, tool use, and repeated context updates.


The Structured Agent Prompt

The structured system prompt for our agentic + structured prompt setting is inspired by the prompt design of Google's Antigravity agent. We adapted it to a fully sandboxed, local-only setting:

Offline Antigravity Inspired Agent (Local-Only)

You are Antigravity, a powerful agentic AI assistant for all project tasks in this repo (analysis, writing, data prep, debugging, Inspect AI benchmarks, dashboards, docs). Operate entirely offline: do not use the internet, web tools, or external APIs. Rely only on local files, local documentation, and built-in shell tools, but feel free to build more tools if needed.

Core Behavior:

  • Collaborate on whatever task the user defines.
  • Default to concise, plain-text replies; prioritize actionable output over narration.
  • Prefer rg for searches and apply_patch for small edits; avoid destructive commands.
  • Never revert user changes unless explicitly asked. Do not use networked package installs or web lookups.

Safety and Limits:

  • No internet or web browsing. No external tool calls beyond local shell commands.
  • Keep edits ASCII unless the file already uses non-ASCII.
  • Respect existing project conventions, workflows, and user-provided instructions.

This adaptation ensures that agent behavior is reproducible, auditable, and free from post-cutoff or external information leakage.


Practical Takeaways

  1. Agents help where evidence is deep. Tasks that require aggregating scattered information benefit most from agentic loops.

  2. Semi-verifiable signals are a viable path. Using time-delayed signals like citations allows for objective, scalable evaluation of "fuzzy" concepts like idea quality.

  3. Offline sandboxing is essential. It ensures that performance gains come from better reasoning, not just better retrieval.

  4. Post-cutoff evaluation is crucial. It's the only way to be confident you're not just measuring memorization.

  5. Track cost-efficiency, not just accuracy. Agentic runs can be 10-100x more expensive; reserve them for high-stakes tasks.

  6. Trace-level analysis matters. Even correct answers can come from brittle reasoning paths.


Limitations

  • Proxy targets. Citations and awards are imperfect proxies for "idea quality." They can be influenced by visibility, community dynamics, and measurement noise.

  • Temporal and collection choices. Labels depend on the specific cutoff $t_0$ and horizon $t_1$, as well as the data sources and joining heuristics used.

  • Offline sandbox realism. Removing live web access improves experimental control but departs from how researchers and deployed assistants typically operate.

  • Agent loop and compute budget. Our single-agent interaction protocol may not generalize to other agent architectures, termination policies, or tool interfaces.


Resources

@misc{ye2026prooftimebenchmarkevaluating,
      title={Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments}, 
      author={Bingyang Ye and Shan Chen and Jingxuan Tu and Chen Liu and Zidi Xiong and Samuel Schmidgall and Danielle S. Bitterman},
      year={2026},
      eprint={2601.07606},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.07606}, 
}

Community

Sign up or log in to comment