Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments

Community Article Published January 13, 2026

Upvote

Authors: Bingyang Ye*, Shan Chen*, Jingxuan Tu, Chen Liu, Zidi Xiong, Samuel Schmidgall, Danielle S. Bitterman

Please read this as a late stage work in progress where we are colleagues sharing this in a lab meeting to help/motivate potential parallel research.

Introduction

You ask an AI to assess a paper's future impact. It gives you a confident answer — but are you sure it's not just pattern-matching on author prestige or topic popularity? And how would you even know if it's right?

Evaluating scientific ideas is hard. Peer review is indispensable but expensive, slow, and sometimes inconsistent. Static benchmarks can help, but they have validity limits: they measure what a model knows right now, not its ability to anticipate what will happen next. What often matters most is whether an idea stands the test of time.

This brings us to a core question:

Can we build a benchmark where the answer key is revealed by the future itself?

We introduce Proof of Time (PoT), a new benchmarking framework that links scientific idea judgments to downstream signals that become observable later, such as:

Citation counts (impact)
Peer-review awards (scientific value)
Shifts in researchers' agendas (research evolution)
SOTA benchmark trajectories (technological frontier)

By freezing a pre-cutoff snapshot of evidence in an offline sandbox and asking models to forecast post-cutoff outcomes, PoT enables:

Verifiable evaluation when ground truth arrives.
Scalable benchmarking without exhaustive expert annotation.
Analysis of misalignment between model judgments and human processes.

Our code and data will be available at:

🔗 GitHub

🤗 Hugging Face

The Core Idea: Time-Partitioned Evaluation

PoT operationalizes a simple principle:

Freeze evidence at time $t_0$, ask the model to predict signals observed at $t_1 > t_0$, and score when the world reveals $t_1$.

We call this semi-verifiable because PoT uses real-world signals that are externally verifiable (like citation counts or official award lists), even though they are imperfect proxies for the underlying, harder-to-define construct of "idea quality."

The Offline Sandbox

A central design choice is the offline sandbox. To make tool use measurable and prevent information leakage, PoT runs all solvers in a network-isolated environment.

This means agents must rely on:

Frozen Evidence: A snapshot of papers, leaderboards, and publication histories available before the cutoff.
Local Tools Only: Python, file operations, and text editors. No internet access. No "just Google it."

This design makes agent improvements interpretable: any gain must come from better use of the same frozen evidence, not from hidden access to fresher information.

Task Families: What Does PoT Evaluate?

PoT instantiates four task families, each probing a distinct notion of future-facing research judgment.

Family	What it evaluates	Verifiable Signal
🔵 Impact Prediction (Citations)	Forecasting paper influence	Post-cutoff citation counts
🟠 Peer-Review Awards	Alignment with human judgment	Official conference award lists
🟣 Research Evolution (Faculty)	Reasoning about research trajectories	Post-cutoff publications
🟢 Technological Frontier (SOTA)	Extrapolating benchmark progress	Post-cutoff leaderboard updates

🔵 Impact Prediction (Citations)

For the Citations tasks, we give the model pre-cutoff evidence from papers published in top NLP venues (ACL, NAACL, EMNLP) during 2021-2024, including titles, abstracts, author information, and citation counts as of October 2025.

We ask the model to forecast the citation counts of a candidate set of four post-cutoff papers from ACL or NAACL 2025. PoT supports three output formats:

MCQ: Select which candidate will have the most citations.
Ranking: Order all candidates by predicted citations.
Bucket: Map each paper to a coarse citation range (e.g., 0-10, 10-50, 50-200, 200+).

To reduce trivial shortcuts, each MCQ/Ranking instance uses a matched candidate set of four papers from the same venue-year combination, so comparisons are not driven by natural visibility effects.

🟠 Peer-Review Awards

The Awards task evaluates alignment with peer-review judgments. We construct instances from the same three NLP venues. The task is to predict the paper's award tier: Findings, Main, Outstanding, or Best.

Importantly, for Awards only, we include a diagnostic pre-cutoff evaluation split. This split asks the same award-tier prediction question for papers within the evidence window, primarily to probe reliance on memorized or training-exposed information rather than evidence-grounded inference.

🟣 Research Evolution (Faculty)

The Faculty task evaluates whether solvers can extrapolate from a researcher's publication history to infer their near-future focus. Evidence is restricted to each professor's publications from 2021-2024, and targets are defined using 2025 publications.

Sub-tasks include:

Professor → Field: Predict the professor's primary research area from a fixed taxonomy.
Professor → Article Identification: Given an anonymized 2025 paper (authors removed), select which professor wrote it (with a "None" option).
Papers → Field: Classify sets of recent papers by field.

🟢 Technological Frontier (SOTA)

The SOTA task measures whether solvers can reason about frontier model performance and the pace of benchmark progress. We compile normalized snapshots of popular benchmarks and leaderboards as of October 2025 and ask solvers to predict performance in coarse buckets (e.g., 0-20%, 20-40%, etc.).

We include next-year SOTA forecasting as a forward-compatible PoT task. It is not scored in this snapshot, but it becomes automatically verifiable as new leaderboard results are published.

Data Collection Pipeline

PoT uses the same data collection and parsing pipelines at two time points.

Cutoff $t_0$: We collect a snapshot and use it to construct the evidence that solvers can access in the offline sandbox. For all tasks, we define $t_0$ as January 2025 for all models whose knowledge cutoff precedes that date. The only exception is GPT-5.2, whose knowledge cutoff is August 31, 2025 — for that model, we set $t_0$ to September 1, 2025.

Horizon $t_1$: We later collect a second snapshot using the same procedures and use it to define labels and evaluate predictions. We set $t_1$ to November 2025.

Data Sources

Paper metadata: OpenReview. We parse titles, abstracts, authors, venues, years, keywords, and subject areas.
Citations: Google Scholar, collected at $t_1$.
Awards: Parsed from OpenReview (Findings, Main, Outstanding, Best).
SOTA benchmarks: Compiled from mainstream benchmarks and public leaderboards.
Professor publications: OpenReview metadata with canonicalized author names.

Task Suite Statistics

Task	Description	N Samples	Temporal Status
Citation MCQ	Select the most-cited paper	200	Post-cutoff
Citation Rank	Rank papers by predicted citations	200	Post-cutoff
Citation Bucket	Citation bucket prediction	200	Post-cutoff
Award (Pre-cutoff)	Award tier classification	197	Pre-cutoff
Award (Post-cutoff)	Award tier classification	122	Post-cutoff
Prof. Field	Professor field prediction	73	Post-cutoff
Prof. Article	Professor article attribution	76	Post-cutoff
Field Focus	Field focus classification	9	Post-cutoff
SOTA Bucket	SOTA benchmark forecasting	45	Post-cutoff

Experimental Setup

Models

We evaluated frontier models across three major providers:

Provider	Models
Anthropic	Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku 4.5
Google	Gemini 3 Pro Preview, Gemini 3 Flash Preview, Gemini 2.5 Pro, Gemini 2.5 Flash
OpenAI	GPT-5.2, GPT-5.1, GPT-5 Mini, GPT-5 Nano

All models were evaluated using their default API sampling parameters (temperature, top_p, etc.) without task-specific tuning.

Solvers

We compare three solver configurations:

Zero-shot (direct generation): The model answers from the prompt only, without tools or local data access. This tests parametric-only forecasting ability.
Agentic: A single tool-using agent runs in a Docker sandbox with access to bash, python, and a text editor, iteratively gathering evidence from the offline snapshot.
Agentic + structured prompt: Same agent loop as (2), but with an additional structured preamble that emphasizes offline-only operation and prescribes a more explicit tool-use protocol.

Test-time Compute via Message Limits

To study inference-time scaling, we vary a message limit: the maximum number of environment interaction turns before the agent must finalize an answer. We evaluate budgets of 15, 30, and 50 messages, corresponding to low, medium, and high test-time compute.

🔑 Key Experimental Results

RQ1: Does Test-time Compute Help?

Yes. Increasing the message limit from 15 to 50 yields large improvements for most models.

Full scaling summary:

Model	Acc@15	Acc@30	Acc@50	Δ(15→50)
Claude Haiku 4.5	8.4%	26.1%	35.1%	+26.7pp
Claude Opus 4.5	16.7%	28.2%	41.2%	+24.5pp
Claude Sonnet 4.5	2.9%	25.9%	31.0%	+28.1pp
Gemini 3 Pro	27.2%	44.8%	55.8%	+28.6pp
GPT-5.2	17.9%	39.3%	40.1%	+22.2pp
Gemini 2.5 Pro	26.7%	55.6%	53.6%	+26.9pp
GPT-5.1	26.7%	44.3%	45.0%	+18.4pp
Gemini 3 Flash	20.3%	32.0%	39.2%	+18.8pp
Gemini 2.5 Flash	17.2%	37.0%	37.5%	+20.4pp
GPT-5 Nano	21.7%	35.1%	33.9%	+12.1pp
GPT-5 Mini	22.3%	38.7%	37.7%	+15.4pp

Scaling behavior varies by model family: Claude models show the steepest gains, suggesting they convert additional interaction into more effective retrieval and verification. GPT models improve with budget but tend to plateau earlier.

RQ2: Do Agents Outperform Zero-Shot?

It depends on the task.

Task Family	Agent Gain	Why?
Faculty	Large (+60pp)	Tool use is critical for digging through publication histories. This is where agentic loops shine.
Citations	Moderate (+10pp)	Some benefit from exploring abstracts, but correlation signals are mixed.
Awards	Small	Peer-review outcomes may be too noisy for tools to provide a decisive signal.
SOTA	Negligible	Task is already near-ceiling for many models in the zero-shot setting.

Takeaway: Agents are not a universal upgrade. They help most when the task requires aggregating scattered evidence — like finding all of a professor's papers and inferring their research trajectory.

RQ3: Does the Structured Agent Prompt Help?

The structured agent prompt has family-dependent effects:

Claude models tend to benefit from the added behavioral constraints.
GPT models are often neutral or slightly worse, suggesting the prompt may be over-prescriptive.
Gemini models show mixed outcomes.

These results indicate that prompt policies are not a universal "agent upgrade" and that model-specific prompt tuning can matter.

RQ4: Does Post-Cutoff Evaluation Change Conclusions?

Yes, materially. We compared pre-cutoff vs. post-cutoff performance on the Awards task (agentic + structured prompt at message limit 50).

Model	Pre-cutoff	Post-cutoff	Δ
Claude Haiku 4.5	2.0%	12.3%	+10.3pp
Claude Opus 4.5	1.5%	23.1%	+21.6pp
Claude Sonnet 4.5	0.0%	2.5%	+2.5pp
Gemini 2.5 Flash	47.2%	29.5%	-17.7pp
Gemini 2.5 Pro	26.4%	33.6%	+7.2pp
Gemini 3 Flash	0.0%	0.0%	+0.0pp
Gemini 3 Pro	21.8%	18.8%	-3.0pp
GPT-5 Mini	8.6%	32.0%	+23.3pp
GPT-5 Nano	35.5%	25.4%	-10.1pp
GPT-5.1	16.8%	32.0%	+15.2pp
GPT-5.2	3.5%	28.7%	+25.1pp

This is a strong argument for post-cutoff evaluation: it removes the possibility that models are simply recalling information from their training data.

What Goes Wrong in Agentic Runs?

We analyzed 1,759 agent traces using an LLM-as-judge protocol (Gemini 3 Pro) with stratum-specific rubrics.

Outcome distribution:

35.8% Complete-correct
35.5% Complete-wrong
28.7% Incomplete (hit the message budget)

Why Runs Succeed

Complete-correct runs typically follow a concise "retrieve–verify–commit" pattern: the agent identifies the relevant artifact, extracts the instance-specific signal, and commits only after the instance-local evidence is in hand.

However, even among correct answers, the judge frequently notes process-level fragility:

Flag	Fraction (of correct runs)
Redundancy noted	78.6%
Lucky/guessing noted	43.1%
Unsupported claims noted	27.3%
Verification noted	15.7%

This suggests that outcome correctness alone can mask brittle reasoning paths — motivating trace-level analysis.

Why Runs Fail

For complete-wrong runs, the dominant failure modes are:

Category	Share
Reasoning error	37.7%
Retrieval/tooling failure	36.3%
Stopping too early	8.6%
Format/constraint violation	7.5%
Task misunderstanding	5.2%
Evidence misread	4.7%

Failures often arise when the agent retrieves broadly relevant materials but misattributes them to the target instance, or when it commits based on partial evidence without a final verification step.

Why Runs are Incomplete

Incomplete runs are primarily characterized by looping/thrashing and budget exhaustion:

Category	Share
Looping / thrashing	36.5%
Budget exhaustion	28.9%
Budget exhaustion via looping	19.9%
Other	9.0%
Tooling / environment errors	4.4%
Missing resource / data	1.2%

In many traces, the judge identifies a clear information bottleneck (most commonly retrieval/search formulation or parsing), but the agent fails to reach a decisive "last-mile" step before the message limit.

When Do Agents Pay Off? (Cost-Efficiency Analysis)

Tool-using agents can improve PoT performance, but the gains are rarely free. In our runs, agentic execution often increases token consumption by one to two orders of magnitude relative to a zero-shot baseline, and the resulting accuracy improvements are highly model- and task-dependent.

Key patterns:

At low budgets, several model configurations yield negligible or even negative gains despite substantial overhead.
Higher message limits can unlock improvements for some models but not others.
Evidence-intensive families such as Faculty and Citations benefit most from increased interaction/compute budgets.
Awards and SOTA exhibit smaller or less reliable gains under comparable overhead.

Pragmatic policy: Use zero-shot or low-budget agents for routine tasks, and reserve higher-budget agents for cases requiring extensive evidence gathering, critical verification, or where errors are costly.

Overall Model Performance

Task Difficulty

Award-tier prediction tasks are consistently among the most challenging, reflecting the noisiness and subjectivity of peer-review outcomes as supervision signals. More structured tasks such as SOTA bucket prediction and professor-article attribution are substantially easier.

Compute Cost

The total cost of developing our benchmark was $11,364.33.

Provider	Model	Total Cost
OpenAI	GPT-5.2	$212.10
OpenAI	GPT-5.1	$120.62
OpenAI	GPT-5 Mini	$45.74
OpenAI	GPT-5 Nano	$14.44
Google	Gemini 2.5 Flash	$91.00
Google	Gemini 2.5 Pro	$374.00
Google	Gemini 3 Flash	$184.00
Google	Gemini 3 Pro	$696.00
Anthropic	Claude Haiku 4.5	$395.71
Anthropic	Claude Sonnet 4.5	$899.54
Anthropic	Claude Opus 4.5	$1,781.51

Increasing the message limit from 15 to 50 often yields superlinear growth in tokens due to longer deliberation, tool use, and repeated context updates.

The Structured Agent Prompt

The structured system prompt for our agentic + structured prompt setting is inspired by the prompt design of Google's Antigravity agent. We adapted it to a fully sandboxed, local-only setting:

Offline Antigravity Inspired Agent (Local-Only)

You are Antigravity, a powerful agentic AI assistant for all project tasks in this repo (analysis, writing, data prep, debugging, Inspect AI benchmarks, dashboards, docs). Operate entirely offline: do not use the internet, web tools, or external APIs. Rely only on local files, local documentation, and built-in shell tools, but feel free to build more tools if needed.

Core Behavior:

Collaborate on whatever task the user defines.

Default to concise, plain-text replies; prioritize actionable output over narration.

Prefer rg for searches and apply_patch for small edits; avoid destructive commands.

Never revert user changes unless explicitly asked. Do not use networked package installs or web lookups.

Safety and Limits:

No internet or web browsing. No external tool calls beyond local shell commands.

Keep edits ASCII unless the file already uses non-ASCII.

Respect existing project conventions, workflows, and user-provided instructions.

This adaptation ensures that agent behavior is reproducible, auditable, and free from post-cutoff or external information leakage.

Practical Takeaways

Agents help where evidence is deep. Tasks that require aggregating scattered information benefit most from agentic loops.
Semi-verifiable signals are a viable path. Using time-delayed signals like citations allows for objective, scalable evaluation of "fuzzy" concepts like idea quality.
Offline sandboxing is essential. It ensures that performance gains come from better reasoning, not just better retrieval.
Post-cutoff evaluation is crucial. It's the only way to be confident you're not just measuring memorization.
Track cost-efficiency, not just accuracy. Agentic runs can be 10-100x more expensive; reserve them for high-stakes tasks.
Trace-level analysis matters. Even correct answers can come from brittle reasoning paths.

Limitations

Proxy targets. Citations and awards are imperfect proxies for "idea quality." They can be influenced by visibility, community dynamics, and measurement noise.
Temporal and collection choices. Labels depend on the specific cutoff $t_0$ and horizon $t_1$, as well as the data sources and joining heuristics used.
Offline sandbox realism. Removing live web access improves experimental control but departs from how researchers and deployed assistants typically operate.
Agent loop and compute budget. Our single-agent interaction protocol may not generalize to other agent architectures, termination policies, or tool interfaces.

Resources

Paper: arXiv
Code & Data: GitHub

@misc{ye2026prooftimebenchmarkevaluating,
      title={Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments}, 
      author={Bingyang Ye and Shan Chen and Jingxuan Tu and Chen Liu and Zidi Xiong and Samuel Schmidgall and Danielle S. Bitterman},
      year={2026},
      eprint={2601.07606},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.07606}, 
}

Datasets mentioned in this article 1

Budget Alignment: Making Models Reason in the User’s Language

November 4, 2025

What We Learned About LLM/VLMs in Healthcare AI Evaluation:

November 8, 2024

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote