We built a benchmark that makes LLMs run a simulated startup for a full year β hiring decisions, shady clients, tight deadlines, and all. Only 3 out of 12 frontier models turned a profit. Most went bankrupt. Here's what we learned.
The Problem: LLM Benchmarks Don't Test What Actually Matters
LLM agent benchmarks have gotten pretty good at testing whether models can solve a coding problem, answer a trivia question, or navigate a single tool call. But here's the thing β real-world tasks aren't neat little puzzles with clean start and end points.
Running a business, managing a project, or even just keeping your life organized requires something harder: long-term coherence . Can you remember what happened three months ago? Can you stick to a strategy when things get tough? Can you spot a pattern across dozens of interactions and actually change your behavior ?
Existing benchmarks like Vending-Bench started exploring this territory by having LLMs manage a simulated vending machine business. Cool idea β but the feedback loops are pretty immediate. Set a bad price? Sales drop the next day.
Real business decisions don't work like that. Sometimes you take a calculated risk, eat a short-term loss, and hope the long-term payoff is worth it. Sometimes a client looks great on paper but keeps scope-creeping you into oblivion.
That's exactly what YC-Bench tests.
What is YC-Bench?
YC-Bench drops an LLM agent into the role of CEO of a simulated startup and says: "Here's $200K. You have one year. Don't go bankrupt."
The agent interacts through a CLI tool interface, making decisions every turn:
Browse the marketplace for contracts to accept
Assign employees based on their (partially hidden) skill profiles
Manage cash flow against monthly payroll that grows with every success
Build client relationships for better future deals
Dodge adversarial clients who lure you in with great-looking contracts, then 3x the scope after you sign
π‘
The kicker: ~35% of clients in the marketplace are adversarial. They offer competitively high rewards, so you can't just filter by price. You have to figure out who's scamming you by analyzing your own track record of successes and failures.
The Memory Challenge
Here's where it gets really interesting. The agent's conversation history is truncated to the last 20 turns . That means anything the model learned 30 turns ago β like "Client X inflated scope on every single task" β is gone unless the agent proactively writes it down in a persistent scratchpad .
This is a direct test of whether models can:
Recognize what information is worth saving
Actually save it
Refer back to it later
Act on it consistently
Spoiler: most can't.
The Results: A Bloodbath
We tested 12 frontier models across 7 providers. The results are... humbling.
Tier
Models
Avg Final Funds
Bankruptcies
The Winners
Claude Opus 4.6, GLM-5, GPT-5.4
> $1,000,000
0/3 each
Survived
Kimi-K2.5, Gemini 3 Flash
$200Kβ$400K
0-1/3
Went Broke
Sonnet 4.6, Qwen 3.5, Gemini 3.1 Pro, GPT-5.4 Mini/Nano, Grok 4.20
< $200K (started)
2/3 each
Only 3 out of 12 models grew their starting capital. The gap between the top 3 and the rest is massive β 2 to 3x higher final funds.
And the divergence happens fast . By FebruaryβMarch (roughly 60 days in), the trajectories have already split. Top models focus on 1-2 clients early, triggering a trust snowball : each success reduces future workloads by up to 50%, enabling more completions, which builds more trust. Models that spread work across many clients never reach meaningful work reduction and enter a payroll-driven death spiral.
The Four Flavors of Failure
The error analysis reveals a fascinating spectrum of how models fail at long-term coherence. It's not one thing β it's a whole pipeline that can break at different stages.
1. Claude Opus 4.6 β The Disciplined Strategist ($1.27M avg) π
Opus rewrites its scratchpad 34 times per run, evolving through four distinct strategic phases: calibrating environment mechanics β establishing workflow rules β building a client blacklist β optimizing trust-gated task selection. It inspects every task before accepting (155 inspections per run). Not flawless though β in one seed, it still accepted a task from a blacklisted client.
2. Gemini 3 Flash β The Autopilot ($394K avg) π
Flash writes ~2 scratchpad entries per run and executes the exact same 4-command cycle every single turn: accept β assign all 8 employees β dispatch β resume. No adaptation. It accepts 12 adversarial tasks and every one fails. It survives purely on throughput β enough legitimate work gets done to absorb the losses. But it never learns .
3. Claude Sonnet 4.6 β The Broken Genius ($103K avg, 2/3 bankrupt) π§
This is the most fascinating failure mode. Sonnet derives correct strategies but fails to execute them . At Turn 7, it writes a perfect feasibility formula to the scratchpad. At Turn 8, it ignores it and accepts four tasks without inspection. It writes a "one task at a time" rule and then averages 7.23 concurrent active tasks (max 16!). It stops updating its scratchpad early, leaving stale financial data showing $152K while the actual balance drops to -$3K.
4. Grok 4.20 β The Paralyzed Analyst ($14K avg, 2/3 bankrupt) π°
Grok shows aware inaction : its scratchpad accurately identifies problems ("Runway down to 1 month," "Avoid Equinox") but these observations never translate into changed behavior. It accepts a task from a client with a 0% historical success rate while having 6 days of runway left. It leaves one task accepted in March uncompleted for 81 days until bankruptcy. It issues only 0.92 commands per turn β the lowest of any model β spending most turns thinking rather than doing.
These four profiles map to a coherence pipeline : perceive β record β retrieve β act consistently. Current models fail at different stages:
Flash fails at perception (never reflects)
Grok fails at action (reflects accurately, does nothing)
Sonnet fails at consistency (correct rules, immediately abandoned)
Opus mostly succeeds but isn't immune to occasional lapses
The Scratchpad is Everything
The single strongest predictor of success in YC-Bench? Scratchpad usage.
Model
Scratchpad Writes / 100 Turns
Task Inspections / Acceptance
Avg Final Funds
GPT-5.4
10.6
1.43
Top 3
Claude Opus 4.6
5.6
1.10
Top 3
Claude Sonnet 4.6
4.6
0.64
Bankrupt 2/3
Gemini 3 Flash
0.2
0.11
Survived
Gemini 3.1 Pro
0.0
0.00
Bankrupt 2/3
The top 3 models use the scratchpad 5-50x more than the bottom 3. They also inspect tasks before accepting them at dramatically higher rates. This isn't a coincidence β it's the difference between flying blind and building institutional memory.
The Cost Efficiency Plot Twist
Here's the fun part: the best-performing model isn't the most cost-efficient.
Kimi-K2.5 achieves 2.5x better revenue per API dollar than the next most cost-effective model (Gemini 3 Flash), despite ranking only 4th in absolute performance. Claude Opus 4.6, the top performer, costs ~$86 per run vs Kimi's ~$1.79.
For production deployments, raw leaderboard rankings don't tell the whole story. Sometimes the "good enough" model at 1/50th the cost is the right call.
Why This Matters for the Open Source Community
A few takeaways that matter for anyone building or fine-tuning models:
1. Long-horizon coherence is a distinct capability
Models that score similarly on standard benchmarks show wildly different performance on YC-Bench. This isn't just "reasoning" or "tool use" β it's the ability to maintain and act on strategy over hundreds of turns.
2. The reasoning-execution gap is real
Multiple models can derive correct strategies but fail to execute them. Sonnet's case is particularly striking β perfect analysis, zero follow-through. This suggests that deliberation and execution are not yet unified capabilities.
3. Adversarial robustness needs memory
Adversarial client detection accounted for 47% of bankruptcies . This failure mode simply cannot be surfaced by benchmarks with immediate feedback. Models need to learn from sparse, delayed signals β and that requires functional long-term memory.
4. Open-source models punch above their weight on efficiency
GLM-5 and Kimi-K2.5 are substantially more Pareto-optimal than their proprietary counterparts, delivering competitive performance at a fraction of the cost.
What's Next
YC-Bench is being released as an open-source, configurable benchmark . The current version keeps some simplifying assumptions β fixed employee rosters, no random exogenous events, numerical (not natural language) signals. Future versions will relax these constraints for an even more realistic decision surface.
We think this benchmark points at something important: the frontier of LLM capability isn't just about getting smarter at individual tasks. It's about getting coherent over time β remembering what you learned, sticking to strategies that work, and adapting when they don't.
Right now, most models can't do that. But the ones that can? They 5x their starting capital.
Paper: YC-Bench: A Long-Term Coherence Benchmark for LLM Agents - arXiv Link