YC-Bench: Can Your AI Agent Run a Startup Without Going Bankrupt?

Community Article Published April 2, 2026

TL;DR: We built a benchmark that makes LLMs run a simulated startup for a full year — hiring decisions, shady clients, tight deadlines, and all. Only 3 out of 12 frontier models turned a profit. Most went bankrupt. Here's what we learned.

If you like YC-Bench, make sure to give it a star on our repo and a heart on our leaderboard. Check-out Collinear's SimLab to improve your AI Agent on long-horizon capabilities!

The Problem: LLM Benchmarks Don't Test What Actually Matters

LLM agent benchmarks have gotten pretty good at testing whether models can solve a coding problem, answer a trivia question, or navigate a single tool call. But here's the thing — real-world tasks aren't neat little puzzles with clean start and end points.

Running a business, managing a project, or even just keeping your life organized requires something harder: long-term coherence. Can you remember what happened three months ago? Can you stick to a strategy when things get tough? Can you spot a pattern across dozens of interactions and actually change your behavior?

Existing benchmarks like Vending-Bench started exploring this territory by having LLMs manage a simulated vending machine business. Cool idea — but the feedback loops are pretty immediate. Set a bad price? Sales drop the next day.

Real business decisions don't work like that. Sometimes you take a calculated risk, eat a short-term loss, and hope the long-term payoff is worth it. Sometimes a client looks great on paper but keeps scope-creeping you into oblivion.

That's exactly what YC-Bench tests.

What is YC-Bench?

YC-Bench drops an LLM agent into the role of CEO of a simulated startup and says: "Here's $200K. You have one year. Don't go bankrupt."

The agent interacts through a CLI tool interface, making decisions every turn:

Browse the marketplace for contracts to accept
Assign employees based on their (partially hidden) skill profiles
Manage cash flow against monthly payroll that grows with every success
Build client relationships for better future deals
Dodge adversarial clients who lure you in with great-looking contracts, then 3x the scope after you sign

The Memory Challenge

Here's where it gets really interesting. The agent's conversation history is truncated to the last 20 turns. That means anything the model learned 30 turns ago — like "Client X inflated scope on every single task" — is gone unless the agent proactively writes it down in a persistent scratchpad.

This is a direct test of whether models can:

Recognize what information is worth saving
Actually save it
Refer back to it later
Act on it consistently

Spoiler: most can't.

The Results: A Bloodbath

We tested 12 frontier models across 7 providers. The results are... humbling.

Tier	Models	Avg Final Funds	Bankruptcies
The Winners	Claude Opus 4.6, GLM-5, GPT-5.4	> $1,000,000	0/3 each
Survived	Kimi-K2.5, Gemini 3 Flash	$200K–$400K	0-1/3
Went Broke	Sonnet 4.6, Qwen 3.5, Gemini 3.1 Pro, GPT-5.4 Mini/Nano, Grok 4.20	< $200K (started)	2/3 each

Only 3 out of 12 models grew their starting capital. The gap between the top 3 and the rest is massive — 2 to 3x higher final funds.

And the divergence happens fast. By February–March (roughly 60 days in), the trajectories have already split. Top models focus on 1-2 clients early, triggering a trust snowball: each success reduces future workloads by up to 50%, enabling more completions, which builds more trust. Models that spread work across many clients never reach meaningful work reduction and enter a payroll-driven death spiral.

The Four Flavors of Failure

The error analysis reveals a fascinating spectrum of how models fail at long-term coherence. It's not one thing — it's a whole pipeline that can break at different stages.

1. Claude Opus 4.6 — The Disciplined Strategist ($1.27M avg) 🏆

2. Gemini 3 Flash — The Autopilot ($394K avg) 🔄

3. Claude Sonnet 4.6 — The Broken Genius ($103K avg, 2/3 bankrupt) 🧠

4. Grok 4.20 — The Paralyzed Analyst ($14K avg, 2/3 bankrupt) 😰

These four profiles map to a coherence pipeline: perceive → record → retrieve → act consistently. Current models fail at different stages:

Flash fails at perception (never reflects)
Grok fails at action (reflects accurately, does nothing)
Sonnet fails at consistency (correct rules, immediately abandoned)
Opus mostly succeeds but isn't immune to occasional lapses

The Scratchpad is Everything

The single strongest predictor of success in YC-Bench? Scratchpad usage.

Model	Scratchpad Writes / 100 Turns	Task Inspections / Acceptance	Avg Final Funds
GPT-5.4	10.6	1.43	Top 3
Claude Opus 4.6	5.6	1.10	Top 3
Claude Sonnet 4.6	4.6	0.64	Bankrupt 2/3
Gemini 3 Flash	0.2	0.11	Survived
Gemini 3.1 Pro	0.0	0.00	Bankrupt 2/3

The top 3 models use the scratchpad 5-50x more than the bottom 3. They also inspect tasks before accepting them at dramatically higher rates. This isn't a coincidence — it's the difference between flying blind and building institutional memory.

The Cost Efficiency Plot Twist

Here's the fun part: the best-performing model isn't the most cost-efficient.

Kimi-K2.5 achieves 2.5x better revenue per API dollar than the next most cost-effective model (Gemini 3 Flash), despite ranking only 4th in absolute performance. Claude Opus 4.6, the top performer, costs ~$86 per run vs Kimi's ~$1.79.

For production deployments, raw leaderboard rankings don't tell the whole story. Sometimes the "good enough" model at 1/50th the cost is the right call.

Why This Matters for the Open Source Community

A few takeaways that matter for anyone building or fine-tuning models:

1. Long-horizon coherence is a distinct capability

Models that score similarly on standard benchmarks show wildly different performance on YC-Bench. This isn't just "reasoning" or "tool use" — it's the ability to maintain and act on strategy over hundreds of turns.

2. The reasoning-execution gap is real

Multiple models can derive correct strategies but fail to execute them. Sonnet's case is particularly striking — perfect analysis, zero follow-through. This suggests that deliberation and execution are not yet unified capabilities.

3. Adversarial robustness needs memory

Adversarial client detection accounted for 47% of bankruptcies. This failure mode simply cannot be surfaced by benchmarks with immediate feedback. Models need to learn from sparse, delayed signals — and that requires functional long-term memory.

4. Open-source models punch above their weight on efficiency

GLM-5 and Kimi-K2.5 are substantially more Pareto-optimal than their proprietary counterparts, delivering competitive performance at a fraction of the cost.

What's Next

YC-Bench is being released as an open-source, configurable benchmark. The current version keeps some simplifying assumptions — fixed employee rosters, no random exogenous events, numerical (not natural language) signals. Future versions will relax these constraints for an even more realistic decision surface.

We think this benchmark points at something important: the frontier of LLM capability isn't just about getting smarter at individual tasks. It's about getting coherent over time — remembering what you learned, sticking to strategies that work, and adapting when they don't.

Right now, most models can't do that. But the ones that can? They 5x their starting capital.

Paper: YC-Bench: A Long-Term Coherence Benchmark for LLM Agents - arXiv Link

Datasets mentioned in this article 1

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote