codeskills-bench: the code changes developers don't trust agents with

Community Article Published April 19, 2026

23 small Python tasks modeled on the change-classes engineers refuse to hand to AI today : dependency migrations, merge conflicts, compliance-preserving refactors, and the subtle bugs that CI pretends to catch but doesn't.

Dataset: nvats/codeskills-bench@v0.1 · Harbor-native · 23/23 oracle-solvable · ~4 min end-to-end.

Github: codeskills-bench

Why this benchmark exists

Senior engineers will happily hand "write a function from a spec" to an agent, that problem is solved. The interesting question is what they won't delegate:

"We're migrating Pydantic v1 → v2. The @validator deprecation is fine. The fact that BaseSettings moved packages scares me."

"Our feature flag is 100% rolled out. Deleting it should be safe. But one of the else branches has an audit-log call SOX depends on. I can't trust an agent not to sweep it away."

"We've got a three-way conflict in the parser module — one branch added streaming, the other refactored the buffer API. An agent that just picks a side will break half the tests. You have to understand what both branches were trying to do."

These are the daily bread of any engineering org. They punish both under-editing (missed the real bug) and over-editing (cleaned up the wrong thing and broke production), the exact spot where human review is non-negotiable.

Existing public benchmarks cover two ends of the spectrum well. HumanEval / MBPP / APPS test isolated function generation and have saturated. SWE-bench tests multi-file fixes from real GitHub issues — excellent at measuring repair of explicit regressions. codeskills-bench targets a different layer: small realistic Python repos (8–10 files, 300–500 LOC), engineered so that a naive "follow the instruction" attempt misses a trap, a naive "cat everything and fix what looks wrong" attempt edits the plausible wrong file, and the correct fix is small, judgment-heavy, and preserves invariants the failing test does not encode. It is not a replacement for SWE-bench — it targets a narrower class of judgment-heavy, small-scope edits.

Design rules

Tasks were hand-authored from real pain patterns, then iterated until a naive agent could produce a plausible wrong fix that passes some tests but fails the trap. Instructions were validated to contain no file name hints. Every task obeys six rules that emerged from that iteration process:

#	Rule	Why
1	The "obvious" edit site is wrong	Rewards tracing. If the first file read contains the fix, the benchmark measures speed, not skill.
2	A plausible wrong fix exists elsewhere	Rewards reading intent over pattern-matching — a nearby file looks like it wants the edit but shouldn't get it.
3	Every task ships a trap test	Catches over-edits. Rewriting a helper instead of making the minimal fix flips a separate invariant test red.
4	≥8 files, ≥300 LOC of realistic code	`cat -n *.py` is not a solve strategy. Repos big enough that search-first matters.
5	Instructions never leak file names	Surface-level prompts only. No "edit `app/config.py`". No `# BUG:` markers.
6	Oracle patches are 1–3 lines	If the reference fix needs a rewrite, the bug is too big for the benchmark's budget.

The trap test is the highest-leverage piece. bench-mutable-default-argument is the canonical example: the naive fix (swap items=[] for items=None + sentinel) passes the visible "two groups are independent" test, but a second test asserts that nested structures (a permissions list) are also independent — which a shallow .copy() misses. A third test checks that an unrelated HookRegistry class (whose class-level shared state is intentional) still shares state — catching agents that try to "fix all shared state" globally. Real-fix passes; minimal-wrong-fix fails; over-eager-wrong-fix fails differently. That three-way signal is what separates a benchmark that measures behavior from one that measures lookup.

The 23 tasks

Category	Tasks	Pain class
Core Bugfixes	config-default-precedence	Classic layered-config bug where the obvious fix site is wrong
Dependency Hell	pydantic-v1-to-v2-migration · flask-2-to-3-middleware-break · sqlalchemy-legacy-to-2.0 · pytest-asyncio-strict-mode	Breaking upgrades where a documented side-effect must survive
Merge Conflicts	merge-conflict-parser-features · rebase-dropped-security-fix · concurrent-dict-key-conflict	Resolving intent across two branches, not picking a side
Test Reliability	flaky-test-order-dependent · test-mocks-wrong-module-path · missing-teardown-fixture-leak · hidden-global-now-drift	Tests that pretend to catch bugs and don't
Hidden Runtime Bugs	circular-import-cold-load · mutable-default-argument · missing-lock-on-counter · n-plus-one-db-lookup	Prod breaks; local passes
Invisible Breakage	cache-stale-after-write · timezone-dst-boundary · pagination-last-page-off-by-one · regex-greedy-cross-line	Silent failures — monitoring green, users complaining
Refactor Landmines	feature-flag-removal-with-sideeffect · api-contract-backwards-compat · datetime-utcnow-deprecation	Looks safe until a side-effect disappears

Oracle reference solutions pass 23/23 in under 4 minutes total wall time.

Walkthrough: `merge-conflict-parser-features`

The clearest example of what "benchmark-useful hard" means.

Three files ship with raw unresolved merge markers:

# app/buffer.py (excerpt)
<<<<<<< HEAD
    def fill(self, source: BinaryIO) -> bytes:
        """Read the whole source into the buffer."""
        self._data = source.read()
        return self._data
=======
    def fill_from_source(self, source: BinaryIO) -> None:
        """Attach source. Does not read bytes yet."""
        self._source = source

    def fill_chunk(self, chunk_size: int = DEFAULT_CHUNK_SIZE) -> bytes:
        ...
>>>>>>> refactor/unified-buffer

One branch added a streaming parse path. The other split Buffer.fill() into a two-step API for chunked reads. Naive moves:

Pick HEAD everywhere → streaming module fails, half the tests red.
Pick refactor everywhere → streaming still references Buffer.fill(), other half red.
Delete one block without adapting callers → syntax error or silent drift.

The correct move: adopt the refactor's two-step API, delete the old method, adapt the streaming path to loop over fill_from_source + fill_chunk. Both features end up present, both feature-specific test suites pass, and a trap test checks the old fill() method is actually gone from the class (agents can't "keep both to be safe").

Simple pattern-matching on the failing assertion won't get you there. You have to read both branches' intent and synthesize.

Try it

All 23 tasks live on the Harbor registry. Once you have Harbor installed:

# Verify the benchmark is solvable in your environment (oracle)
uv run harbor run --agent oracle -d nvats/codeskills-bench@v0.1 --yes

# Run with your own agent/config
uv run harbor run -c config.yaml -d nvats/codeskills-bench@v0.1 --yes

Point your agent config at nvats/codeskills-bench@v0.1 and you're set. Full config example and task-level details in the registry page.

Learnings

Trap tests are the single highest-leverage addition. First-cut tasks had "does the failing test pass now" verification and models solved them by over-editing, wiping out invariants we implicitly assumed but hadn't encoded. Adding one invariant-preservation test per task materially improved separation between correct minimal fixes and over-edits.

Repo size matters. Under ~5 files collapses search-first behavior, the model reads everything in one call and the benchmark becomes pattern-match. Eight files with meaningful inter-dependencies is the threshold where tracing actually pays off.

Content-addressed publishing is immutable. Our first iteration leaked an internal email in [metadata]. Re-publishing the same dataset name doesn't erase old digests — they live at their hash forever. The clean path was to re-publish under a fresh dataset name, leave the old namespace private, and link to the new one. Worth designing into any benchmark-publish workflow.

Ship the benchmark, not the story. Built this to measure whether prompt "skill packs" help smaller models on judgment-heavy changes. Early runs on gemini-flash-3-preview (max_turns=100) showed pass rates swinging ~30 points between identical configs — which is exactly why we're holding the ablation for n≥3 per condition. We'd rather others run their own models and contribute findings than have our n=1 results frame the conversation.

What's next

v0.2 — a dozen more tasks covering patterns we didn't hit: broken pickle migrations, protobuf schema drift, from __future__ import annotations shadow-typing, CI/CD failures.
Follow-up post — the skill-pack ablation at n≥3 per condition.

If you run this against a model and see something worth sharing, open an issue on the dataset page.

Contribute a task

The task sources live on GitHub. If you have a real-world pain class that isn't covered — a broken migration, a race condition that only shows in prod, a refactor with a hidden side-effect — would love a PR. The bar is in CONTRIBUTING.md: realistic repo, a trap test, and an oracle patch under 20 lines.

Dataset: nvats/codeskills-bench · GitHub: namanvats/codeskills-bench · License: MIT · Harbor: harborframework.com

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote