Second Loop · Project hub

🔁

Second Loop

Honesty & self-correction in language models

Submitted by

Serghei Brinza

Static · project hub

★ Three experiments · one arc ★

Subject model

Qwen2.5-3B-Instruct (frozen)

Independent judge

Qwen2.5-7B-Instruct

License

MIT

The arc

Three demo Spaces, one through-line. (1) Can a confidently memorized error in a frozen LLM be durably corrected? (2) Can that correction survive a noisy notebook whose external entries are partly unreliable? (3) How rarely can external truth arrive before calibration collapses back to the raw model? Every linked Space is fully static — no model is loaded, and every number is a verbatim output of the live experimental run.

Part 1

Scar-Survival

A memorized LLM error, corrected — how durable is the fix? Turn the mechanism on, reload the frozen model, then stress it with a counterfeit fact.

0/12 → 12/12 · holds 10/10 reloads · 6/12 survive

Open Space ↗ GitHub ↗

Part 2

External Grounding

Lifting self-correction from 50% to 100% under a noisy notebook. Drag the guardian through six versions and watch which traps get fixed — and which regress.

50% → 100% · 66.7% plateau · +fixed / −broken

Open Space ↗ GitHub ↗

Part 3

Thin Channel

How rarely external truth can arrive before calibration collapses. Move the lever from “every day” to “never” and watch the curve hold — until the cliff.

finite schedule holds · zero contact collapses to raw 3B

Open Space ↗ GitHub ↗