A newer version of the Gradio SDK is available: 6.19.0
title: TurboSkillSlug
emoji: π
colorFrom: purple
colorTo: yellow
sdk: gradio
sdk_version: 6.16.0
python_version: '3.12'
app_file: app.py
pinned: false
short_description: Turn a coding session into a skill, recap, and shell.
tags:
- hackathon
- build-small-hackathon
- track:wood
- sponsor:openai
- sponsor:modal
- achievement:welltuned
- achievement:offbrand
- achievement:fieldnotes
models:
- legendarydragontamer/slugvoice-qwen2.5-1.5b-lora
- legendarydragontamer/slugextract-qwen2.5-1.5b-lora
- Qwen/Qwen2.5-1.5B-Instruct
- openai/whisper-large-v3-turbo
datasets:
- legendarydragontamer/turboskillslug-groundedness-eval
TurboSkillSlug
Feed it a coding session. Get back a reusable SKILL.md, a grounded spoken recap, and a procedural shell that encodes the whole session as art.
The session goes in one of two ways: narrate it aloud, or drop a real Claude Code
or Codex CLI session log (.jsonl). A fine-tuned 1.5B model extracts what you
tried, what failed, and what finally worked. Total active pipeline: ~2.6B
parameters, measured to match a 7B on groundedness at a third the size.
Then the slug gives you four things:
- A SKILL.md another LLM can actually use: the non-obvious gotchas (symptom β cause β fix), the approaches that fail and why, the breakthrough. Built to give a frontier model real uplift, not a session summary.
- A spoken recap in a fine-tuned "slug voice," every line grounded in something that happened, never reciting invented numbers.
- A shell whose spiral, knots, jewels, and colors all derive from your session, born on screen as a scroll that unrolls along its own arm, with a byΕbu-style battle inked across it (dead ends are fallen warriors, the breakthrough is a dragon).
- A receipt like a thermal printout: approaches tried, dead ends, mood.
The slug witnesses every kind of session, not just debugging
Most coding sessions are not bug hunts. They are exploring an unfamiliar repo, writing docs, setting up tooling, building a feature. A witness that only has eyes for "what broke and what fixed it" leaves those sessions with a hollow shell.
So the slug detects the session's genre (debugging, exploration, authoring, feature, refactor, setup) and witnesses the right thing for each:
- debugging β the struggle: dead ends and the breakthrough
- exploration β the discoveries: the non-obvious facts learned about the codebase
- authoring β the decisions, and the false assumptions caught before they became wrong docs
- feature / refactor / setup β what was built or changed, and what would break if done naively
The shell's vocabulary adapts with it: for an exploration session the rim jewels are discoveries, the aperture is the clearest insight. Genre detection is pure pattern-matching: no model call, no added latency.
Why this matters concretely: on a real exploration trace, the slug surfaced that
a project's checkpoint mirror uses a custom git ref namespace
(refs/entire/...) that a standard git fetch --all will miss. That is exactly
the kind of private, non-derivable knowledge a SKILL.md exists to carry, and it
came from a session that had no "bug" at all.
Every shell is unique because every session is unique.
Try it in one click
Two tabs, two sample inputs:
- narrate aloud β a sample build session (audio)
- drop a session trace β a sample Claude Code
.jsonltrace
Or bring your own: upload a recording, or drag a real session log from
~/.claude/projects/.../*.jsonl or ~/.codex/sessions/.../*.jsonl. Judges can
feed their own agent logs and watch the slug read them.
Demo
Watch the demo: youtu.be/qSP9olWRv7o
Social
The launch post: x.com/anubhav27071997
Why this is hard the right way
The slug's entire promise is a witness that only says what it saw. That makes groundedness the core engineering problem: a small model that invents facts is worthless here. So we measured it, honestly, and published the data.
Groundedness: does the small model hallucinate more than the 7B it replaced?
On 25 held-out transcripts, comparing the shipped fine-tuned 1.5B against the Qwen-7B it replaced and its own un-tuned 1.5B base:
| system | semantic groundedness | lexical | parse | facts |
|---|---|---|---|---|
| prompted 7B | 0.716 | 0.576 | 24/25 | 272 |
| prompted 1.5B base | 0.565 | 0.390 | 21/25 | 140 |
| fine-tuned 1.5B LoRA | 0.762 | 0.378 | 21/25 | 195 |
The fine-tuned 1.5B matches and slightly exceeds the 7B (0.76 vs 0.72) at a third of the active size. It does this by paraphrasing rather than copying: lowest lexical overlap, highest semantic groundedness, the signature of a model that restates meaning instead of echoing words.
Reported with its costs, not spun: the LoRA produces valid JSON less often (21/25 vs 24/25), and the semantic metric passed 5/6 hand-labeled calibration cases (the one miss understates the LoRA, not the reverse). 25 transcripts is a small sample, so the honest claim is "matches or slightly exceeds," not "beats." Raw generations and per-fact scores are published so anyone can re-score: turboskillslug-groundedness-eval.
The SKILL.md is the real gift
A skill file is only worth shipping if it helps an LLM that is already capable without it. A summary does not. So the SKILL.md is built to carry the non-obvious, transferable knowledge a frontier model cannot derive on its own:
- Gotchas as symptom β cause β fix, not labels. "Processing leaf nodes first looks natural but breaks because a parent depends on its children being finalized; process deepest-first" β not "ordering unclear."
- What does NOT work, and why, so the model skips the dead ends you already paid for.
- Transferable principles distilled from the arc, not a diary of the session.
- A negative guardrail under each gotcha, phrased as a "do not X / verify Y before assuming" rule. This follows the 2026 RuleShaping finding that negative, state-dependent guardrails are the rule type that actually helps a model, and it is generated deterministically with no model call.
Terse gotchas are expanded by an optional one-shot pass, guarded so example phrasing can never leak into output.
How the shell reads your session
| What happened | How the shell shows it |
|---|---|
| Duration | Overall size and number of spiral turns |
| Each approach tried | Spiral arm density |
| Each dead end | A dark knot β and a fallen warrior in the battle layer |
| The breakthrough | The glowing aperture at the tip β and a dragon |
| Gotchas | Iridescent jewels along the rim β and archers |
| Your emotional arc | Color gradient from start to end |
A frustrated session ending in relief is red-to-green. A curious exploration ending in delight is warm gold. A long grind is cold blue-grey. The color story is the emotional story.
Procedural SVG: nacre texture filters, HSL color harmonies, bezier-smoothed curves. No diffusion, no image generation. Every pixel traces to a real session feature, which is the whole point: if a diffusion model drew the shell, "this knot is your dead end" would stop being true. The shell is born as a scroll unrolling along its spiral arm, led by a 3D paper curl, with the byΕbu battle inking on as the parchment is laid. There is a "watch it unroll again" replay.
Kept shells go into a shared terrarium (gallery): a living collection where
every shell is the fingerprint of a real session, each with a ?shell=<id>
permalink.
Architecture
The full primary pipeline runs on Modal at ~2.6B active parameters. The Qwen-7B is a labeled fallback only, not on the primary path.
| Component | Model | Params | Infrastructure |
|---|---|---|---|
| Transcription | openai/whisper-large-v3-turbo |
809M | Modal T4 |
| Feature extraction | slugextract-qwen2.5-1.5b-lora |
1.5B | Modal T4 (shared) |
| Slug voice | slugvoice-qwen2.5-1.5b-lora |
1.5B | Modal T4 (shared) |
| Spoken recap | Chatterbox TTS | ~300M | Modal A10G |
| Genre detection | Pattern matching (no model) | 0 | CPU |
| Shell + Receipt | Procedural SVG (no model) | 0 | CPU |
| Extraction fallback | Qwen/Qwen2.5-7B-Instruct |
7B | HF Inference (fallback only) |
Two custom LoRA adapters, one base model, one T4. Both adapters are fine-tunes of Qwen2.5-1.5B-Instruct, loaded onto a single base on a single T4 and switched per request, so the whole language pipeline runs on one GPU.
- SlugVoice (adapter): 500 hand-crafted (transcript snippet, slug observation) pairs. Loss 4.97 β 0.89.
- SlugExtract (adapter): 167 balanced (transcript, structured-JSON) pairs across 14 sentiment arcs. Loss 1.88 β 0.81. Replaces the Qwen-7B extractor; brings the pipeline to ~2.6B.
How Modal is used
- Fine-tuning. Both LoRAs trained on Modal (A10G). SlugExtract on a pure transformers + PEFT + bitsandbytes stack.
- Serving. Whisper on a T4; both LoRAs on a single shared T4 via PEFT multi-adapter, switched per request. Kept-warm containers (one always-on plus a buffer) for demo reliability.
- TTS. Chatterbox on an A10G speaks the recap.
- Evaluation. The groundedness eval (three models, 25 transcripts, two metrics) runs as a Modal job, persisting raw generations to a Volume.
- Gallery. The shared terrarium's save/list/fetch endpoints run on Modal, backed by the same Volume.
If the primary extraction misses, the app retries on the same small model and otherwise degrades to a clear message, so it never crashes mid-render.
Built with OpenAI Codex
Built using OpenAI Codex as the primary coding agent. Full commit history: github.com/AnubhavBharadwaaj/turbo-skill-slug
Codex handled scaffolding, the Gradio skeleton, the shell SVG geometry, tests, dependency wiring, and deployment fixes. Human judgment went into the slug's voice, the shell's visual design, the grounding constraints, and the two LoRA fine-tunes.
Technical choices and why
Procedural SVG over generated images. Every element traces to a real feature; a diffusion model would break that link.
Fine-tuned 1.5B over prompted 7B for both voice and extraction. A prompted large model copies examples; a fine-tuned small model learns the pattern. The eval shows the extraction fine-tune reaching 7B-level groundedness at a third the size.
Counts live on the receipt, never in the voice. The voice is forbidden from reciting tallies, so it can never state a number that contradicts the record. The slug describes moments; the receipt does arithmetic.
Graceful degradation everywhere. If the primary Modal extraction misses, the app retries once on the same small model, then shows a clear "try again" message rather than crashing. Unexpected sentiment β closest valid label. Animation fails β the full shell still shows (nothing is hard-hidden). Nothing crashes on edge cases.
Hackathon patches
| Patch | Status | Evidence |
|---|---|---|
| π Thousand Token Wood | β | A slug grows a shell from your session |
| π Best Use of Codex | β | Codex-attributed commits, documented usage, public repo |
| π¨ Off Brand | β | Procedural shell: scroll-unroll birth, byΕbu battle layer, thermal receipt |
| π― Well-Tuned | β | TWO published LoRAs + a published, re-scorable groundedness eval |
| ποΈ Best Use of Modal | β | Two fine-tunes, dual-adapter serving, TTS, eval, and gallery on Modal |
| π Field Notes | β | Blog article |
Honesty
Full caveats (parse-rate cost, the eval's 5/6 calibration, the animation payload, what is freshly built) are documented in HONEST_SUBMISSION.md. Every model and the eval data are published for anyone to verify.
Research foundation: from one skill to lifecycle-governed rules
Beyond the shipped app, this project carries an offline-validated research layer that answers a sharper question than "can a small model extract a skill": when does an extracted artifact actually help a capable model, and how should many sessions compound into durable knowledge?
This work is validated offline (it is not yet wired into the live Space) and is documented and tested in the repo. Stated plainly so the line between shipped and researched is clear:
When skills help (measured). A blind, calibrated eval (one model answers, an independent model judges) found that a generated skill gives a frontier model uplift only when it carries knowledge that could not be in training data: private behavior, post-cutoff facts, project conventions. General algorithmic skills gave 0.0 uplift; novel/private ones gave real uplift. The dividing line is provenance, not difficulty.
Compounding across sessions (built, offline-tested). A promotion engine, grounded in the 2026 "Experience Compression Spectrum" framing, consolidates gotchas that recur across multiple sessions of the same codebase into compact, guardrail-phrased rules, with provenance, confidence, and a validation gate that demotes rules that stop holding. Rule phrasing follows the "RuleShaping" finding that negative, state-dependent guardrails help where positive directives hurt.
Faithful trace distillation (built, offline-tested). A from-scratch implementation of the 2026 "Trace2Skill" method (validation-gated error analysis, hierarchical prevalent-pattern merging, niche items routed to references) for higher-fidelity extraction.
An honest result from running the extractor on a sample of real public agent
traces (SALT-NLP/SWE-chat and nebius/SWE-agent-trajectories): its gotchas are
specific and real (it named exact functions, ref namespaces, and build-tool
quirks), but cross-session promotion only fires within a single codebase, because
two different repos never share the same private trap. This was an exploratory
check on a handful of sessions, not a systematic benchmark, and it surfaced a
true property of the problem rather than a polished number, reported rather than
hidden. The loader scripts live in research/traces/ in the GitHub repo, so the
check is reproducible from scratch.
Full method, code, and tests live in the repo; the research is a foundation for where the slug goes next, not a claim about the current Space.
What comes next
- Session diff view. Upload two sessions, see how the shells differ.
- Tighter extraction reliability. Close the parse-rate gap (21/25 vs the 7B's 24/25) with more training pairs and constrained decoding.
Links
- Space: build-small-hackathon/TurboSkillSlug
- Code: github.com/AnubhavBharadwaaj/turbo-skill-slug
- SlugVoice LoRA: slugvoice-qwen2.5-1.5b-lora
- SlugExtract LoRA: slugextract-qwen2.5-1.5b-lora
- Groundedness eval: turboskillslug-groundedness-eval
- Demo: youtu.be/qSP9olWRv7o
- Social post: x.com/anubhav27071997
- Blog: turboskillslug-shell-from-session
The slug watches, gives its gifts, and goes back to sleep. I was here.