Spaces:

build-small-hackathon
/

TurboSkillSlug

Running

App Files Files Community

TurboSkillSlug / README.md

legendarydragontamer

deploy

51a9974 18 days ago

preview code

Raw

History Blame Contribute Delete

16.2 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

metadata

title: TurboSkillSlug
emoji: 🐌
colorFrom: purple
colorTo: yellow
sdk: gradio
sdk_version: 6.16.0
python_version: '3.12'
app_file: app.py
pinned: false
short_description: Turn a coding session into a skill, recap, and shell.
tags:
  - hackathon
  - build-small-hackathon
  - track:wood
  - sponsor:openai
  - sponsor:modal
  - achievement:welltuned
  - achievement:offbrand
  - achievement:fieldnotes
models:
  - legendarydragontamer/slugvoice-qwen2.5-1.5b-lora
  - legendarydragontamer/slugextract-qwen2.5-1.5b-lora
  - Qwen/Qwen2.5-1.5B-Instruct
  - openai/whisper-large-v3-turbo
datasets:
  - legendarydragontamer/turboskillslug-groundedness-eval

TurboSkillSlug

Feed it a coding session. Get back a reusable SKILL.md, a grounded spoken recap, and a procedural shell that encodes the whole session as art.

The session goes in one of two ways: narrate it aloud, or drop a real Claude Code or Codex CLI session log (.jsonl). A fine-tuned 1.5B model extracts what you tried, what failed, and what finally worked. Total active pipeline: ~2.6B parameters, measured to match a 7B on groundedness at a third the size.

Then the slug gives you four things:

A SKILL.md another LLM can actually use: the non-obvious gotchas (symptom → cause → fix), the approaches that fail and why, the breakthrough. Built to give a frontier model real uplift, not a session summary.
A spoken recap in a fine-tuned "slug voice," every line grounded in something that happened, never reciting invented numbers.
A shell whose spiral, knots, jewels, and colors all derive from your session, born on screen as a scroll that unrolls along its own arm, with a byōbu-style battle inked across it (dead ends are fallen warriors, the breakthrough is a dragon).
A receipt like a thermal printout: approaches tried, dead ends, mood.

The slug witnesses every kind of session, not just debugging

Most coding sessions are not bug hunts. They are exploring an unfamiliar repo, writing docs, setting up tooling, building a feature. A witness that only has eyes for "what broke and what fixed it" leaves those sessions with a hollow shell.

So the slug detects the session's genre (debugging, exploration, authoring, feature, refactor, setup) and witnesses the right thing for each:

debugging → the struggle: dead ends and the breakthrough
exploration → the discoveries: the non-obvious facts learned about the codebase
authoring → the decisions, and the false assumptions caught before they became wrong docs
feature / refactor / setup → what was built or changed, and what would break if done naively

The shell's vocabulary adapts with it: for an exploration session the rim jewels are discoveries, the aperture is the clearest insight. Genre detection is pure pattern-matching: no model call, no added latency.

Why this matters concretely: on a real exploration trace, the slug surfaced that a project's checkpoint mirror uses a custom git ref namespace (refs/entire/...) that a standard git fetch --all will miss. That is exactly the kind of private, non-derivable knowledge a SKILL.md exists to carry, and it came from a session that had no "bug" at all.

Every shell is unique because every session is unique.

Try it in one click

Two tabs, two sample inputs:

narrate aloud — a sample build session (audio)
drop a session trace — a sample Claude Code .jsonl trace

Or bring your own: upload a recording, or drag a real session log from ~/.claude/projects/.../*.jsonl or ~/.codex/sessions/.../*.jsonl. Judges can feed their own agent logs and watch the slug read them.

Demo

Watch the demo: youtu.be/qSP9olWRv7o

Social

The launch post: x.com/anubhav27071997

Why this is hard the right way

The slug's entire promise is a witness that only says what it saw. That makes groundedness the core engineering problem: a small model that invents facts is worthless here. So we measured it, honestly, and published the data.

Groundedness: does the small model hallucinate more than the 7B it replaced?

On 25 held-out transcripts, comparing the shipped fine-tuned 1.5B against the Qwen-7B it replaced and its own un-tuned 1.5B base:

system	semantic groundedness	lexical	parse	facts
prompted 7B	0.716	0.576	24/25	272
prompted 1.5B base	0.565	0.390	21/25	140
fine-tuned 1.5B LoRA	0.762	0.378	21/25	195

The fine-tuned 1.5B matches and slightly exceeds the 7B (0.76 vs 0.72) at a third of the active size. It does this by paraphrasing rather than copying: lowest lexical overlap, highest semantic groundedness, the signature of a model that restates meaning instead of echoing words.

Reported with its costs, not spun: the LoRA produces valid JSON less often (21/25 vs 24/25), and the semantic metric passed 5/6 hand-labeled calibration cases (the one miss understates the LoRA, not the reverse). 25 transcripts is a small sample, so the honest claim is "matches or slightly exceeds," not "beats." Raw generations and per-fact scores are published so anyone can re-score: turboskillslug-groundedness-eval.

The SKILL.md is the real gift

A skill file is only worth shipping if it helps an LLM that is already capable without it. A summary does not. So the SKILL.md is built to carry the non-obvious, transferable knowledge a frontier model cannot derive on its own:

Gotchas as symptom → cause → fix, not labels. "Processing leaf nodes first looks natural but breaks because a parent depends on its children being finalized; process deepest-first" — not "ordering unclear."
What does NOT work, and why, so the model skips the dead ends you already paid for.
Transferable principles distilled from the arc, not a diary of the session.
A negative guardrail under each gotcha, phrased as a "do not X / verify Y before assuming" rule. This follows the 2026 RuleShaping finding that negative, state-dependent guardrails are the rule type that actually helps a model, and it is generated deterministically with no model call.

Terse gotchas are expanded by an optional one-shot pass, guarded so example phrasing can never leak into output.

How the shell reads your session

What happened	How the shell shows it
Duration	Overall size and number of spiral turns
Each approach tried	Spiral arm density
Each dead end	A dark knot — and a fallen warrior in the battle layer
The breakthrough	The glowing aperture at the tip — and a dragon
Gotchas	Iridescent jewels along the rim — and archers
Your emotional arc	Color gradient from start to end

A frustrated session ending in relief is red-to-green. A curious exploration ending in delight is warm gold. A long grind is cold blue-grey. The color story is the emotional story.

Procedural SVG: nacre texture filters, HSL color harmonies, bezier-smoothed curves. No diffusion, no image generation. Every pixel traces to a real session feature, which is the whole point: if a diffusion model drew the shell, "this knot is your dead end" would stop being true. The shell is born as a scroll unrolling along its spiral arm, led by a 3D paper curl, with the byōbu battle inking on as the parchment is laid. There is a "watch it unroll again" replay.

Kept shells go into a shared terrarium (gallery): a living collection where every shell is the fingerprint of a real session, each with a ?shell=<id> permalink.

Architecture

The full primary pipeline runs on Modal at ~2.6B active parameters. The Qwen-7B is a labeled fallback only, not on the primary path.

Component	Model	Params	Infrastructure
Transcription	`openai/whisper-large-v3-turbo`	809M	Modal T4
Feature extraction	`slugextract-qwen2.5-1.5b-lora`	1.5B	Modal T4 (shared)
Slug voice	`slugvoice-qwen2.5-1.5b-lora`	1.5B	Modal T4 (shared)
Spoken recap	Chatterbox TTS	~300M	Modal A10G
Genre detection	Pattern matching (no model)	0	CPU
Shell + Receipt	Procedural SVG (no model)	0	CPU
Extraction fallback	`Qwen/Qwen2.5-7B-Instruct`	7B	HF Inference (fallback only)

Two custom LoRA adapters, one base model, one T4. Both adapters are fine-tunes of Qwen2.5-1.5B-Instruct, loaded onto a single base on a single T4 and switched per request, so the whole language pipeline runs on one GPU.

SlugVoice (adapter): 500 hand-crafted (transcript snippet, slug observation) pairs. Loss 4.97 → 0.89.
SlugExtract (adapter): 167 balanced (transcript, structured-JSON) pairs across 14 sentiment arcs. Loss 1.88 → 0.81. Replaces the Qwen-7B extractor; brings the pipeline to ~2.6B.

How Modal is used

Fine-tuning. Both LoRAs trained on Modal (A10G). SlugExtract on a pure transformers + PEFT + bitsandbytes stack.
Serving. Whisper on a T4; both LoRAs on a single shared T4 via PEFT multi-adapter, switched per request. Kept-warm containers (one always-on plus a buffer) for demo reliability.
TTS. Chatterbox on an A10G speaks the recap.
Evaluation. The groundedness eval (three models, 25 transcripts, two metrics) runs as a Modal job, persisting raw generations to a Volume.
Gallery. The shared terrarium's save/list/fetch endpoints run on Modal, backed by the same Volume.

If the primary extraction misses, the app retries on the same small model and otherwise degrades to a clear message, so it never crashes mid-render.

Built with OpenAI Codex

Built using OpenAI Codex as the primary coding agent. Full commit history: github.com/AnubhavBharadwaaj/turbo-skill-slug

Codex handled scaffolding, the Gradio skeleton, the shell SVG geometry, tests, dependency wiring, and deployment fixes. Human judgment went into the slug's voice, the shell's visual design, the grounding constraints, and the two LoRA fine-tunes.

Technical choices and why

Procedural SVG over generated images. Every element traces to a real feature; a diffusion model would break that link.

Fine-tuned 1.5B over prompted 7B for both voice and extraction. A prompted large model copies examples; a fine-tuned small model learns the pattern. The eval shows the extraction fine-tune reaching 7B-level groundedness at a third the size.

Counts live on the receipt, never in the voice. The voice is forbidden from reciting tallies, so it can never state a number that contradicts the record. The slug describes moments; the receipt does arithmetic.

Graceful degradation everywhere. If the primary Modal extraction misses, the app retries once on the same small model, then shows a clear "try again" message rather than crashing. Unexpected sentiment → closest valid label. Animation fails → the full shell still shows (nothing is hard-hidden). Nothing crashes on edge cases.

Hackathon patches

Patch	Status	Evidence
🍄 Thousand Token Wood	✅	A slug grows a shell from your session
🏆 Best Use of Codex	✅	Codex-attributed commits, documented usage, public repo
🎨 Off Brand	✅	Procedural shell: scroll-unroll birth, byōbu battle layer, thermal receipt
🎯 Well-Tuned	✅	TWO published LoRAs + a published, re-scorable groundedness eval
🏗️ Best Use of Modal	✅	Two fine-tunes, dual-adapter serving, TTS, eval, and gallery on Modal
📓 Field Notes	✅	Blog article

Honesty

Full caveats (parse-rate cost, the eval's 5/6 calibration, the animation payload, what is freshly built) are documented in HONEST_SUBMISSION.md. Every model and the eval data are published for anyone to verify.

Research foundation: from one skill to lifecycle-governed rules

Beyond the shipped app, this project carries an offline-validated research layer that answers a sharper question than "can a small model extract a skill": when does an extracted artifact actually help a capable model, and how should many sessions compound into durable knowledge?

This work is validated offline (it is not yet wired into the live Space) and is documented and tested in the repo. Stated plainly so the line between shipped and researched is clear:

When skills help (measured). A blind, calibrated eval (one model answers, an independent model judges) found that a generated skill gives a frontier model uplift only when it carries knowledge that could not be in training data: private behavior, post-cutoff facts, project conventions. General algorithmic skills gave 0.0 uplift; novel/private ones gave real uplift. The dividing line is provenance, not difficulty.
Compounding across sessions (built, offline-tested). A promotion engine, grounded in the 2026 "Experience Compression Spectrum" framing, consolidates gotchas that recur across multiple sessions of the same codebase into compact, guardrail-phrased rules, with provenance, confidence, and a validation gate that demotes rules that stop holding. Rule phrasing follows the "RuleShaping" finding that negative, state-dependent guardrails help where positive directives hurt.
Faithful trace distillation (built, offline-tested). A from-scratch implementation of the 2026 "Trace2Skill" method (validation-gated error analysis, hierarchical prevalent-pattern merging, niche items routed to references) for higher-fidelity extraction.

An honest result from running the extractor on a sample of real public agent traces (SALT-NLP/SWE-chat and nebius/SWE-agent-trajectories): its gotchas are specific and real (it named exact functions, ref namespaces, and build-tool quirks), but cross-session promotion only fires within a single codebase, because two different repos never share the same private trap. This was an exploratory check on a handful of sessions, not a systematic benchmark, and it surfaced a true property of the problem rather than a polished number, reported rather than hidden. The loader scripts live in research/traces/ in the GitHub repo, so the check is reproducible from scratch.

Full method, code, and tests live in the repo; the research is a foundation for where the slug goes next, not a claim about the current Space.

What comes next

Session diff view. Upload two sessions, see how the shells differ.
Tighter extraction reliability. Close the parse-rate gap (21/25 vs the 7B's 24/25) with more training pairs and constrained decoding.