Spaces:
Running on Zero
A newer version of the Gradio SDK is available: 6.19.0
title: Lightloom · speak your world into being
emoji: 🌅
colorFrom: indigo
colorTo: yellow
sdk: gradio
app_file: app.py
license: apache-2.0
pinned: true
short_description: Speak — your story unrolls as a living storyboard world.
tags:
- track:wood
- sponsor:openbmb
- sponsor:modal
- achievement:offgrid
- achievement:welltuned
- achievement:offbrand
- achievement:fieldnotes
- thousand-token-wood
- off-brand
- best-minicpm-build
- best-use-of-modal
- best-demo
- judges-wildcard
🌅 Lightloom — speak, and your world unrolls ahead of you
An entry for the Hugging Face Build Small hackathon — Thousand Token Wood track.
Speak a story and a continuous, painterly world unrolls live, ahead of you — a living storyboard where every phrase you speak becomes the next shot of one unbroken mural, painted in real time on one ZeroGPU slot by a handful of tiny local models. No cloud APIs. The world is the interface: a full‑bleed, framework‑free WebGL canvas, not a stock Gradio form.
▶️ Demo video: watch on YouTube · 📣 Social post: on LinkedIn · 📖 Field notes: the build write‑up · 👤 Team: Efradeca
(Built on Gradio · hosted as a Hugging Face Space · every model runs locally, ≤ 32B total.)
Why it's worth showing a friend (the 30 seconds)
You tap the mic and talk. As you speak, the Director turns each phrase into a shot, the Painter outpaints the next strip of one continuous mural that continues the previous edge, Depth‑Anything gives it relief, and the browser scrolls you through it — a world that keeps extending and breathing while you narrate. When you're done, an Art Director reads your finished world from its own pixels, names it, and films a calm keepsake — and you can step INTO it in real 3D ("Explore in 3D", drag to peer around, all on your own GPU).
It is delightful, the AI is load‑bearing (no models → no world), the concept is original (live voice → an endless painterly world + a VLM that narrates what it sees), and the UI pushes hard past stock Gradio (a hand‑written WebGL scroll, an "orchestra" HUD that lights up each tiny model, a cinematic keepsake, a navigable 3D viewer).
How it works — the orchestra, in real time
flowchart LR
V[🎙 Your voice] -->|NVIDIA Parakeet-CTC · transcribe| T[transcript]
T -->|split into phrases| D[Director · MiniCPM]
D -->|vivid scene + style per phrase| P[Painter · FLUX.2 klein · 4 steps]
P --> I[panorama strips]
I -->|Depth-Anything V2| Z[depth]
I & Z -->|stream| C[🌍 living world · continuous painterly scroll]
C -.->|at session end| A[Art Director · MiniCPM-V · names + films + 3D keepsake]
Phrases are cut from your voice as you talk (a browser‑side VAD), so the world keeps flowing
while you narrate. One spoken phrase = one @spaces.GPU call that paints a few strips continuing
the panorama (continuity is keyed per session on disk). Voice is optional — you can also type a
story, which feeds the same live pipeline, phrase by phrase.
The live scroll shows the painted panorama with a subtle DepthAnything depth cue; at session end you can step INTO the finished world as a navigable depth‑displaced 3D mesh ("Explore in 3D", client‑GPU WebGL). Either way each strip is a single still image given depth — not Gaussian‑splat reconstruction and not video diffusion.
The orchestra — parameter ledger (live at /health, ≤ 32B)
| Model | Role | Params | License | Runtime |
|---|---|---|---|---|
| nvidia/parakeet-ctc-1.1b | Voice → text (CTC; cannot hallucinate filler) | 1.10B | cc-by-4.0 | ✓ |
| openbmb/MiniCPM5-1B | Director (shot + style per scene) | 1.00B | apache-2.0 | ✓ |
| black-forest-labs/FLUX.2-klein-4B | Painter (4-step, CFG-free strip) | 4.00B | apache-2.0 | ✓ |
| depth-anything/Depth-Anything-V2-Small | Depth / relief | 0.025B | apache-2.0 | ✓ |
| openbmb/MiniCPM-V-4.6 | Art Director — names + describes the finished world from its pixels (post-process) | 1.30B | apache-2.0 | ✓* |
| openai/whisper-large-v3-turbo | ASR fallback (only if Parakeet fails to load) | 0.809B | mit | — |
| CohereLabs/tiny-aya-global-GGUF | Translator (Cohere, evaluated — not loaded) | 3.35B | cc-by-nc-4.0 | — |
| stabilityai/stable-audio-open-small | Ambient bed (not yet wired) | 0.341B | Stability Community | — |
| onnx-community/silero-vad | Voice activity (browser RMS does the live VAD) | 0.002B | mit | — |
TOTAL: 6.13B / 32B live runtime (Parakeet + MiniCPM Director + klein-4B + Depth-Anything — the
four models on the live slot), verifiable at /health. The MiniCPM‑V Art Director (1.30B, ✓*) loads
only at session end (post‑process), never on the live painter slot — so the live experience is a
6.13B orchestra and the whole thing stays far under 32B.
Sponsor integrations & badges (evidence-linked)
- OpenBMB — load-bearing twice. MiniCPM5‑1B is the live Director (it reads each phrase and picks the shot and the art style), and MiniCPM‑V‑4.6 is the post‑process Art Director — it looks at your finished painting and names it, captions it, lists what it sees, and points the keepsake/3D camera at the most striking region. The world's variety and its narration are MiniCPM's work.
- Black Forest Labs — FLUX.2 [klein] 4B. The distilled 4‑step, CFG‑free painter is what makes a live painterly scroll possible at all (~1.3 s per spoken phrase on the ZeroGPU slot).
- NVIDIA — Parakeet‑CTC‑1.1b. Alignment‑based CTC ASR: it emits blanks on silence and so structurally cannot hallucinate filler — the right tool for live, hands‑off narration.
- Cohere. We evaluated Tiny Aya and Cohere Transcribe; the live path transcribes the spoken language and does not translate; Aya is in the ledger as evaluated, not loaded.
- Modal — the cohesive painterly look. A style LoRA for FLUX.2‑klein was trained on Modal
(
training/modal_lora/, rank‑16, 1500 steps), published to the Hub, and fused into the distilled painter at warm‑up (scale 0.75) — 0B net runtime, since it folds into klein's existing weights (no new model loaded). The triggerlghtlm styleis prepended to every painter prompt; it is gated byLIGHTLOOM_STYLE_LORA(default on) and a load hiccup falls back to the un‑styled painter. The adapter is public on the Hub (Efradeca/lightloom-style-lora) so the fine‑tune is verifiable. - Off the Grid — zero cloud APIs at runtime;
/healthdeclares the flags and a compliance test greps for cloud SDKs. - Off‑Brand — a fully custom front end over
gradio.Server: no stock Gradio components; the world is a hand‑written WebGL painterly scroll with a live model "orchestra" HUD, a cinematic keepsake modal, and a navigable 3D viewer. - Well‑Tuned — the painterly LoRA above is a fine‑tune trained on Modal and published on the
Hub (
Efradeca/lightloom-style-lora), loaded live by the app — small models, fine‑tuned, punching far above their weight on a 6.13B orchestra that paints, directs, depths, transcribes and (post‑hoc) narrates from pixels. - Field Notes — the build write‑up, published as a Hugging Face blog post: I built a world you can talk into existence.
Live vs pre-rendered (honesty notes)
Everything in the scroll is generated live in this Space (~25–30 s one‑time model warm‑up, covered by a pre‑rendered ambient scroll, then ~1.3 s per spoken phrase). The Showcase ("watch the showcase") is a panorama pre‑rendered by this same engine and bundled, clearly badged, so a visitor who has spent their ZeroGPU quota still sees the full experience instantly. Known limits: the one‑time warm‑up; ZeroGPU anonymous quota is ~2 min/day; very long sessions can drift in style.
Run locally
pip install -r requirements.txt
python app.py # serves the gradio.Server app at http://localhost:7860
Code: Apache‑2.0 (see LICENSE). Demo texts are original or public‑domain.
Judges: if a live run hits the ZeroGPU quota, the on‑screen Showcase plays a full pre‑rendered world instantly, and the demo video above shows the live experience end‑to‑end.