--- title: Lightloom · speak your world into being emoji: 🌅 colorFrom: indigo colorTo: yellow sdk: gradio app_file: app.py license: apache-2.0 pinned: true short_description: Speak — your story unrolls as a living storyboard world. tags: - track:wood - sponsor:openbmb - sponsor:modal - achievement:offgrid - achievement:welltuned - achievement:offbrand - achievement:fieldnotes - thousand-token-wood - off-brand - best-minicpm-build - best-use-of-modal - best-demo - judges-wildcard --- # 🌅 Lightloom — speak, and your world unrolls ahead of you *An entry for the Hugging Face **Build Small** hackathon — Thousand Token Wood track.* **Speak a story and a continuous, painterly world unrolls live, ahead of you** — a **living storyboard** where every phrase you speak becomes the next shot of one unbroken mural, painted in real time on **one ZeroGPU slot** by a handful of tiny **local** models. No cloud APIs. The world *is* the interface: a full‑bleed, framework‑free WebGL canvas, not a stock Gradio form. ▶️ **Demo video:** [watch on YouTube](https://youtu.be/Dn3IYpoVS7k) · 📣 **Social post:** [on LinkedIn](https://www.linkedin.com/posts/efrain-deulofeu-9563a1223_buildsmall-huggingface-generativeai-ugcPost-7472424691769647104-rlZI/) · 📖 **Field notes:** [the build write‑up](https://huggingface.co/blog/build-small-hackathon/lightloom) · 👤 **Team:** [Efradeca](https://huggingface.co/Efradeca) *(Built on Gradio · hosted as a Hugging Face Space · every model runs locally, ≤ 32B total.)* --- ## Why it's worth showing a friend (the 30 seconds) You tap the mic and **talk**. As you speak, the **Director** turns each phrase into a shot, the **Painter** outpaints the next strip of one continuous mural that *continues* the previous edge, **Depth‑Anything** gives it relief, and the browser scrolls you through it — a world that keeps **extending and breathing** while you narrate. When you're done, an **Art Director** *reads your finished world from its own pixels*, **names** it, and films a calm keepsake — and you can **step INTO it in real 3D** ("Explore in 3D", drag to peer around, all on your own GPU). It is **delightful**, the **AI is load‑bearing** (no models → no world), the **concept is original** (live voice → an endless painterly world + a VLM that narrates what it sees), and the **UI pushes hard past stock Gradio** (a hand‑written WebGL scroll, an "orchestra" HUD that lights up each tiny model, a cinematic keepsake, a navigable 3D viewer). ## How it works — the orchestra, in real time ```mermaid flowchart LR V[🎙 Your voice] -->|NVIDIA Parakeet-CTC · transcribe| T[transcript] T -->|split into phrases| D[Director · MiniCPM] D -->|vivid scene + style per phrase| P[Painter · FLUX.2 klein · 4 steps] P --> I[panorama strips] I -->|Depth-Anything V2| Z[depth] I & Z -->|stream| C[🌍 living world · continuous painterly scroll] C -.->|at session end| A[Art Director · MiniCPM-V · names + films + 3D keepsake] ``` Phrases are cut from your voice **as you talk** (a browser‑side VAD), so the world keeps flowing while you narrate. **One spoken phrase = one `@spaces.GPU` call** that paints a few strips continuing the panorama (continuity is keyed per session on disk). Voice is optional — you can also **type a story**, which feeds the **same** live pipeline, phrase by phrase. The live scroll shows the painted panorama with a subtle DepthAnything depth cue; at session end you can step INTO the finished world as a **navigable depth‑displaced 3D mesh** ("Explore in 3D", client‑GPU WebGL). Either way each strip is a single still image given depth — **not** Gaussian‑splat reconstruction and **not** video diffusion. ## The orchestra — parameter ledger (live at `/health`, ≤ 32B) | Model | Role | Params | License | Runtime | |---|---|---|---|---| | nvidia/parakeet-ctc-1.1b | **Voice → text** (CTC; cannot hallucinate filler) | 1.10B | cc-by-4.0 | ✓ | | openbmb/MiniCPM5-1B | **Director** (shot + style per scene) | 1.00B | apache-2.0 | ✓ | | black-forest-labs/FLUX.2-klein-4B | **Painter** (4-step, CFG-free strip) | 4.00B | apache-2.0 | ✓ | | depth-anything/Depth-Anything-V2-Small | Depth / relief | 0.025B | apache-2.0 | ✓ | | openbmb/MiniCPM-V-4.6 | **Art Director** — names + describes the finished world from its pixels (post-process) | 1.30B | apache-2.0 | ✓* | | openai/whisper-large-v3-turbo | ASR fallback (only if Parakeet fails to load) | 0.809B | mit | — | | CohereLabs/tiny-aya-global-GGUF | Translator (Cohere, evaluated — not loaded) | 3.35B | cc-by-nc-4.0 | — | | stabilityai/stable-audio-open-small | Ambient bed (not yet wired) | 0.341B | Stability Community | — | | onnx-community/silero-vad | Voice activity (browser RMS does the live VAD) | 0.002B | mit | — | **TOTAL: 6.13B / 32B live runtime** (Parakeet + MiniCPM Director + klein-4B + Depth-Anything — the four models on the live slot), verifiable at `/health`. The MiniCPM‑V Art Director (1.30B, ✓*) loads only at **session end** (post‑process), never on the live painter slot — so the live experience is a **6.13B** orchestra and the whole thing stays far under 32B. ## Sponsor integrations & badges (evidence-linked) - **OpenBMB — load-bearing twice.** **MiniCPM5‑1B** is the live **Director** (it reads each phrase and picks the shot *and* the art style), and **MiniCPM‑V‑4.6** is the post‑process **Art Director** — it *looks at your finished painting* and names it, captions it, lists what it sees, and points the keepsake/3D camera at the most striking region. The world's variety and its narration are MiniCPM's work. - **Black Forest Labs — FLUX.2 [klein] 4B.** The distilled **4‑step, CFG‑free** painter is what makes a live painterly scroll possible at all (~1.3 s per spoken phrase on the ZeroGPU slot). - **NVIDIA — Parakeet‑CTC‑1.1b.** Alignment‑based CTC ASR: it emits blanks on silence and so *structurally cannot hallucinate* filler — the right tool for live, hands‑off narration. - **Cohere.** We **evaluated** Tiny Aya and Cohere Transcribe; the live path transcribes the spoken language and does **not** translate; Aya is in the ledger as evaluated, not loaded. - **Modal — the cohesive painterly look.** A style **LoRA** for FLUX.2‑klein was trained on **Modal** (`training/modal_lora/`, rank‑16, 1500 steps), published to the Hub, and **fused into the distilled painter at warm‑up** (scale 0.75) — **0B net runtime**, since it folds into klein's existing weights (no new model loaded). The trigger `lghtlm style` is prepended to every painter prompt; it is gated by `LIGHTLOOM_STYLE_LORA` (default on) and a load hiccup falls back to the un‑styled painter. The adapter is **public on the Hub** (`Efradeca/lightloom-style-lora`) so the fine‑tune is verifiable. - **Off the Grid** — **zero cloud APIs at runtime**; `/health` declares the flags and a compliance test greps for cloud SDKs. - **Off‑Brand** — a fully custom front end over `gradio.Server`: no stock Gradio components; the world is a hand‑written WebGL painterly scroll with a live model "orchestra" HUD, a cinematic keepsake modal, and a navigable 3D viewer. - **Well‑Tuned** — the painterly **LoRA** above is a fine‑tune trained on Modal and **published on the Hub** (`Efradeca/lightloom-style-lora`), loaded live by the app — small models, fine‑tuned, punching far above their weight on a **6.13B** orchestra that paints, directs, depths, transcribes and (post‑hoc) *narrates from pixels*. - **Field Notes** — the build write‑up, published as a Hugging Face blog post: **[I built a world you can talk into existence](https://huggingface.co/blog/build-small-hackathon/lightloom)**. ## Live vs pre-rendered (honesty notes) Everything in the scroll is generated **live** in this Space (~25–30 s one‑time model warm‑up, covered by a pre‑rendered ambient scroll, then ~1.3 s per spoken phrase). The **Showcase** ("watch the showcase") is a panorama **pre‑rendered by this same engine** and bundled, clearly badged, so a visitor who has spent their ZeroGPU quota still sees the full experience instantly. Known limits: the one‑time warm‑up; ZeroGPU anonymous quota is ~2 min/day; very long sessions can drift in style. ## Run locally ```bash pip install -r requirements.txt python app.py # serves the gradio.Server app at http://localhost:7860 ``` Code: **Apache‑2.0** (see [`LICENSE`](LICENSE)). Demo texts are original or public‑domain. --- > **Judges:** if a live run hits the ZeroGPU quota, the on‑screen **Showcase** plays a full > pre‑rendered world instantly, and the **demo video** above shows the live experience end‑to‑end.