Spaces:
Running on Zero
Running on Zero
| title: Lightloom · speak your world into being | |
| emoji: 🌅 | |
| colorFrom: indigo | |
| colorTo: yellow | |
| sdk: gradio | |
| app_file: app.py | |
| license: apache-2.0 | |
| pinned: true | |
| short_description: Speak — your story unrolls as a living storyboard world. | |
| tags: | |
| - track:wood | |
| - sponsor:openbmb | |
| - sponsor:modal | |
| - achievement:offgrid | |
| - achievement:welltuned | |
| - achievement:offbrand | |
| - achievement:fieldnotes | |
| - thousand-token-wood | |
| - off-brand | |
| - best-minicpm-build | |
| - best-use-of-modal | |
| - best-demo | |
| - judges-wildcard | |
| # 🌅 Lightloom — speak, and your world unrolls ahead of you | |
| *An entry for the Hugging Face **Build Small** hackathon — Thousand Token Wood track.* | |
| **Speak a story and a continuous, painterly world unrolls live, ahead of you** — a **living | |
| storyboard** where every phrase you speak becomes the next shot of one unbroken mural, painted in | |
| real time on **one ZeroGPU slot** by a handful of tiny **local** models. No cloud APIs. | |
| The world *is* the interface: a full‑bleed, framework‑free WebGL canvas, not a stock Gradio form. | |
| ▶️ **Demo video:** [watch on YouTube](https://youtu.be/Dn3IYpoVS7k) · 📣 **Social post:** [on LinkedIn](https://www.linkedin.com/posts/efrain-deulofeu-9563a1223_buildsmall-huggingface-generativeai-ugcPost-7472424691769647104-rlZI/) · 📖 **Field notes:** [the build write‑up](https://huggingface.co/blog/build-small-hackathon/lightloom) · 👤 **Team:** [Efradeca](https://huggingface.co/Efradeca) | |
| *(Built on Gradio · hosted as a Hugging Face Space · every model runs locally, ≤ 32B total.)* | |
| --- | |
| ## Why it's worth showing a friend (the 30 seconds) | |
| You tap the mic and **talk**. As you speak, the **Director** turns each phrase into a shot, the | |
| **Painter** outpaints the next strip of one continuous mural that *continues* the previous edge, | |
| **Depth‑Anything** gives it relief, and the browser scrolls you through it — a world that keeps | |
| **extending and breathing** while you narrate. When you're done, an **Art Director** *reads your | |
| finished world from its own pixels*, **names** it, and films a calm keepsake — and you can **step | |
| INTO it in real 3D** ("Explore in 3D", drag to peer around, all on your own GPU). | |
| It is **delightful**, the **AI is load‑bearing** (no models → no world), the **concept is original** | |
| (live voice → an endless painterly world + a VLM that narrates what it sees), and the **UI pushes | |
| hard past stock Gradio** (a hand‑written WebGL scroll, an "orchestra" HUD that lights up each tiny | |
| model, a cinematic keepsake, a navigable 3D viewer). | |
| ## How it works — the orchestra, in real time | |
| ```mermaid | |
| flowchart LR | |
| V[🎙 Your voice] -->|NVIDIA Parakeet-CTC · transcribe| T[transcript] | |
| T -->|split into phrases| D[Director · MiniCPM] | |
| D -->|vivid scene + style per phrase| P[Painter · FLUX.2 klein · 4 steps] | |
| P --> I[panorama strips] | |
| I -->|Depth-Anything V2| Z[depth] | |
| I & Z -->|stream| C[🌍 living world · continuous painterly scroll] | |
| C -.->|at session end| A[Art Director · MiniCPM-V · names + films + 3D keepsake] | |
| ``` | |
| Phrases are cut from your voice **as you talk** (a browser‑side VAD), so the world keeps flowing | |
| while you narrate. **One spoken phrase = one `@spaces.GPU` call** that paints a few strips continuing | |
| the panorama (continuity is keyed per session on disk). Voice is optional — you can also **type a | |
| story**, which feeds the **same** live pipeline, phrase by phrase. | |
| The live scroll shows the painted panorama with a subtle DepthAnything depth cue; at session end you | |
| can step INTO the finished world as a **navigable depth‑displaced 3D mesh** ("Explore in 3D", | |
| client‑GPU WebGL). Either way each strip is a single still image given depth — **not** Gaussian‑splat | |
| reconstruction and **not** video diffusion. | |
| ## The orchestra — parameter ledger (live at `/health`, ≤ 32B) | |
| | Model | Role | Params | License | Runtime | | |
| |---|---|---|---|---| | |
| | nvidia/parakeet-ctc-1.1b | **Voice → text** (CTC; cannot hallucinate filler) | 1.10B | cc-by-4.0 | ✓ | | |
| | openbmb/MiniCPM5-1B | **Director** (shot + style per scene) | 1.00B | apache-2.0 | ✓ | | |
| | black-forest-labs/FLUX.2-klein-4B | **Painter** (4-step, CFG-free strip) | 4.00B | apache-2.0 | ✓ | | |
| | depth-anything/Depth-Anything-V2-Small | Depth / relief | 0.025B | apache-2.0 | ✓ | | |
| | openbmb/MiniCPM-V-4.6 | **Art Director** — names + describes the finished world from its pixels (post-process) | 1.30B | apache-2.0 | ✓* | | |
| | openai/whisper-large-v3-turbo | ASR fallback (only if Parakeet fails to load) | 0.809B | mit | — | | |
| | CohereLabs/tiny-aya-global-GGUF | Translator (Cohere, evaluated — not loaded) | 3.35B | cc-by-nc-4.0 | — | | |
| | stabilityai/stable-audio-open-small | Ambient bed (not yet wired) | 0.341B | Stability Community | — | | |
| | onnx-community/silero-vad | Voice activity (browser RMS does the live VAD) | 0.002B | mit | — | | |
| **TOTAL: 6.13B / 32B live runtime** (Parakeet + MiniCPM Director + klein-4B + Depth-Anything — the | |
| four models on the live slot), verifiable at `/health`. The MiniCPM‑V Art Director (1.30B, ✓*) loads | |
| only at **session end** (post‑process), never on the live painter slot — so the live experience is a | |
| **6.13B** orchestra and the whole thing stays far under 32B. | |
| ## Sponsor integrations & badges (evidence-linked) | |
| - **OpenBMB — load-bearing twice.** **MiniCPM5‑1B** is the live **Director** (it reads each phrase and | |
| picks the shot *and* the art style), and **MiniCPM‑V‑4.6** is the post‑process **Art Director** — it | |
| *looks at your finished painting* and names it, captions it, lists what it sees, and points the | |
| keepsake/3D camera at the most striking region. The world's variety and its narration are MiniCPM's work. | |
| - **Black Forest Labs — FLUX.2 [klein] 4B.** The distilled **4‑step, CFG‑free** painter is what makes a | |
| live painterly scroll possible at all (~1.3 s per spoken phrase on the ZeroGPU slot). | |
| - **NVIDIA — Parakeet‑CTC‑1.1b.** Alignment‑based CTC ASR: it emits blanks on silence and so | |
| *structurally cannot hallucinate* filler — the right tool for live, hands‑off narration. | |
| - **Cohere.** We **evaluated** Tiny Aya and Cohere Transcribe; the live path transcribes the spoken | |
| language and does **not** translate; Aya is in the ledger as evaluated, not loaded. | |
| - **Modal — the cohesive painterly look.** A style **LoRA** for FLUX.2‑klein was trained on **Modal** | |
| (`training/modal_lora/`, rank‑16, 1500 steps), published to the Hub, and **fused into the distilled | |
| painter at warm‑up** (scale 0.75) — **0B net runtime**, since it folds into klein's existing weights | |
| (no new model loaded). The trigger `lghtlm style` is prepended to every painter prompt; it is gated by | |
| `LIGHTLOOM_STYLE_LORA` (default on) and a load hiccup falls back to the un‑styled painter. The adapter | |
| is **public on the Hub** (`Efradeca/lightloom-style-lora`) so the fine‑tune is verifiable. | |
| - **Off the Grid** — **zero cloud APIs at runtime**; `/health` declares the flags and a compliance test | |
| greps for cloud SDKs. | |
| - **Off‑Brand** — a fully custom front end over `gradio.Server`: no stock Gradio components; the world | |
| is a hand‑written WebGL painterly scroll with a live model "orchestra" HUD, a cinematic keepsake | |
| modal, and a navigable 3D viewer. | |
| - **Well‑Tuned** — the painterly **LoRA** above is a fine‑tune trained on Modal and **published on the | |
| Hub** (`Efradeca/lightloom-style-lora`), loaded live by the app — small models, fine‑tuned, punching | |
| far above their weight on a **6.13B** orchestra that paints, directs, depths, transcribes and (post‑hoc) | |
| *narrates from pixels*. | |
| - **Field Notes** — the build write‑up, published as a Hugging Face blog post: | |
| **[I built a world you can talk into existence](https://huggingface.co/blog/build-small-hackathon/lightloom)**. | |
| ## Live vs pre-rendered (honesty notes) | |
| Everything in the scroll is generated **live** in this Space (~25–30 s one‑time model warm‑up, covered | |
| by a pre‑rendered ambient scroll, then ~1.3 s per spoken phrase). The **Showcase** ("watch the | |
| showcase") is a panorama **pre‑rendered by this same engine** and bundled, clearly badged, so a visitor | |
| who has spent their ZeroGPU quota still sees the full experience instantly. Known limits: the one‑time | |
| warm‑up; ZeroGPU anonymous quota is ~2 min/day; very long sessions can drift in style. | |
| ## Run locally | |
| ```bash | |
| pip install -r requirements.txt | |
| python app.py # serves the gradio.Server app at http://localhost:7860 | |
| ``` | |
| Code: **Apache‑2.0** (see [`LICENSE`](LICENSE)). Demo texts are original or public‑domain. | |
| --- | |
| > **Judges:** if a live run hits the ZeroGPU quota, the on‑screen **Showcase** plays a full | |
| > pre‑rendered world instantly, and the **demo video** above shows the live experience end‑to‑end. | |