lightloom / README.md
Efradeca's picture
Link demo video — submission complete
d2b467c verified
|
Raw
History Blame Contribute Delete
8.84 kB
---
title: Lightloom · speak your world into being
emoji: 🌅
colorFrom: indigo
colorTo: yellow
sdk: gradio
app_file: app.py
license: apache-2.0
pinned: true
short_description: Speak your story unrolls as a living storyboard world.
tags:
- track:wood
- sponsor:openbmb
- sponsor:modal
- achievement:offgrid
- achievement:welltuned
- achievement:offbrand
- achievement:fieldnotes
- thousand-token-wood
- off-brand
- best-minicpm-build
- best-use-of-modal
- best-demo
- judges-wildcard
---
# 🌅 Lightloom — speak, and your world unrolls ahead of you
*An entry for the Hugging Face **Build Small** hackathon — Thousand Token Wood track.*
**Speak a story and a continuous, painterly world unrolls live, ahead of you** — a **living
storyboard** where every phrase you speak becomes the next shot of one unbroken mural, painted in
real time on **one ZeroGPU slot** by a handful of tiny **local** models. No cloud APIs.
The world *is* the interface: a full‑bleed, framework‑free WebGL canvas, not a stock Gradio form.
▶️ **Demo video:** [watch on YouTube](https://youtu.be/Dn3IYpoVS7k) · 📣 **Social post:** [on LinkedIn](https://www.linkedin.com/posts/efrain-deulofeu-9563a1223_buildsmall-huggingface-generativeai-ugcPost-7472424691769647104-rlZI/) · 📖 **Field notes:** [the build write‑up](https://huggingface.co/blog/build-small-hackathon/lightloom) · 👤 **Team:** [Efradeca](https://huggingface.co/Efradeca)
*(Built on Gradio · hosted as a Hugging Face Space · every model runs locally, ≤ 32B total.)*
---
## Why it's worth showing a friend (the 30 seconds)
You tap the mic and **talk**. As you speak, the **Director** turns each phrase into a shot, the
**Painter** outpaints the next strip of one continuous mural that *continues* the previous edge,
**Depth‑Anything** gives it relief, and the browser scrolls you through it — a world that keeps
**extending and breathing** while you narrate. When you're done, an **Art Director** *reads your
finished world from its own pixels*, **names** it, and films a calm keepsake — and you can **step
INTO it in real 3D** ("Explore in 3D", drag to peer around, all on your own GPU).
It is **delightful**, the **AI is load‑bearing** (no models → no world), the **concept is original**
(live voice → an endless painterly world + a VLM that narrates what it sees), and the **UI pushes
hard past stock Gradio** (a hand‑written WebGL scroll, an "orchestra" HUD that lights up each tiny
model, a cinematic keepsake, a navigable 3D viewer).
## How it works — the orchestra, in real time
```mermaid
flowchart LR
V[🎙 Your voice] -->|NVIDIA Parakeet-CTC · transcribe| T[transcript]
T -->|split into phrases| D[Director · MiniCPM]
D -->|vivid scene + style per phrase| P[Painter · FLUX.2 klein · 4 steps]
P --> I[panorama strips]
I -->|Depth-Anything V2| Z[depth]
I & Z -->|stream| C[🌍 living world · continuous painterly scroll]
C -.->|at session end| A[Art Director · MiniCPM-V · names + films + 3D keepsake]
```
Phrases are cut from your voice **as you talk** (a browser‑side VAD), so the world keeps flowing
while you narrate. **One spoken phrase = one `@spaces.GPU` call** that paints a few strips continuing
the panorama (continuity is keyed per session on disk). Voice is optional — you can also **type a
story**, which feeds the **same** live pipeline, phrase by phrase.
The live scroll shows the painted panorama with a subtle DepthAnything depth cue; at session end you
can step INTO the finished world as a **navigable depth‑displaced 3D mesh** ("Explore in 3D",
client‑GPU WebGL). Either way each strip is a single still image given depth — **not** Gaussian‑splat
reconstruction and **not** video diffusion.
## The orchestra — parameter ledger (live at `/health`, ≤ 32B)
| Model | Role | Params | License | Runtime |
|---|---|---|---|---|
| nvidia/parakeet-ctc-1.1b | **Voice → text** (CTC; cannot hallucinate filler) | 1.10B | cc-by-4.0 | ✓ |
| openbmb/MiniCPM5-1B | **Director** (shot + style per scene) | 1.00B | apache-2.0 | ✓ |
| black-forest-labs/FLUX.2-klein-4B | **Painter** (4-step, CFG-free strip) | 4.00B | apache-2.0 | ✓ |
| depth-anything/Depth-Anything-V2-Small | Depth / relief | 0.025B | apache-2.0 | ✓ |
| openbmb/MiniCPM-V-4.6 | **Art Director** — names + describes the finished world from its pixels (post-process) | 1.30B | apache-2.0 | ✓* |
| openai/whisper-large-v3-turbo | ASR fallback (only if Parakeet fails to load) | 0.809B | mit | — |
| CohereLabs/tiny-aya-global-GGUF | Translator (Cohere, evaluated — not loaded) | 3.35B | cc-by-nc-4.0 | — |
| stabilityai/stable-audio-open-small | Ambient bed (not yet wired) | 0.341B | Stability Community | — |
| onnx-community/silero-vad | Voice activity (browser RMS does the live VAD) | 0.002B | mit | — |
**TOTAL: 6.13B / 32B live runtime** (Parakeet + MiniCPM Director + klein-4B + Depth-Anything — the
four models on the live slot), verifiable at `/health`. The MiniCPM‑V Art Director (1.30B, ✓*) loads
only at **session end** (post‑process), never on the live painter slot — so the live experience is a
**6.13B** orchestra and the whole thing stays far under 32B.
## Sponsor integrations & badges (evidence-linked)
- **OpenBMB — load-bearing twice.** **MiniCPM5‑1B** is the live **Director** (it reads each phrase and
picks the shot *and* the art style), and **MiniCPM‑V‑4.6** is the post‑process **Art Director** — it
*looks at your finished painting* and names it, captions it, lists what it sees, and points the
keepsake/3D camera at the most striking region. The world's variety and its narration are MiniCPM's work.
- **Black Forest Labs — FLUX.2 [klein] 4B.** The distilled **4‑step, CFG‑free** painter is what makes a
live painterly scroll possible at all (~1.3 s per spoken phrase on the ZeroGPU slot).
- **NVIDIA — Parakeet‑CTC‑1.1b.** Alignment‑based CTC ASR: it emits blanks on silence and so
*structurally cannot hallucinate* filler — the right tool for live, hands‑off narration.
- **Cohere.** We **evaluated** Tiny Aya and Cohere Transcribe; the live path transcribes the spoken
language and does **not** translate; Aya is in the ledger as evaluated, not loaded.
- **Modal — the cohesive painterly look.** A style **LoRA** for FLUX.2‑klein was trained on **Modal**
(`training/modal_lora/`, rank‑16, 1500 steps), published to the Hub, and **fused into the distilled
painter at warm‑up** (scale 0.75) — **0B net runtime**, since it folds into klein's existing weights
(no new model loaded). The trigger `lghtlm style` is prepended to every painter prompt; it is gated by
`LIGHTLOOM_STYLE_LORA` (default on) and a load hiccup falls back to the un‑styled painter. The adapter
is **public on the Hub** (`Efradeca/lightloom-style-lora`) so the fine‑tune is verifiable.
- **Off the Grid****zero cloud APIs at runtime**; `/health` declares the flags and a compliance test
greps for cloud SDKs.
- **Off‑Brand** — a fully custom front end over `gradio.Server`: no stock Gradio components; the world
is a hand‑written WebGL painterly scroll with a live model "orchestra" HUD, a cinematic keepsake
modal, and a navigable 3D viewer.
- **Well‑Tuned** — the painterly **LoRA** above is a fine‑tune trained on Modal and **published on the
Hub** (`Efradeca/lightloom-style-lora`), loaded live by the app — small models, fine‑tuned, punching
far above their weight on a **6.13B** orchestra that paints, directs, depths, transcribes and (post‑hoc)
*narrates from pixels*.
- **Field Notes** — the build write‑up, published as a Hugging Face blog post:
**[I built a world you can talk into existence](https://huggingface.co/blog/build-small-hackathon/lightloom)**.
## Live vs pre-rendered (honesty notes)
Everything in the scroll is generated **live** in this Space (~25–30 s one‑time model warm‑up, covered
by a pre‑rendered ambient scroll, then ~1.3 s per spoken phrase). The **Showcase** ("watch the
showcase") is a panorama **pre‑rendered by this same engine** and bundled, clearly badged, so a visitor
who has spent their ZeroGPU quota still sees the full experience instantly. Known limits: the one‑time
warm‑up; ZeroGPU anonymous quota is ~2 min/day; very long sessions can drift in style.
## Run locally
```bash
pip install -r requirements.txt
python app.py # serves the gradio.Server app at http://localhost:7860
```
Code: **Apache‑2.0** (see [`LICENSE`](LICENSE)). Demo texts are original or public‑domain.
---
> **Judges:** if a live run hits the ZeroGPU quota, the on‑screen **Showcase** plays a full
> pre‑rendered world instantly, and the **demo video** above shows the live experience end‑to‑end.