lightloom / README.md
Efradeca's picture
Link demo video — submission complete
d2b467c verified
|
Raw
History Blame Contribute Delete
8.84 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade
metadata
title: Lightloom · speak your world into being
emoji: 🌅
colorFrom: indigo
colorTo: yellow
sdk: gradio
app_file: app.py
license: apache-2.0
pinned: true
short_description: Speak  your story unrolls as a living storyboard world.
tags:
  - track:wood
  - sponsor:openbmb
  - sponsor:modal
  - achievement:offgrid
  - achievement:welltuned
  - achievement:offbrand
  - achievement:fieldnotes
  - thousand-token-wood
  - off-brand
  - best-minicpm-build
  - best-use-of-modal
  - best-demo
  - judges-wildcard

🌅 Lightloom — speak, and your world unrolls ahead of you

An entry for the Hugging Face Build Small hackathon — Thousand Token Wood track.

Speak a story and a continuous, painterly world unrolls live, ahead of you — a living storyboard where every phrase you speak becomes the next shot of one unbroken mural, painted in real time on one ZeroGPU slot by a handful of tiny local models. No cloud APIs. The world is the interface: a full‑bleed, framework‑free WebGL canvas, not a stock Gradio form.

▶️ Demo video: watch on YouTube · 📣 Social post: on LinkedIn · 📖 Field notes: the build write‑up · 👤 Team: Efradeca

(Built on Gradio · hosted as a Hugging Face Space · every model runs locally, ≤ 32B total.)


Why it's worth showing a friend (the 30 seconds)

You tap the mic and talk. As you speak, the Director turns each phrase into a shot, the Painter outpaints the next strip of one continuous mural that continues the previous edge, Depth‑Anything gives it relief, and the browser scrolls you through it — a world that keeps extending and breathing while you narrate. When you're done, an Art Director reads your finished world from its own pixels, names it, and films a calm keepsake — and you can step INTO it in real 3D ("Explore in 3D", drag to peer around, all on your own GPU).

It is delightful, the AI is load‑bearing (no models → no world), the concept is original (live voice → an endless painterly world + a VLM that narrates what it sees), and the UI pushes hard past stock Gradio (a hand‑written WebGL scroll, an "orchestra" HUD that lights up each tiny model, a cinematic keepsake, a navigable 3D viewer).

How it works — the orchestra, in real time

flowchart LR
  V[🎙 Your voice] -->|NVIDIA Parakeet-CTC · transcribe| T[transcript]
  T -->|split into phrases| D[Director · MiniCPM]
  D -->|vivid scene + style per phrase| P[Painter · FLUX.2 klein · 4 steps]
  P --> I[panorama strips]
  I -->|Depth-Anything V2| Z[depth]
  I & Z -->|stream| C[🌍 living world · continuous painterly scroll]
  C -.->|at session end| A[Art Director · MiniCPM-V · names + films + 3D keepsake]

Phrases are cut from your voice as you talk (a browser‑side VAD), so the world keeps flowing while you narrate. One spoken phrase = one @spaces.GPU call that paints a few strips continuing the panorama (continuity is keyed per session on disk). Voice is optional — you can also type a story, which feeds the same live pipeline, phrase by phrase.

The live scroll shows the painted panorama with a subtle DepthAnything depth cue; at session end you can step INTO the finished world as a navigable depth‑displaced 3D mesh ("Explore in 3D", client‑GPU WebGL). Either way each strip is a single still image given depth — not Gaussian‑splat reconstruction and not video diffusion.

The orchestra — parameter ledger (live at /health, ≤ 32B)

Model Role Params License Runtime
nvidia/parakeet-ctc-1.1b Voice → text (CTC; cannot hallucinate filler) 1.10B cc-by-4.0
openbmb/MiniCPM5-1B Director (shot + style per scene) 1.00B apache-2.0
black-forest-labs/FLUX.2-klein-4B Painter (4-step, CFG-free strip) 4.00B apache-2.0
depth-anything/Depth-Anything-V2-Small Depth / relief 0.025B apache-2.0
openbmb/MiniCPM-V-4.6 Art Director — names + describes the finished world from its pixels (post-process) 1.30B apache-2.0 ✓*
openai/whisper-large-v3-turbo ASR fallback (only if Parakeet fails to load) 0.809B mit
CohereLabs/tiny-aya-global-GGUF Translator (Cohere, evaluated — not loaded) 3.35B cc-by-nc-4.0
stabilityai/stable-audio-open-small Ambient bed (not yet wired) 0.341B Stability Community
onnx-community/silero-vad Voice activity (browser RMS does the live VAD) 0.002B mit

TOTAL: 6.13B / 32B live runtime (Parakeet + MiniCPM Director + klein-4B + Depth-Anything — the four models on the live slot), verifiable at /health. The MiniCPM‑V Art Director (1.30B, ✓*) loads only at session end (post‑process), never on the live painter slot — so the live experience is a 6.13B orchestra and the whole thing stays far under 32B.

Sponsor integrations & badges (evidence-linked)

  • OpenBMB — load-bearing twice. MiniCPM5‑1B is the live Director (it reads each phrase and picks the shot and the art style), and MiniCPM‑V‑4.6 is the post‑process Art Director — it looks at your finished painting and names it, captions it, lists what it sees, and points the keepsake/3D camera at the most striking region. The world's variety and its narration are MiniCPM's work.
  • Black Forest Labs — FLUX.2 [klein] 4B. The distilled 4‑step, CFG‑free painter is what makes a live painterly scroll possible at all (~1.3 s per spoken phrase on the ZeroGPU slot).
  • NVIDIA — Parakeet‑CTC‑1.1b. Alignment‑based CTC ASR: it emits blanks on silence and so structurally cannot hallucinate filler — the right tool for live, hands‑off narration.
  • Cohere. We evaluated Tiny Aya and Cohere Transcribe; the live path transcribes the spoken language and does not translate; Aya is in the ledger as evaluated, not loaded.
  • Modal — the cohesive painterly look. A style LoRA for FLUX.2‑klein was trained on Modal (training/modal_lora/, rank‑16, 1500 steps), published to the Hub, and fused into the distilled painter at warm‑up (scale 0.75) — 0B net runtime, since it folds into klein's existing weights (no new model loaded). The trigger lghtlm style is prepended to every painter prompt; it is gated by LIGHTLOOM_STYLE_LORA (default on) and a load hiccup falls back to the un‑styled painter. The adapter is public on the Hub (Efradeca/lightloom-style-lora) so the fine‑tune is verifiable.
  • Off the Gridzero cloud APIs at runtime; /health declares the flags and a compliance test greps for cloud SDKs.
  • Off‑Brand — a fully custom front end over gradio.Server: no stock Gradio components; the world is a hand‑written WebGL painterly scroll with a live model "orchestra" HUD, a cinematic keepsake modal, and a navigable 3D viewer.
  • Well‑Tuned — the painterly LoRA above is a fine‑tune trained on Modal and published on the Hub (Efradeca/lightloom-style-lora), loaded live by the app — small models, fine‑tuned, punching far above their weight on a 6.13B orchestra that paints, directs, depths, transcribes and (post‑hoc) narrates from pixels.
  • Field Notes — the build write‑up, published as a Hugging Face blog post: I built a world you can talk into existence.

Live vs pre-rendered (honesty notes)

Everything in the scroll is generated live in this Space (~25–30 s one‑time model warm‑up, covered by a pre‑rendered ambient scroll, then ~1.3 s per spoken phrase). The Showcase ("watch the showcase") is a panorama pre‑rendered by this same engine and bundled, clearly badged, so a visitor who has spent their ZeroGPU quota still sees the full experience instantly. Known limits: the one‑time warm‑up; ZeroGPU anonymous quota is ~2 min/day; very long sessions can drift in style.

Run locally

pip install -r requirements.txt
python app.py   # serves the gradio.Server app at http://localhost:7860

Code: Apache‑2.0 (see LICENSE). Demo texts are original or public‑domain.


Judges: if a live run hits the ZeroGPU quota, the on‑screen Showcase plays a full pre‑rendered world instantly, and the demo video above shows the live experience end‑to‑end.