Spaces:

build-small-hackathon
/

lightloom

Running on Zero

App Files Files Community

lightloom / README.md

Efradeca

Link demo video — submission complete

d2b467c verified 20 days ago

preview code

Raw

History Blame Contribute Delete

8.84 kB

	---
	title: Lightloom · speak your world into being
	emoji: 🌅
	colorFrom: indigo
	colorTo: yellow
	sdk: gradio
	app_file: app.py
	license: apache-2.0
	pinned: true
	short_description: Speak — your story unrolls as a living storyboard world.
	tags:
	- track:wood
	- sponsor:openbmb
	- sponsor:modal
	- achievement:offgrid
	- achievement:welltuned
	- achievement:offbrand
	- achievement:fieldnotes
	- thousand-token-wood
	- off-brand
	- best-minicpm-build
	- best-use-of-modal
	- best-demo
	- judges-wildcard
	---

	# 🌅 Lightloom — speak, and your world unrolls ahead of you

	An entry for the Hugging Face Build Small* hackathon — Thousand Token Wood track.*

	Speak a story and a continuous, painterly world unrolls live, ahead of you — a **living
	storyboard** where every phrase you speak becomes the next shot of one unbroken mural, painted in
	real time on one ZeroGPU slot by a handful of tiny local models. No cloud APIs.
	The world is the interface: a full‑bleed, framework‑free WebGL canvas, not a stock Gradio form.

	▶️ Demo video: [watch on YouTube](https://youtu.be/Dn3IYpoVS7k) · 📣 Social post: [on LinkedIn](https://www.linkedin.com/posts/efrain-deulofeu-9563a1223_buildsmall-huggingface-generativeai-ugcPost-7472424691769647104-rlZI/) · 📖 Field notes: [the build write‑up](https://huggingface.co/blog/build-small-hackathon/lightloom) · 👤 Team: [Efradeca](https://huggingface.co/Efradeca)

	(Built on Gradio · hosted as a Hugging Face Space · every model runs locally, ≤ 32B total.)

	---

	## Why it's worth showing a friend (the 30 seconds)

	You tap the mic and talk. As you speak, the Director turns each phrase into a shot, the
	Painter outpaints the next strip of one continuous mural that continues the previous edge,
	Depth‑Anything gives it relief, and the browser scrolls you through it — a world that keeps
	extending and breathing while you narrate. When you're done, an Art Director *reads your
	finished world from its own pixels, names* it, and films a calm keepsake — and you can **step
	INTO it in real 3D** ("Explore in 3D", drag to peer around, all on your own GPU).

	It is delightful, the AI is load‑bearing (no models → no world), the concept is original
	(live voice → an endless painterly world + a VLM that narrates what it sees), and the **UI pushes
	hard past stock Gradio** (a hand‑written WebGL scroll, an "orchestra" HUD that lights up each tiny
	model, a cinematic keepsake, a navigable 3D viewer).

	## How it works — the orchestra, in real time

	```mermaid
	flowchart LR
	V[🎙 Your voice] -->\|NVIDIA Parakeet-CTC · transcribe\| T[transcript]
	T -->\|split into phrases\| D[Director · MiniCPM]
	D -->\|vivid scene + style per phrase\| P[Painter · FLUX.2 klein · 4 steps]
	P --> I[panorama strips]
	I -->\|Depth-Anything V2\| Z[depth]
	I & Z -->\|stream\| C[🌍 living world · continuous painterly scroll]
	C -.->\|at session end\| A[Art Director · MiniCPM-V · names + films + 3D keepsake]
	```

	Phrases are cut from your voice as you talk (a browser‑side VAD), so the world keeps flowing
	while you narrate. One spoken phrase = one `@spaces.GPU` call that paints a few strips continuing
	the panorama (continuity is keyed per session on disk). Voice is optional — you can also **type a
	story, which feeds the same** live pipeline, phrase by phrase.

	The live scroll shows the painted panorama with a subtle DepthAnything depth cue; at session end you
	can step INTO the finished world as a navigable depth‑displaced 3D mesh ("Explore in 3D",
	client‑GPU WebGL). Either way each strip is a single still image given depth — not Gaussian‑splat
	reconstruction and not video diffusion.

	## The orchestra — parameter ledger (live at `/health`, ≤ 32B)

	\| Model \| Role \| Params \| License \| Runtime \|
	\|---\|---\|---\|---\|---\|
	\| nvidia/parakeet-ctc-1.1b \| Voice → text (CTC; cannot hallucinate filler) \| 1.10B \| cc-by-4.0 \| ✓ \|
	\| openbmb/MiniCPM5-1B \| Director (shot + style per scene) \| 1.00B \| apache-2.0 \| ✓ \|
	\| black-forest-labs/FLUX.2-klein-4B \| Painter (4-step, CFG-free strip) \| 4.00B \| apache-2.0 \| ✓ \|
	\| depth-anything/Depth-Anything-V2-Small \| Depth / relief \| 0.025B \| apache-2.0 \| ✓ \|
	\| openbmb/MiniCPM-V-4.6 \| Art Director — names + describes the finished world from its pixels (post-process) \| 1.30B \| apache-2.0 \| ✓* \|
	\| openai/whisper-large-v3-turbo \| ASR fallback (only if Parakeet fails to load) \| 0.809B \| mit \| — \|
	\| CohereLabs/tiny-aya-global-GGUF \| Translator (Cohere, evaluated — not loaded) \| 3.35B \| cc-by-nc-4.0 \| — \|
	\| stabilityai/stable-audio-open-small \| Ambient bed (not yet wired) \| 0.341B \| Stability Community \| — \|
	\| onnx-community/silero-vad \| Voice activity (browser RMS does the live VAD) \| 0.002B \| mit \| — \|

	TOTAL: 6.13B / 32B live runtime (Parakeet + MiniCPM Director + klein-4B + Depth-Anything — the
	four models on the live slot), verifiable at `/health`. The MiniCPM‑V Art Director (1.30B, ✓*) loads
	only at session end (post‑process), never on the live painter slot — so the live experience is a
	6.13B orchestra and the whole thing stays far under 32B.

	## Sponsor integrations & badges (evidence-linked)

	- OpenBMB — load-bearing twice. MiniCPM5‑1B is the live Director (it reads each phrase and
	picks the shot and the art style), and MiniCPM‑V‑4.6 is the post‑process Art Director — it
	looks at your finished painting and names it, captions it, lists what it sees, and points the
	keepsake/3D camera at the most striking region. The world's variety and its narration are MiniCPM's work.
	- Black Forest Labs — FLUX.2 [klein] 4B. The distilled 4‑step, CFG‑free painter is what makes a
	live painterly scroll possible at all (~1.3 s per spoken phrase on the ZeroGPU slot).
	- NVIDIA — Parakeet‑CTC‑1.1b. Alignment‑based CTC ASR: it emits blanks on silence and so
	structurally cannot hallucinate filler — the right tool for live, hands‑off narration.
	- Cohere. We evaluated Tiny Aya and Cohere Transcribe; the live path transcribes the spoken
	language and does not translate; Aya is in the ledger as evaluated, not loaded.
	- Modal — the cohesive painterly look. A style LoRA for FLUX.2‑klein was trained on Modal
	(`training/modal_lora/`, rank‑16, 1500 steps), published to the Hub, and **fused into the distilled
	painter at warm‑up (scale 0.75) — 0B net runtime**, since it folds into klein's existing weights
	(no new model loaded). The trigger `lghtlm style` is prepended to every painter prompt; it is gated by
	`LIGHTLOOM_STYLE_LORA` (default on) and a load hiccup falls back to the un‑styled painter. The adapter
	is public on the Hub (`Efradeca/lightloom-style-lora`) so the fine‑tune is verifiable.
	- Off the Grid — zero cloud APIs at runtime; `/health` declares the flags and a compliance test
	greps for cloud SDKs.
	- Off‑Brand — a fully custom front end over `gradio.Server`: no stock Gradio components; the world
	is a hand‑written WebGL painterly scroll with a live model "orchestra" HUD, a cinematic keepsake
	modal, and a navigable 3D viewer.
	- Well‑Tuned — the painterly LoRA above is a fine‑tune trained on Modal and **published on the
	Hub** (`Efradeca/lightloom-style-lora`), loaded live by the app — small models, fine‑tuned, punching
	far above their weight on a 6.13B orchestra that paints, directs, depths, transcribes and (post‑hoc)
	narrates from pixels.
	- Field Notes — the build write‑up, published as a Hugging Face blog post:
	[I built a world you can talk into existence](https://huggingface.co/blog/build-small-hackathon/lightloom).

	## Live vs pre-rendered (honesty notes)

	Everything in the scroll is generated live in this Space (~25–30 s one‑time model warm‑up, covered
	by a pre‑rendered ambient scroll, then ~1.3 s per spoken phrase). The Showcase ("watch the
	showcase") is a panorama pre‑rendered by this same engine and bundled, clearly badged, so a visitor
	who has spent their ZeroGPU quota still sees the full experience instantly. Known limits: the one‑time
	warm‑up; ZeroGPU anonymous quota is ~2 min/day; very long sessions can drift in style.

	## Run locally

	```bash
	pip install -r requirements.txt
	python app.py # serves the gradio.Server app at http://localhost:7860
	```

	Code: Apache‑2.0 (see [`LICENSE`](LICENSE)). Demo texts are original or public‑domain.

	---

	> Judges: if a live run hits the ZeroGPU quota, the on‑screen Showcase plays a full
	> pre‑rendered world instantly, and the demo video above shows the live experience end‑to‑end.

	---
	title: Lightloom · speak your world into being
	emoji: 🌅
	colorFrom: indigo
	colorTo: yellow
	sdk: gradio
	app_file: app.py
	license: apache-2.0
	pinned: true
	short_description: Speak — your story unrolls as a living storyboard world.
	tags:
	- track:wood
	- sponsor:openbmb
	- sponsor:modal
	- achievement:offgrid
	- achievement:welltuned
	- achievement:offbrand
	- achievement:fieldnotes
	- thousand-token-wood
	- off-brand
	- best-minicpm-build
	- best-use-of-modal
	- best-demo
	- judges-wildcard
	---

	# 🌅 Lightloom — speak, and your world unrolls ahead of you

	An entry for the Hugging Face Build Small* hackathon — Thousand Token Wood track.*

	Speak a story and a continuous, painterly world unrolls live, ahead of you — a **living
	storyboard** where every phrase you speak becomes the next shot of one unbroken mural, painted in
	real time on one ZeroGPU slot by a handful of tiny local models. No cloud APIs.
	The world is the interface: a full‑bleed, framework‑free WebGL canvas, not a stock Gradio form.

	▶️ Demo video: [watch on YouTube](https://youtu.be/Dn3IYpoVS7k) · 📣 Social post: [on LinkedIn](https://www.linkedin.com/posts/efrain-deulofeu-9563a1223_buildsmall-huggingface-generativeai-ugcPost-7472424691769647104-rlZI/) · 📖 Field notes: [the build write‑up](https://huggingface.co/blog/build-small-hackathon/lightloom) · 👤 Team: [Efradeca](https://huggingface.co/Efradeca)

	(Built on Gradio · hosted as a Hugging Face Space · every model runs locally, ≤ 32B total.)

	---

	## Why it's worth showing a friend (the 30 seconds)

	You tap the mic and talk. As you speak, the Director turns each phrase into a shot, the
	Painter outpaints the next strip of one continuous mural that continues the previous edge,
	Depth‑Anything gives it relief, and the browser scrolls you through it — a world that keeps
	extending and breathing while you narrate. When you're done, an Art Director *reads your
	finished world from its own pixels, names* it, and films a calm keepsake — and you can **step
	INTO it in real 3D** ("Explore in 3D", drag to peer around, all on your own GPU).

	It is delightful, the AI is load‑bearing (no models → no world), the concept is original
	(live voice → an endless painterly world + a VLM that narrates what it sees), and the **UI pushes
	hard past stock Gradio** (a hand‑written WebGL scroll, an "orchestra" HUD that lights up each tiny
	model, a cinematic keepsake, a navigable 3D viewer).

	## How it works — the orchestra, in real time

	```mermaid
	flowchart LR
	V[🎙 Your voice] -->\|NVIDIA Parakeet-CTC · transcribe\| T[transcript]
	T -->\|split into phrases\| D[Director · MiniCPM]
	D -->\|vivid scene + style per phrase\| P[Painter · FLUX.2 klein · 4 steps]
	P --> I[panorama strips]
	I -->\|Depth-Anything V2\| Z[depth]
	I & Z -->\|stream\| C[🌍 living world · continuous painterly scroll]
	C -.->\|at session end\| A[Art Director · MiniCPM-V · names + films + 3D keepsake]
	```

	Phrases are cut from your voice as you talk (a browser‑side VAD), so the world keeps flowing
	while you narrate. One spoken phrase = one `@spaces.GPU` call that paints a few strips continuing
	the panorama (continuity is keyed per session on disk). Voice is optional — you can also **type a
	story, which feeds the same** live pipeline, phrase by phrase.

	The live scroll shows the painted panorama with a subtle DepthAnything depth cue; at session end you
	can step INTO the finished world as a navigable depth‑displaced 3D mesh ("Explore in 3D",
	client‑GPU WebGL). Either way each strip is a single still image given depth — not Gaussian‑splat
	reconstruction and not video diffusion.

	## The orchestra — parameter ledger (live at `/health`, ≤ 32B)

	\| Model \| Role \| Params \| License \| Runtime \|
	\|---\|---\|---\|---\|---\|
	\| nvidia/parakeet-ctc-1.1b \| Voice → text (CTC; cannot hallucinate filler) \| 1.10B \| cc-by-4.0 \| ✓ \|
	\| openbmb/MiniCPM5-1B \| Director (shot + style per scene) \| 1.00B \| apache-2.0 \| ✓ \|
	\| black-forest-labs/FLUX.2-klein-4B \| Painter (4-step, CFG-free strip) \| 4.00B \| apache-2.0 \| ✓ \|
	\| depth-anything/Depth-Anything-V2-Small \| Depth / relief \| 0.025B \| apache-2.0 \| ✓ \|
	\| openbmb/MiniCPM-V-4.6 \| Art Director — names + describes the finished world from its pixels (post-process) \| 1.30B \| apache-2.0 \| ✓* \|
	\| openai/whisper-large-v3-turbo \| ASR fallback (only if Parakeet fails to load) \| 0.809B \| mit \| — \|
	\| CohereLabs/tiny-aya-global-GGUF \| Translator (Cohere, evaluated — not loaded) \| 3.35B \| cc-by-nc-4.0 \| — \|
	\| stabilityai/stable-audio-open-small \| Ambient bed (not yet wired) \| 0.341B \| Stability Community \| — \|
	\| onnx-community/silero-vad \| Voice activity (browser RMS does the live VAD) \| 0.002B \| mit \| — \|

	TOTAL: 6.13B / 32B live runtime (Parakeet + MiniCPM Director + klein-4B + Depth-Anything — the
	four models on the live slot), verifiable at `/health`. The MiniCPM‑V Art Director (1.30B, ✓*) loads
	only at session end (post‑process), never on the live painter slot — so the live experience is a
	6.13B orchestra and the whole thing stays far under 32B.

	## Sponsor integrations & badges (evidence-linked)

	- OpenBMB — load-bearing twice. MiniCPM5‑1B is the live Director (it reads each phrase and
	picks the shot and the art style), and MiniCPM‑V‑4.6 is the post‑process Art Director — it
	looks at your finished painting and names it, captions it, lists what it sees, and points the
	keepsake/3D camera at the most striking region. The world's variety and its narration are MiniCPM's work.
	- Black Forest Labs — FLUX.2 [klein] 4B. The distilled 4‑step, CFG‑free painter is what makes a
	live painterly scroll possible at all (~1.3 s per spoken phrase on the ZeroGPU slot).
	- NVIDIA — Parakeet‑CTC‑1.1b. Alignment‑based CTC ASR: it emits blanks on silence and so
	structurally cannot hallucinate filler — the right tool for live, hands‑off narration.
	- Cohere. We evaluated Tiny Aya and Cohere Transcribe; the live path transcribes the spoken
	language and does not translate; Aya is in the ledger as evaluated, not loaded.
	- Modal — the cohesive painterly look. A style LoRA for FLUX.2‑klein was trained on Modal
	(`training/modal_lora/`, rank‑16, 1500 steps), published to the Hub, and **fused into the distilled
	painter at warm‑up (scale 0.75) — 0B net runtime**, since it folds into klein's existing weights
	(no new model loaded). The trigger `lghtlm style` is prepended to every painter prompt; it is gated by
	`LIGHTLOOM_STYLE_LORA` (default on) and a load hiccup falls back to the un‑styled painter. The adapter
	is public on the Hub (`Efradeca/lightloom-style-lora`) so the fine‑tune is verifiable.
	- Off the Grid — zero cloud APIs at runtime; `/health` declares the flags and a compliance test
	greps for cloud SDKs.
	- Off‑Brand — a fully custom front end over `gradio.Server`: no stock Gradio components; the world
	is a hand‑written WebGL painterly scroll with a live model "orchestra" HUD, a cinematic keepsake
	modal, and a navigable 3D viewer.
	- Well‑Tuned — the painterly LoRA above is a fine‑tune trained on Modal and **published on the
	Hub** (`Efradeca/lightloom-style-lora`), loaded live by the app — small models, fine‑tuned, punching
	far above their weight on a 6.13B orchestra that paints, directs, depths, transcribes and (post‑hoc)
	narrates from pixels.
	- Field Notes — the build write‑up, published as a Hugging Face blog post:
	[I built a world you can talk into existence](https://huggingface.co/blog/build-small-hackathon/lightloom).

	## Live vs pre-rendered (honesty notes)

	Everything in the scroll is generated live in this Space (~25–30 s one‑time model warm‑up, covered
	by a pre‑rendered ambient scroll, then ~1.3 s per spoken phrase). The Showcase ("watch the
	showcase") is a panorama pre‑rendered by this same engine and bundled, clearly badged, so a visitor
	who has spent their ZeroGPU quota still sees the full experience instantly. Known limits: the one‑time
	warm‑up; ZeroGPU anonymous quota is ~2 min/day; very long sessions can drift in style.

	## Run locally

	```bash
	pip install -r requirements.txt
	python app.py # serves the gradio.Server app at http://localhost:7860
	```

	Code: Apache‑2.0 (see [`LICENSE`](LICENSE)). Demo texts are original or public‑domain.

	---

	> Judges: if a live run hits the ZeroGPU quota, the on‑screen Showcase plays a full
	> pre‑rendered world instantly, and the demo video above shows the live experience end‑to‑end.