echo-plantain
A LoRA adapter on FLUX.2 Klein (4B) that predicts the magnitude spectrogram of a room impulse response from a top-down schematic of the room. Reframes acoustic modeling as image-to-image generation: the source image is a schematic showing room geometry plus source and listener positions, the target is the RIR spectrogram in RGB, and an inverse bijection recovers a mono RIR suitable for audio convolution.
This adapter tests whether the recipe from Image Generators are Generalist Vision Learners (Gabeur et al., 2026; arXiv:2604.20329) extends to physics-grounded prediction tasks where the input is a 2D image and the output is a signal that captures the response of a physical system.
Method
- Reframe room acoustics as image-to-image. Source: a 768 × 768 top-down schematic of a rectangular room with the audio source rendered as a red ⊕ glyph, the listener as a blue ⊙ glyph, and floor brightness encoding surface absorption (lighter = more reflective). Target: the room impulse response, computed via the image-source method, encoded as an RGB spectrogram.
- Bijective magnitude↔RGB encoding. Linear-amplitude STFT magnitude → dB clipped to [−100, 0] → curve
u ∈ [0, 1]→ 7-segment Hamiltonian path through the corners of the RGB cube (black → blue → cyan → green → yellow → red → magenta → white). The wider dB range relative to speech encodings captures the full RIR dynamic range from direct-arrival peak to late-reverberation tail. - Audio params. 16 kHz, n_fft = 1024, hop = 256, 1-second clips. STFT (513 frequency bins × 63 time frames) placed top-left in a 768 × 768 canvas with silence padding.
Training data: 10,000 randomly-generated rectangular rooms via pyroomacoustics. Dimensions uniform on 3–12 m × 3–12 m with 2.4–4.0 m ceiling; surface absorption uniform on [0.05, 0.50]; source and listener positions uniform inside the room with minimum 0.5 m separation. RIRs computed by image-source method up to reflection order 6.
Status
Training in progress. Weights will be added when complete.
Training
| Base | black-forest-labs/FLUX.2-klein-base-4B |
| Adapter | LoRA, rank 256 on transformer attention + rank 32 on text encoder |
| Resolution | 768 × 768 |
| Batch size | 4 |
| Optimizer | AdamW, lr 1e-4, cosine schedule, 300-step warmup |
| Max steps | 15 000 |
| Mixed precision | bf16 |
| Training data | 10 000 synthetic rooms (pyroomacoustics, image-source method, max order 6) |
| Audio params | 16 kHz, n_fft 1024, hop 256, 1-second RIR clips |
| Spectrogram encoding | Linear magnitude → dB clipped [−100, 0] → Hilbert RGB-cube path |
Usage
import torch
from PIL import Image
from diffusers import Flux2KleinPipeline
pipe = Flux2KleinPipeline.from_pretrained(
"black-forest-labs/FLUX.2-klein-base-4B", torch_dtype=torch.bfloat16,
).to("cuda")
pipe.load_lora_weights("phanerozoic/echo-plantain")
# A top-down schematic of the target room (see `render_schematic.py` for the
# renderer convention: walls as outline, source as red ⊕, listener as blue ⊙,
# floor brightness encoding absorption).
schematic = Image.open("room_schematic.png").convert("RGB").resize((768, 768))
prompt = (
"Generate a room impulse response spectrogram for the depicted space. "
"Time on horizontal axis (early reflections at left, late reverb tail extending right), "
"frequency on vertical axis. Energy encoded in RGB along a Hilbert path through "
"the color cube: black is below noise floor, blue/cyan is faint reflections, "
"green/yellow is strong reflections, red/magenta is direct-arrival energy."
)
img = pipe(
image=schematic, prompt=prompt, height=768, width=768,
guidance_scale=4.0, num_inference_steps=20,
).images[0]
The decoder (RGB → magnitude → mono RIR) is in decode_rir.py. The recovered RIR can be convolved with any dry signal to apply the predicted room reverb.
License
The LoRA adapter weights in this repository are released under the Apache License 2.0, matching the license of the base model FLUX.2 Klein 4B.
Training data attribution
The training data is fully synthetic, generated at preparation time from random rectangular room geometries via the pyroomacoustics Python library (Scheibler, Bezzam, Dokmanić, 2018). Pyroomacoustics is distributed under the MIT License. No external dataset is required to reproduce the training corpus; the dataset-generation script is included in this repository.
Base model
Base model FLUX.2 Klein 4B is distributed by Black Forest Labs under the Apache License 2.0. See https://huggingface.co/black-forest-labs/FLUX.2-klein-base-4B for the original model card.
References
- Gabeur, Long, Peng, et al. Image Generators are Generalist Vision Learners. arXiv:2604.20329 (2026).
- Scheibler, Bezzam, Dokmanić. Pyroomacoustics: A Python package for audio room simulation and array processing algorithms. ICASSP 2018.
- Allen, Berkley. Image method for efficiently simulating small-room acoustics. JASA 1979.
- Downloads last month
- -
Model tree for phanerozoic/echo-plantain
Base model
black-forest-labs/FLUX.2-klein-base-4B