echo-plantain

A LoRA adapter on FLUX.2 Klein (4B) that predicts the magnitude spectrogram of a room impulse response from a top-down schematic of the room. Reframes acoustic modeling as image-to-image generation: the source image is a schematic showing room geometry plus source and listener positions, the target is the RIR spectrogram in RGB, and an inverse bijection recovers a mono RIR suitable for audio convolution.

This adapter tests whether the recipe from Image Generators are Generalist Vision Learners (Gabeur et al., 2026; arXiv:2604.20329) extends to physics-grounded prediction tasks where the input is a 2D image and the output is a signal that captures the response of a physical system.

Method

  1. Reframe room acoustics as image-to-image. Source: a 768 × 768 top-down schematic of a rectangular room with the audio source rendered as a red ⊕ glyph, the listener as a blue ⊙ glyph, and floor brightness encoding surface absorption (lighter = more reflective). Target: the room impulse response, computed via the image-source method, encoded as an RGB spectrogram.
  2. Bijective magnitude↔RGB encoding. Linear-amplitude STFT magnitude → dB clipped to [−100, 0] → curve u ∈ [0, 1] → 7-segment Hamiltonian path through the corners of the RGB cube (black → blue → cyan → green → yellow → red → magenta → white). The wider dB range relative to speech encodings captures the full RIR dynamic range from direct-arrival peak to late-reverberation tail.
  3. Audio params. 16 kHz, n_fft = 1024, hop = 256, 1-second clips. STFT (513 frequency bins × 63 time frames) placed top-left in a 768 × 768 canvas with silence padding.

Training data: 10,000 randomly-generated rectangular rooms via pyroomacoustics. Dimensions uniform on 3–12 m × 3–12 m with 2.4–4.0 m ceiling; surface absorption uniform on [0.05, 0.50]; source and listener positions uniform inside the room with minimum 0.5 m separation. RIRs computed by image-source method up to reflection order 6.

Status

Training in progress. Weights will be added when complete.

Training

Base black-forest-labs/FLUX.2-klein-base-4B
Adapter LoRA, rank 256 on transformer attention + rank 32 on text encoder
Resolution 768 × 768
Batch size 4
Optimizer AdamW, lr 1e-4, cosine schedule, 300-step warmup
Max steps 15 000
Mixed precision bf16
Training data 10 000 synthetic rooms (pyroomacoustics, image-source method, max order 6)
Audio params 16 kHz, n_fft 1024, hop 256, 1-second RIR clips
Spectrogram encoding Linear magnitude → dB clipped [−100, 0] → Hilbert RGB-cube path

Usage

import torch
from PIL import Image
from diffusers import Flux2KleinPipeline

pipe = Flux2KleinPipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-base-4B", torch_dtype=torch.bfloat16,
).to("cuda")
pipe.load_lora_weights("phanerozoic/echo-plantain")

# A top-down schematic of the target room (see `render_schematic.py` for the
# renderer convention: walls as outline, source as red ⊕, listener as blue ⊙,
# floor brightness encoding absorption).
schematic = Image.open("room_schematic.png").convert("RGB").resize((768, 768))

prompt = (
    "Generate a room impulse response spectrogram for the depicted space. "
    "Time on horizontal axis (early reflections at left, late reverb tail extending right), "
    "frequency on vertical axis. Energy encoded in RGB along a Hilbert path through "
    "the color cube: black is below noise floor, blue/cyan is faint reflections, "
    "green/yellow is strong reflections, red/magenta is direct-arrival energy."
)
img = pipe(
    image=schematic, prompt=prompt, height=768, width=768,
    guidance_scale=4.0, num_inference_steps=20,
).images[0]

The decoder (RGB → magnitude → mono RIR) is in decode_rir.py. The recovered RIR can be convolved with any dry signal to apply the predicted room reverb.

License

The LoRA adapter weights in this repository are released under the Apache License 2.0, matching the license of the base model FLUX.2 Klein 4B.

Training data attribution

The training data is fully synthetic, generated at preparation time from random rectangular room geometries via the pyroomacoustics Python library (Scheibler, Bezzam, Dokmanić, 2018). Pyroomacoustics is distributed under the MIT License. No external dataset is required to reproduce the training corpus; the dataset-generation script is included in this repository.

Base model

Base model FLUX.2 Klein 4B is distributed by Black Forest Labs under the Apache License 2.0. See https://huggingface.co/black-forest-labs/FLUX.2-klein-base-4B for the original model card.

References

  • Gabeur, Long, Peng, et al. Image Generators are Generalist Vision Learners. arXiv:2604.20329 (2026).
  • Scheibler, Bezzam, Dokmanić. Pyroomacoustics: A Python package for audio room simulation and array processing algorithms. ICASSP 2018.
  • Allen, Berkley. Image method for efficiently simulating small-room acoustics. JASA 1979.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for phanerozoic/echo-plantain

Adapter
(44)
this model

Collection including phanerozoic/echo-plantain

Paper for phanerozoic/echo-plantain