--- license: other license_name: nscl-a2sb-and-polyform-nc license_link: https://raw.githubusercontent.com/NVIDIA/diffusion-audio-restoration/refs/heads/main/LICENSE tags: - audio - audio-restoration - schrodinger-bridge - diffusion - festival-audio - non-commercial library_name: pytorch pipeline_tag: audio-to-audio --- # Soundboard Schrödinger Bridge denoiser fine-tuned for musical recording audio restoration — recovers a soundboard-style mix from heavily-corrupted audience recordings (room reverb + audience-mic blend + lossy codec artifacts). Fine-tuned from NVIDIA's [A2SB](https://huggingface.co/nvidia/audio_to_audio_schrodinger_bridge) (`twosplit_0.5_1.0` split) on a synthetic-corruption training pipeline driven by **profile-based augmentation** — corruption parameters are calibrated from real (clean, festival-recording) pairs and sampled at training time from the recovered distribution. See [Locutius](https://github.com/protodotdesign/locutius) for the full corruption chain, profiling, and training scaffold. ## Quick facts | | | |---|---| | Architecture | AttnUNetF (565.5M params) | | Audio format | 44.1 kHz, 2-channel, 32-bit float | | Segment length | 130560 samples (2.96 s) | | STFT | n_fft=2048, hop=512, window=hann | | Representation | 3-channel `[mag^0.25, cos(phase), sin(phase)]` | | Trained at step | 50,000 | | Base checkpoint | NVIDIA A2SB `twosplit_0.5_1.0` | | Checkpoint size | 2.1 GB | | Diffusion | Schrödinger Bridge, β_max=1.0 | ## Usage Load with the [Locutius](https://github.com/protodotdesign/locutius) training package: ```python import torch from huggingface_hub import hf_hub_download from locutius_train.config import TrainConfig from locutius_train.network import AttnUNetF, SinusoidalTemporalEmbedding from locutius_train.diffusion import Diffusion from locutius_train.representation import WaveformToInput, InputToWaveform from locutius_train.restore import restore_spectrogram ckpt_path = hf_hub_download(repo_id="protodotdesign/Soundboard", filename="model.pt") sd = torch.load(ckpt_path, map_location="cuda", weights_only=False) cfg = TrainConfig() model = AttnUNetF( n_updown_levels=cfg.model.n_updown_levels, in_channels=cfg.model.in_channels, hidden_channels=list(cfg.model.hidden_channels), out_channels=cfg.model.out_channels, emb_channels=cfg.diffusion.n_timestep_channels, band_embedding_dim=cfg.model.band_embedding_dim, n_attn_heads=cfg.model.n_attn_heads, attention_levels=list(cfg.model.attention_levels), use_attn_input_norm=cfg.model.use_attn_input_norm, num_res_blocks=cfg.model.num_res_blocks, ).to("cuda").eval() model.load_state_dict(sd["model"]) ``` See `restore.py` in the Locutius repo for a complete CLI that takes a clean source, applies the calibrated festival-corruption profile, and runs the reverse Schrödinger Bridge to produce a restored output. ## Calibrated corruption profile This model was trained against a single calibrated profile recovered from a real (studio FLAC, festival M4A) pair via per-kick local Wiener deconvolution. The profile is bundled in `profile.json`: ```json { "name": "edc_festival", "ir_path": "../impulses/EchoThief/Brutalism/San Diego Supercomputer Center Outdoor Patio California.wav", "delay_ms_range": [ 15.0, 25.0 ], "studio_gain_range": [ 0.6, 0.7 ], "room_gain_range": [ 0.55, 0.65 ] } ``` Each training-step corruption draws fresh values from these ranges, so the model has been exposed to ~50,000 distinct delay/blend combinations within the same venue character. ## Training data Trained on a focused subset of electronic music FLACs. **No festival recordings or other licensed audio were stored or distributed** — only the studio source material was used; festival-corrupted versions were synthesized on-the-fly from the calibrated profile during each training step. ## Limitations - **Single profile**: trained against one calibrated venue (`edc_festival`). Performance on festival recordings from very different venues / mix chains will degrade. - **Electronic music bias**: training set was EDM-heavy. Restoration quality on rock, classical, or vocal-led material may be uneven. - **No crowd-noise model**: the calibrated profile didn't include additive crowd-noise (no real crowd recordings were available during calibration). Recordings with heavy crowd vocals may have residual artifacts. - **Non-commercial use only** — see the license below. ## License Dual non-commercial license: - [NVIDIA Source Code License for A2SB](LICENSE.NSCL-A2SB) (the upstream license inherited from the A2SB base checkpoint) - [PolyForm Noncommercial 1.0.0](LICENSE.PolyForm-NC) (additional terms on top, source-availability + patent retaliation) You must comply with **both** licenses. Use is restricted to research and evaluation only — no commercial use is permitted. See [LICENSING.md](https://github.com/protodotdesign/locutius/blob/main/LICENSING.md) for the full plain-English breakdown. ## Citation If you use this model in research, please cite the upstream A2SB paper and reference this fine-tune: ```bibtex @misc{soundboard, title={Soundboard: festival audio restoration via profile-calibrated Schrödinger Bridge fine-tuning}, author={Locutius}, year={2026}, howpublished={\url{https://huggingface.co/protodotdesign/Soundboard}}, } ```