--- license: mit tags: - audio - audio-enhancement - speech-enhancement - bandwidth-extension - codec-repair - neural-codec - waveform-processing - pytorch library_name: pytorch pipeline_tag: audio-to-audio frameworks: PyTorch language: - en --- # Brontes: Synthesis-First Waveform Enhancement **Brontes** is a time-domain audio enhancement model designed for neural codec repair and bandwidth extension. This is the general pretrained model trained on diverse audio data. ## Model Description Brontes upsamples and repairs speech degraded by neural codec compression. Unlike conventional Wave U-Net approaches that rely on dense skip connections, Brontes uses a **synthesis-first architecture** with selective deep skips, forcing the model to actively reconstruct rather than copy degraded input details. ### Key Capabilities - **Neural codec repair** — removes compression artifacts from neural codec outputs - **Bandwidth extension** — upsamples from 24 kHz to 48 kHz (2× extension) - **Waveform-domain processing** — operates directly on audio samples, no spectrogram conversion - **Synthesis-first design** — only the two deepest skips retained, preventing artifact leakage - **LSTM bottleneck** — captures long-range temporal dependencies at maximum compression ### Model Architecture - **Type:** Encoder-decoder U-Net with selective skip connections - **Stages:** 6 encoder stages + 6 decoder stages (4096× total compression) - **Bottleneck:** Bidirectional LSTM for temporal modeling - **Parameters:** ~29M - **Input:** 24 kHz mono audio (codec-degraded) - **Output:** 48 kHz mono audio (enhanced) ## Intended Use This is a **general pretrained model** trained on diverse audio data. For optimal performance on your specific use case: ⚠️ **It is strongly recommended to fine-tune this model on your target dataset** using the `--pretrained` flag. ### Primary Use Cases - Repairing audio degraded by neural codecs (e.g., EnCodec, SoundStream, Lyra) - Bandwidth extension from narrowband/wideband to fullband - Speech enhancement and quality improvement - Post-processing for codec-compressed audio ## Quick Start For detailed usage instructions, training, and fine-tuning, please see the [GitHub repository](https://github.com/ZDisket/Brontes). ### Basic Inference Example ```python import torch import torchaudio import yaml from brontes import Brontes # Setup device device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # Load config with open('configs/config_brontes_48khz_demucs.yaml', 'r') as f: config = yaml.safe_load(f) # Create model model = Brontes(unet_config=config['model'].get('unet_config', {})).to(device) # Load checkpoint checkpoint = torch.load('path/to/checkpoint.pt', map_location=device) model.load_state_dict(checkpoint['model'] if 'model' in checkpoint else checkpoint) model.eval() # Load audio audio, sr = torchaudio.load('input.wav') target_sr = config['dataset']['sample_rate'] # Resample if necessary if sr != target_sr: resampler = torchaudio.transforms.Resample(sr, target_sr) audio = resampler(audio) # Convert to mono and normalize if audio.shape[0] > 1: audio = audio.mean(dim=0, keepdim=True) max_val = audio.abs().max() if max_val > 0: audio = audio / max_val # Add batch dimension and process audio = audio.unsqueeze(0).to(device) with torch.no_grad(): output, _, _, _ = model(audio) # Save output output = output.squeeze(0).cpu() if output.abs().max() > 1.0: output = output / output.abs().max() torchaudio.save('output.wav', output, target_sr) ``` Or use the command-line interface: ```bash python infer_brontes.py \ --config configs/config_brontes_48khz_demucs.yaml \ --checkpoint path/to/checkpoint.pt \ --input input.wav \ --output output.wav ``` ## Training Details ### Training Data The model was trained on diverse audio data including: - Clean speech recordings - Codec-degraded audio pairs - Various acoustic conditions and speakers ### Training Procedure - **Pretraining:** 10,000 steps generator-only training - **Adversarial training:** Multi-Period Discriminator (MPD) + Multi-Band Spectral Discriminator (MBSD) - **Loss functions:** Multi-scale mel loss, pitch loss, adversarial loss, feature matching - **Precision:** BF16 mixed precision - **Framework:** PyTorch with custom training loop ## Fine-tuning Recommendations To achieve best results on your specific dataset: 1. **Prepare paired data:** Input (degraded) and target (clean) audio pairs 2. **Use the `--pretrained` flag** to load model weights without optimizer state 3. **Train for 10-50k steps** depending on dataset size 4. **Monitor validation loss** to prevent overfitting See the [repository README](https://github.com/ZDisket/Brontes) for detailed fine-tuning instructions. ## Limitations - **Domain-specific performance:** General model may not perform optimally on highly specialized audio (fine-tuning recommended) - **Mono audio only:** Currently supports single-channel audio - **Fixed sample rates:** Designed for 24 kHz input → 48 kHz output - **Codec-specific artifacts:** Performance may vary across different codec types - **Long-form audio:** Very long audio files may require chunking or sufficient GPU memory ## Ethical Considerations - This model is designed for audio enhancement and should not be used to create misleading or deceptive content - Users should respect privacy and consent when processing speech recordings - Enhanced audio should be clearly labeled as processed when used in sensitive contexts ## License Both the model weights and code are released under the MIT License. ## Additional Resources - **GitHub Repository:** [https://github.com/ZDisket/Brontes](https://github.com/ZDisket/Brontes) - **Technical Report:** See the repository - **Issues & Support:** [GitHub Issues](https://github.com/ZDisket/Brontes/issues) ## Acknowledgments Compute resources provided by Hot Aisle and AI at AMD.