Brontes: Synthesis-First Waveform Enhancement
Brontes is a time-domain audio enhancement model designed for neural codec repair and bandwidth extension. This is the general pretrained model trained on diverse audio data.
Model Description
Brontes upsamples and repairs speech degraded by neural codec compression. Unlike conventional Wave U-Net approaches that rely on dense skip connections, Brontes uses a synthesis-first architecture with selective deep skips, forcing the model to actively reconstruct rather than copy degraded input details.
Key Capabilities
- Neural codec repair โ removes compression artifacts from neural codec outputs
- Bandwidth extension โ upsamples from 24 kHz to 48 kHz (2ร extension)
- Waveform-domain processing โ operates directly on audio samples, no spectrogram conversion
- Synthesis-first design โ only the two deepest skips retained, preventing artifact leakage
- LSTM bottleneck โ captures long-range temporal dependencies at maximum compression
Model Architecture
- Type: Encoder-decoder U-Net with selective skip connections
- Stages: 6 encoder stages + 6 decoder stages (4096ร total compression)
- Bottleneck: Bidirectional LSTM for temporal modeling
- Parameters: ~29M
- Input: 24 kHz mono audio (codec-degraded)
- Output: 48 kHz mono audio (enhanced)
Intended Use
This is a general pretrained model trained on diverse audio data. For optimal performance on your specific use case:
โ ๏ธ It is strongly recommended to fine-tune this model on your target dataset using the --pretrained flag.
Primary Use Cases
- Repairing audio degraded by neural codecs (e.g., EnCodec, SoundStream, Lyra)
- Bandwidth extension from narrowband/wideband to fullband
- Speech enhancement and quality improvement
- Post-processing for codec-compressed audio
Quick Start
For detailed usage instructions, training, and fine-tuning, please see the GitHub repository.
Basic Inference Example
import torch
import torchaudio
import yaml
from brontes import Brontes
# Setup device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Load config
with open('configs/config_brontes_48khz_demucs.yaml', 'r') as f:
config = yaml.safe_load(f)
# Create model
model = Brontes(unet_config=config['model'].get('unet_config', {})).to(device)
# Load checkpoint
checkpoint = torch.load('path/to/checkpoint.pt', map_location=device)
model.load_state_dict(checkpoint['model'] if 'model' in checkpoint else checkpoint)
model.eval()
# Load audio
audio, sr = torchaudio.load('input.wav')
target_sr = config['dataset']['sample_rate']
# Resample if necessary
if sr != target_sr:
resampler = torchaudio.transforms.Resample(sr, target_sr)
audio = resampler(audio)
# Convert to mono and normalize
if audio.shape[0] > 1:
audio = audio.mean(dim=0, keepdim=True)
max_val = audio.abs().max()
if max_val > 0:
audio = audio / max_val
# Add batch dimension and process
audio = audio.unsqueeze(0).to(device)
with torch.no_grad():
output, _, _, _ = model(audio)
# Save output
output = output.squeeze(0).cpu()
if output.abs().max() > 1.0:
output = output / output.abs().max()
torchaudio.save('output.wav', output, target_sr)
Or use the command-line interface:
python infer_brontes.py \
--config configs/config_brontes_48khz_demucs.yaml \
--checkpoint path/to/checkpoint.pt \
--input input.wav \
--output output.wav
Training Details
Training Data
The model was trained on diverse audio data including:
- Clean speech recordings
- Codec-degraded audio pairs
- Various acoustic conditions and speakers
Training Procedure
- Pretraining: 10,000 steps generator-only training
- Adversarial training: Multi-Period Discriminator (MPD) + Multi-Band Spectral Discriminator (MBSD)
- Loss functions: Multi-scale mel loss, pitch loss, adversarial loss, feature matching
- Precision: BF16 mixed precision
- Framework: PyTorch with custom training loop
Fine-tuning Recommendations
To achieve best results on your specific dataset:
- Prepare paired data: Input (degraded) and target (clean) audio pairs
- Use the
--pretrainedflag to load model weights without optimizer state - Train for 10-50k steps depending on dataset size
- Monitor validation loss to prevent overfitting
See the repository README for detailed fine-tuning instructions.
Limitations
- Domain-specific performance: General model may not perform optimally on highly specialized audio (fine-tuning recommended)
- Mono audio only: Currently supports single-channel audio
- Fixed sample rates: Designed for 24 kHz input โ 48 kHz output
- Codec-specific artifacts: Performance may vary across different codec types
- Long-form audio: Very long audio files may require chunking or sufficient GPU memory
Ethical Considerations
- This model is designed for audio enhancement and should not be used to create misleading or deceptive content
- Users should respect privacy and consent when processing speech recordings
- Enhanced audio should be clearly labeled as processed when used in sensitive contexts
License
Both the model weights and code are released under the MIT License.
Additional Resources
- GitHub Repository: https://github.com/ZDisket/Brontes
- Technical Report: See the repository
- Issues & Support: GitHub Issues
Acknowledgments
Compute resources provided by Hot Aisle and AI at AMD.
- Downloads last month
- 5