ZDisket commited on
Commit
9491b77
·
verified ·
1 Parent(s): 7fa3276

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +176 -3
README.md CHANGED
@@ -1,3 +1,176 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - audio
5
+ - audio-enhancement
6
+ - speech-enhancement
7
+ - bandwidth-extension
8
+ - codec-repair
9
+ - neural-codec
10
+ - waveform-processing
11
+ - pytorch
12
+ library_name: pytorch
13
+ pipeline_tag: audio-to-audio
14
+ frameworks: PyTorch
15
+ language:
16
+ - en
17
+ ---
18
+ # Brontes: Synthesis-First Waveform Enhancement
19
+
20
+ **Brontes** is a time-domain audio enhancement model designed for neural codec repair and bandwidth extension. This is the general pretrained model trained on diverse audio data.
21
+
22
+ ## Model Description
23
+
24
+ Brontes upsamples and repairs speech degraded by neural codec compression. Unlike conventional Wave U-Net approaches that rely on dense skip connections, Brontes uses a **synthesis-first architecture** with selective deep skips, forcing the model to actively reconstruct rather than copy degraded input details.
25
+
26
+ ### Key Capabilities
27
+
28
+ - **Neural codec repair** — removes compression artifacts from neural codec outputs
29
+ - **Bandwidth extension** — upsamples from 24 kHz to 48 kHz (2× extension)
30
+ - **Waveform-domain processing** — operates directly on audio samples, no spectrogram conversion
31
+ - **Synthesis-first design** — only the two deepest skips retained, preventing artifact leakage
32
+ - **LSTM bottleneck** — captures long-range temporal dependencies at maximum compression
33
+
34
+ ### Model Architecture
35
+
36
+ - **Type:** Encoder-decoder U-Net with selective skip connections
37
+ - **Stages:** 6 encoder stages + 6 decoder stages (4096× total compression)
38
+ - **Bottleneck:** Bidirectional LSTM for temporal modeling
39
+ - **Parameters:** ~29M
40
+ - **Input:** 24 kHz mono audio (codec-degraded)
41
+ - **Output:** 48 kHz mono audio (enhanced)
42
+
43
+ ## Intended Use
44
+
45
+ This is a **general pretrained model** trained on diverse audio data. For optimal performance on your specific use case:
46
+
47
+ ⚠️ **It is strongly recommended to fine-tune this model on your target dataset** using the `--pretrained` flag.
48
+
49
+ ### Primary Use Cases
50
+
51
+ - Repairing audio degraded by neural codecs (e.g., EnCodec, SoundStream, Lyra)
52
+ - Bandwidth extension from narrowband/wideband to fullband
53
+ - Speech enhancement and quality improvement
54
+ - Post-processing for codec-compressed audio
55
+
56
+ ## Quick Start
57
+
58
+ For detailed usage instructions, training, and fine-tuning, please see the [GitHub repository](https://github.com/ZDisket/Brontes).
59
+
60
+ ### Basic Inference Example
61
+
62
+ ```python
63
+ import torch
64
+ import torchaudio
65
+ import yaml
66
+ from brontes import Brontes
67
+
68
+ # Setup device
69
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
70
+
71
+ # Load config
72
+ with open('configs/config_brontes_48khz_demucs.yaml', 'r') as f:
73
+ config = yaml.safe_load(f)
74
+
75
+ # Create model
76
+ model = Brontes(unet_config=config['model'].get('unet_config', {})).to(device)
77
+
78
+ # Load checkpoint
79
+ checkpoint = torch.load('path/to/checkpoint.pt', map_location=device)
80
+ model.load_state_dict(checkpoint['model'] if 'model' in checkpoint else checkpoint)
81
+ model.eval()
82
+
83
+ # Load audio
84
+ audio, sr = torchaudio.load('input.wav')
85
+ target_sr = config['dataset']['sample_rate']
86
+
87
+ # Resample if necessary
88
+ if sr != target_sr:
89
+ resampler = torchaudio.transforms.Resample(sr, target_sr)
90
+ audio = resampler(audio)
91
+
92
+ # Convert to mono and normalize
93
+ if audio.shape[0] > 1:
94
+ audio = audio.mean(dim=0, keepdim=True)
95
+ max_val = audio.abs().max()
96
+ if max_val > 0:
97
+ audio = audio / max_val
98
+
99
+ # Add batch dimension and process
100
+ audio = audio.unsqueeze(0).to(device)
101
+ with torch.no_grad():
102
+ output, _, _, _ = model(audio)
103
+
104
+ # Save output
105
+ output = output.squeeze(0).cpu()
106
+ if output.abs().max() > 1.0:
107
+ output = output / output.abs().max()
108
+ torchaudio.save('output.wav', output, target_sr)
109
+ ```
110
+
111
+ Or use the command-line interface:
112
+
113
+ ```bash
114
+ python infer_brontes.py \
115
+ --config configs/config_brontes_48khz_demucs.yaml \
116
+ --checkpoint path/to/checkpoint.pt \
117
+ --input input.wav \
118
+ --output output.wav
119
+ ```
120
+
121
+ ## Training Details
122
+
123
+ ### Training Data
124
+
125
+ The model was trained on diverse audio data including:
126
+ - Clean speech recordings
127
+ - Codec-degraded audio pairs
128
+ - Various acoustic conditions and speakers
129
+
130
+ ### Training Procedure
131
+
132
+ - **Pretraining:** 10,000 steps generator-only training
133
+ - **Adversarial training:** Multi-Period Discriminator (MPD) + Multi-Band Spectral Discriminator (MBSD)
134
+ - **Loss functions:** Multi-scale mel loss, pitch loss, adversarial loss, feature matching
135
+ - **Precision:** BF16 mixed precision
136
+ - **Framework:** PyTorch with custom training loop
137
+
138
+ ## Fine-tuning Recommendations
139
+
140
+ To achieve best results on your specific dataset:
141
+
142
+ 1. **Prepare paired data:** Input (degraded) and target (clean) audio pairs
143
+ 2. **Use the `--pretrained` flag** to load model weights without optimizer state
144
+ 3. **Train for 10-50k steps** depending on dataset size
145
+ 4. **Monitor validation loss** to prevent overfitting
146
+
147
+ See the [repository README](https://github.com/ZDisket/Brontes) for detailed fine-tuning instructions.
148
+
149
+ ## Limitations
150
+
151
+ - **Domain-specific performance:** General model may not perform optimally on highly specialized audio (fine-tuning recommended)
152
+ - **Mono audio only:** Currently supports single-channel audio
153
+ - **Fixed sample rates:** Designed for 24 kHz input → 48 kHz output
154
+ - **Codec-specific artifacts:** Performance may vary across different codec types
155
+ - **Long-form audio:** Very long audio files may require chunking or sufficient GPU memory
156
+
157
+ ## Ethical Considerations
158
+
159
+ - This model is designed for audio enhancement and should not be used to create misleading or deceptive content
160
+ - Users should respect privacy and consent when processing speech recordings
161
+ - Enhanced audio should be clearly labeled as processed when used in sensitive contexts
162
+
163
+
164
+ ## License
165
+
166
+ Both the model weights and code are released under the MIT License.
167
+
168
+ ## Additional Resources
169
+
170
+ - **GitHub Repository:** [https://github.com/ZDisket/Brontes](https://github.com/ZDisket/Brontes)
171
+ - **Technical Report:** See the repository
172
+ - **Issues & Support:** [GitHub Issues](https://github.com/ZDisket/Brontes/issues)
173
+
174
+ ## Acknowledgments
175
+
176
+ Compute resources provided by Hot Aisle and AI at AMD.