WeSpeaker ResNet34-LM β MLX
MLX-compatible weights for WeSpeaker ResNet34-LM, converted from the pyannote speaker embedding model with BatchNorm fused into Conv2d.
Model
WeSpeaker ResNet34-LM is a speaker embedding model (~6.6M params) that produces 256-dimensional L2-normalized speaker embeddings from audio. Trained on VoxCeleb for speaker verification and diarization.
Architecture:
Input: [B, T, 80, 1] log-mel spectrogram (80 fbank, 16kHz)
β
ββ Conv2d(1β32, k=3, p=1) + ReLU
ββ Layer1: 3Γ BasicBlock(32β32)
ββ Layer2: 4Γ BasicBlock(32β64, stride=2)
ββ Layer3: 6Γ BasicBlock(64β128, stride=2)
ββ Layer4: 3Γ BasicBlock(128β256, stride=2)
β
ββ Statistics Pooling: mean + std β [B, 5120]
ββ Linear(5120β256) β L2 normalize
β
Output: [B, 256] speaker embedding
BatchNorm is fused into Conv2d at conversion time β no BN layers in the MLX model.
Usage (Swift / MLX)
import SpeechVAD
// Speaker embedding
let model = try await WeSpeakerModel.fromPretrained()
let embedding = model.embed(audio: samples, sampleRate: 16000)
// embedding: [Float] of length 256, L2-normalized
// Compare speakers
let similarity = WeSpeakerModel.cosineSimilarity(embeddingA, embeddingB)
// Full speaker diarization pipeline
let pipeline = try await DiarizationPipeline.fromPretrained()
let result = pipeline.diarize(audio: samples, sampleRate: 16000)
for seg in result.segments {
print("Speaker \(seg.speakerId): \(seg.startTime)s - \(seg.endTime)s")
}
Part of qwen3-asr-swift.
Conversion
python3 scripts/convert_wespeaker.py --upload
Converts the original pyannote/wespeaker-voxceleb-resnet34-LM checkpoint using a custom unpickler (no pyannote.audio dependency required). Key transformations:
- Fuse BatchNorm into Conv2d:
w_fused = w Γ Ξ³/β(ΟΒ²+Ξ΅),b_fused = Ξ² β ΞΌΓΞ³/β(ΟΒ²+Ξ΅) - Transpose Conv2d weights:
[O, I, H, W]β[O, H, W, I]for MLX channels-last - Rename: strip
resnet.prefix,seg_1βembedding - Drop
num_batches_trackedkeys
Weight Mapping
| PyTorch Key | MLX Key | Shape |
|---|---|---|
resnet.conv1.weight + resnet.bn1.* |
conv1.weight |
[32, 3, 3, 1] |
resnet.layer{L}.{B}.conv{1,2}.weight + bn{1,2}.* |
layer{L}.{B}.conv{1,2}.weight |
[O, 3, 3, I] |
resnet.layer{L}.0.shortcut.0.weight + shortcut.1.* |
layer{L}.0.shortcut.weight |
[O, 1, 1, I] |
resnet.seg_1.weight |
embedding.weight |
[256, 5120] |
resnet.seg_1.bias |
embedding.bias |
[256] |
License
The original WeSpeaker model is released under the MIT License.
- Downloads last month
- 14
Hardware compatibility
Log In to add your hardware
Quantized
Model tree for aitytech/WeSpeaker-ResNet34-LM-MLX
Base model
pyannote/wespeaker-voxceleb-resnet34-LM