--- license: mit tags: - mlx - speaker-embedding - speaker-verification - speaker-diarization - wespeaker - resnet - apple-silicon base_model: pyannote/wespeaker-voxceleb-resnet34-LM library_name: mlx pipeline_tag: audio-classification --- # WeSpeaker ResNet34-LM — MLX MLX-compatible weights for [WeSpeaker ResNet34-LM](https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM), converted from the pyannote speaker embedding model with BatchNorm fused into Conv2d. ## Model WeSpeaker ResNet34-LM is a speaker embedding model (~6.6M params) that produces 256-dimensional L2-normalized speaker embeddings from audio. Trained on VoxCeleb for speaker verification and diarization. **Architecture:** ``` Input: [B, T, 80, 1] log-mel spectrogram (80 fbank, 16kHz) │ ├─ Conv2d(1→32, k=3, p=1) + ReLU ├─ Layer1: 3× BasicBlock(32→32) ├─ Layer2: 4× BasicBlock(32→64, stride=2) ├─ Layer3: 6× BasicBlock(64→128, stride=2) ├─ Layer4: 3× BasicBlock(128→256, stride=2) │ ├─ Statistics Pooling: mean + std → [B, 5120] ├─ Linear(5120→256) → L2 normalize │ Output: [B, 256] speaker embedding ``` BatchNorm is fused into Conv2d at conversion time — no BN layers in the MLX model. ## Usage (Swift / MLX) ```swift import SpeechVAD // Speaker embedding let model = try await WeSpeakerModel.fromPretrained() let embedding = model.embed(audio: samples, sampleRate: 16000) // embedding: [Float] of length 256, L2-normalized // Compare speakers let similarity = WeSpeakerModel.cosineSimilarity(embeddingA, embeddingB) // Full speaker diarization pipeline let pipeline = try await DiarizationPipeline.fromPretrained() let result = pipeline.diarize(audio: samples, sampleRate: 16000) for seg in result.segments { print("Speaker \(seg.speakerId): \(seg.startTime)s - \(seg.endTime)s") } ``` Part of [qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift). ## Conversion ```bash python3 scripts/convert_wespeaker.py --upload ``` Converts the original pyannote/wespeaker-voxceleb-resnet34-LM checkpoint using a custom unpickler (no pyannote.audio dependency required). Key transformations: - **Fuse BatchNorm** into Conv2d: `w_fused = w × γ/√(σ²+ε)`, `b_fused = β − μ×γ/√(σ²+ε)` - **Transpose Conv2d** weights: `[O, I, H, W]` → `[O, H, W, I]` for MLX channels-last - **Rename**: strip `resnet.` prefix, `seg_1` → `embedding` - **Drop** `num_batches_tracked` keys ## Weight Mapping | PyTorch Key | MLX Key | Shape | |-------------|---------|-------| | `resnet.conv1.weight` + `resnet.bn1.*` | `conv1.weight` | [32, 3, 3, 1] | | `resnet.layer{L}.{B}.conv{1,2}.weight` + `bn{1,2}.*` | `layer{L}.{B}.conv{1,2}.weight` | [O, 3, 3, I] | | `resnet.layer{L}.0.shortcut.0.weight` + `shortcut.1.*` | `layer{L}.0.shortcut.weight` | [O, 1, 1, I] | | `resnet.seg_1.weight` | `embedding.weight` | [256, 5120] | | `resnet.seg_1.bias` | `embedding.bias` | [256] | ## License The original WeSpeaker model is released under the [MIT License](https://github.com/wenet-e2e/wespeaker/blob/master/LICENSE).