---
license: mit
tags:
  - mlx
  - speaker-embedding
  - speaker-verification
  - speaker-diarization
  - wespeaker
  - resnet
  - apple-silicon
base_model: pyannote/wespeaker-voxceleb-resnet34-LM
library_name: mlx
pipeline_tag: audio-classification
---

# WeSpeaker ResNet34-LM — MLX

MLX-compatible weights for [WeSpeaker ResNet34-LM](https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM), converted from the pyannote speaker embedding model with BatchNorm fused into Conv2d.

## Model

WeSpeaker ResNet34-LM is a speaker embedding model (~6.6M params) that produces 256-dimensional L2-normalized speaker embeddings from audio. Trained on VoxCeleb for speaker verification and diarization.

**Architecture:**

```
Input: [B, T, 80, 1] log-mel spectrogram (80 fbank, 16kHz)
  │
  ├─ Conv2d(1→32, k=3, p=1) + ReLU
  ├─ Layer1: 3× BasicBlock(32→32)
  ├─ Layer2: 4× BasicBlock(32→64, stride=2)
  ├─ Layer3: 6× BasicBlock(64→128, stride=2)
  ├─ Layer4: 3× BasicBlock(128→256, stride=2)
  │
  ├─ Statistics Pooling: mean + std → [B, 5120]
  ├─ Linear(5120→256) → L2 normalize
  │
  Output: [B, 256] speaker embedding
```

BatchNorm is fused into Conv2d at conversion time — no BN layers in the MLX model.

## Usage (Swift / MLX)

```swift
import SpeechVAD

// Speaker embedding
let model = try await WeSpeakerModel.fromPretrained()
let embedding = model.embed(audio: samples, sampleRate: 16000)
// embedding: [Float] of length 256, L2-normalized

// Compare speakers
let similarity = WeSpeakerModel.cosineSimilarity(embeddingA, embeddingB)

// Full speaker diarization pipeline
let pipeline = try await DiarizationPipeline.fromPretrained()
let result = pipeline.diarize(audio: samples, sampleRate: 16000)
for seg in result.segments {
    print("Speaker \(seg.speakerId): \(seg.startTime)s - \(seg.endTime)s")
}
```

Part of [qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift).

## Conversion

```bash
python3 scripts/convert_wespeaker.py --upload
```

Converts the original pyannote/wespeaker-voxceleb-resnet34-LM checkpoint using a custom unpickler (no pyannote.audio dependency required). Key transformations:

- **Fuse BatchNorm** into Conv2d: `w_fused = w × γ/√(σ²+ε)`, `b_fused = β − μ×γ/√(σ²+ε)`
- **Transpose Conv2d** weights: `[O, I, H, W]` → `[O, H, W, I]` for MLX channels-last
- **Rename**: strip `resnet.` prefix, `seg_1` → `embedding`
- **Drop** `num_batches_tracked` keys

## Weight Mapping

| PyTorch Key | MLX Key | Shape |
|-------------|---------|-------|
| `resnet.conv1.weight` + `resnet.bn1.*` | `conv1.weight` | [32, 3, 3, 1] |
| `resnet.layer{L}.{B}.conv{1,2}.weight` + `bn{1,2}.*` | `layer{L}.{B}.conv{1,2}.weight` | [O, 3, 3, I] |
| `resnet.layer{L}.0.shortcut.0.weight` + `shortcut.1.*` | `layer{L}.0.shortcut.weight` | [O, 1, 1, I] |
| `resnet.seg_1.weight` | `embedding.weight` | [256, 5120] |
| `resnet.seg_1.bias` | `embedding.bias` | [256] |

## License

The original WeSpeaker model is released under the [MIT License](https://github.com/wenet-e2e/wespeaker/blob/master/LICENSE).