leduclinh's picture
Duplicate from aufklarer/WeSpeaker-ResNet34-LM-MLX
626548f
metadata
license: mit
tags:
  - mlx
  - speaker-embedding
  - speaker-verification
  - speaker-diarization
  - wespeaker
  - resnet
  - apple-silicon
base_model: pyannote/wespeaker-voxceleb-resnet34-LM
library_name: mlx
pipeline_tag: audio-classification

WeSpeaker ResNet34-LM β€” MLX

MLX-compatible weights for WeSpeaker ResNet34-LM, converted from the pyannote speaker embedding model with BatchNorm fused into Conv2d.

Model

WeSpeaker ResNet34-LM is a speaker embedding model (~6.6M params) that produces 256-dimensional L2-normalized speaker embeddings from audio. Trained on VoxCeleb for speaker verification and diarization.

Architecture:

Input: [B, T, 80, 1] log-mel spectrogram (80 fbank, 16kHz)
  β”‚
  β”œβ”€ Conv2d(1β†’32, k=3, p=1) + ReLU
  β”œβ”€ Layer1: 3Γ— BasicBlock(32β†’32)
  β”œβ”€ Layer2: 4Γ— BasicBlock(32β†’64, stride=2)
  β”œβ”€ Layer3: 6Γ— BasicBlock(64β†’128, stride=2)
  β”œβ”€ Layer4: 3Γ— BasicBlock(128β†’256, stride=2)
  β”‚
  β”œβ”€ Statistics Pooling: mean + std β†’ [B, 5120]
  β”œβ”€ Linear(5120β†’256) β†’ L2 normalize
  β”‚
  Output: [B, 256] speaker embedding

BatchNorm is fused into Conv2d at conversion time β€” no BN layers in the MLX model.

Usage (Swift / MLX)

import SpeechVAD

// Speaker embedding
let model = try await WeSpeakerModel.fromPretrained()
let embedding = model.embed(audio: samples, sampleRate: 16000)
// embedding: [Float] of length 256, L2-normalized

// Compare speakers
let similarity = WeSpeakerModel.cosineSimilarity(embeddingA, embeddingB)

// Full speaker diarization pipeline
let pipeline = try await DiarizationPipeline.fromPretrained()
let result = pipeline.diarize(audio: samples, sampleRate: 16000)
for seg in result.segments {
    print("Speaker \(seg.speakerId): \(seg.startTime)s - \(seg.endTime)s")
}

Part of qwen3-asr-swift.

Conversion

python3 scripts/convert_wespeaker.py --upload

Converts the original pyannote/wespeaker-voxceleb-resnet34-LM checkpoint using a custom unpickler (no pyannote.audio dependency required). Key transformations:

  • Fuse BatchNorm into Conv2d: w_fused = w Γ— Ξ³/√(σ²+Ξ΅), b_fused = Ξ² βˆ’ ΞΌΓ—Ξ³/√(σ²+Ξ΅)
  • Transpose Conv2d weights: [O, I, H, W] β†’ [O, H, W, I] for MLX channels-last
  • Rename: strip resnet. prefix, seg_1 β†’ embedding
  • Drop num_batches_tracked keys

Weight Mapping

PyTorch Key MLX Key Shape
resnet.conv1.weight + resnet.bn1.* conv1.weight [32, 3, 3, 1]
resnet.layer{L}.{B}.conv{1,2}.weight + bn{1,2}.* layer{L}.{B}.conv{1,2}.weight [O, 3, 3, I]
resnet.layer{L}.0.shortcut.0.weight + shortcut.1.* layer{L}.0.shortcut.weight [O, 1, 1, I]
resnet.seg_1.weight embedding.weight [256, 5120]
resnet.seg_1.bias embedding.bias [256]

License

The original WeSpeaker model is released under the MIT License.