WeSpeaker ResNet34-LM β€” MLX

MLX-compatible weights for WeSpeaker ResNet34-LM, converted from the pyannote speaker embedding model with BatchNorm fused into Conv2d.

Model

WeSpeaker ResNet34-LM is a speaker embedding model (~6.6M params) that produces 256-dimensional L2-normalized speaker embeddings from audio. Trained on VoxCeleb for speaker verification and diarization.

Architecture:

Input: [B, T, 80, 1] log-mel spectrogram (80 fbank, 16kHz)
  β”‚
  β”œβ”€ Conv2d(1β†’32, k=3, p=1) + ReLU
  β”œβ”€ Layer1: 3Γ— BasicBlock(32β†’32)
  β”œβ”€ Layer2: 4Γ— BasicBlock(32β†’64, stride=2)
  β”œβ”€ Layer3: 6Γ— BasicBlock(64β†’128, stride=2)
  β”œβ”€ Layer4: 3Γ— BasicBlock(128β†’256, stride=2)
  β”‚
  β”œβ”€ Statistics Pooling: mean + std β†’ [B, 5120]
  β”œβ”€ Linear(5120β†’256) β†’ L2 normalize
  β”‚
  Output: [B, 256] speaker embedding

BatchNorm is fused into Conv2d at conversion time β€” no BN layers in the MLX model.

Usage (Swift / MLX)

import SpeechVAD

// Speaker embedding
let model = try await WeSpeakerModel.fromPretrained()
let embedding = model.embed(audio: samples, sampleRate: 16000)
// embedding: [Float] of length 256, L2-normalized

// Compare speakers
let similarity = WeSpeakerModel.cosineSimilarity(embeddingA, embeddingB)

// Full speaker diarization pipeline
let pipeline = try await DiarizationPipeline.fromPretrained()
let result = pipeline.diarize(audio: samples, sampleRate: 16000)
for seg in result.segments {
    print("Speaker \(seg.speakerId): \(seg.startTime)s - \(seg.endTime)s")
}

Part of qwen3-asr-swift.

Conversion

python3 scripts/convert_wespeaker.py --upload

Converts the original pyannote/wespeaker-voxceleb-resnet34-LM checkpoint using a custom unpickler (no pyannote.audio dependency required). Key transformations:

  • Fuse BatchNorm into Conv2d: w_fused = w Γ— Ξ³/√(σ²+Ξ΅), b_fused = Ξ² βˆ’ ΞΌΓ—Ξ³/√(σ²+Ξ΅)
  • Transpose Conv2d weights: [O, I, H, W] β†’ [O, H, W, I] for MLX channels-last
  • Rename: strip resnet. prefix, seg_1 β†’ embedding
  • Drop num_batches_tracked keys

Weight Mapping

PyTorch Key MLX Key Shape
resnet.conv1.weight + resnet.bn1.* conv1.weight [32, 3, 3, 1]
resnet.layer{L}.{B}.conv{1,2}.weight + bn{1,2}.* layer{L}.{B}.conv{1,2}.weight [O, 3, 3, I]
resnet.layer{L}.0.shortcut.0.weight + shortcut.1.* layer{L}.0.shortcut.weight [O, 1, 1, I]
resnet.seg_1.weight embedding.weight [256, 5120]
resnet.seg_1.bias embedding.bias [256]

License

The original WeSpeaker model is released under the MIT License.

Downloads last month
14
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for aitytech/WeSpeaker-ResNet34-LM-MLX

Finetuned
(4)
this model