leduclinh aufklarer commited on
Commit
626548f
Β·
0 Parent(s):

Duplicate from aufklarer/WeSpeaker-ResNet34-LM-MLX

Browse files

Co-authored-by: Ivan <aufklarer@users.noreply.huggingface.co>

Files changed (4) hide show
  1. .gitattributes +35 -0
  2. README.md +91 -0
  3. config.json +19 -0
  4. model.safetensors +3 -0
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - mlx
5
+ - speaker-embedding
6
+ - speaker-verification
7
+ - speaker-diarization
8
+ - wespeaker
9
+ - resnet
10
+ - apple-silicon
11
+ base_model: pyannote/wespeaker-voxceleb-resnet34-LM
12
+ library_name: mlx
13
+ pipeline_tag: audio-classification
14
+ ---
15
+
16
+ # WeSpeaker ResNet34-LM β€” MLX
17
+
18
+ MLX-compatible weights for [WeSpeaker ResNet34-LM](https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM), converted from the pyannote speaker embedding model with BatchNorm fused into Conv2d.
19
+
20
+ ## Model
21
+
22
+ WeSpeaker ResNet34-LM is a speaker embedding model (~6.6M params) that produces 256-dimensional L2-normalized speaker embeddings from audio. Trained on VoxCeleb for speaker verification and diarization.
23
+
24
+ **Architecture:**
25
+
26
+ ```
27
+ Input: [B, T, 80, 1] log-mel spectrogram (80 fbank, 16kHz)
28
+ β”‚
29
+ β”œβ”€ Conv2d(1β†’32, k=3, p=1) + ReLU
30
+ β”œβ”€ Layer1: 3Γ— BasicBlock(32β†’32)
31
+ β”œβ”€ Layer2: 4Γ— BasicBlock(32β†’64, stride=2)
32
+ β”œβ”€ Layer3: 6Γ— BasicBlock(64β†’128, stride=2)
33
+ β”œβ”€ Layer4: 3Γ— BasicBlock(128β†’256, stride=2)
34
+ β”‚
35
+ β”œβ”€ Statistics Pooling: mean + std β†’ [B, 5120]
36
+ β”œβ”€ Linear(5120β†’256) β†’ L2 normalize
37
+ β”‚
38
+ Output: [B, 256] speaker embedding
39
+ ```
40
+
41
+ BatchNorm is fused into Conv2d at conversion time β€” no BN layers in the MLX model.
42
+
43
+ ## Usage (Swift / MLX)
44
+
45
+ ```swift
46
+ import SpeechVAD
47
+
48
+ // Speaker embedding
49
+ let model = try await WeSpeakerModel.fromPretrained()
50
+ let embedding = model.embed(audio: samples, sampleRate: 16000)
51
+ // embedding: [Float] of length 256, L2-normalized
52
+
53
+ // Compare speakers
54
+ let similarity = WeSpeakerModel.cosineSimilarity(embeddingA, embeddingB)
55
+
56
+ // Full speaker diarization pipeline
57
+ let pipeline = try await DiarizationPipeline.fromPretrained()
58
+ let result = pipeline.diarize(audio: samples, sampleRate: 16000)
59
+ for seg in result.segments {
60
+ print("Speaker \(seg.speakerId): \(seg.startTime)s - \(seg.endTime)s")
61
+ }
62
+ ```
63
+
64
+ Part of [qwen3-asr-swift](https://github.com/ivan-digital/qwen3-asr-swift).
65
+
66
+ ## Conversion
67
+
68
+ ```bash
69
+ python3 scripts/convert_wespeaker.py --upload
70
+ ```
71
+
72
+ Converts the original pyannote/wespeaker-voxceleb-resnet34-LM checkpoint using a custom unpickler (no pyannote.audio dependency required). Key transformations:
73
+
74
+ - **Fuse BatchNorm** into Conv2d: `w_fused = w Γ— Ξ³/√(σ²+Ξ΅)`, `b_fused = Ξ² βˆ’ ΞΌΓ—Ξ³/√(σ²+Ξ΅)`
75
+ - **Transpose Conv2d** weights: `[O, I, H, W]` β†’ `[O, H, W, I]` for MLX channels-last
76
+ - **Rename**: strip `resnet.` prefix, `seg_1` β†’ `embedding`
77
+ - **Drop** `num_batches_tracked` keys
78
+
79
+ ## Weight Mapping
80
+
81
+ | PyTorch Key | MLX Key | Shape |
82
+ |-------------|---------|-------|
83
+ | `resnet.conv1.weight` + `resnet.bn1.*` | `conv1.weight` | [32, 3, 3, 1] |
84
+ | `resnet.layer{L}.{B}.conv{1,2}.weight` + `bn{1,2}.*` | `layer{L}.{B}.conv{1,2}.weight` | [O, 3, 3, I] |
85
+ | `resnet.layer{L}.0.shortcut.0.weight` + `shortcut.1.*` | `layer{L}.0.shortcut.weight` | [O, 1, 1, I] |
86
+ | `resnet.seg_1.weight` | `embedding.weight` | [256, 5120] |
87
+ | `resnet.seg_1.bias` | `embedding.bias` | [256] |
88
+
89
+ ## License
90
+
91
+ The original WeSpeaker model is released under the [MIT License](https://github.com/wenet-e2e/wespeaker/blob/master/LICENSE).
config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "wespeaker-resnet34-lm",
3
+ "sample_rate": 16000,
4
+ "n_mels": 80,
5
+ "embedding_dim": 256,
6
+ "layers": [
7
+ 3,
8
+ 4,
9
+ 6,
10
+ 3
11
+ ],
12
+ "channels": [
13
+ 32,
14
+ 64,
15
+ 128,
16
+ 256
17
+ ],
18
+ "pooling_output_dim": 5120
19
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f56204883f2de969f584af7893e5373575556b422190c14c206a5fd94f3d7fe6
3
+ size 26526952