bwarzecha commited on
Commit
95c7b88
·
verified ·
1 Parent(s): 3617748

Add README with model card and license attribution

Browse files
Files changed (1) hide show
  1. README.md +68 -0
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: mixed
4
+ license_link: LICENSE.md
5
+ tags:
6
+ - speaker-diarization
7
+ - coreml
8
+ - apple
9
+ - macos
10
+ - ios
11
+ - sortformer
12
+ - wespeaker
13
+ - speaker-embedding
14
+ language:
15
+ - en
16
+ pipeline_tag: audio-classification
17
+ ---
18
+
19
+ # Speaker Diarization CoreML Models
20
+
21
+ CoreML conversions of speaker diarization and speaker embedding models for on-device inference on Apple platforms.
22
+
23
+ ## Models
24
+
25
+ | Model | Original | Size | Description |
26
+ |-------|----------|------|-------------|
27
+ | `sortformer_4spk_v21.mlpackage` | [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1) | 441 MB | Sortformer diarization model — end-to-end neural speaker diarization supporting up to 4 speakers, streaming capable |
28
+ | `wespeaker_resnet34.mlpackage` | [WeSpeaker ResNet34](https://github.com/wenet-e2e/wespeaker) | 25 MB | ResNet34 speaker embedding model — extracts 256-dim speaker embeddings for speaker verification and identification |
29
+
30
+ ## Format
31
+
32
+ Both models are in Apple `.mlpackage` format (FP32). On first load, CoreML compiles them to `.mlmodelc` and caches the compiled version for subsequent fast loading.
33
+
34
+ - **Sortformer**: Input `mel_features (B, 128, T)` → Output `speaker_probs (B, T/8, 4)` sigmoid probabilities per speaker per frame
35
+ - **ResNet34**: Input `fbank_features (1, 80, T)` → Output `embedding (1, 256)` speaker embedding vector
36
+
37
+ ## Usage
38
+
39
+ These models are designed for use with the [AxiiDiarization](https://github.com/AugustDev/AxiiDiarization) Swift library:
40
+
41
+ ```swift
42
+ import AxiiDiarization
43
+
44
+ let pipeline = try DiarizationPipeline(
45
+ sortformerModelPath: "path/to/sortformer_4spk_v21.mlpackage",
46
+ embModelPath: "path/to/wespeaker_resnet34.mlpackage"
47
+ )
48
+
49
+ let result = try pipeline.run(samples: audioSamples)
50
+ for segment in result.segments {
51
+ print("\(segment.speaker.label): \(segment.start)s - \(segment.end)s")
52
+ }
53
+ ```
54
+
55
+ ## Licenses
56
+
57
+ The models in this repository have separate licenses from their original authors:
58
+
59
+ - **Sortformer v2.1**: Licensed by NVIDIA Corporation under the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Commercial use permitted. See original model card: [nvidia/diar_streaming_sortformer_4spk-v2.1](https://huggingface.co/nvidia/diar_streaming_sortformer_4spk-v2.1)
60
+
61
+ - **WeSpeaker ResNet34**: Licensed under [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0). See original project: [wenet-e2e/wespeaker](https://github.com/wenet-e2e/wespeaker)
62
+
63
+ The CoreML conversion code and this repository are MIT licensed.
64
+
65
+ ## Acknowledgments
66
+
67
+ - NVIDIA NeMo team for the Sortformer diarization model
68
+ - WeSpeaker / WeNet team for the ResNet34 speaker embedding model