Clément Duhamel commited on
Commit
ea65e74
·
1 Parent(s): 1ea97a6

Switch to mlprogram backend with ANE/GPU support

Browse files

Changes:
- Converted both encoders to CoreML mlprogram (modern backend)
- Disabled fuse_transpose_matmul pass to fix NaN on transformer attention
- Forced FLOAT32 precision to avoid convolution overflow
- Target: macOS 14+ / iOS 17+ for full ANE compatibility
- Numerically verified: audio cos_sim = 1.00000036, text cos_sim = 1.00000012
- Updated model card (README.md) and Git LFS tracking for mlprogram

Files changed (2) hide show
  1. .gitattributes +2 -2
  2. README.md +13 -7
.gitattributes CHANGED
@@ -1,6 +1,6 @@
1
  # CoreML models are large binaries — track via Git LFS
2
  *.mlmodel filter=lfs diff=lfs merge=lfs -text
3
- *.mlpackage/**/weight.bin filter=lfs diff=lfs merge=lfs -text
4
- *.mlpackage/**/model.mlmodel filter=lfs diff=lfs merge=lfs -text
5
  **/weight.bin filter=lfs diff=lfs merge=lfs -text
6
  **/model.mlmodel filter=lfs diff=lfs merge=lfs -text
 
1
  # CoreML models are large binaries — track via Git LFS
2
  *.mlmodel filter=lfs diff=lfs merge=lfs -text
3
+ *.mlpackage/Data/com.apple.CoreML/weights/weight.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.mlpackage/Data/com.apple.CoreML/model.mlmodel filter=lfs diff=lfs merge=lfs -text
5
  **/weight.bin filter=lfs diff=lfs merge=lfs -text
6
  **/model.mlmodel filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -18,8 +18,8 @@ This repository contains the student tinyCLAP encoder pair converted to CoreML:
18
 
19
  | File | Size | Purpose |
20
  |------|------|---------|
21
- | `TinyCLAP_AudioEncoder.mlmodel` | ~17 MB | PhiNet-based audio encoder. Consumes log-mel spectrograms and emits 1024-dim L2-normalized embeddings. |
22
- | `TinyCLAP_TextEncoder.mlmodel` | ~422 MB | BERT-base text encoder. Consumes tokenized text and emits 1024-dim L2-normalized embeddings. |
23
  | `mel_filter_bank.json` | ~170 KB | Pre-computed 64-bin mel filter bank (Slaney, 50–14000 Hz) for audio preprocessing. |
24
  | `text_embeddings.json` | ~220 KB | Sample text embeddings for sanity-checking the model output. |
25
 
@@ -32,7 +32,7 @@ STFT (n_fft=1024, hop=320, win=1024, Hann)
32
 
33
  Log-mel spectrogram (64 bins, 50–14000 Hz)
34
 
35
- TinyCLAP_AudioEncoder.mlmodel
36
 
37
  1024-dim L2-normalised embedding
38
  ```
@@ -42,7 +42,7 @@ Text query
42
 
43
  BERT tokenization (max_length=100)
44
 
45
- TinyCLAP_TextEncoder.mlmodel
46
 
47
  1024-dim L2-normalised embedding
48
  ```
@@ -71,12 +71,18 @@ Cosine similarity between the two embeddings yields semantic relevance scores.
71
 
72
  ## CoreML conversion details
73
 
74
- - **Backend**: `neuralnetwork` (`convert_to="neuralnetwork"`)
75
- - **Compute units**: `CPU_AND_GPU`
 
 
76
  - **Audio encoder input**: flexible time dimension (640 frames nominal)
77
  - **Text encoder input**: fixed `[1, 100]` for both `input_ids` and `attention_mask`
78
 
79
- The `mlprogram` backend was attempted but produced NaN outputs with PhiNet's BatchNorm statistics; `neuralnetwork` gives bit-exact results.
 
 
 
 
80
 
81
  ## Original work
82
 
 
18
 
19
  | File | Size | Purpose |
20
  |------|------|---------|
21
+ | `TinyCLAP_AudioEncoder.mlpackage` | ~17 MB | PhiNet-based audio encoder. Consumes log-mel spectrograms and emits 1024-dim L2-normalized embeddings. |
22
+ | `TinyCLAP_TextEncoder.mlpackage` | ~422 MB | BERT-base text encoder. Consumes tokenized text and emits 1024-dim L2-normalized embeddings. |
23
  | `mel_filter_bank.json` | ~170 KB | Pre-computed 64-bin mel filter bank (Slaney, 50–14000 Hz) for audio preprocessing. |
24
  | `text_embeddings.json` | ~220 KB | Sample text embeddings for sanity-checking the model output. |
25
 
 
32
 
33
  Log-mel spectrogram (64 bins, 50–14000 Hz)
34
 
35
+ TinyCLAP_AudioEncoder.mlpackage
36
 
37
  1024-dim L2-normalised embedding
38
  ```
 
42
 
43
  BERT tokenization (max_length=100)
44
 
45
+ TinyCLAP_TextEncoder.mlpackage
46
 
47
  1024-dim L2-normalised embedding
48
  ```
 
71
 
72
  ## CoreML conversion details
73
 
74
+ - **Backend**: `mlprogram` (`convert_to="mlprogram"`)
75
+ - **Compute units**: `ALL` (CPU + GPU + Apple Neural Engine)
76
+ - **Precision**: `FLOAT32`
77
+ - **Minimum deployment target**: macOS 14 / iOS 17
78
  - **Audio encoder input**: flexible time dimension (640 frames nominal)
79
  - **Text encoder input**: fixed `[1, 100]` for both `input_ids` and `attention_mask`
80
 
81
+ The `neuralnetwork` backend was used initially but forced CPU-only execution due to internal shape-validation warnings. The switch to `mlprogram` with the `fuse_transpose_matmul` optimization disabled and `FLOAT32` precision enabled achieves bit-exact numerical parity with PyTorch while unlocking ANE/GPU acceleration.
82
+
83
+ Numerical verification:
84
+ - Audio encoder: cosine similarity ≈ 1.00000036 vs PyTorch
85
+ - Text encoder: cosine similarity ≈ 1.00000012 vs PyTorch
86
 
87
  ## Original work
88