Clément Duhamel commited on
Commit ·
ea65e74
1
Parent(s): 1ea97a6
Switch to mlprogram backend with ANE/GPU support
Browse filesChanges:
- Converted both encoders to CoreML mlprogram (modern backend)
- Disabled fuse_transpose_matmul pass to fix NaN on transformer attention
- Forced FLOAT32 precision to avoid convolution overflow
- Target: macOS 14+ / iOS 17+ for full ANE compatibility
- Numerically verified: audio cos_sim = 1.00000036, text cos_sim = 1.00000012
- Updated model card (README.md) and Git LFS tracking for mlprogram
- .gitattributes +2 -2
- README.md +13 -7
.gitattributes
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# CoreML models are large binaries — track via Git LFS
|
| 2 |
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 3 |
-
*.mlpackage/
|
| 4 |
-
*.mlpackage/
|
| 5 |
**/weight.bin filter=lfs diff=lfs merge=lfs -text
|
| 6 |
**/model.mlmodel filter=lfs diff=lfs merge=lfs -text
|
|
|
|
| 1 |
# CoreML models are large binaries — track via Git LFS
|
| 2 |
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.mlpackage/Data/com.apple.CoreML/weights/weight.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.mlpackage/Data/com.apple.CoreML/model.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 5 |
**/weight.bin filter=lfs diff=lfs merge=lfs -text
|
| 6 |
**/model.mlmodel filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -18,8 +18,8 @@ This repository contains the student tinyCLAP encoder pair converted to CoreML:
|
|
| 18 |
|
| 19 |
| File | Size | Purpose |
|
| 20 |
|------|------|---------|
|
| 21 |
-
| `TinyCLAP_AudioEncoder.
|
| 22 |
-
| `TinyCLAP_TextEncoder.
|
| 23 |
| `mel_filter_bank.json` | ~170 KB | Pre-computed 64-bin mel filter bank (Slaney, 50–14000 Hz) for audio preprocessing. |
|
| 24 |
| `text_embeddings.json` | ~220 KB | Sample text embeddings for sanity-checking the model output. |
|
| 25 |
|
|
@@ -32,7 +32,7 @@ STFT (n_fft=1024, hop=320, win=1024, Hann)
|
|
| 32 |
↓
|
| 33 |
Log-mel spectrogram (64 bins, 50–14000 Hz)
|
| 34 |
↓
|
| 35 |
-
TinyCLAP_AudioEncoder.
|
| 36 |
↓
|
| 37 |
1024-dim L2-normalised embedding
|
| 38 |
```
|
|
@@ -42,7 +42,7 @@ Text query
|
|
| 42 |
↓
|
| 43 |
BERT tokenization (max_length=100)
|
| 44 |
↓
|
| 45 |
-
TinyCLAP_TextEncoder.
|
| 46 |
↓
|
| 47 |
1024-dim L2-normalised embedding
|
| 48 |
```
|
|
@@ -71,12 +71,18 @@ Cosine similarity between the two embeddings yields semantic relevance scores.
|
|
| 71 |
|
| 72 |
## CoreML conversion details
|
| 73 |
|
| 74 |
-
- **Backend**: `
|
| 75 |
-
- **Compute units**: `
|
|
|
|
|
|
|
| 76 |
- **Audio encoder input**: flexible time dimension (640 frames nominal)
|
| 77 |
- **Text encoder input**: fixed `[1, 100]` for both `input_ids` and `attention_mask`
|
| 78 |
|
| 79 |
-
The `
|
|
|
|
|
|
|
|
|
|
|
|
|
| 80 |
|
| 81 |
## Original work
|
| 82 |
|
|
|
|
| 18 |
|
| 19 |
| File | Size | Purpose |
|
| 20 |
|------|------|---------|
|
| 21 |
+
| `TinyCLAP_AudioEncoder.mlpackage` | ~17 MB | PhiNet-based audio encoder. Consumes log-mel spectrograms and emits 1024-dim L2-normalized embeddings. |
|
| 22 |
+
| `TinyCLAP_TextEncoder.mlpackage` | ~422 MB | BERT-base text encoder. Consumes tokenized text and emits 1024-dim L2-normalized embeddings. |
|
| 23 |
| `mel_filter_bank.json` | ~170 KB | Pre-computed 64-bin mel filter bank (Slaney, 50–14000 Hz) for audio preprocessing. |
|
| 24 |
| `text_embeddings.json` | ~220 KB | Sample text embeddings for sanity-checking the model output. |
|
| 25 |
|
|
|
|
| 32 |
↓
|
| 33 |
Log-mel spectrogram (64 bins, 50–14000 Hz)
|
| 34 |
↓
|
| 35 |
+
TinyCLAP_AudioEncoder.mlpackage
|
| 36 |
↓
|
| 37 |
1024-dim L2-normalised embedding
|
| 38 |
```
|
|
|
|
| 42 |
↓
|
| 43 |
BERT tokenization (max_length=100)
|
| 44 |
↓
|
| 45 |
+
TinyCLAP_TextEncoder.mlpackage
|
| 46 |
↓
|
| 47 |
1024-dim L2-normalised embedding
|
| 48 |
```
|
|
|
|
| 71 |
|
| 72 |
## CoreML conversion details
|
| 73 |
|
| 74 |
+
- **Backend**: `mlprogram` (`convert_to="mlprogram"`)
|
| 75 |
+
- **Compute units**: `ALL` (CPU + GPU + Apple Neural Engine)
|
| 76 |
+
- **Precision**: `FLOAT32`
|
| 77 |
+
- **Minimum deployment target**: macOS 14 / iOS 17
|
| 78 |
- **Audio encoder input**: flexible time dimension (640 frames nominal)
|
| 79 |
- **Text encoder input**: fixed `[1, 100]` for both `input_ids` and `attention_mask`
|
| 80 |
|
| 81 |
+
The `neuralnetwork` backend was used initially but forced CPU-only execution due to internal shape-validation warnings. The switch to `mlprogram` with the `fuse_transpose_matmul` optimization disabled and `FLOAT32` precision enabled achieves bit-exact numerical parity with PyTorch while unlocking ANE/GPU acceleration.
|
| 82 |
+
|
| 83 |
+
Numerical verification:
|
| 84 |
+
- Audio encoder: cosine similarity ≈ 1.00000036 vs PyTorch
|
| 85 |
+
- Text encoder: cosine similarity ≈ 1.00000012 vs PyTorch
|
| 86 |
|
| 87 |
## Original work
|
| 88 |
|