StephanST commited on
Commit
3a7fe72
·
verified ·
1 Parent(s): ae3a673

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +35 -8
README.md CHANGED
@@ -30,6 +30,8 @@ https://github.com/stephansturges/c-radio_v4_MLX
30
  | --- | --- | --- | --- |
31
  | `so400m/8bit-affine` | `nvidia/C-RADIOv4-SO400M` | 8-bit affine, group size 64 | Compact/high-precision |
32
  | `h/8bit-affine` | `nvidia/C-RADIOv4-H` | 8-bit affine, group size 64 | Compact/high-precision |
 
 
33
  | `so400m/mxfp8` | `nvidia/C-RADIOv4-SO400M` | `mxfp8`, group size 32 | Experimental/lower precision |
34
  | `h/mxfp8` | `nvidia/C-RADIOv4-H` | `mxfp8`, group size 32 | Experimental/lower precision |
35
 
@@ -58,23 +60,35 @@ Measured against local bf16 MLX bundles at `512x512` on 12 WALDO crop images.
58
  | `so400m/mxfp8` | 0.989820 / 0.950717 | 0.993502 / 0.977879 |
59
  | `h/mxfp8` | 0.990217 / 0.974710 | 0.988696 / 0.976071 |
60
 
61
- The 8-bit affine bundles are the recommended compact/high-precision artifacts. The
62
- `mxfp8` bundles are included for experimentation and are lower precision in these checks.
 
 
 
 
 
 
 
 
 
63
 
64
  ## Speed Summary
65
 
66
- Fast-kernel compiled-forward MLX measurements on Apple Silicon at `512x512`, batch 1:
67
 
68
  | Bundle | p50 latency | Throughput |
69
  | --- | ---: | ---: |
70
- | `so400m/8bit-affine` | 47.1 ms | 21.2 images/s |
71
- | `h/8bit-affine` | 58.8 ms | 17.0 images/s |
 
 
72
  | `so400m/mxfp8` | 49.8 ms | 20.1 images/s |
73
  | `h/mxfp8` | 52.6 ms | 19.0 images/s |
74
 
75
- These are packed low-bit runtime measurements. The quantized bundles prioritize compact
76
- storage and lower runtime weight memory. For latency-sensitive inference, the bf16 bundles
77
- in the implementation repo remain faster when they fit.
 
78
 
79
  ## Usage
80
 
@@ -103,6 +117,19 @@ cradio-mlx embed \
103
  --save-npz embedding.npz
104
  ```
105
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
  ## License
107
 
108
  The implementation code in `c-radio_v4_MLX` is MIT licensed. The model weights and these
 
30
  | --- | --- | --- | --- |
31
  | `so400m/8bit-affine` | `nvidia/C-RADIOv4-SO400M` | 8-bit affine, group size 64 | Compact/high-precision |
32
  | `h/8bit-affine` | `nvidia/C-RADIOv4-H` | 8-bit affine, group size 64 | Compact/high-precision |
33
+ | `so400m/cider-w8a8` | `nvidia/C-RADIOv4-SO400M` | Cider W8A8, per-channel | M5+ compact/runtime low-bit |
34
+ | `h/cider-w8a8` | `nvidia/C-RADIOv4-H` | Cider W8A8, per-channel | M5+ compact/runtime low-bit |
35
  | `so400m/mxfp8` | `nvidia/C-RADIOv4-SO400M` | `mxfp8`, group size 32 | Experimental/lower precision |
36
  | `h/mxfp8` | `nvidia/C-RADIOv4-H` | `mxfp8`, group size 32 | Experimental/lower precision |
37
 
 
60
  | `so400m/mxfp8` | 0.989820 / 0.950717 | 0.993502 / 0.977879 |
61
  | `h/mxfp8` | 0.990217 / 0.974710 | 0.988696 / 0.976071 |
62
 
63
+ The 8-bit affine bundles are the recommended compact/high-precision artifacts. Cider W8A8
64
+ is a real low-bit runtime path for Apple M5+ machines and trades a little more embedding
65
+ drift for lower memory and modest speedups in some cells. The `mxfp8` bundles are included
66
+ for experimentation and are lower precision in these checks.
67
+
68
+ Smoke-image Cider W8A8 precision versus local bf16 MLX at `512x512`:
69
+
70
+ | Bundle | Summary cosine | Spatial cosine |
71
+ | --- | ---: | ---: |
72
+ | `so400m/cider-w8a8` | 0.998164 | 0.998889 |
73
+ | `h/cider-w8a8` | 0.997202 | 0.996210 |
74
 
75
  ## Speed Summary
76
 
77
+ MLX measurements on Apple M5 Max at `512x512`, batch 1:
78
 
79
  | Bundle | p50 latency | Throughput |
80
  | --- | ---: | ---: |
81
+ | `so400m/8bit-affine` | 49.6 ms | 20.2 images/s |
82
+ | `h/8bit-affine` | 74.2 ms | 13.5 images/s |
83
+ | `so400m/cider-w8a8` | 32.5 ms | 30.8 images/s |
84
+ | `h/cider-w8a8` | 47.1 ms | 21.2 images/s |
85
  | `so400m/mxfp8` | 49.8 ms | 20.1 images/s |
86
  | `h/mxfp8` | 52.6 ms | 19.0 images/s |
87
 
88
+ These are packed low-bit runtime measurements. The MLX affine and `mxfp8` bundles
89
+ prioritize compact storage and lower runtime weight memory over throughput. Cider W8A8 is
90
+ the faster low-bit runtime path found so far, but it requires Apple M5+ hardware and the
91
+ optional Cider package.
92
 
93
  ## Usage
94
 
 
117
  --save-npz embedding.npz
118
  ```
119
 
120
+ Cider W8A8 bundles require Python `>=3.12`, Apple M5+ hardware, and Cider:
121
+
122
+ ```sh
123
+ python -m pip install "cider @ git+https://github.com/Mininglamp-AI/cider.git"
124
+ cradio-mlx embed \
125
+ --backend mlx-h \
126
+ --checkpoint h/cider-w8a8 \
127
+ --image image.jpg \
128
+ --image-size 512 \
129
+ --dtype bfloat16 \
130
+ --save-npz embedding.npz
131
+ ```
132
+
133
  ## License
134
 
135
  The implementation code in `c-radio_v4_MLX` is MIT licensed. The model weights and these