StephanST commited on
Commit
ce1c84b
·
verified ·
1 Parent(s): c445f5f

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +14 -21
README.md CHANGED
@@ -53,10 +53,10 @@ Measured against local bf16 MLX bundles at `512x512` on 12 WALDO crop images.
53
 
54
  | Bundle | Summary cosine mean/min | Spatial cosine mean/min |
55
  | --- | ---: | ---: |
56
- | `so400m/8bit-affine` | 0.999913 / 0.999885 | 0.999927 / 0.999881 |
57
- | `h/8bit-affine` | 0.999907 / 0.999884 | 0.999828 / 0.999761 |
58
- | `so400m/mxfp8` | 0.989676 / 0.949449 | 0.993379 / 0.978096 |
59
- | `h/mxfp8` | 0.990272 / 0.974978 | 0.988784 / 0.976665 |
60
 
61
  The 8-bit affine bundles are the recommended compact/high-precision artifacts. The
62
  `mxfp8` bundles are included for experimentation and are lower precision in these checks.
@@ -65,21 +65,16 @@ The 8-bit affine bundles are the recommended compact/high-precision artifacts. T
65
 
66
  Fast-kernel compiled-forward MLX measurements on Apple Silicon at `512x512`, batch 1:
67
 
68
- | Bundle | Runtime | p50 latency | Throughput |
69
- | --- | --- | ---: | ---: |
70
- | `so400m/8bit-affine` | packed | 47.1 ms | 21.2 images/s |
71
- | `so400m/8bit-affine` | dequantize at load | 32.4 ms | 30.9 images/s |
72
- | `h/8bit-affine` | packed | 58.8 ms | 17.0 images/s |
73
- | `h/8bit-affine` | dequantize at load | 45.5 ms | 22.0 images/s |
74
- | `so400m/mxfp8` | packed | 49.8 ms | 20.1 images/s |
75
- | `so400m/mxfp8` | dequantize at load | 32.5 ms | 30.8 images/s |
76
- | `h/mxfp8` | packed | 52.6 ms | 19.0 images/s |
77
- | `h/mxfp8` | dequantize at load | 45.4 ms | 22.0 images/s |
78
-
79
- `packed` keeps weights low-bit during inference and reduces runtime weight memory, but it
80
- is slower than dense bf16 on this ViT encoder. `dequantize at load` expands the compact
81
- artifact to bf16 weights once during load, then uses the dense MLX kernels; it recovers
82
- bf16-class throughput while using bf16 runtime weight memory.
83
 
84
  ## Usage
85
 
@@ -93,7 +88,6 @@ cradio-mlx embed \
93
  --image image.jpg \
94
  --image-size 512 \
95
  --dtype bfloat16 \
96
- --quantized-runtime dequantize \
97
  --save-npz embedding.npz
98
  ```
99
 
@@ -106,7 +100,6 @@ cradio-mlx embed \
106
  --image image.jpg \
107
  --image-size 512 \
108
  --dtype bfloat16 \
109
- --quantized-runtime dequantize \
110
  --save-npz embedding.npz
111
  ```
112
 
 
53
 
54
  | Bundle | Summary cosine mean/min | Spatial cosine mean/min |
55
  | --- | ---: | ---: |
56
+ | `so400m/8bit-affine` | 0.999907 / 0.999868 | 0.999930 / 0.999876 |
57
+ | `h/8bit-affine` | 0.999899 / 0.999878 | 0.999830 / 0.999764 |
58
+ | `so400m/mxfp8` | 0.989820 / 0.950717 | 0.993502 / 0.977879 |
59
+ | `h/mxfp8` | 0.990217 / 0.974710 | 0.988696 / 0.976071 |
60
 
61
  The 8-bit affine bundles are the recommended compact/high-precision artifacts. The
62
  `mxfp8` bundles are included for experimentation and are lower precision in these checks.
 
65
 
66
  Fast-kernel compiled-forward MLX measurements on Apple Silicon at `512x512`, batch 1:
67
 
68
+ | Bundle | p50 latency | Throughput |
69
+ | --- | ---: | ---: |
70
+ | `so400m/8bit-affine` | 47.1 ms | 21.2 images/s |
71
+ | `h/8bit-affine` | 58.8 ms | 17.0 images/s |
72
+ | `so400m/mxfp8` | 49.8 ms | 20.1 images/s |
73
+ | `h/mxfp8` | 52.6 ms | 19.0 images/s |
74
+
75
+ These are packed low-bit runtime measurements. The quantized bundles prioritize compact
76
+ storage and lower runtime weight memory. For latency-sensitive inference, the bf16 bundles
77
+ in the implementation repo remain faster when they fit.
 
 
 
 
 
78
 
79
  ## Usage
80
 
 
88
  --image image.jpg \
89
  --image-size 512 \
90
  --dtype bfloat16 \
 
91
  --save-npz embedding.npz
92
  ```
93
 
 
100
  --image image.jpg \
101
  --image-size 512 \
102
  --dtype bfloat16 \
 
103
  --save-npz embedding.npz
104
  ```
105