StephanST commited on
Commit
c445f5f
·
verified ·
1 Parent(s): 5a9f3f3

Upload h/mxfp8/README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. h/mxfp8/README.md +15 -2
h/mxfp8/README.md CHANGED
@@ -45,11 +45,23 @@ Against the local bf16 MLX bundle at `512x512` on 12 WALDO crop images:
45
 
46
  | Metric | Mean | Min |
47
  | --- | ---: | ---: |
48
- | Summary cosine | 0.990217 | 0.974710 |
49
- | Spatial cosine | 0.988696 | 0.976071 |
50
 
51
  This is lower precision than the 8-bit affine bundle. Treat this as experimental.
52
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ## Usage
54
 
55
  ```sh
@@ -59,6 +71,7 @@ cradio-mlx embed \
59
  --image image.jpg \
60
  --image-size 512 \
61
  --dtype bfloat16 \
 
62
  --save-npz embedding.npz
63
  ```
64
 
 
45
 
46
  | Metric | Mean | Min |
47
  | --- | ---: | ---: |
48
+ | Summary cosine | 0.990272 | 0.974978 |
49
+ | Spatial cosine | 0.988784 | 0.976665 |
50
 
51
  This is lower precision than the 8-bit affine bundle. Treat this as experimental.
52
 
53
+ ## Measured Speed
54
+
55
+ Fast-kernel compiled-forward MLX measurements at `512x512`, batch 1:
56
+
57
+ | Runtime | p50 latency | Throughput |
58
+ | --- | ---: | ---: |
59
+ | packed | 52.6 ms | 19.0 images/s |
60
+ | dequantize at load | 45.4 ms | 22.0 images/s |
61
+
62
+ `packed` keeps weights low-bit at runtime but is slower for this ViT encoder. Use
63
+ `--quantized-runtime dequantize` when latency matters; it expands weights to bf16 at load.
64
+
65
  ## Usage
66
 
67
  ```sh
 
71
  --image image.jpg \
72
  --image-size 512 \
73
  --dtype bfloat16 \
74
+ --quantized-runtime dequantize \
75
  --save-npz embedding.npz
76
  ```
77