StephanST commited on
Commit
5a9f3f3
·
verified ·
1 Parent(s): 0acd7c6

Upload so400m/mxfp8/README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. so400m/mxfp8/README.md +15 -2
so400m/mxfp8/README.md CHANGED
@@ -45,11 +45,23 @@ Against the local bf16 MLX bundle at `512x512` on 12 WALDO crop images:
45
 
46
  | Metric | Mean | Min |
47
  | --- | ---: | ---: |
48
- | Summary cosine | 0.989820 | 0.950717 |
49
- | Spatial cosine | 0.993502 | 0.977879 |
50
 
51
  This is lower precision than the 8-bit affine bundle. Treat this as experimental.
52
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ## Usage
54
 
55
  ```sh
@@ -59,6 +71,7 @@ cradio-mlx embed \
59
  --image image.jpg \
60
  --image-size 512 \
61
  --dtype bfloat16 \
 
62
  --save-npz embedding.npz
63
  ```
64
 
 
45
 
46
  | Metric | Mean | Min |
47
  | --- | ---: | ---: |
48
+ | Summary cosine | 0.989676 | 0.949449 |
49
+ | Spatial cosine | 0.993379 | 0.978096 |
50
 
51
  This is lower precision than the 8-bit affine bundle. Treat this as experimental.
52
 
53
+ ## Measured Speed
54
+
55
+ Fast-kernel compiled-forward MLX measurements at `512x512`, batch 1:
56
+
57
+ | Runtime | p50 latency | Throughput |
58
+ | --- | ---: | ---: |
59
+ | packed | 49.8 ms | 20.1 images/s |
60
+ | dequantize at load | 32.5 ms | 30.8 images/s |
61
+
62
+ `packed` keeps weights low-bit at runtime but is slower for this ViT encoder. Use
63
+ `--quantized-runtime dequantize` when latency matters; it expands weights to bf16 at load.
64
+
65
  ## Usage
66
 
67
  ```sh
 
71
  --image image.jpg \
72
  --image-size 512 \
73
  --dtype bfloat16 \
74
+ --quantized-runtime dequantize \
75
  --save-npz embedding.npz
76
  ```
77