Instructions to use StephanST/C-radiov4_quantized with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use StephanST/C-radiov4_quantized with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir C-radiov4_quantized StephanST/C-radiov4_quantized
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -53,10 +53,10 @@ Measured against local bf16 MLX bundles at `512x512` on 12 WALDO crop images.
|
|
| 53 |
|
| 54 |
| Bundle | Summary cosine mean/min | Spatial cosine mean/min |
|
| 55 |
| --- | ---: | ---: |
|
| 56 |
-
| `so400m/8bit-affine` | 0.
|
| 57 |
-
| `h/8bit-affine` | 0.
|
| 58 |
-
| `so400m/mxfp8` | 0.
|
| 59 |
-
| `h/mxfp8` | 0.
|
| 60 |
|
| 61 |
The 8-bit affine bundles are the recommended compact/high-precision artifacts. The
|
| 62 |
`mxfp8` bundles are included for experimentation and are lower precision in these checks.
|
|
@@ -65,21 +65,16 @@ The 8-bit affine bundles are the recommended compact/high-precision artifacts. T
|
|
| 65 |
|
| 66 |
Fast-kernel compiled-forward MLX measurements on Apple Silicon at `512x512`, batch 1:
|
| 67 |
|
| 68 |
-
| Bundle |
|
| 69 |
-
| --- | ---
|
| 70 |
-
| `so400m/8bit-affine` |
|
| 71 |
-
| `
|
| 72 |
-
| `
|
| 73 |
-
| `h/
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
`packed` keeps weights low-bit during inference and reduces runtime weight memory, but it
|
| 80 |
-
is slower than dense bf16 on this ViT encoder. `dequantize at load` expands the compact
|
| 81 |
-
artifact to bf16 weights once during load, then uses the dense MLX kernels; it recovers
|
| 82 |
-
bf16-class throughput while using bf16 runtime weight memory.
|
| 83 |
|
| 84 |
## Usage
|
| 85 |
|
|
@@ -93,7 +88,6 @@ cradio-mlx embed \
|
|
| 93 |
--image image.jpg \
|
| 94 |
--image-size 512 \
|
| 95 |
--dtype bfloat16 \
|
| 96 |
-
--quantized-runtime dequantize \
|
| 97 |
--save-npz embedding.npz
|
| 98 |
```
|
| 99 |
|
|
@@ -106,7 +100,6 @@ cradio-mlx embed \
|
|
| 106 |
--image image.jpg \
|
| 107 |
--image-size 512 \
|
| 108 |
--dtype bfloat16 \
|
| 109 |
-
--quantized-runtime dequantize \
|
| 110 |
--save-npz embedding.npz
|
| 111 |
```
|
| 112 |
|
|
|
|
| 53 |
|
| 54 |
| Bundle | Summary cosine mean/min | Spatial cosine mean/min |
|
| 55 |
| --- | ---: | ---: |
|
| 56 |
+
| `so400m/8bit-affine` | 0.999907 / 0.999868 | 0.999930 / 0.999876 |
|
| 57 |
+
| `h/8bit-affine` | 0.999899 / 0.999878 | 0.999830 / 0.999764 |
|
| 58 |
+
| `so400m/mxfp8` | 0.989820 / 0.950717 | 0.993502 / 0.977879 |
|
| 59 |
+
| `h/mxfp8` | 0.990217 / 0.974710 | 0.988696 / 0.976071 |
|
| 60 |
|
| 61 |
The 8-bit affine bundles are the recommended compact/high-precision artifacts. The
|
| 62 |
`mxfp8` bundles are included for experimentation and are lower precision in these checks.
|
|
|
|
| 65 |
|
| 66 |
Fast-kernel compiled-forward MLX measurements on Apple Silicon at `512x512`, batch 1:
|
| 67 |
|
| 68 |
+
| Bundle | p50 latency | Throughput |
|
| 69 |
+
| --- | ---: | ---: |
|
| 70 |
+
| `so400m/8bit-affine` | 47.1 ms | 21.2 images/s |
|
| 71 |
+
| `h/8bit-affine` | 58.8 ms | 17.0 images/s |
|
| 72 |
+
| `so400m/mxfp8` | 49.8 ms | 20.1 images/s |
|
| 73 |
+
| `h/mxfp8` | 52.6 ms | 19.0 images/s |
|
| 74 |
+
|
| 75 |
+
These are packed low-bit runtime measurements. The quantized bundles prioritize compact
|
| 76 |
+
storage and lower runtime weight memory. For latency-sensitive inference, the bf16 bundles
|
| 77 |
+
in the implementation repo remain faster when they fit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
## Usage
|
| 80 |
|
|
|
|
| 88 |
--image image.jpg \
|
| 89 |
--image-size 512 \
|
| 90 |
--dtype bfloat16 \
|
|
|
|
| 91 |
--save-npz embedding.npz
|
| 92 |
```
|
| 93 |
|
|
|
|
| 100 |
--image image.jpg \
|
| 101 |
--image-size 512 \
|
| 102 |
--dtype bfloat16 \
|
|
|
|
| 103 |
--save-npz embedding.npz
|
| 104 |
```
|
| 105 |
|