Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -13,7 +13,7 @@ tags:
|
|
| 13 |
- optiq
|
| 14 |
---
|
| 15 |
|
| 16 |
-
# Qwen3.5-2B-
|
| 17 |
|
| 18 |
> Mixed-precision quantized with OptiQ — sensitivity-driven quantization for Apple Silicon
|
| 19 |
|
|
@@ -65,7 +65,7 @@ At the 2B scale, there is a **sharp quality cliff between 4-bit and 3-bit** —
|
|
| 65 |
- **At 3.0 BPW:** The floor bit-width matters enormously. With a 3-bit floor [3,4,8], OptiQ matches uniform 3-bit (6.0%). With a 2-bit floor [2,4], quality collapses to 2.0% — even mixed-precision can't save layers quantized to 2-bit at this scale.
|
| 66 |
- **Uniform 2-bit** is essentially random (0.5%).
|
| 67 |
|
| 68 |
-
**Recommendation:** Use this 4.5 BPW model for the best quality-size tradeoff at 2B scale. For smaller models where mixed-precision shows dramatic benefits, see [Qwen3.5-0.8B-
|
| 69 |
|
| 70 |
## Usage
|
| 71 |
|
|
@@ -74,7 +74,7 @@ This model works with standard `mlx-lm` — no special code needed:
|
|
| 74 |
```python
|
| 75 |
from mlx_lm import load, generate
|
| 76 |
|
| 77 |
-
model, tokenizer = load("mlx-community/Qwen3.5-2B-
|
| 78 |
|
| 79 |
prompt = "Explain quantum computing in simple terms:"
|
| 80 |
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
|
|
|
|
| 13 |
- optiq
|
| 14 |
---
|
| 15 |
|
| 16 |
+
# Qwen3.5-2B-OptiQ-4bit
|
| 17 |
|
| 18 |
> Mixed-precision quantized with OptiQ — sensitivity-driven quantization for Apple Silicon
|
| 19 |
|
|
|
|
| 65 |
- **At 3.0 BPW:** The floor bit-width matters enormously. With a 3-bit floor [3,4,8], OptiQ matches uniform 3-bit (6.0%). With a 2-bit floor [2,4], quality collapses to 2.0% — even mixed-precision can't save layers quantized to 2-bit at this scale.
|
| 66 |
- **Uniform 2-bit** is essentially random (0.5%).
|
| 67 |
|
| 68 |
+
**Recommendation:** Use this 4.5 BPW model for the best quality-size tradeoff at 2B scale. For smaller models where mixed-precision shows dramatic benefits, see [Qwen3.5-0.8B-OptiQ-4bit](https://huggingface.co/mlx-community/Qwen3.5-0.8B-OptiQ-4bit) where OptiQ more than doubles uniform 4-bit accuracy (27% vs 11.5%).
|
| 69 |
|
| 70 |
## Usage
|
| 71 |
|
|
|
|
| 74 |
```python
|
| 75 |
from mlx_lm import load, generate
|
| 76 |
|
| 77 |
+
model, tokenizer = load("mlx-community/Qwen3.5-2B-OptiQ-4bit")
|
| 78 |
|
| 79 |
prompt = "Explain quantum computing in simple terms:"
|
| 80 |
response = generate(model, tokenizer, prompt=prompt, max_tokens=200)
|