Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,3 +1,48 @@
|
|
| 1 |
-
---
|
| 2 |
-
license:
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- diffusion
|
| 7 |
+
- llada
|
| 8 |
+
- gguf
|
| 9 |
+
- diffuse-cpp
|
| 10 |
+
base_model: GSAI-ML/LLaDA-8B-Instruct
|
| 11 |
+
quantized_by: Carmenest
|
| 12 |
+
pipeline_tag: text-generation
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# LLaDA-8B-Instruct GGUF
|
| 16 |
+
|
| 17 |
+
GGUF quantized versions of [GSAI-ML/LLaDA-8B-Instruct](https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct) for use with [diffuse-cpp](https://github.com/iafiscal1212/diffuse-cpp).
|
| 18 |
+
|
| 19 |
+
LLaDA is a **diffusion language model** that generates text by iterative unmasking rather than autoregressive token-by-token prediction.
|
| 20 |
+
|
| 21 |
+
## Available Quantizations
|
| 22 |
+
|
| 23 |
+
| File | Quant | Size | Description |
|
| 24 |
+
|------|-------|------|-------------|
|
| 25 |
+
| llada-8b-q4km.gguf | Q4_K_M | 5.1 GB | **Recommended** best throughput |
|
| 26 |
+
| llada-8b-q8_0.gguf | Q8_0 | 8.4 GB | High quality, good throughput |
|
| 27 |
+
| llada-8b-f16.gguf | F16 | 14.9 GB | Full precision reference |
|
| 28 |
+
|
| 29 |
+
## Benchmark (24-core Xeon, 64 tokens)
|
| 30 |
+
|
| 31 |
+
| Model | Scheduler | tok/s | Speedup vs F16 |
|
| 32 |
+
|-------|-----------|-------|----------------|
|
| 33 |
+
| F16 | low_confidence | 1.64 | 1.00x |
|
| 34 |
+
| F16 | entropy_exit | 8.74 | 5.32x |
|
| 35 |
+
| Q8_0 | low_confidence | 1.86 | 1.13x |
|
| 36 |
+
| Q8_0 | entropy_exit | 10.09 | 6.14x |
|
| 37 |
+
| Q4_K_M | low_confidence | 2.48 | 1.51x |
|
| 38 |
+
| Q4_K_M | entropy_exit | **13.59** | **8.27x** |
|
| 39 |
+
|
| 40 |
+
**Q4_K_M + entropy_exit = 13.59 tok/s** (1.6x llama.cpp on same hardware)
|
| 41 |
+
|
| 42 |
+
## Usage
|
| 43 |
+
|
| 44 |
+
```bash
|
| 45 |
+
git clone https://github.com/iafiscal1212/diffuse-cpp
|
| 46 |
+
cd diffuse-cpp && mkdir build && cd build && cmake .. && make -j
|
| 47 |
+
./diffuse-cli -m llada-8b-q4km.gguf -p "What is the capital of France?" -s 16 -t 12
|
| 48 |
+
```
|