diffuse-cpp
/

LLaDA-8B-Instruct-GGUF

@@ -26,23 +26,49 @@ LLaDA is a **diffusion language model** that generates text by iterative unmaski
 | llada-8b-q8_0.gguf | Q8_0 | 8.4 GB | High quality, good throughput |
 | llada-8b-f16.gguf | F16 | 14.9 GB | Full precision reference |
-## Benchmark (24-core Xeon, 64 tokens)
-| Model | Scheduler | tok/s | Speedup vs F16 |
-|-------|-----------|-------|----------------|
-| F16 | low_confidence | 1.64 | 1.00x |
-| F16 | entropy_exit | 8.74 | 5.32x |
-| Q8_0 | low_confidence | 1.86 | 1.13x |
-| Q8_0 | entropy_exit | 10.09 | 6.14x |
-| Q4_K_M | low_confidence | 2.48 | 1.51x |
-| Q4_K_M | entropy_exit | **13.59** | **8.27x** |
-**Q4_K_M + entropy_exit = 13.59 tok/s** (1.6x llama.cpp on same hardware)
 ## Usage
 ```bash
-git clone https://github.com/iafiscal1212/diffuse-cpp
-cd diffuse-cpp && mkdir build && cd build && cmake .. && make -j
-./diffuse-cli -m llada-8b-q4km.gguf -p "What is the capital of France?" -s 16 -t 12
 ```

 | llada-8b-q8_0.gguf | Q8_0 | 8.4 GB | High quality, good throughput |
 | llada-8b-f16.gguf | F16 | 14.9 GB | Full precision reference |
+## Benchmark (AMD EPYC 4465P 12-Core, 64 tokens, steps=16, threads=12)
+### Real Prompt Performance (Q4_K_M + entropy_exit)
+| Prompt type | tok/s | Steps used | Speedup |
+|---|---|---|---|
+| Factual ("Capital of France?") | **9.22** | 4 | 3.9x |
+| Translation ("Translate to French") | **10.23** | 3 | 4.6x |
+| Arithmetic ("15 x 23?") | **11.49** | 3 | 5.5x |
+| Code (is_prime function) | **2.53** | 15 | 1.1x |
+| Creative (poem, explanation) | 2.33 | 17 | 1.0x |
+entropy_exit adapts to prompt difficulty: 3-4 steps for easy, 16 for hard. Never slower than baseline.
+### Quantization Comparison (low_confidence baseline)
+| Model | Size | tok/s | vs F16 |
+|-------|------|-------|--------|
+| F16 | 14.9 GB | 1.64 | 1.00x |
+| Q8_0 | 8.4 GB | 1.84 | 1.12x |
+| Q4_K_M | 5.1 GB | 2.52 | 1.54x |
+### Summary
+- **~10 tok/s on easy real prompts** (Q4_K_M + entropy_exit)
+- **~6x faster than F16 baseline** on factual/translation tasks
+- **7.5x thread scaling** from 1 to 12 threads
+- **40+ tok/s peak** on synthetic benchmarks (single forward pass)
+Full results: [research/benchmark/RESULTS.md](https://github.com/iafiscal1212/diffuse-cpp/blob/main/research/benchmark/RESULTS.md)
 ## Usage
 ```bash
+git clone --recursive https://github.com/iafiscal1212/diffuse-cpp
+cd diffuse-cpp
+cmake -B build -DCMAKE_BUILD_TYPE=Release
+cmake --build build -j$(nproc)
+# Generate with entropy_exit (recommended)
+python tools/generate.py \
+    --model-dir /path/to/LLaDA-8B-Instruct \
+    --gguf llada-8b-q4km.gguf \
+    -p "What is the capital of France?" \
+    -s 16 -t 12 --remasking entropy_exit
 ```