Carmenest commited on
Commit
deb1e7f
·
verified ·
1 Parent(s): 6ce1b66

Update benchmarks with rigorous real-prompt results (buffer 1.5x fix)

Browse files
Files changed (1) hide show
  1. README.md +39 -13
README.md CHANGED
@@ -26,23 +26,49 @@ LLaDA is a **diffusion language model** that generates text by iterative unmaski
26
  | llada-8b-q8_0.gguf | Q8_0 | 8.4 GB | High quality, good throughput |
27
  | llada-8b-f16.gguf | F16 | 14.9 GB | Full precision reference |
28
 
29
- ## Benchmark (24-core Xeon, 64 tokens)
30
 
31
- | Model | Scheduler | tok/s | Speedup vs F16 |
32
- |-------|-----------|-------|----------------|
33
- | F16 | low_confidence | 1.64 | 1.00x |
34
- | F16 | entropy_exit | 8.74 | 5.32x |
35
- | Q8_0 | low_confidence | 1.86 | 1.13x |
36
- | Q8_0 | entropy_exit | 10.09 | 6.14x |
37
- | Q4_K_M | low_confidence | 2.48 | 1.51x |
38
- | Q4_K_M | entropy_exit | **13.59** | **8.27x** |
39
 
40
- **Q4_K_M + entropy_exit = 13.59 tok/s** (1.6x llama.cpp on same hardware)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  ## Usage
43
 
44
  ```bash
45
- git clone https://github.com/iafiscal1212/diffuse-cpp
46
- cd diffuse-cpp && mkdir build && cd build && cmake .. && make -j
47
- ./diffuse-cli -m llada-8b-q4km.gguf -p "What is the capital of France?" -s 16 -t 12
 
 
 
 
 
 
 
 
48
  ```
 
26
  | llada-8b-q8_0.gguf | Q8_0 | 8.4 GB | High quality, good throughput |
27
  | llada-8b-f16.gguf | F16 | 14.9 GB | Full precision reference |
28
 
29
+ ## Benchmark (AMD EPYC 4465P 12-Core, 64 tokens, steps=16, threads=12)
30
 
31
+ ### Real Prompt Performance (Q4_K_M + entropy_exit)
 
 
 
 
 
 
 
32
 
33
+ | Prompt type | tok/s | Steps used | Speedup |
34
+ |---|---|---|---|
35
+ | Factual ("Capital of France?") | **9.22** | 4 | 3.9x |
36
+ | Translation ("Translate to French") | **10.23** | 3 | 4.6x |
37
+ | Arithmetic ("15 x 23?") | **11.49** | 3 | 5.5x |
38
+ | Code (is_prime function) | **2.53** | 15 | 1.1x |
39
+ | Creative (poem, explanation) | 2.33 | 17 | 1.0x |
40
+
41
+ entropy_exit adapts to prompt difficulty: 3-4 steps for easy, 16 for hard. Never slower than baseline.
42
+
43
+ ### Quantization Comparison (low_confidence baseline)
44
+
45
+ | Model | Size | tok/s | vs F16 |
46
+ |-------|------|-------|--------|
47
+ | F16 | 14.9 GB | 1.64 | 1.00x |
48
+ | Q8_0 | 8.4 GB | 1.84 | 1.12x |
49
+ | Q4_K_M | 5.1 GB | 2.52 | 1.54x |
50
+
51
+ ### Summary
52
+
53
+ - **~10 tok/s on easy real prompts** (Q4_K_M + entropy_exit)
54
+ - **~6x faster than F16 baseline** on factual/translation tasks
55
+ - **7.5x thread scaling** from 1 to 12 threads
56
+ - **40+ tok/s peak** on synthetic benchmarks (single forward pass)
57
+
58
+ Full results: [research/benchmark/RESULTS.md](https://github.com/iafiscal1212/diffuse-cpp/blob/main/research/benchmark/RESULTS.md)
59
 
60
  ## Usage
61
 
62
  ```bash
63
+ git clone --recursive https://github.com/iafiscal1212/diffuse-cpp
64
+ cd diffuse-cpp
65
+ cmake -B build -DCMAKE_BUILD_TYPE=Release
66
+ cmake --build build -j$(nproc)
67
+
68
+ # Generate with entropy_exit (recommended)
69
+ python tools/generate.py \
70
+ --model-dir /path/to/LLaDA-8B-Instruct \
71
+ --gguf llada-8b-q4km.gguf \
72
+ -p "What is the capital of France?" \
73
+ -s 16 -t 12 --remasking entropy_exit
74
  ```