Carmenest commited on
Commit
9dc8866
·
verified ·
1 Parent(s): e00b07f

Update model card with inter-step cache benchmark results (v0.2.0)

Browse files
Files changed (1) hide show
  1. README.md +18 -17
README.md CHANGED
@@ -30,22 +30,21 @@ LLaDA is a **diffusion language model** that generates text by iterative unmaski
30
 
31
  ## Benchmark (AMD EPYC 4465P 12-Core, steps=16, threads=12)
32
 
33
- ### Real Prompt Performance (Q4_K_M + entropy_exit)
34
 
35
- | Prompt | B=64 tok/s | B=256 tok/s | Steps | vs llama.cpp |
36
  |---|---|---|---|---|
37
- | Capital of France? | 9.22 | **15.60** | 4 | 1.8x |
38
- | Translate to French | 10.23 | **21.78** | 3 | 2.6x |
39
- | 15 × 23? | 11.49 | **11.45** | 5 | 1.3x |
40
- | Translate to Spanish | 4.59 | **7.17** | 8 | 0.8x |
41
- | Python is_prime() | 2.53 | **3.12** | 17 | 0.4x |
42
- | Poem about ocean | 2.33 | **3.10** | 17 | 0.4x |
43
- | Why is sky blue? | 2.21 | **3.18** | 17 | 0.4x |
44
- | List the planets | 2.33 | **3.19** | 17 | 0.4x |
45
-
46
- *B = generation buffer size (tokens generated per call). llama.cpp baseline: 8.51 tok/s (Llama-3-8B Q4_K_M, same hardware).*
47
-
48
- entropy_exit adapts to prompt difficulty: 3–4 steps for easy, 16 for hard. Never slower than baseline.
49
 
50
  ### Quantization Comparison (low_confidence baseline, B=64)
51
 
@@ -57,8 +56,10 @@ entropy_exit adapts to prompt difficulty: 3–4 steps for easy, 16 for hard. Nev
57
 
58
  ### Summary
59
 
60
- - **11–22 tok/s on easy real prompts** (Q4_K_M + entropy_exit, B=256)
61
- - **Up to 2.6x faster than llama.cpp** on the same hardware
 
 
62
  - **256-token generation** with 20% lower per-token cost vs 64-token batches
63
  - **7.5x thread scaling** from 1 to 12 threads
64
 
@@ -72,7 +73,7 @@ cd diffuse-cpp
72
  cmake -B build -DCMAKE_BUILD_TYPE=Release
73
  cmake --build build -j$(nproc)
74
 
75
- # Generate with entropy_exit (recommended)
76
  python tools/generate.py \
77
  --model-dir /path/to/LLaDA-8B-Instruct \
78
  --gguf llada-8b-q4km.gguf \
 
30
 
31
  ## Benchmark (AMD EPYC 4465P 12-Core, steps=16, threads=12)
32
 
33
+ ### Real Prompt Performance (Q4_K_M + entropy_exit + inter-step cache, B=256)
34
 
35
+ | Prompt | No-Cache tok/s | Cache tok/s | Steps | vs llama.cpp |
36
  |---|---|---|---|---|
37
+ | Capital of France? | 17.5 | **24.4** | 3 | 2.9x |
38
+ | Translate to French | 25.9 | **27.7** | 2 | 3.3x |
39
+ | 15 x 23? | 12.8 | **15.7** | 4 | 1.8x |
40
+ | Translate to Spanish | 7.6 | **22.9** | 7 | 2.7x |
41
+ | Python is_prime() | 3.2 | **4.9** | 16 | 0.6x |
42
+ | Poem about ocean | 3.2 | **5.3** | 16 | 0.6x |
43
+ | Why is sky blue? | 3.3 | **12.0** | 16 | 1.4x |
44
+ | List the planets | 3.3 | **9.4** | 15 | 1.1x |
45
+ | **Average** | **9.6** | **15.3** | | **1.8x** |
46
+
47
+ *llama.cpp baseline: 8.51 tok/s (Llama-3-8B Q4_K_M, same hardware). Cache enabled by default. 6 of 8 prompts outperform llama.cpp; 2 (code generation, creative writing) remain slower due to requiring all 16 steps.*
 
48
 
49
  ### Quantization Comparison (low_confidence baseline, B=64)
50
 
 
56
 
57
  ### Summary
58
 
59
+ - **15-28 tok/s on easy real prompts** (Q4_K_M + entropy_exit + inter-step cache, B=256)
60
+ - **Up to 3.2x faster than llama.cpp** on the same hardware
61
+ - **Inter-step KV cache**: 1.6x average speedup with no quality degradation
62
+ - **6 of 8 real prompts outperform llama.cpp** (vs 3 of 8 without cache)
63
  - **256-token generation** with 20% lower per-token cost vs 64-token batches
64
  - **7.5x thread scaling** from 1 to 12 threads
65
 
 
73
  cmake -B build -DCMAKE_BUILD_TYPE=Release
74
  cmake --build build -j$(nproc)
75
 
76
+ # Generate with entropy_exit + cache (recommended, cache is ON by default)
77
  python tools/generate.py \
78
  --model-dir /path/to/LLaDA-8B-Instruct \
79
  --gguf llada-8b-q4km.gguf \