Instructions to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="diffuse-cpp/LLaDA-8B-Instruct-GGUF",
	filename="llada-8b-q4km.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
# Run inference directly in the terminal:
llama cli -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
# Run inference directly in the terminal:
llama cli -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
# Run inference directly in the terminal:
./llama-cli -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0

Use Docker

docker model run hf.co/diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0

LM Studio
Jan

vLLM

How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "diffuse-cpp/LLaDA-8B-Instruct-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "diffuse-cpp/LLaDA-8B-Instruct-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0

Ollama
How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with Ollama:
```
ollama run hf.co/diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
```

Unsloth Studio

How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for diffuse-cpp/LLaDA-8B-Instruct-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for diffuse-cpp/LLaDA-8B-Instruct-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for diffuse-cpp/LLaDA-8B-Instruct-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with Docker Model Runner:
```
docker model run hf.co/diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0
```

Lemonade

How to use diffuse-cpp/LLaDA-8B-Instruct-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull diffuse-cpp/LLaDA-8B-Instruct-GGUF:Q8_0

Run and chat with the model

lemonade run user.LLaDA-8B-Instruct-GGUF-Q8_0

List all available models

lemonade list

Carmenest commited on Mar 22

Commit

9dc8866

verified ·

1 Parent(s): e00b07f

Update model card with inter-step cache benchmark results (v0.2.0)

Browse files

Files changed (1) hide show

README.md +18 -17

README.md CHANGED Viewed

@@ -30,22 +30,21 @@ LLaDA is a **diffusion language model** that generates text by iterative unmaski
 ## Benchmark (AMD EPYC 4465P 12-Core, steps=16, threads=12)
-### Real Prompt Performance (Q4_K_M + entropy_exit)
-| Prompt | B=64 tok/s | B=256 tok/s | Steps | vs llama.cpp |
 |---|---|---|---|---|
-| Capital of France? | 9.22 | **15.60** | 4 | 1.8x |
-| Translate to French | 10.23 | **21.78** | 3 | 2.6x |
-| 15 × 23? | 11.49 | **11.45** | 5 | 1.3x |
-| Translate to Spanish | 4.59 | **7.17** | 8 | 0.8x |
-| Python is_prime() | 2.53 | **3.12** | 17 | 0.4x |
-| Poem about ocean | 2.33 | **3.10** | 17 | 0.4x |
-| Why is sky blue? | 2.21 | **3.18** | 17 | 0.4x |
-| List the planets | 2.33 | **3.19** | 17 | 0.4x |
-*B = generation buffer size (tokens generated per call). llama.cpp baseline: 8.51 tok/s (Llama-3-8B Q4_K_M, same hardware).*
-entropy_exit adapts to prompt difficulty: 3–4 steps for easy, 16 for hard. Never slower than baseline.
 ### Quantization Comparison (low_confidence baseline, B=64)
@@ -57,8 +56,10 @@ entropy_exit adapts to prompt difficulty: 3–4 steps for easy, 16 for hard. Nev
 ### Summary
-- **11–22 tok/s on easy real prompts** (Q4_K_M + entropy_exit, B=256)
-- **Up to 2.6x faster than llama.cpp** on the same hardware
 - **256-token generation** with 20% lower per-token cost vs 64-token batches
 - **7.5x thread scaling** from 1 to 12 threads
@@ -72,7 +73,7 @@ cd diffuse-cpp
 cmake -B build -DCMAKE_BUILD_TYPE=Release
 cmake --build build -j$(nproc)
-# Generate with entropy_exit (recommended)
 python tools/generate.py \
     --model-dir /path/to/LLaDA-8B-Instruct \
     --gguf llada-8b-q4km.gguf \

 ## Benchmark (AMD EPYC 4465P 12-Core, steps=16, threads=12)
+### Real Prompt Performance (Q4_K_M + entropy_exit + inter-step cache, B=256)
+| Prompt | No-Cache tok/s | Cache tok/s | Steps | vs llama.cpp |
 |---|---|---|---|---|
+| Capital of France? | 17.5 | **24.4** | 3 | 2.9x |
+| Translate to French | 25.9 | **27.7** | 2 | 3.3x |
+| 15 x 23? | 12.8 | **15.7** | 4 | 1.8x |
+| Translate to Spanish | 7.6 | **22.9** | 7 | 2.7x |
+| Python is_prime() | 3.2 | **4.9** | 16 | 0.6x |
+| Poem about ocean | 3.2 | **5.3** | 16 | 0.6x |
+| Why is sky blue? | 3.3 | **12.0** | 16 | 1.4x |
+| List the planets | 3.3 | **9.4** | 15 | 1.1x |
+| **Average** | **9.6** | **15.3** | | **1.8x** |
+*llama.cpp baseline: 8.51 tok/s (Llama-3-8B Q4_K_M, same hardware). Cache enabled by default. 6 of 8 prompts outperform llama.cpp; 2 (code generation, creative writing) remain slower due to requiring all 16 steps.*
 ### Quantization Comparison (low_confidence baseline, B=64)
 ### Summary
+- **15-28 tok/s on easy real prompts** (Q4_K_M + entropy_exit + inter-step cache, B=256)
+- **Up to 3.2x faster than llama.cpp** on the same hardware
+- **Inter-step KV cache**: 1.6x average speedup with no quality degradation
+- **6 of 8 real prompts outperform llama.cpp** (vs 3 of 8 without cache)
 - **256-token generation** with 20% lower per-token cost vs 64-token batches
 - **7.5x thread scaling** from 1 to 12 threads
 cmake -B build -DCMAKE_BUILD_TYPE=Release
 cmake --build build -j$(nproc)
+# Generate with entropy_exit + cache (recommended, cache is ON by default)
 python tools/generate.py \
     --model-dir /path/to/LLaDA-8B-Instruct \
     --gguf llada-8b-q4km.gguf \