Instructions to use qubitron/LLaDA-8B-Quantized with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use qubitron/LLaDA-8B-Quantized with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="qubitron/LLaDA-8B-Quantized")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("qubitron/LLaDA-8B-Quantized", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use qubitron/LLaDA-8B-Quantized with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "qubitron/LLaDA-8B-Quantized" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "qubitron/LLaDA-8B-Quantized", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/qubitron/LLaDA-8B-Quantized
- SGLang
How to use qubitron/LLaDA-8B-Quantized with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "qubitron/LLaDA-8B-Quantized" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "qubitron/LLaDA-8B-Quantized", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "qubitron/LLaDA-8B-Quantized" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "qubitron/LLaDA-8B-Quantized", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use qubitron/LLaDA-8B-Quantized with Docker Model Runner:
docker model run hf.co/qubitron/LLaDA-8B-Quantized
LLaDA-8B-Quantized
World's first INT8 and INT4 weight-only quantization for LLaDA โ a masked diffusion large language model trained from scratch at 8B scale.
Code & full documentation: github.com/qubitronlabsdev/llada-quantization
Model Description
LLaDA (Large Language Diffusion with mAsking) is a diffusion-based language model that generates tokens in parallel via iterative masked denoising โ unlike autoregressive models (GPT, LLaMA) that generate one token at a time.
This repository provides two post-training quantized variants of GSAI-ML/LLaDA-8B-Instruct:
| File | Quantization | Size | Memory Saved | Speed (A100) |
|---|---|---|---|---|
llada_int8_quantized.pt |
INT8 per-row | 8.54 GB | 47% | 9.64 tok/s |
llada_int4_quantized.pt |
INT4 packed | 4.79 GB | 70% | 3.39 tok/s |
Original model (bfloat16): 16.13 GB
How It Works
All nn.Linear layers are replaced with custom quantized layers:
- INT8 โ weights scaled per-row to
[-127, 127]integers. Scale factors stored in float32. ~1 byte per weight. - INT4 โ weights scaled per-row to
[-8, 7]integers. Two values packed per byte (uint8). ~0.5 bytes per weight.
Both variants dequantize weights on-the-fly during the forward pass. No changes to model architecture or generation logic.
Usage
Installation
git clone https://github.com/qubitronlabsdev/llada-quantization
cd llada-quantization
pip install -r requirements.txt
Load and Generate
from inference import load_quantized, generate
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"GSAI-ML/LLaDA-8B-Instruct",
trust_remote_code=True
)
# Download weights from this repo first, then:
# INT8
model = load_quantized(
"llada_int8_quantized.pt",
mode="int8",
device="cuda"
)
# INT4
model = load_quantized(
"llada_int4_quantized.pt",
mode="int4",
device="cuda"
)
output = generate(model, tokenizer, "What is machine learning?")
print(output)
Quantize from Scratch
from quantize import run_and_save
run_and_save(mode="int8", save_path="llada_int8_quantized.pt")
run_and_save(mode="int4", save_path="llada_int4_quantized.pt")
Hardware Requirements
| Variant | Min VRAM | Recommended |
|---|---|---|
| INT8 | 12 GB | A100 / H100 |
| INT4 | 8 GB | RTX 3090 / A100 |
Tested on: NVIDIA A100 80GB, NVIDIA H100
Limitations
- INT4 introduces slightly more quantization error than INT8
- Generation speed depends on sequence length and number of diffusion steps
- English only (inherited from base model)
Citation
If you use this work, please cite:
@misc{llada-quantization-2026,
title = {LLaDA Quantization: INT8 and INT4 for Diffusion Language Models},
author = {Dhiraj Choudhary},
year = {2026},
url = {https://github.com/qubitronlabsdev/llada-quantization}
}
Original LLaDA paper:
@article{nie2025large,
title = {Large Language Diffusion Models},
author = {Nie, Shen and others},
year = {2025},
url = {https://arxiv.org/abs/2502.09992}
}
License
Apache 2.0 โ same as the original LLaDA model.
Model tree for qubitron/LLaDA-8B-Quantized
Base model
GSAI-ML/LLaDA-8B-Instruct