Instructions to use qubitron/LLaDA-8B-Quantized with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use qubitron/LLaDA-8B-Quantized with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="qubitron/LLaDA-8B-Quantized")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("qubitron/LLaDA-8B-Quantized", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use qubitron/LLaDA-8B-Quantized with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "qubitron/LLaDA-8B-Quantized" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "qubitron/LLaDA-8B-Quantized", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/qubitron/LLaDA-8B-Quantized
- SGLang
How to use qubitron/LLaDA-8B-Quantized with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "qubitron/LLaDA-8B-Quantized" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "qubitron/LLaDA-8B-Quantized", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "qubitron/LLaDA-8B-Quantized" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "qubitron/LLaDA-8B-Quantized", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use qubitron/LLaDA-8B-Quantized with Docker Model Runner:
docker model run hf.co/qubitron/LLaDA-8B-Quantized
| license: apache-2.0 | |
| language: | |
| - en | |
| base_model: | |
| - GSAI-ML/LLaDA-8B-Instruct | |
| pipeline_tag: text-generation | |
| tags: | |
| - diffusion-language-model | |
| - quantization | |
| library_name: transformers | |
| # LLaDA-8B-Quantized | |
| **World's first INT8 and INT4 weight-only quantization for [LLaDA](https://github.com/ML-GSAI/LLaDA) — a masked diffusion large language model trained from scratch at 8B scale.** | |
| > Code & full documentation: [github.com/qubitronlabsdev/llada-quantization](https://github.com/qubitronlabsdev/llada-quantization) | |
| --- | |
| ## Model Description | |
| LLaDA (Large Language Diffusion with mAsking) is a diffusion-based language model that generates tokens **in parallel** via iterative masked denoising — unlike autoregressive models (GPT, LLaMA) that generate one token at a time. | |
| This repository provides two post-training quantized variants of `GSAI-ML/LLaDA-8B-Instruct`: | |
| | File | Quantization | Size | Memory Saved | Speed (A100) | | |
| |---|---|---|---|---| | |
| | `llada_int8_quantized.pt` | INT8 per-row | 8.54 GB | **47%** | **9.64 tok/s** | | |
| | `llada_int4_quantized.pt` | INT4 packed | 4.79 GB | **70%** | 3.39 tok/s | | |
| Original model (bfloat16): 16.13 GB | |
| --- | |
| ## How It Works | |
| All `nn.Linear` layers are replaced with custom quantized layers: | |
| - **INT8** — weights scaled per-row to `[-127, 127]` integers. Scale factors stored in float32. ~1 byte per weight. | |
| - **INT4** — weights scaled per-row to `[-8, 7]` integers. Two values packed per byte (uint8). ~0.5 bytes per weight. | |
| Both variants dequantize weights on-the-fly during the forward pass. No changes to model architecture or generation logic. | |
| --- | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| git clone https://github.com/qubitronlabsdev/llada-quantization | |
| cd llada-quantization | |
| pip install -r requirements.txt | |
| ``` | |
| ### Load and Generate | |
| ```python | |
| from inference import load_quantized, generate | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| "GSAI-ML/LLaDA-8B-Instruct", | |
| trust_remote_code=True | |
| ) | |
| # Download weights from this repo first, then: | |
| # INT8 | |
| model = load_quantized( | |
| "llada_int8_quantized.pt", | |
| mode="int8", | |
| device="cuda" | |
| ) | |
| # INT4 | |
| model = load_quantized( | |
| "llada_int4_quantized.pt", | |
| mode="int4", | |
| device="cuda" | |
| ) | |
| output = generate(model, tokenizer, "What is machine learning?") | |
| print(output) | |
| ``` | |
| ### Quantize from Scratch | |
| ```python | |
| from quantize import run_and_save | |
| run_and_save(mode="int8", save_path="llada_int8_quantized.pt") | |
| run_and_save(mode="int4", save_path="llada_int4_quantized.pt") | |
| ``` | |
| --- | |
| ## Hardware Requirements | |
| | Variant | Min VRAM | Recommended | | |
| |---|---|---| | |
| | INT8 | 12 GB | A100 / H100 | | |
| | INT4 | 8 GB | RTX 3090 / A100 | | |
| Tested on: NVIDIA A100 80GB, NVIDIA H100 | |
| --- | |
| ## Limitations | |
| - INT4 introduces slightly more quantization error than INT8 | |
| - Generation speed depends on sequence length and number of diffusion steps | |
| - English only (inherited from base model) | |
| --- | |
| ## Citation | |
| If you use this work, please cite: | |
| ```bibtex | |
| @misc{llada-quantization-2026, | |
| title = {LLaDA Quantization: INT8 and INT4 for Diffusion Language Models}, | |
| author = {Dhiraj Choudhary}, | |
| year = {2026}, | |
| url = {https://github.com/qubitronlabsdev/llada-quantization} | |
| } | |
| ``` | |
| Original LLaDA paper: | |
| ```bibtex | |
| @article{nie2025large, | |
| title = {Large Language Diffusion Models}, | |
| author = {Nie, Shen and others}, | |
| year = {2025}, | |
| url = {https://arxiv.org/abs/2502.09992} | |
| } | |
| ``` | |
| --- | |
| ## License | |
| Apache 2.0 — same as the original LLaDA model. |