LLaDA-8B-Quantized / README.md
qubitron's picture
Update README.md
a69905b verified
---
license: apache-2.0
language:
- en
base_model:
- GSAI-ML/LLaDA-8B-Instruct
pipeline_tag: text-generation
tags:
- diffusion-language-model
- quantization
library_name: transformers
---
# LLaDA-8B-Quantized
**World's first INT8 and INT4 weight-only quantization for [LLaDA](https://github.com/ML-GSAI/LLaDA) — a masked diffusion large language model trained from scratch at 8B scale.**
> Code & full documentation: [github.com/qubitronlabsdev/llada-quantization](https://github.com/qubitronlabsdev/llada-quantization)
---
## Model Description
LLaDA (Large Language Diffusion with mAsking) is a diffusion-based language model that generates tokens **in parallel** via iterative masked denoising — unlike autoregressive models (GPT, LLaMA) that generate one token at a time.
This repository provides two post-training quantized variants of `GSAI-ML/LLaDA-8B-Instruct`:
| File | Quantization | Size | Memory Saved | Speed (A100) |
|---|---|---|---|---|
| `llada_int8_quantized.pt` | INT8 per-row | 8.54 GB | **47%** | **9.64 tok/s** |
| `llada_int4_quantized.pt` | INT4 packed | 4.79 GB | **70%** | 3.39 tok/s |
Original model (bfloat16): 16.13 GB
---
## How It Works
All `nn.Linear` layers are replaced with custom quantized layers:
- **INT8** — weights scaled per-row to `[-127, 127]` integers. Scale factors stored in float32. ~1 byte per weight.
- **INT4** — weights scaled per-row to `[-8, 7]` integers. Two values packed per byte (uint8). ~0.5 bytes per weight.
Both variants dequantize weights on-the-fly during the forward pass. No changes to model architecture or generation logic.
---
## Usage
### Installation
```bash
git clone https://github.com/qubitronlabsdev/llada-quantization
cd llada-quantization
pip install -r requirements.txt
```
### Load and Generate
```python
from inference import load_quantized, generate
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"GSAI-ML/LLaDA-8B-Instruct",
trust_remote_code=True
)
# Download weights from this repo first, then:
# INT8
model = load_quantized(
"llada_int8_quantized.pt",
mode="int8",
device="cuda"
)
# INT4
model = load_quantized(
"llada_int4_quantized.pt",
mode="int4",
device="cuda"
)
output = generate(model, tokenizer, "What is machine learning?")
print(output)
```
### Quantize from Scratch
```python
from quantize import run_and_save
run_and_save(mode="int8", save_path="llada_int8_quantized.pt")
run_and_save(mode="int4", save_path="llada_int4_quantized.pt")
```
---
## Hardware Requirements
| Variant | Min VRAM | Recommended |
|---|---|---|
| INT8 | 12 GB | A100 / H100 |
| INT4 | 8 GB | RTX 3090 / A100 |
Tested on: NVIDIA A100 80GB, NVIDIA H100
---
## Limitations
- INT4 introduces slightly more quantization error than INT8
- Generation speed depends on sequence length and number of diffusion steps
- English only (inherited from base model)
---
## Citation
If you use this work, please cite:
```bibtex
@misc{llada-quantization-2026,
title = {LLaDA Quantization: INT8 and INT4 for Diffusion Language Models},
author = {Dhiraj Choudhary},
year = {2026},
url = {https://github.com/qubitronlabsdev/llada-quantization}
}
```
Original LLaDA paper:
```bibtex
@article{nie2025large,
title = {Large Language Diffusion Models},
author = {Nie, Shen and others},
year = {2025},
url = {https://arxiv.org/abs/2502.09992}
}
```
---
## License
Apache 2.0 — same as the original LLaDA model.