File size: 3,500 Bytes

---
license: apache-2.0
language:
- en
base_model:
- GSAI-ML/LLaDA-8B-Instruct
pipeline_tag: text-generation
tags:
- diffusion-language-model
- quantization
library_name: transformers
---
# LLaDA-8B-Quantized

**World's first INT8 and INT4 weight-only quantization for [LLaDA](https://github.com/ML-GSAI/LLaDA) — a masked diffusion large language model trained from scratch at 8B scale.**

> Code & full documentation: [github.com/qubitronlabsdev/llada-quantization](https://github.com/qubitronlabsdev/llada-quantization)

---

## Model Description

LLaDA (Large Language Diffusion with mAsking) is a diffusion-based language model that generates tokens **in parallel** via iterative masked denoising — unlike autoregressive models (GPT, LLaMA) that generate one token at a time.

This repository provides two post-training quantized variants of `GSAI-ML/LLaDA-8B-Instruct`:

| File | Quantization | Size | Memory Saved | Speed (A100) |
|---|---|---|---|---|
| `llada_int8_quantized.pt` | INT8 per-row | 8.54 GB | **47%** | **9.64 tok/s** |
| `llada_int4_quantized.pt` | INT4 packed | 4.79 GB | **70%** | 3.39 tok/s |

Original model (bfloat16): 16.13 GB

---

## How It Works

All `nn.Linear` layers are replaced with custom quantized layers:

- **INT8** — weights scaled per-row to `[-127, 127]` integers. Scale factors stored in float32. ~1 byte per weight.
- **INT4** — weights scaled per-row to `[-8, 7]` integers. Two values packed per byte (uint8). ~0.5 bytes per weight.

Both variants dequantize weights on-the-fly during the forward pass. No changes to model architecture or generation logic.

---

## Usage

### Installation

```bash
git clone https://github.com/qubitronlabsdev/llada-quantization
cd llada-quantization
pip install -r requirements.txt
```

### Load and Generate

```python
from inference import load_quantized, generate
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "GSAI-ML/LLaDA-8B-Instruct",
    trust_remote_code=True
)

# Download weights from this repo first, then:

# INT8
model = load_quantized(
    "llada_int8_quantized.pt",
    mode="int8",
    device="cuda"
)

# INT4
model = load_quantized(
    "llada_int4_quantized.pt",
    mode="int4",
    device="cuda"
)

output = generate(model, tokenizer, "What is machine learning?")
print(output)
```

### Quantize from Scratch

```python
from quantize import run_and_save

run_and_save(mode="int8", save_path="llada_int8_quantized.pt")
run_and_save(mode="int4", save_path="llada_int4_quantized.pt")
```

---

## Hardware Requirements

| Variant | Min VRAM | Recommended |
|---|---|---|
| INT8 | 12 GB | A100 / H100 |
| INT4 | 8 GB | RTX 3090 / A100 |

Tested on: NVIDIA A100 80GB, NVIDIA H100

---

## Limitations

- INT4 introduces slightly more quantization error than INT8
- Generation speed depends on sequence length and number of diffusion steps
- English only (inherited from base model)

---

## Citation

If you use this work, please cite:

```bibtex
@misc{llada-quantization-2026,
  title  = {LLaDA Quantization: INT8 and INT4 for Diffusion Language Models},
  author = {Dhiraj Choudhary},
  year   = {2026},
  url    = {https://github.com/qubitronlabsdev/llada-quantization}
}
```

Original LLaDA paper:

```bibtex
@article{nie2025large,
  title  = {Large Language Diffusion Models},
  author = {Nie, Shen and others},
  year   = {2025},
  url    = {https://arxiv.org/abs/2502.09992}
}
```

---

## License

Apache 2.0 — same as the original LLaDA model.