qubitron
/

LLaDA-8B-Quantized

Text Generation

diffusion-language-model

Model card Files Files and versions

LLaDA-8B-Quantized / README.md

qubitron's picture

Update README.md

a69905b verified about 1 month ago

|

history blame contribute delete

3.5 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- GSAI-ML/LLaDA-8B-Instruct
	pipeline_tag: text-generation
	tags:
	- diffusion-language-model
	- quantization
	library_name: transformers
	---
	# LLaDA-8B-Quantized

	World's first INT8 and INT4 weight-only quantization for [LLaDA](https://github.com/ML-GSAI/LLaDA) — a masked diffusion large language model trained from scratch at 8B scale.

	> Code & full documentation: [github.com/qubitronlabsdev/llada-quantization](https://github.com/qubitronlabsdev/llada-quantization)

	---

	## Model Description

	LLaDA (Large Language Diffusion with mAsking) is a diffusion-based language model that generates tokens in parallel via iterative masked denoising — unlike autoregressive models (GPT, LLaMA) that generate one token at a time.

	This repository provides two post-training quantized variants of `GSAI-ML/LLaDA-8B-Instruct`:

	\| File \| Quantization \| Size \| Memory Saved \| Speed (A100) \|
	\|---\|---\|---\|---\|---\|
	\| `llada_int8_quantized.pt` \| INT8 per-row \| 8.54 GB \| 47% \| 9.64 tok/s \|
	\| `llada_int4_quantized.pt` \| INT4 packed \| 4.79 GB \| 70% \| 3.39 tok/s \|

	Original model (bfloat16): 16.13 GB

	---

	## How It Works

	All `nn.Linear` layers are replaced with custom quantized layers:

	- INT8 — weights scaled per-row to `[-127, 127]` integers. Scale factors stored in float32. ~1 byte per weight.
	- INT4 — weights scaled per-row to `[-8, 7]` integers. Two values packed per byte (uint8). ~0.5 bytes per weight.

	Both variants dequantize weights on-the-fly during the forward pass. No changes to model architecture or generation logic.

	---

	## Usage

	### Installation

	```bash
	git clone https://github.com/qubitronlabsdev/llada-quantization
	cd llada-quantization
	pip install -r requirements.txt
	```

	### Load and Generate

	```python
	from inference import load_quantized, generate
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained(
	"GSAI-ML/LLaDA-8B-Instruct",
	trust_remote_code=True
	)

	# Download weights from this repo first, then:

	# INT8
	model = load_quantized(
	"llada_int8_quantized.pt",
	mode="int8",
	device="cuda"
	)

	# INT4
	model = load_quantized(
	"llada_int4_quantized.pt",
	mode="int4",
	device="cuda"
	)

	output = generate(model, tokenizer, "What is machine learning?")
	print(output)
	```

	### Quantize from Scratch

	```python
	from quantize import run_and_save

	run_and_save(mode="int8", save_path="llada_int8_quantized.pt")
	run_and_save(mode="int4", save_path="llada_int4_quantized.pt")
	```

	---

	## Hardware Requirements

	\| Variant \| Min VRAM \| Recommended \|
	\|---\|---\|---\|
	\| INT8 \| 12 GB \| A100 / H100 \|
	\| INT4 \| 8 GB \| RTX 3090 / A100 \|

	Tested on: NVIDIA A100 80GB, NVIDIA H100

	---

	## Limitations

	- INT4 introduces slightly more quantization error than INT8
	- Generation speed depends on sequence length and number of diffusion steps
	- English only (inherited from base model)

	---

	## Citation

	If you use this work, please cite:

	```bibtex
	@misc{llada-quantization-2026,
	title = {LLaDA Quantization: INT8 and INT4 for Diffusion Language Models},
	author = {Dhiraj Choudhary},
	year = {2026},
	url = {https://github.com/qubitronlabsdev/llada-quantization}
	}
	```

	Original LLaDA paper:

	```bibtex
	@article{nie2025large,
	title = {Large Language Diffusion Models},
	author = {Nie, Shen and others},
	year = {2025},
	url = {https://arxiv.org/abs/2502.09992}
	}
	```

	---

	## License

	Apache 2.0 — same as the original LLaDA model.