Dusk-8B-INT4
Dusk-8B-INT4 is a 4-bit quantized version of Dusk-8B β a Masked Diffusion Language Model (MDM) capable of high-quality, coherent text generation.
Unlike autoregressive LLMs (GPT, Llama, etc.) that generate text left-to-right one token at a time, Dusk generates all tokens in parallel through iterative denoising. This enables global planning, bidirectional reasoning, and holistic output generation.
This INT4 model was quantized using optimum-quanto, making it device-agnostic β it runs on CPU, CUDA, and Apple Silicon MPS without any dequantization spikes.
Model Details
| Property | Value |
|---|---|
| Architecture | Masked Diffusion Language Model |
| Parameters | 8B |
| Quantization | INT4 (via optimum-quanto) |
| Base Model | GSAI-ML/LLaDA-8B-Instruct |
| Memory (INT4) | ~4β5 GB RAM |
| Memory (FP16) | ~16 GB RAM |
| License | Apache 2.0 |
Quick Start
Installation
pip install transformers accelerate optimum-quanto sentencepiece
Load & Generate
import torch
from transformers import AutoTokenizer
from optimum.quanto import quantize, freeze, qint4
# --- Load model (you need the models/ package from our repo) ---
# git clone https://github.com/QubitronLabs/dusk && cd dusk
# pip install -r requirements.txt
import sys
sys.path.insert(0, ".") # project root
from models import DuskModelLM
MODEL_ID = "QubitronLabs/dusk-8b-int4"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = DuskModelLM.from_pretrained(
MODEL_ID,
torch_dtype=torch.float16,
device_map="cpu", # change to "cuda" or "mps" if available
trust_remote_code=True,
)
# Re-apply quantization at runtime
quantize(model, weights=qint4, exclude=["lm_head"])
freeze(model)
model.eval()
Chat Inference
# Import the custom MDM generate function (from generate.py in the repo)
from generate import generate
MASK_ID = 126336 # [MASK] token id
prompt = "Explain the difference between supervised and unsupervised learning."
messages = [{"role": "user", "content": prompt}]
formatted = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
input_ids = tokenizer(formatted, return_tensors="pt")["input_ids"].to(model.device)
with torch.no_grad():
out = generate(
model,
input_ids,
steps=128,
gen_length=128,
block_length=128,
temperature=0.,
cfg_scale=0.,
remasking="low_confidence",
mask_id=MASK_ID,
)
response = tokenizer.batch_decode(out[:, input_ids.shape[1]:], skip_special_tokens=True)[0]
print(response)
Why Masked Diffusion?
| Feature | Autoregressive (GPT, Llama) | Dusk (MDM) |
|---|---|---|
| Generation direction | Left β Right | All tokens simultaneously |
| Can revise earlier tokens | β No | β Yes |
| Global planning | Limited | Native |
| Bidirectional context | Partial (decoder-only) | Full |
| Speed (parallel hardware) | Sequential bottleneck | Highly parallelisable |
Why INT4?
Standard FP16 Dusk-8B requires 16 GB of RAM β too large for most consumer hardware.4β5 GB** while preserving generation quality.
INT4 quantization via optimum-quanto reduces this to **
Why optimum-quanto over bitsandbytes/GPTQ/AWQ?
quantois device-agnostic β saved INT4 weights load on CPU, CUDA, and Apple MPSbitsandbytes/GPTQ/AWQ are CUDA-only β won't run on Mac- No fp16 dequantisation spikes during loading
Hardware Requirements
| Hardware | Supported | Notes |
|---|---|---|
| NVIDIA GPU (CUDA) | β | Use device_map="cuda" |
| Apple Silicon MPS | β | Use device_map="mps" |
| CPU only | β | Use device_map="cpu" β slow but works |
| Min RAM (INT4) | 6 GB | 8 GB+ recommended |
Citation
If you use this model in your research, please cite:
@misc{dusk2026,
title = {Dusk: A Masked Diffusion Language Model},
author = {Qubitron Labs},
year = {2026},
url = {https://huggingface.co/QubitronLabs/dusk-8b-int4}
}
License
Apache License 2.0 β see LICENSE for details.
- Downloads last month
- 62
Model tree for qubitronlabs/dusk-8b-int4
Base model
GSAI-ML/LLaDA-8B-Instruct