Dusk-8B-INT4

by Qubitron Labs

Dusk-8B-INT4 is a 4-bit quantized version of Dusk-8B — a Masked Diffusion Language Model (MDM) capable of high-quality, coherent text generation.

Unlike autoregressive LLMs (GPT, Llama, etc.) that generate text left-to-right one token at a time, Dusk generates all tokens in parallel through iterative denoising. This enables global planning, bidirectional reasoning, and holistic output generation.

This INT4 model was quantized using optimum-quanto, making it device-agnostic — it runs on CPU, CUDA, and Apple Silicon MPS without any dequantization spikes.

Model Details

Property	Value
Architecture	Masked Diffusion Language Model
Parameters	8B
Quantization	INT4 (via `optimum-quanto`)
Base Model	GSAI-ML/LLaDA-8B-Instruct
Memory (INT4)	~4–5 GB RAM
Memory (FP16)	~16 GB RAM
License	Apache 2.0

Quick Start

Installation

pip install transformers accelerate optimum-quanto sentencepiece

Load & Generate

import torch
from transformers import AutoTokenizer
from optimum.quanto import quantize, freeze, qint4

# --- Load model (you need the models/ package from our repo) ---
# git clone https://github.com/QubitronLabs/dusk && cd dusk
# pip install -r requirements.txt
import sys
sys.path.insert(0, ".")  # project root
from models import DuskModelLM

MODEL_ID = "QubitronLabs/dusk-8b-int4"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = DuskModelLM.from_pretrained(
    MODEL_ID,
    torch_dtype=torch.float16,
    device_map="cpu",   # change to "cuda" or "mps" if available
    trust_remote_code=True,
)

# Re-apply quantization at runtime
quantize(model, weights=qint4, exclude=["lm_head"])
freeze(model)
model.eval()

Chat Inference

# Import the custom MDM generate function (from generate.py in the repo)
from generate import generate

MASK_ID = 126336  # [MASK] token id

prompt = "Explain the difference between supervised and unsupervised learning."
messages = [{"role": "user", "content": prompt}]
formatted = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
input_ids = tokenizer(formatted, return_tensors="pt")["input_ids"].to(model.device)

with torch.no_grad():
    out = generate(
        model,
        input_ids,
        steps=128,
        gen_length=128,
        block_length=128,
        temperature=0.,
        cfg_scale=0.,
        remasking="low_confidence",
        mask_id=MASK_ID,
    )

response = tokenizer.batch_decode(out[:, input_ids.shape[1]:], skip_special_tokens=True)[0]
print(response)

Why Masked Diffusion?

Feature	Autoregressive (GPT, Llama)	Dusk (MDM)
Generation direction	Left → Right	All tokens simultaneously
Can revise earlier tokens	❌ No	✅ Yes
Global planning	Limited	Native
Bidirectional context	Partial (decoder-only)	Full
Speed (parallel hardware)	Sequential bottleneck	Highly parallelisable

Why INT4?

Standard FP16 Dusk-8B requires 16 GB of RAM — too large for most consumer hardware.
INT4 quantization via optimum-quanto reduces this to **4–5 GB** while preserving generation quality.

Why optimum-quanto over bitsandbytes/GPTQ/AWQ?

quanto is device-agnostic — saved INT4 weights load on CPU, CUDA, and Apple MPS
bitsandbytes/GPTQ/AWQ are CUDA-only — won't run on Mac
No fp16 dequantisation spikes during loading

Hardware Requirements

Hardware	Supported	Notes
NVIDIA GPU (CUDA)	✅	Use `device_map="cuda"`
Apple Silicon MPS	✅	Use `device_map="mps"`
CPU only	✅	Use `device_map="cpu"` — slow but works
Min RAM (INT4)	6 GB	8 GB+ recommended

Citation

If you use this model in your research, please cite:

@misc{dusk2026,
  title        = {Dusk: A Masked Diffusion Language Model},
  author       = {Qubitron Labs},
  year         = {2026},
  url          = {https://huggingface.co/QubitronLabs/dusk-8b-int4}
}

License

Apache License 2.0 — see LICENSE for details.

Downloads last month: 62

Safetensors

Model size

4B params

Tensor type

F16

Model tree for qubitronlabs/dusk-8b-int4

Base model

GSAI-ML/LLaDA-8B-Instruct

Finetuned

(23)

this model