CBD-LLM: Causal Block Diffusion Language Model (PoC)

CBD-LLM (Causal Block Diffusion) is an experimental hybrid Diffusion–Autoregressive language model that enables block-parallel text generation while retaining standard causal attention, KV caching, and compatibility with pretrained AR weights.

This repository hosts a Proof of Concept (PoC) checkpoint demonstrating the feasibility of parallel decoding with causal attention, trained efficiently on consumer hardware using LoRA.

🔍 Model Overview

Attribute	Description
Model Type	Causal Block Diffusion LLM
Base Model	Qwen2.5
Parameters	~1B (base), LoRA fine-tuned
Attention	Standard causal attention
Decoding	Block-parallel diffusion
Training Stage	Proof of Concept (Research Preview)
License	MIT

Key Idea

CBD-LLM bridges the gap between:

Autoregressive LLMs (low data cost, KV-cache friendly, but serial decoding)
Diffusion LLMs (parallel decoding, but high training cost and no KV cache)

By combining topological token reordering with block-wise diffusion, CBD-LLM achieves:

Parallel generation
Low VRAM usage
Compatibility with FlashAttention and KV caching
Efficient fine-tuning from pretrained AR models

Architecture Summary

1. Topological Reordering (Causal-Friendly Diffusion)

Diffusion models require masked tokens to attend to future context, normally forcing bidirectional attention.

CBD-LLM avoids this by:

Physically moving observed tokens to the front
Moving masked tokens to the back
Preserving original positional IDs (RoPE)

This allows masked tokens to attend to observed tokens using a standard causal mask.

Logical: [The] [quick] [brown] [fox] Masked: [The] [MASK] [MASK] [fox]

Physical: [The] [fox] [MASK] [MASK] Pos IDs: 0 3 1 2

Result: causal attention + KV cache remain intact.

2. Block-Wise Variable Noise Diffusion

Instead of diffusing entire sequences:

Text is generated in fixed-size blocks (e.g., 64 tokens)
Each block undergoes multiple denoising steps
The full block is refined in parallel

The model learns both:

Drafting from noise
Refinement from partial context

Research and experimentation only

Recommended use cases:

Parallel decoding research
Diffusion–AR hybrid modeling
Efficient LLM inference studies
Architecture prototyping

Not recommended for:

Production deployment
Safety-critical applications

References

This model is inspired by:

Fast-dLLM v2: Efficient Block-Diffusion LLM (2025)
WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention (2025)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support