CBD-LLM: Causal Block Diffusion Language Model (PoC)
CBD-LLM (Causal Block Diffusion) is an experimental hybrid Diffusion–Autoregressive language model that enables block-parallel text generation while retaining standard causal attention, KV caching, and compatibility with pretrained AR weights.
This repository hosts a Proof of Concept (PoC) checkpoint demonstrating the feasibility of parallel decoding with causal attention, trained efficiently on consumer hardware using LoRA.
🔍 Model Overview
| Attribute | Description |
|---|---|
| Model Type | Causal Block Diffusion LLM |
| Base Model | Qwen2.5 |
| Parameters | ~1B (base), LoRA fine-tuned |
| Attention | Standard causal attention |
| Decoding | Block-parallel diffusion |
| Training Stage | Proof of Concept (Research Preview) |
| License | MIT |
Key Idea
CBD-LLM bridges the gap between:
- Autoregressive LLMs (low data cost, KV-cache friendly, but serial decoding)
- Diffusion LLMs (parallel decoding, but high training cost and no KV cache)
By combining topological token reordering with block-wise diffusion, CBD-LLM achieves:
- Parallel generation
- Low VRAM usage
- Compatibility with FlashAttention and KV caching
- Efficient fine-tuning from pretrained AR models
Architecture Summary
1. Topological Reordering (Causal-Friendly Diffusion)
Diffusion models require masked tokens to attend to future context, normally forcing bidirectional attention.
CBD-LLM avoids this by:
- Physically moving observed tokens to the front
- Moving masked tokens to the back
- Preserving original positional IDs (RoPE)
This allows masked tokens to attend to observed tokens using a standard causal mask.
Logical: [The] [quick] [brown] [fox] Masked: [The] [MASK] [MASK] [fox]
Physical: [The] [fox] [MASK] [MASK] Pos IDs: 0 3 1 2
Result: causal attention + KV cache remain intact.
2. Block-Wise Variable Noise Diffusion
Instead of diffusing entire sequences:
- Text is generated in fixed-size blocks (e.g., 64 tokens)
- Each block undergoes multiple denoising steps
- The full block is refined in parallel
The model learns both:
- Drafting from noise
- Refinement from partial context
Research and experimentation only
Recommended use cases:
- Parallel decoding research
- Diffusion–AR hybrid modeling
- Efficient LLM inference studies
- Architecture prototyping
Not recommended for:
- Production deployment
- Safety-critical applications
References
This model is inspired by:
- Fast-dLLM v2: Efficient Block-Diffusion LLM (2025)
- WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention (2025)