CBD-LLM: Causal Block Diffusion Language Model (PoC)

CBD-LLM (Causal Block Diffusion) is an experimental hybrid Diffusion–Autoregressive language model that enables block-parallel text generation while retaining standard causal attention, KV caching, and compatibility with pretrained AR weights.

This repository hosts a Proof of Concept (PoC) checkpoint demonstrating the feasibility of parallel decoding with causal attention, trained efficiently on consumer hardware using LoRA.


🔍 Model Overview

Attribute Description
Model Type Causal Block Diffusion LLM
Base Model Qwen2.5
Parameters ~1B (base), LoRA fine-tuned
Attention Standard causal attention
Decoding Block-parallel diffusion
Training Stage Proof of Concept (Research Preview)
License MIT

Key Idea

CBD-LLM bridges the gap between:

  • Autoregressive LLMs (low data cost, KV-cache friendly, but serial decoding)
  • Diffusion LLMs (parallel decoding, but high training cost and no KV cache)

By combining topological token reordering with block-wise diffusion, CBD-LLM achieves:

  • Parallel generation
  • Low VRAM usage
  • Compatibility with FlashAttention and KV caching
  • Efficient fine-tuning from pretrained AR models

Architecture Summary

1. Topological Reordering (Causal-Friendly Diffusion)

Diffusion models require masked tokens to attend to future context, normally forcing bidirectional attention.

CBD-LLM avoids this by:

  • Physically moving observed tokens to the front
  • Moving masked tokens to the back
  • Preserving original positional IDs (RoPE)

This allows masked tokens to attend to observed tokens using a standard causal mask.

Logical: [The] [quick] [brown] [fox] Masked: [The] [MASK] [MASK] [fox]

Physical: [The] [fox] [MASK] [MASK] Pos IDs: 0 3 1 2

Result: causal attention + KV cache remain intact.


2. Block-Wise Variable Noise Diffusion

Instead of diffusing entire sequences:

  • Text is generated in fixed-size blocks (e.g., 64 tokens)
  • Each block undergoes multiple denoising steps
  • The full block is refined in parallel

The model learns both:

  • Drafting from noise
  • Refinement from partial context

Research and experimentation only

Recommended use cases:

  • Parallel decoding research
  • Diffusion–AR hybrid modeling
  • Efficient LLM inference studies
  • Architecture prototyping

Not recommended for:

  • Production deployment
  • Safety-critical applications

References

This model is inspired by:

  1. Fast-dLLM v2: Efficient Block-Diffusion LLM (2025)
  2. WeDLM: Reconciling Diffusion Language Models with Standard Causal Attention (2025)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support