File size: 4,994 Bytes

---
language:
- en
license: mit
pipeline_tag: text-generation
library_name: transformers
tags:
- text-diffusion
- discrete-diffusion
- pytorch
- mdlm
- seed-diffusion
- generative-ai
model_index:
- name: diffusionGPT
  results: []
custom_pipelines:
  text-diffusion:
    impl: pipeline.TextDiffusionPipeline
    pt:
    - AutoModelForMaskedLM
---

# diffusionGPT

[**GitHub Repository**](https://github.com/JorgeVanco/diffusionGPT) | [**Model License: MIT**](https://opensource.org/licenses/MIT)

DiffusionGPT is a **Discrete Diffusion Language Model (MDLM)** fine-tuned for conversational AI. Unlike traditional autoregressive models (like GPT-4 or Llama) that predict text one token at a time from left to right, DiffusionGPT generates text through an iterative denoising process. 

This approach allows for parallel decoding, flexible text infilling, and "Seed Diffusion" editing capabilities.

## Key Features

* **Parallel Decoding:** Generates and refines tokens simultaneously across the sequence.
* **Seed Diffusion Editing:** Implements advanced editing logic (per [arXiv:2508.02193](https://arxiv.org/pdf/2508.02193)) to refine existing text while maintaining context.
* **Semi-Autoregressive Generation:** Supports block-wise generation for long-form content, combining the strengths of diffusion with the length-scaling of autoregression.
* **Custom Pipeline:** Built-in support for `TextDiffusionPipeline` which handles the complex ancestral sampling and confidence-based unmasking automatically.

---

## Quickstart

To use this model, ensure you have the `pipeline.py` file from the repository in your local directory (Hugging Face will download it automatically if `trust_remote_code=True`).

### 1. Basic Chat Completion
```python
from transformers import pipeline

pipe = pipeline(
  "text-diffusion",
  model="JorgeVanco/diffusionGPT",
  trust_remote_code=True
)

messages = [{"role": "user", "content": "Explain diffusion models in simple terms."}]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate using standard diffusion
result = pipe(prompt, num_steps=50)
print(result["decoded_texts"][0])
```

### 2. Streaming Intermediate Denoising
Watch the model "think" as it refines the text from masks to a final response.
```python
for partial_text in pipe.stream_generation(prompt, num_steps=32):
  print(f"\033[H\033[J{partial_text}") # Clears terminal for animation effect
```

### 3. Block-wise (Semi-Autoregressive) Generation
For longer responses that exceed the standard sequence length:
```python
response = pipe.stream_semi_autoregressive_generate(
  input_text=prompt,
  block_size=64,
  max_length=256,
  num_steps=32
)

for step in response:
  print(step)
```

## Technical Details

### Model Architecture
The backbone is a Transformer Encoder (`AutoModelForMaskedLM`) configured for discrete diffusion.
- **Training Objective:** Multi-step corruption and reconstruction (MDLM formulation).
- **Corruption Strategy:** Uses a `DiscreteDiffusionCollator` which applies random masking and optional "Insertion Corruption" using a `<|delete|>` token.

### Sampling Parameters
In the `pipe()`, you can tune the generation using:
- `num_steps`: Higher steps generally lead to higher quality but slower inference.
- `use_confidence`: When `True`, the model uses confidence-based unmasking (Top-K) instead of random unmasking.
- `allow_edits`: Enables Seed Diffusion logic to refine previously "visible" tokens (leave at `True` for better generation).

## Training Setup
The model was trained using the `DiffusionTrainer` class provided in the [source repository](https://github.com/JorgeVanco/diffusionGPT).
### Hardware & Config:
- **Optimizer:** AdamW with linear schedule.
- **Loss:** Time-weighted Cross-Entropy (MDLM).
- **Curriculum:** Includes a `SeedDiffusionCurriculumCallback` that introduces corruption stages gradually to improve model robustness.

### Example Training Command:
```bash
uv run train.py \
  --num_hidden_layers 12 \
  --hidden_size 768 \
  --num_diffusion_steps 32 \
  --max_seq_length 128 \
  --target_param_data_ratio 20
```

## ⚠️ Limitations & Bias
- **Factual Accuracy:** Like all LLMs, this model can hallucinate. It is not optimized for factual retrieval.
- **Coherence:** While excellent for short-to-medium chat, very long-range coherence is currently under development through the semi-autoregressive block method.
- **Special Tokens:** The model relies on specific tokens like `<|im_start|>` and `<|im_end|>` for chat structure.

## Citation & Acknowledgments
This implementation is inspired by recent research in discrete diffusion for language:
- **MDLM:** [Simple and Effective Masked Diffusion Language Models](https://s-sahoo.com/mdlm/)
- **Seed Diffusion:** [Seed Diffusion: Continuous Training of Discrete Diffusion Language Models](https://seed.bytedance.com/en/seed_diffusion)

## License
This model and its associated code are relased under the **MIT License**.