File size: 4,994 Bytes
6989842 74076a7 82be92f 74076a7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
---
language:
- en
license: mit
pipeline_tag: text-generation
library_name: transformers
tags:
- text-diffusion
- discrete-diffusion
- pytorch
- mdlm
- seed-diffusion
- generative-ai
model_index:
- name: diffusionGPT
results: []
custom_pipelines:
text-diffusion:
impl: pipeline.TextDiffusionPipeline
pt:
- AutoModelForMaskedLM
---
# diffusionGPT
[**GitHub Repository**](https://github.com/JorgeVanco/diffusionGPT) | [**Model License: MIT**](https://opensource.org/licenses/MIT)
DiffusionGPT is a **Discrete Diffusion Language Model (MDLM)** fine-tuned for conversational AI. Unlike traditional autoregressive models (like GPT-4 or Llama) that predict text one token at a time from left to right, DiffusionGPT generates text through an iterative denoising process.
This approach allows for parallel decoding, flexible text infilling, and "Seed Diffusion" editing capabilities.
## Key Features
* **Parallel Decoding:** Generates and refines tokens simultaneously across the sequence.
* **Seed Diffusion Editing:** Implements advanced editing logic (per [arXiv:2508.02193](https://arxiv.org/pdf/2508.02193)) to refine existing text while maintaining context.
* **Semi-Autoregressive Generation:** Supports block-wise generation for long-form content, combining the strengths of diffusion with the length-scaling of autoregression.
* **Custom Pipeline:** Built-in support for `TextDiffusionPipeline` which handles the complex ancestral sampling and confidence-based unmasking automatically.
---
## Quickstart
To use this model, ensure you have the `pipeline.py` file from the repository in your local directory (Hugging Face will download it automatically if `trust_remote_code=True`).
### 1. Basic Chat Completion
```python
from transformers import pipeline
pipe = pipeline(
"text-diffusion",
model="JorgeVanco/diffusionGPT",
trust_remote_code=True
)
messages = [{"role": "user", "content": "Explain diffusion models in simple terms."}]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Generate using standard diffusion
result = pipe(prompt, num_steps=50)
print(result["decoded_texts"][0])
```
### 2. Streaming Intermediate Denoising
Watch the model "think" as it refines the text from masks to a final response.
```python
for partial_text in pipe.stream_generation(prompt, num_steps=32):
print(f"\033[H\033[J{partial_text}") # Clears terminal for animation effect
```
### 3. Block-wise (Semi-Autoregressive) Generation
For longer responses that exceed the standard sequence length:
```python
response = pipe.stream_semi_autoregressive_generate(
input_text=prompt,
block_size=64,
max_length=256,
num_steps=32
)
for step in response:
print(step)
```
## Technical Details
### Model Architecture
The backbone is a Transformer Encoder (`AutoModelForMaskedLM`) configured for discrete diffusion.
- **Training Objective:** Multi-step corruption and reconstruction (MDLM formulation).
- **Corruption Strategy:** Uses a `DiscreteDiffusionCollator` which applies random masking and optional "Insertion Corruption" using a `<|delete|>` token.
### Sampling Parameters
In the `pipe()`, you can tune the generation using:
- `num_steps`: Higher steps generally lead to higher quality but slower inference.
- `use_confidence`: When `True`, the model uses confidence-based unmasking (Top-K) instead of random unmasking.
- `allow_edits`: Enables Seed Diffusion logic to refine previously "visible" tokens (leave at `True` for better generation).
## Training Setup
The model was trained using the `DiffusionTrainer` class provided in the [source repository](https://github.com/JorgeVanco/diffusionGPT).
### Hardware & Config:
- **Optimizer:** AdamW with linear schedule.
- **Loss:** Time-weighted Cross-Entropy (MDLM).
- **Curriculum:** Includes a `SeedDiffusionCurriculumCallback` that introduces corruption stages gradually to improve model robustness.
### Example Training Command:
```bash
uv run train.py \
--num_hidden_layers 12 \
--hidden_size 768 \
--num_diffusion_steps 32 \
--max_seq_length 128 \
--target_param_data_ratio 20
```
## ⚠️ Limitations & Bias
- **Factual Accuracy:** Like all LLMs, this model can hallucinate. It is not optimized for factual retrieval.
- **Coherence:** While excellent for short-to-medium chat, very long-range coherence is currently under development through the semi-autoregressive block method.
- **Special Tokens:** The model relies on specific tokens like `<|im_start|>` and `<|im_end|>` for chat structure.
## Citation & Acknowledgments
This implementation is inspired by recent research in discrete diffusion for language:
- **MDLM:** [Simple and Effective Masked Diffusion Language Models](https://s-sahoo.com/mdlm/)
- **Seed Diffusion:** [Seed Diffusion: Continuous Training of Discrete Diffusion Language Models](https://seed.bytedance.com/en/seed_diffusion)
## License
This model and its associated code are relased under the **MIT License**. |