--- language: - en license: mit pipeline_tag: text-generation library_name: transformers tags: - text-diffusion - discrete-diffusion - pytorch - mdlm - seed-diffusion - generative-ai model_index: - name: diffusionGPT results: [] custom_pipelines: text-diffusion: impl: pipeline.TextDiffusionPipeline pt: - AutoModelForMaskedLM --- # diffusionGPT [**GitHub Repository**](https://github.com/JorgeVanco/diffusionGPT) | [**Model License: MIT**](https://opensource.org/licenses/MIT) DiffusionGPT is a **Discrete Diffusion Language Model (MDLM)** fine-tuned for conversational AI. Unlike traditional autoregressive models (like GPT-4 or Llama) that predict text one token at a time from left to right, DiffusionGPT generates text through an iterative denoising process. This approach allows for parallel decoding, flexible text infilling, and "Seed Diffusion" editing capabilities. ## Key Features * **Parallel Decoding:** Generates and refines tokens simultaneously across the sequence. * **Seed Diffusion Editing:** Implements advanced editing logic (per [arXiv:2508.02193](https://arxiv.org/pdf/2508.02193)) to refine existing text while maintaining context. * **Semi-Autoregressive Generation:** Supports block-wise generation for long-form content, combining the strengths of diffusion with the length-scaling of autoregression. * **Custom Pipeline:** Built-in support for `TextDiffusionPipeline` which handles the complex ancestral sampling and confidence-based unmasking automatically. --- ## Quickstart To use this model, ensure you have the `pipeline.py` file from the repository in your local directory (Hugging Face will download it automatically if `trust_remote_code=True`). ### 1. Basic Chat Completion ```python from transformers import pipeline pipe = pipeline( "text-diffusion", model="JorgeVanco/diffusionGPT", trust_remote_code=True ) messages = [{"role": "user", "content": "Explain diffusion models in simple terms."}] prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) # Generate using standard diffusion result = pipe(prompt, num_steps=50) print(result["decoded_texts"][0]) ``` ### 2. Streaming Intermediate Denoising Watch the model "think" as it refines the text from masks to a final response. ```python for partial_text in pipe.stream_generation(prompt, num_steps=32): print(f"\033[H\033[J{partial_text}") # Clears terminal for animation effect ``` ### 3. Block-wise (Semi-Autoregressive) Generation For longer responses that exceed the standard sequence length: ```python response = pipe.stream_semi_autoregressive_generate( input_text=prompt, block_size=64, max_length=256, num_steps=32 ) for step in response: print(step) ``` ## Technical Details ### Model Architecture The backbone is a Transformer Encoder (`AutoModelForMaskedLM`) configured for discrete diffusion. - **Training Objective:** Multi-step corruption and reconstruction (MDLM formulation). - **Corruption Strategy:** Uses a `DiscreteDiffusionCollator` which applies random masking and optional "Insertion Corruption" using a `<|delete|>` token. ### Sampling Parameters In the `pipe()`, you can tune the generation using: - `num_steps`: Higher steps generally lead to higher quality but slower inference. - `use_confidence`: When `True`, the model uses confidence-based unmasking (Top-K) instead of random unmasking. - `allow_edits`: Enables Seed Diffusion logic to refine previously "visible" tokens (leave at `True` for better generation). ## Training Setup The model was trained using the `DiffusionTrainer` class provided in the [source repository](https://github.com/JorgeVanco/diffusionGPT). ### Hardware & Config: - **Optimizer:** AdamW with linear schedule. - **Loss:** Time-weighted Cross-Entropy (MDLM). - **Curriculum:** Includes a `SeedDiffusionCurriculumCallback` that introduces corruption stages gradually to improve model robustness. ### Example Training Command: ```bash uv run train.py \ --num_hidden_layers 12 \ --hidden_size 768 \ --num_diffusion_steps 32 \ --max_seq_length 128 \ --target_param_data_ratio 20 ``` ## ⚠️ Limitations & Bias - **Factual Accuracy:** Like all LLMs, this model can hallucinate. It is not optimized for factual retrieval. - **Coherence:** While excellent for short-to-medium chat, very long-range coherence is currently under development through the semi-autoregressive block method. - **Special Tokens:** The model relies on specific tokens like `<|im_start|>` and `<|im_end|>` for chat structure. ## Citation & Acknowledgments This implementation is inspired by recent research in discrete diffusion for language: - **MDLM:** [Simple and Effective Masked Diffusion Language Models](https://s-sahoo.com/mdlm/) - **Seed Diffusion:** [Seed Diffusion: Continuous Training of Discrete Diffusion Language Models](https://seed.bytedance.com/en/seed_diffusion) ## License This model and its associated code are relased under the **MIT License**.