JorgeVanco
/

diffusionGPT

     impl: pipeline.TextDiffusionPipeline
     pt:
     - AutoModelForMaskedLM
+---
+# diffusionGPT
+[**GitHub Repository**](https://github.com/JorgeVanco/diffusionGPT) | [**Model License: MIT**](https://opensource.org/licenses/MIT)
+DiffusionGPT is a **Discrete Diffusion Language Model (MDLM)** fine-tuned for conversational AI. Unlike traditional autoregressive models (like GPT-4 or Llama) that predict text one token at a time from left to right, DiffusionGPT generates text through an iterative denoising process.
+This approach allows for parallel decoding, flexible text infilling, and "Seed Diffusion" editing capabilities.
+## Key Features
+* **Parallel Decoding:** Generates and refines tokens simultaneously across the sequence.
+* **Seed Diffusion Editing:** Implements advanced editing logic (per [arXiv:2508.02193](https://arxiv.org/pdf/2508.02193)) to refine existing text while maintaining context.
+* **Semi-Autoregressive Generation:** Supports block-wise generation for long-form content, combining the strengths of diffusion with the length-scaling of autoregression.
+* **Custom Pipeline:** Built-in support for `TextDiffusionPipeline` which handles the complex ancestral sampling and confidence-based unmasking automatically.
+---
+## Quickstart
+To use this model, ensure you have the `pipeline.py` file from the repository in your local directory (Hugging Face will download it automatically if `trust_remote_code=True`).
+### 1. Basic Chat Completion
+```python
+from transformers import pipeline
+pipe = pipeline(
+  "text-diffusion",
+  model="JorgeVanco/diffusionGPT",
+  trust_remote_code=True
+)
+messages = [{"role": "user", "content": "Explain diffusion models in simple terms."}]
+prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+# Generate using standard diffusion
+result = pipe(prompt, num_steps=50)
+print(result["decoded_texts"][0])
+```
+### 2. Streaming Intermediate Denoising
+Watch the model "think" as it refines the text from masks to a final response.
+```python
+for partial_text in pipe.stream_generation(prompt, num_steps=32):
+  print(f"\033[H\033[J{partial_text}") # Clears terminal for animation effect
+```
+### 3. Block-wise (Semi-Autoregressive) Generation
+For longer responses that exceed the standard sequence length:
+```python
+response = pipe.stream_semi_autoregressive_generate(
+  input_text=prompt,
+  block_size=64,
+  max_length=256,
+  num_steps=32
+)
+for step in response:
+  print(step)
+```
+## Technical Details
+### Model Architecture
+The backbone is a Transformer Encoder (`AutoModelForMaskedLM`) configured for discrete diffusion.
+- **Training Objective:** Multi-step corruption and reconstruction (MDLM formulation).
+- **Corruption Strategy:** Uses a `DiscreteDiffusionCollator` which applies random masking and optional "Insertion Corruption" using a `<|delete|>` token.
+### Sampling Parameters
+In the `pipe()`, you can tune the generation using:
+- `num_steps`: Higher steps generally lead to higher quality but slower inference.
+- `use_confidence`: When `True`, the model uses confidence-based unmasking (Top-K) instead of random unmasking.
+- `allow_edits`: Enables Seed Diffusion logic to refine previously "visible" tokens (leave at `True` for better generation).
+## Training Setup
+The model was trained using the `DiffusionTrainer` class provided in the [source repository](https://github.com/JorgeVanco/diffusionGPT).
+### Hardware & Config:
+- **Optimizer:** AdamW with linear schedule.
+- **Loss:** Time-weighted Cross-Entropy (MDLM).
+- **Curriculum:** Includes a `SeedDiffusionCurriculumCallback` that introduces corruption stages gradually to improve model robustness.
+### Example Training Command:
+```bash
+uv run train.py \
+  --num_hidden_layers 12 \
+  --hidden_size 768 \
+  --num_diffusion_steps 32 \
+  --max_seq_length 128 \
+  --target_param_data_ratio 20
+```
+## ⚠️ Limitations & Bias
+- **Factual Accuracy:** Like all LLMs, this model can hallucinate. It is not optimized for factual retrieval.
+- **Coherence:** While excellent for short-to-medium chat, very long-range coherence is currently under development through the semi-autoregressive block method.
+- **Special Tokens:** The model relies on specific tokens like `<|im_start|>` and `<|im_end|>` for chat structure.
+## Citation & Acknowledgments
+This implementation is inspired by recent research in discrete diffusion for language:
+- **MDLM:** [Simple and Effective Masked Diffusion Language Models](https://s-sahoo.com/mdlm/)
+- **Seed Diffusion:**: [Seed Diffusion: Continuous Training of Discrete Diffusion Language Models](https://seed.bytedance.com/en/seed_diffusion)
+## License
+This model and its associated code are relased under the **MIT License**.