JorgeVanco commited on
Commit
74076a7
·
verified ·
1 Parent(s): 6989842

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +104 -1
README.md CHANGED
@@ -19,4 +19,107 @@ custom_pipelines:
19
  impl: pipeline.TextDiffusionPipeline
20
  pt:
21
  - AutoModelForMaskedLM
22
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  impl: pipeline.TextDiffusionPipeline
20
  pt:
21
  - AutoModelForMaskedLM
22
+ ---
23
+
24
+ # diffusionGPT
25
+
26
+ [**GitHub Repository**](https://github.com/JorgeVanco/diffusionGPT) | [**Model License: MIT**](https://opensource.org/licenses/MIT)
27
+
28
+ DiffusionGPT is a **Discrete Diffusion Language Model (MDLM)** fine-tuned for conversational AI. Unlike traditional autoregressive models (like GPT-4 or Llama) that predict text one token at a time from left to right, DiffusionGPT generates text through an iterative denoising process.
29
+
30
+ This approach allows for parallel decoding, flexible text infilling, and "Seed Diffusion" editing capabilities.
31
+
32
+ ## Key Features
33
+
34
+ * **Parallel Decoding:** Generates and refines tokens simultaneously across the sequence.
35
+ * **Seed Diffusion Editing:** Implements advanced editing logic (per [arXiv:2508.02193](https://arxiv.org/pdf/2508.02193)) to refine existing text while maintaining context.
36
+ * **Semi-Autoregressive Generation:** Supports block-wise generation for long-form content, combining the strengths of diffusion with the length-scaling of autoregression.
37
+ * **Custom Pipeline:** Built-in support for `TextDiffusionPipeline` which handles the complex ancestral sampling and confidence-based unmasking automatically.
38
+
39
+ ---
40
+
41
+ ## Quickstart
42
+
43
+ To use this model, ensure you have the `pipeline.py` file from the repository in your local directory (Hugging Face will download it automatically if `trust_remote_code=True`).
44
+
45
+ ### 1. Basic Chat Completion
46
+ ```python
47
+ from transformers import pipeline
48
+
49
+ pipe = pipeline(
50
+ "text-diffusion",
51
+ model="JorgeVanco/diffusionGPT",
52
+ trust_remote_code=True
53
+ )
54
+
55
+ messages = [{"role": "user", "content": "Explain diffusion models in simple terms."}]
56
+ prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
57
+
58
+ # Generate using standard diffusion
59
+ result = pipe(prompt, num_steps=50)
60
+ print(result["decoded_texts"][0])
61
+ ```
62
+
63
+ ### 2. Streaming Intermediate Denoising
64
+ Watch the model "think" as it refines the text from masks to a final response.
65
+ ```python
66
+ for partial_text in pipe.stream_generation(prompt, num_steps=32):
67
+ print(f"\033[H\033[J{partial_text}") # Clears terminal for animation effect
68
+ ```
69
+
70
+ ### 3. Block-wise (Semi-Autoregressive) Generation
71
+ For longer responses that exceed the standard sequence length:
72
+ ```python
73
+ response = pipe.stream_semi_autoregressive_generate(
74
+ input_text=prompt,
75
+ block_size=64,
76
+ max_length=256,
77
+ num_steps=32
78
+ )
79
+
80
+ for step in response:
81
+ print(step)
82
+ ```
83
+
84
+ ## Technical Details
85
+
86
+ ### Model Architecture
87
+ The backbone is a Transformer Encoder (`AutoModelForMaskedLM`) configured for discrete diffusion.
88
+ - **Training Objective:** Multi-step corruption and reconstruction (MDLM formulation).
89
+ - **Corruption Strategy:** Uses a `DiscreteDiffusionCollator` which applies random masking and optional "Insertion Corruption" using a `<|delete|>` token.
90
+
91
+ ### Sampling Parameters
92
+ In the `pipe()`, you can tune the generation using:
93
+ - `num_steps`: Higher steps generally lead to higher quality but slower inference.
94
+ - `use_confidence`: When `True`, the model uses confidence-based unmasking (Top-K) instead of random unmasking.
95
+ - `allow_edits`: Enables Seed Diffusion logic to refine previously "visible" tokens (leave at `True` for better generation).
96
+
97
+ ## Training Setup
98
+ The model was trained using the `DiffusionTrainer` class provided in the [source repository](https://github.com/JorgeVanco/diffusionGPT).
99
+ ### Hardware & Config:
100
+ - **Optimizer:** AdamW with linear schedule.
101
+ - **Loss:** Time-weighted Cross-Entropy (MDLM).
102
+ - **Curriculum:** Includes a `SeedDiffusionCurriculumCallback` that introduces corruption stages gradually to improve model robustness.
103
+
104
+ ### Example Training Command:
105
+ ```bash
106
+ uv run train.py \
107
+ --num_hidden_layers 12 \
108
+ --hidden_size 768 \
109
+ --num_diffusion_steps 32 \
110
+ --max_seq_length 128 \
111
+ --target_param_data_ratio 20
112
+ ```
113
+
114
+ ## ⚠️ Limitations & Bias
115
+ - **Factual Accuracy:** Like all LLMs, this model can hallucinate. It is not optimized for factual retrieval.
116
+ - **Coherence:** While excellent for short-to-medium chat, very long-range coherence is currently under development through the semi-autoregressive block method.
117
+ - **Special Tokens:** The model relies on specific tokens like `<|im_start|>` and `<|im_end|>` for chat structure.
118
+
119
+ ## Citation & Acknowledgments
120
+ This implementation is inspired by recent research in discrete diffusion for language:
121
+ - **MDLM:** [Simple and Effective Masked Diffusion Language Models](https://s-sahoo.com/mdlm/)
122
+ - **Seed Diffusion:**: [Seed Diffusion: Continuous Training of Discrete Diffusion Language Models](https://seed.bytedance.com/en/seed_diffusion)
123
+
124
+ ## License
125
+ This model and its associated code are relased under the **MIT License**.