adhd-diffusion
A discrete diffusion language model for code generation, based on the CoDA (Coding LM via Diffusion Adaptation) architecture.
⚠️ Note: This is an intermediate checkpoint (step 12,000) from an interrupted training run. The model may not be fully trained.
Model Details
| Property | Value |
|---|---|
| Architecture | DiffusionQwen3 (Bidirectional Transformer) |
| Base Model | Qwen-based architecture |
| Hidden Size | 1536 |
| Layers | 28 |
| Attention Heads | 12 |
| KV Heads | 2 (GQA) |
| Intermediate Size | 8960 |
| Max Position Embeddings | 32,768 |
| Vocab Size | 151,666 |
| Training Checkpoint | 12,000 steps |
How Diffusion LMs Work
Unlike autoregressive models that generate tokens left-to-right, this model uses discrete diffusion:
- Start with all
<mask>tokens in the generation region - Iteratively unmask tokens based on model confidence
- Higher-confidence predictions are revealed first
- Process repeats until all tokens are generated
This enables bidirectional context during generation, potentially improving coherence for code.
Usage
Installation
pip install torch transformers
Inference
import torch
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("shouryamaanjain/adhd-diffusion", trust_remote_code=True)
# Load model (see inference.py for full diffusion generation logic)
# The model uses custom DiffusionQwen3Model class
For full inference with diffusion sampling, use the included inference.py script:
# Single prompt
python inference.py --checkpoint /path/to/model --prompt "def fibonacci(n):"
# Interactive chat
python inference.py --checkpoint /path/to/model --mode chat
# With custom parameters
python inference.py --checkpoint /path/to/model \
--prompt "Write a function to sort a list" \
--steps 128 \
--temperature 0.0 \
--max-tokens 256 \
--alg entropy
Generation Parameters
| Parameter | Default | Description |
|---|---|---|
steps |
128 | Number of diffusion denoising steps |
temperature |
0.0 | Sampling temperature (0 = greedy) |
top_p |
None | Nucleus sampling threshold |
top_k |
None | Top-k sampling |
alg |
entropy | Sampling algorithm: origin, entropy, maskgit_plus, topk_margin |
alg_temp |
0.1 | Algorithm-specific confidence temperature |
Model Architecture
The model is a bidirectional transformer (non-causal attention) trained with discrete diffusion objectives:
DiffusionQwen3Model(
(model): Qwen2Model with bidirectional attention
(lm_head): Linear(1536, 151666)
)
Training Objective
- Forward process: Randomly mask tokens with probability
σ ~ U[ε, 1] - Reverse process: Predict original tokens from masked input
- Loss weighting:
1/σ(ELBO-derived)
Files
pytorch_model.bin- Model weightsconfig.json- Model configurationtokenizer.json,vocab.json,merges.txt- Tokenizer filesinference.py- Standalone inference scriptmodeling_diffusion_qwen3.py- Model class definition
Citation
Based on CoDA by Salesforce AI Research:
@article{coda2024,
title={CoDA: Coding LM via Diffusion Adaptation},
author={Salesforce AI Research},
journal={arXiv preprint},
year={2024}
}
License
Please refer to the base Qwen model license for usage terms.
- Downloads last month
- 10