adhd-diffusion

A discrete diffusion language model for code generation, based on the CoDA (Coding LM via Diffusion Adaptation) architecture.

⚠️ Note: This is an intermediate checkpoint (step 12,000) from an interrupted training run. The model may not be fully trained.

Model Details

Property	Value
Architecture	DiffusionQwen3 (Bidirectional Transformer)
Base Model	Qwen-based architecture
Hidden Size	1536
Layers	28
Attention Heads	12
KV Heads	2 (GQA)
Intermediate Size	8960
Max Position Embeddings	32,768
Vocab Size	151,666
Training Checkpoint	12,000 steps

How Diffusion LMs Work

Unlike autoregressive models that generate tokens left-to-right, this model uses discrete diffusion:

Start with all <mask> tokens in the generation region
Iteratively unmask tokens based on model confidence
Higher-confidence predictions are revealed first
Process repeats until all tokens are generated

This enables bidirectional context during generation, potentially improving coherence for code.

Usage

Installation

pip install torch transformers

Inference

import torch
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("shouryamaanjain/adhd-diffusion", trust_remote_code=True)

# Load model (see inference.py for full diffusion generation logic)
# The model uses custom DiffusionQwen3Model class

For full inference with diffusion sampling, use the included inference.py script:

# Single prompt
python inference.py --checkpoint /path/to/model --prompt "def fibonacci(n):"

# Interactive chat
python inference.py --checkpoint /path/to/model --mode chat

# With custom parameters
python inference.py --checkpoint /path/to/model \
    --prompt "Write a function to sort a list" \
    --steps 128 \
    --temperature 0.0 \
    --max-tokens 256 \
    --alg entropy

Generation Parameters

Parameter	Default	Description
`steps`	128	Number of diffusion denoising steps
`temperature`	0.0	Sampling temperature (0 = greedy)
`top_p`	None	Nucleus sampling threshold
`top_k`	None	Top-k sampling
`alg`	entropy	Sampling algorithm: `origin`, `entropy`, `maskgit_plus`, `topk_margin`
`alg_temp`	0.1	Algorithm-specific confidence temperature

Model Architecture

The model is a bidirectional transformer (non-causal attention) trained with discrete diffusion objectives:

DiffusionQwen3Model(
  (model): Qwen2Model with bidirectional attention
  (lm_head): Linear(1536, 151666)
)

Training Objective

Forward process: Randomly mask tokens with probability σ ~ U[ε, 1]
Reverse process: Predict original tokens from masked input
Loss weighting: 1/σ (ELBO-derived)

Files

pytorch_model.bin - Model weights
config.json - Model configuration
tokenizer.json, vocab.json, merges.txt - Tokenizer files
inference.py - Standalone inference script
modeling_diffusion_qwen3.py - Model class definition

Citation

Based on CoDA by Salesforce AI Research:

@article{coda2024,
  title={CoDA: Coding LM via Diffusion Adaptation},
  author={Salesforce AI Research},
  journal={arXiv preprint},
  year={2024}
}

License

Please refer to the base Qwen model license for usage terms.

Downloads last month: 2