adhd-diffusion

A discrete diffusion language model for code generation, based on the CoDA (Coding LM via Diffusion Adaptation) architecture.

⚠️ Note: This is an intermediate checkpoint (step 12,000) from an interrupted training run. The model may not be fully trained.

Model Details

Property Value
Architecture DiffusionQwen3 (Bidirectional Transformer)
Base Model Qwen-based architecture
Hidden Size 1536
Layers 28
Attention Heads 12
KV Heads 2 (GQA)
Intermediate Size 8960
Max Position Embeddings 32,768
Vocab Size 151,666
Training Checkpoint 12,000 steps

How Diffusion LMs Work

Unlike autoregressive models that generate tokens left-to-right, this model uses discrete diffusion:

  1. Start with all <mask> tokens in the generation region
  2. Iteratively unmask tokens based on model confidence
  3. Higher-confidence predictions are revealed first
  4. Process repeats until all tokens are generated

This enables bidirectional context during generation, potentially improving coherence for code.

Usage

Installation

pip install torch transformers

Inference

import torch
from transformers import AutoTokenizer

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("shouryamaanjain/adhd-diffusion", trust_remote_code=True)

# Load model (see inference.py for full diffusion generation logic)
# The model uses custom DiffusionQwen3Model class

For full inference with diffusion sampling, use the included inference.py script:

# Single prompt
python inference.py --checkpoint /path/to/model --prompt "def fibonacci(n):"

# Interactive chat
python inference.py --checkpoint /path/to/model --mode chat

# With custom parameters
python inference.py --checkpoint /path/to/model \
    --prompt "Write a function to sort a list" \
    --steps 128 \
    --temperature 0.0 \
    --max-tokens 256 \
    --alg entropy

Generation Parameters

Parameter Default Description
steps 128 Number of diffusion denoising steps
temperature 0.0 Sampling temperature (0 = greedy)
top_p None Nucleus sampling threshold
top_k None Top-k sampling
alg entropy Sampling algorithm: origin, entropy, maskgit_plus, topk_margin
alg_temp 0.1 Algorithm-specific confidence temperature

Model Architecture

The model is a bidirectional transformer (non-causal attention) trained with discrete diffusion objectives:

DiffusionQwen3Model(
  (model): Qwen2Model with bidirectional attention
  (lm_head): Linear(1536, 151666)
)

Training Objective

  • Forward process: Randomly mask tokens with probability σ ~ U[ε, 1]
  • Reverse process: Predict original tokens from masked input
  • Loss weighting: 1/σ (ELBO-derived)

Files

  • pytorch_model.bin - Model weights
  • config.json - Model configuration
  • tokenizer.json, vocab.json, merges.txt - Tokenizer files
  • inference.py - Standalone inference script
  • modeling_diffusion_qwen3.py - Model class definition

Citation

Based on CoDA by Salesforce AI Research:

@article{coda2024,
  title={CoDA: Coding LM via Diffusion Adaptation},
  author={Salesforce AI Research},
  journal={arXiv preprint},
  year={2024}
}

License

Please refer to the base Qwen model license for usage terms.

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support