Update README.md

bfbce11 verified 7 months ago

4.19 kB

	---
	language:
	- en
	tags:
	- dllm
	- diffusion-language-model
	- text-generation
	- diffusion
	- language-model
	license: apache-2.0
	---

	# HDLM-Epsilon: Hybrid Diffusion Language Model

	[![Paper](https://img.shields.io/badge/Paper-arXiv-red)](https://arxiv.org/abs/2504.06416)
	[![Code](https://img.shields.io/badge/Code-GitHub-blue)](https://github.com/ServiceNow/hdlm)

	This model card is for the hdlm-base model with epsilon=0.0

	## Model Description

	HDLM-Epsilon is a hybrid diffusion language model that unifies autoregressive and diffusion-based sequence generation through epsilon-hybrid noising. This model interpolates evolution operators between absorbing and uniform processes, making it conceptually closer to MDLM (Sahoo et al. 2024) while maintaining the benefits of both paradigms.

	The epsilon parameter (ε) controls the blend between absorbing and uniform processes during training, where smaller values emphasize the absorbing process and larger values incorporate more uniform noise.

	## Model Architecture

	- Base Model: Transformer architecture with custom conditioning layers
	- Vocabulary Size: 50,258 tokens (GPT-2 vocabulary + absorbing token)
	- Context Length: 1024 tokens
	- Training: Hybrid loss combining token masking with random token corruption
	- Inference: Supports multiple sampling algorithms including ACS (Adaptive Correction Sampler)

	## Usage

	### Quick Start

	```python
	from hdlm.hf_utils import smart_model_loader
	from hdlm.epsilon_hybrid.sample import full_diff
	from transformers import GPT2TokenizerFast
	import torch

	# Load model using smart loader (automatically detects model type)
	model, cfg, device, accelerator, metaschedule = smart_model_loader(
	model_path="hdlm-group/hdlm-base-epsilon-0.0",
	model_type="auto", # automatically detects epsilon_hybrid
	device="cuda"
	)

	# Load tokenizer
	tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')

	# Generate text
	prompt = "The future of artificial intelligence"
	prompt_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)

	# Full diffusion sampling
	generated = full_diff(
	model=model,
	prompt=prompt_ids,
	batch_size=1,
	alg='acs', # or 'original', 'remask', 'remdm'
	steps=512,
	temperature=1.0,
	context_length=1024,
	device=device
	)

	# Decode generated text
	generated_text = tokenizer.decode(generated[0], skip_special_tokens=True)
	print(generated_text)
	```

	### Evaluation

	```bash
	# Text generation evaluation
	python hdlm/eval_generation.py \
	--checkpoint_path hdlm-group/hdlm-base-epsilon-0.0 \
	--sampling_method full_diff \
	--algorithm acs \
	--save_samples

	# Perplexity evaluation
	python hdlm/eval_modeling.py \
	--checkpoint_path hdlm-group/hdlm-base-epsilon-0.0 \
	--work_dir "./logs/eval_modeling_epsilon" \
	--dataset ptb
	```

	## Training Details

	- Dataset: OpenWebText
	- Batch Size: 512
	- Learning Rate: 3e-4 with cosine scheduling
	- Epsilon (ε): 0.01 (controls hybrid noising blend)
	- Lambda (λ): 1.0 (weighting factor for unmasked tokens)
	- Loss Type: Hybrid loss combining masking and random token corruption
	- Training Steps: 1M iterations
	- Warmup: 50K steps

	## Sampling Algorithms

	The model supports several sampling algorithms:

	- `original`: Standard diffusion sampling
	- `acs`: Adaptive Correction Sampler with error correction
	- `remask`: Remasking strategy for improved quality
	- `remdm`: ReMDM-style sampling with probability mixing

	## Model Variants

	Available epsilon values and their characteristics:

	- ε = 0.01: Minimal uniform noise, closest to pure absorbing process
	- ε = 0.1: Moderate hybrid behavior
	- ε = 0.5: Balanced absorbing-uniform blend

	## Citation

	```bibtex
	@article{fathi2025unifying,
	title={Unifying autoregressive and diffusion-based sequence generation},
	author={Fathi, Nima and Scholak, Torsten and No{\"e}l, Pierre-Andr{\'e}},
	journal={arXiv preprint arXiv:2504.06416},
	year={2025}
	}
	```

	## License

	This model is released under the same license as the original HDLM codebase. Please refer to the [GitHub repository](https://github.com/ServiceNow/hdlm) for license details.