philipp-zettl
/

modernbert-diffusion-instruct

Model card Files Files and versions

modernbert-diffusion-instruct / README.md

philipp-zettl's picture

Upload folder using huggingface_hub

d7f377d verified 3 days ago

|

history blame contribute delete

1.95 kB

	---
	language: en
	tags:
	- mask-predict
	- diffusion
	- masked-lm
	library_name: transformers
	base_model: answerdotai/ModernBERT-base
	pipeline_tag: fill-mask
	---

	# modernbert-diffusion-instruct

	## Model Summary
	A diffusion-style masked language model fine-tuned in `instruct` mode using a discrete denoising objective.

	## Model Details
	- Model ID: philipp-zettl/modernbert-diffusion-instruct
	- Base model: answerdotai/ModernBERT-base
	- Training mode: instruct
	- Task type: Masked token denoising / diffusion-style infilling

	## Intended Use
	Intended for instruction-following style infilling in chat-like prompts.

	Example
	```python
	from refinebert.diffusion_engine import MaskedDiffusionEngine

	engine = MaskedDiffusionEngine("philipp-zettl/modernbert-diffusion-instruct")
	prompt = "User: What is diffusion?
	AI:"
	output = engine.generate(prompt, num_new_tokens=30, steps=12, guidance_scale=3.0)
	print(output)
	```

	## Training Data
	Datasets are streamed from Hugging Face and mixed by mode.

	### Dataset Mix
	\| Dataset \| Percentage \| Purpose \|
	\| --- \| --- \| --- \|
	\| HuggingFaceH4/ultrachat_200k (train_sft) \| 100% \| Instruction chat \|



	## Training Procedure
	- Steps: 50000
	- Batch size: 4
	- Sequence length: 256
	- Learning rate: 5e-05
	- CFG dropout probability: 0.1
	- Samples loaded into RAM: 100000

	## Training Time & Hardware
	- Duration: 2h 34m 9s
	- Hardware: NVIDIA GeForce RTX 2060 x1 (CUDA available)

	## Metrics (Training)
	\| Metric \| Value \|
	\| --- \| --- \|
	\| Training loss (latest) \| 4.9687 \|
	\| Training loss (mean) \| 3.7032 \|
	\| Training step \| 50000 / 50000 \|

	## Limitations & Considerations
	- The model is trained with a masked-token diffusion objective and may not behave like an autoregressive LM.
	- Data sources may have licensing or content constraints—review source dataset cards before deployment.
	- Performance can vary substantially by mode (instruct) and prompt structure.