philipp-zettl
/

modernbert-diffusion-universal

Model card Files Files and versions

modernbert-diffusion-universal / README.md

philipp-zettl's picture

Upload folder using huggingface_hub

d8cbde8 verified 3 days ago

|

history blame contribute delete

2.3 kB

	---
	language: en
	tags:
	- mask-predict
	- diffusion
	- masked-lm
	library_name: transformers
	base_model: answerdotai/ModernBERT-base
	pipeline_tag: fill-mask
	---

	# modernbert-diffusion-universal

	## Model Summary
	A diffusion-style masked language model fine-tuned in `universal` mode using a discrete denoising objective.

	## Model Details
	- Model ID: philipp-zettl/modernbert-diffusion-universal
	- Base model: answerdotai/ModernBERT-base
	- Training mode: universal
	- Task type: Masked token denoising / diffusion-style infilling

	## Intended Use
	Intended as a general-purpose infilling model across text, code, JSON, and chat formats.

	Example
	```python
	from refinebert.diffusion_engine import MaskedDiffusionEngine

	engine = MaskedDiffusionEngine("philipp-zettl/modernbert-diffusion-universal")
	prompt = "def generate_json(data):"
	output = engine.generate(prompt, num_new_tokens=25, steps=12, guidance_scale=3.0)
	print(output)
	```

	## Training Data
	Datasets are streamed from Hugging Face and mixed by mode.

	### Dataset Mix
	\| Dataset \| Percentage \| Purpose \|
	\| --- \| --- \| --- \|
	\| HuggingFaceFW/fineweb-edu (sample-10BT) \| 40% \| General web/edu text \|
	\| bigcode/the-stack-dedup (python) \| 30% \| Python code \|
	\| bigcode/the-stack-dedup (json) \| 15% \| Structured JSON \|
	\| HuggingFaceH4/ultrachat_200k (train_sft) \| 15% \| Instruction chat \|

	Fallbacks: FineWeb-Edu may fall back to Wikitext-103, and The Stack may fall back to CodeParrot depending on availability.

	## Training Procedure
	- Steps: 500000
	- Batch size: 16
	- Sequence length: 256
	- Learning rate: 5e-05
	- CFG dropout probability: 0.1
	- Samples loaded into RAM: 100000

	## Training Time & Hardware
	- Duration: 46h 38m 11s
	- Hardware: NVIDIA GeForce RTX 4070 Laptop GPU x1 (CUDA available)

	## Metrics (Training)
	\| Metric \| Value \|
	\| --- \| --- \|
	\| Training loss (latest) \| 4.2869 \|
	\| Training loss (mean) \| 3.5010 \|
	\| Training step \| 500000 / 500000 \|

	## Limitations & Considerations
	- The model is trained with a masked-token diffusion objective and may not behave like an autoregressive LM.
	- Data sources may have licensing or content constraints—review source dataset cards before deployment.
	- Performance can vary substantially by mode (universal) and prompt structure.