philipp-zettl's picture
Upload folder using huggingface_hub
d8cbde8 verified
metadata
language: en
tags:
  - mask-predict
  - diffusion
  - masked-lm
library_name: transformers
base_model: answerdotai/ModernBERT-base
pipeline_tag: fill-mask

modernbert-diffusion-universal

Model Summary

A diffusion-style masked language model fine-tuned in universal mode using a discrete denoising objective.

Model Details

  • Model ID: philipp-zettl/modernbert-diffusion-universal
  • Base model: answerdotai/ModernBERT-base
  • Training mode: universal
  • Task type: Masked token denoising / diffusion-style infilling

Intended Use

Intended as a general-purpose infilling model across text, code, JSON, and chat formats.

Example

from refinebert.diffusion_engine import MaskedDiffusionEngine

engine = MaskedDiffusionEngine("philipp-zettl/modernbert-diffusion-universal")
prompt = "def generate_json(data):"
output = engine.generate(prompt, num_new_tokens=25, steps=12, guidance_scale=3.0)
print(output)

Training Data

Datasets are streamed from Hugging Face and mixed by mode.

Dataset Mix

Dataset Percentage Purpose
HuggingFaceFW/fineweb-edu (sample-10BT) 40% General web/edu text
bigcode/the-stack-dedup (python) 30% Python code
bigcode/the-stack-dedup (json) 15% Structured JSON
HuggingFaceH4/ultrachat_200k (train_sft) 15% Instruction chat

Fallbacks: FineWeb-Edu may fall back to Wikitext-103, and The Stack may fall back to CodeParrot depending on availability.

Training Procedure

  • Steps: 500000
  • Batch size: 16
  • Sequence length: 256
  • Learning rate: 5e-05
  • CFG dropout probability: 0.1
  • Samples loaded into RAM: 100000

Training Time & Hardware

  • Duration: 46h 38m 11s
  • Hardware: NVIDIA GeForce RTX 4070 Laptop GPU x1 (CUDA available)

Metrics (Training)

Metric Value
Training loss (latest) 4.2869
Training loss (mean) 3.5010
Training step 500000 / 500000

Limitations & Considerations

  • The model is trained with a masked-token diffusion objective and may not behave like an autoregressive LM.
  • Data sources may have licensing or content constraints—review source dataset cards before deployment.
  • Performance can vary substantially by mode (universal) and prompt structure.