philipp-zettl's picture
Upload folder using huggingface_hub
d8cbde8 verified
---
language: en
tags:
- mask-predict
- diffusion
- masked-lm
library_name: transformers
base_model: answerdotai/ModernBERT-base
pipeline_tag: fill-mask
---
# modernbert-diffusion-universal
## Model Summary
A diffusion-style masked language model fine-tuned in `universal` mode using a discrete denoising objective.
## Model Details
- **Model ID:** philipp-zettl/modernbert-diffusion-universal
- **Base model:** answerdotai/ModernBERT-base
- **Training mode:** universal
- **Task type:** Masked token denoising / diffusion-style infilling
## Intended Use
Intended as a general-purpose infilling model across text, code, JSON, and chat formats.
**Example**
```python
from refinebert.diffusion_engine import MaskedDiffusionEngine
engine = MaskedDiffusionEngine("philipp-zettl/modernbert-diffusion-universal")
prompt = "def generate_json(data):"
output = engine.generate(prompt, num_new_tokens=25, steps=12, guidance_scale=3.0)
print(output)
```
## Training Data
Datasets are streamed from Hugging Face and mixed by mode.
### Dataset Mix
| Dataset | Percentage | Purpose |
| --- | --- | --- |
| HuggingFaceFW/fineweb-edu (sample-10BT) | 40% | General web/edu text |
| bigcode/the-stack-dedup (python) | 30% | Python code |
| bigcode/the-stack-dedup (json) | 15% | Structured JSON |
| HuggingFaceH4/ultrachat_200k (train_sft) | 15% | Instruction chat |
Fallbacks: FineWeb-Edu may fall back to Wikitext-103, and The Stack may fall back to CodeParrot depending on availability.
## Training Procedure
- **Steps:** 500000
- **Batch size:** 16
- **Sequence length:** 256
- **Learning rate:** 5e-05
- **CFG dropout probability:** 0.1
- **Samples loaded into RAM:** 100000
## Training Time & Hardware
- **Duration:** 46h 38m 11s
- **Hardware:** NVIDIA GeForce RTX 4070 Laptop GPU x1 (CUDA available)
## Metrics (Training)
| Metric | Value |
| --- | --- |
| Training loss (latest) | 4.2869 |
| Training loss (mean) | 3.5010 |
| Training step | 500000 / 500000 |
## Limitations & Considerations
- The model is trained with a masked-token diffusion objective and may not behave like an autoregressive LM.
- Data sources may have licensing or content constraints—review source dataset cards before deployment.
- Performance can vary substantially by mode (universal) and prompt structure.