|
|
--- |
|
|
language: en |
|
|
tags: |
|
|
- mask-predict |
|
|
- diffusion |
|
|
- masked-lm |
|
|
library_name: transformers |
|
|
base_model: answerdotai/ModernBERT-base |
|
|
pipeline_tag: fill-mask |
|
|
--- |
|
|
|
|
|
# modernbert-diffusion-universal |
|
|
|
|
|
## Model Summary |
|
|
A diffusion-style masked language model fine-tuned in `universal` mode using a discrete denoising objective. |
|
|
|
|
|
## Model Details |
|
|
- **Model ID:** philipp-zettl/modernbert-diffusion-universal |
|
|
- **Base model:** answerdotai/ModernBERT-base |
|
|
- **Training mode:** universal |
|
|
- **Task type:** Masked token denoising / diffusion-style infilling |
|
|
|
|
|
## Intended Use |
|
|
Intended as a general-purpose infilling model across text, code, JSON, and chat formats. |
|
|
|
|
|
**Example** |
|
|
```python |
|
|
from refinebert.diffusion_engine import MaskedDiffusionEngine |
|
|
|
|
|
engine = MaskedDiffusionEngine("philipp-zettl/modernbert-diffusion-universal") |
|
|
prompt = "def generate_json(data):" |
|
|
output = engine.generate(prompt, num_new_tokens=25, steps=12, guidance_scale=3.0) |
|
|
print(output) |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
Datasets are streamed from Hugging Face and mixed by mode. |
|
|
|
|
|
### Dataset Mix |
|
|
| Dataset | Percentage | Purpose | |
|
|
| --- | --- | --- | |
|
|
| HuggingFaceFW/fineweb-edu (sample-10BT) | 40% | General web/edu text | |
|
|
| bigcode/the-stack-dedup (python) | 30% | Python code | |
|
|
| bigcode/the-stack-dedup (json) | 15% | Structured JSON | |
|
|
| HuggingFaceH4/ultrachat_200k (train_sft) | 15% | Instruction chat | |
|
|
|
|
|
Fallbacks: FineWeb-Edu may fall back to Wikitext-103, and The Stack may fall back to CodeParrot depending on availability. |
|
|
|
|
|
## Training Procedure |
|
|
- **Steps:** 500000 |
|
|
- **Batch size:** 16 |
|
|
- **Sequence length:** 256 |
|
|
- **Learning rate:** 5e-05 |
|
|
- **CFG dropout probability:** 0.1 |
|
|
- **Samples loaded into RAM:** 100000 |
|
|
|
|
|
## Training Time & Hardware |
|
|
- **Duration:** 46h 38m 11s |
|
|
- **Hardware:** NVIDIA GeForce RTX 4070 Laptop GPU x1 (CUDA available) |
|
|
|
|
|
## Metrics (Training) |
|
|
| Metric | Value | |
|
|
| --- | --- | |
|
|
| Training loss (latest) | 4.2869 | |
|
|
| Training loss (mean) | 3.5010 | |
|
|
| Training step | 500000 / 500000 | |
|
|
|
|
|
## Limitations & Considerations |
|
|
- The model is trained with a masked-token diffusion objective and may not behave like an autoregressive LM. |
|
|
- Data sources may have licensing or content constraints—review source dataset cards before deployment. |
|
|
- Performance can vary substantially by mode (universal) and prompt structure. |
|
|
|