philipp-zettl's picture
Upload folder using huggingface_hub
57b0ca9 verified
metadata
language: en
tags:
  - mask-predict
  - diffusion
  - masked-lm
library_name: transformers
base_model: answerdotai/ModernBERT-base
pipeline_tag: fill-mask

modernbert-diffusion-code

Model Summary

A diffusion-style masked language model fine-tuned in code mode using a discrete denoising objective.

Model Details

  • Model ID: philipp-zettl/modernbert-diffusion-code
  • Base model: answerdotai/ModernBERT-base
  • Training mode: code
  • Task type: Masked token denoising / diffusion-style infilling

Intended Use

Intended for code completion, infilling, and refactoring tasks on Python-like code.

Example

from refinebert.diffusion_engine import MaskedDiffusionEngine

engine = MaskedDiffusionEngine("philipp-zettl/modernbert-diffusion-code")
prompt = "def fibonacci(n):"
output = engine.generate(prompt, num_new_tokens=20, steps=12, guidance_scale=3.0)
print(output)

Training Data

Datasets are streamed from Hugging Face and mixed by mode.

Dataset Mix

Dataset Percentage Purpose
bigcode/the-stack-dedup (python) 100% Python code

Fallback: The Stack may fall back to CodeParrot depending on availability.

Training Procedure

  • Steps: 150000
  • Batch size: 4
  • Sequence length: 256
  • Learning rate: 5e-05
  • CFG dropout probability: 0.1
  • Samples loaded into RAM: 100000

Training Time & Hardware

  • Duration: 7h 50m 28s
  • Hardware: NVIDIA GeForce RTX 2060 x1 (CUDA available)

Metrics (Training)

Metric Value
Training loss (latest) 3.2864
Training loss (mean) 3.1062
Training step 150000 / 150000

Limitations & Considerations

  • The model is trained with a masked-token diffusion objective and may not behave like an autoregressive LM.
  • Data sources may have licensing or content constraints—review source dataset cards before deployment.
  • Performance can vary substantially by mode (code) and prompt structure.