philipp-zettl
/

modernbert-diffusion-code

Model card Files Files and versions

modernbert-diffusion-code / README.md

philipp-zettl's picture

Upload folder using huggingface_hub

57b0ca9 verified 3 days ago

|

history blame contribute delete

1.99 kB

	---
	language: en
	tags:
	- mask-predict
	- diffusion
	- masked-lm
	library_name: transformers
	base_model: answerdotai/ModernBERT-base
	pipeline_tag: fill-mask
	---

	# modernbert-diffusion-code

	## Model Summary
	A diffusion-style masked language model fine-tuned in `code` mode using a discrete denoising objective.

	## Model Details
	- Model ID: philipp-zettl/modernbert-diffusion-code
	- Base model: answerdotai/ModernBERT-base
	- Training mode: code
	- Task type: Masked token denoising / diffusion-style infilling

	## Intended Use
	Intended for code completion, infilling, and refactoring tasks on Python-like code.

	Example
	```python
	from refinebert.diffusion_engine import MaskedDiffusionEngine

	engine = MaskedDiffusionEngine("philipp-zettl/modernbert-diffusion-code")
	prompt = "def fibonacci(n):"
	output = engine.generate(prompt, num_new_tokens=20, steps=12, guidance_scale=3.0)
	print(output)
	```

	## Training Data
	Datasets are streamed from Hugging Face and mixed by mode.

	### Dataset Mix
	\| Dataset \| Percentage \| Purpose \|
	\| --- \| --- \| --- \|
	\| bigcode/the-stack-dedup (python) \| 100% \| Python code \|

	Fallback: The Stack may fall back to CodeParrot depending on availability.

	## Training Procedure
	- Steps: 150000
	- Batch size: 4
	- Sequence length: 256
	- Learning rate: 5e-05
	- CFG dropout probability: 0.1
	- Samples loaded into RAM: 100000

	## Training Time & Hardware
	- Duration: 7h 50m 28s
	- Hardware: NVIDIA GeForce RTX 2060 x1 (CUDA available)

	## Metrics (Training)
	\| Metric \| Value \|
	\| --- \| --- \|
	\| Training loss (latest) \| 3.2864 \|
	\| Training loss (mean) \| 3.1062 \|
	\| Training step \| 150000 / 150000 \|

	## Limitations & Considerations
	- The model is trained with a masked-token diffusion objective and may not behave like an autoregressive LM.
	- Data sources may have licensing or content constraints—review source dataset cards before deployment.
	- Performance can vary substantially by mode (code) and prompt structure.