nielsr's picture
nielsr HF Staff
Add pipeline tag, library name, and paper link to model card
7d058cc verified
|
raw
history blame
4.35 kB
metadata
datasets:
  - turkish-nlp-suite/Havadis
  - turkish-nlp-suite/temiz-OSCAR
  - wikimedia/wikipedia
language:
  - tr
license: apache-2.0
metrics:
  - perplexity
library_name: transformers
pipeline_tag: text-generation
tags:
  - turkish
  - diffusion
  - masked-diffusion
  - non-autoregressive
  - foundation-model
  - dllm

DiffutronLM-0.3B-Base

DiffutronLM-0.3B-Base is the foundational Masked Diffusion Language Model (MDLM) of the Diffutron series, tailored specifically for the Turkish language.

This model is presented in the paper Diffutron: A Masked Diffusion Language Model for Turkish Language. It represents the completion of the Continual Pre-training (CPT) phase, where it successfully adapted the multilingual representations of its backbone to the agglutinative complexity and morphological nuances of Turkish.

โš ๏ธ Note: This is a base foundation model. It has not been instruction-tuned or aligned for chat capabilities. If you are looking for a model that follows prompts and answers questions, please use DiffutronLM-0.3B-Instruct.

๐Ÿ“Œ Model Details

  • Model Type: Masked Diffusion Language Model (MDLM) Base
  • Base Architecture: jhu-clsp/mmBERT-base (ModernBERT-based architecture)
  • Language: Turkish
  • Parameter Count: 307M (0.3B)
  • Context Length: 512 tokens
  • Training Libraries: dllm, PyTorch, transformers
  • Status: Foundation / Base Model (Post-CPT)

๐Ÿš€ Architecture & Continual Pre-training (CPT)

Unlike standard autoregressive models, Diffutron models text generation as a discrete diffusion process. To align the base encoder's latent space with the Turkish target distribution while preserving cross-lingual reasoning, this model underwent a specialized CPT pipeline:

  • Data Curation: Trained on a composite dataset of approximately 2 million sequences (max length 512) sourced from:
    • Havadis: Comprehensive Turkish news articles.
    • Temiz-OSCAR: A cleaned, filtered subset of the Common Crawl-based Turkish OSCAR corpus.
    • Turkish Wikipedia: High-quality encyclopedic sequences.
  • Efficient Adaptation via LoRA: Instead of full-parameter fine-tuning which risks catastrophic forgetting, we applied Low-Rank Adaptation (LoRA) with a high rank ($r=256$, $\alpha=256$) targeting all linear modules (Attention Q, K, V, O and MLP Input, Output). This resulted in ~14.94% trainable parameters.
  • Objective: Masked Language Modeling (MLM).

๐Ÿ“Š Intrinsic Evaluation

To quantify the improvements gained from the CPT phase, we conducted an intrinsic evaluation using perplexity on the Bilkent Turkish Writings Dataset (evaluated with a masked language modeling probability of 0.15).

The CPT process resulted in a significant reduction in perplexity, indicating a strong alignment with Turkish linguistic structures:

  • jhu-clsp/mmBERT-base (Pre-CPT): 3.42
  • DiffutronLM-0.3B-Base (Post-CPT): 2.75

(Note: Downstream task evaluations on the CETVEL benchmark were conducted on the Instruct-tuned versions of this model.)

๐Ÿ’ป Usage

As a base masked diffusion model, this checkpoint is ideal for:

  1. Further Fine-tuning: Acting as a starting point for domain-specific continued pre-training or custom instruction tuning.
  2. Masked Token Prediction: Filling in blanks or reconstructing corrupted text.
  3. Unconditional/Conditional Generation: Generating text using a discrete diffusion sampling loop (e.g., via the dllm library).

Because it uses a non-autoregressive paradigm, standard AutoModelForCausalLM.generate() pipelines will not work. Please utilize discrete diffusion generation strategies.

โš ๏ธ Limitations

  • No Instruction Tuning: Will not respond to QA prompts or instructions naturally.
  • Multilingual Backbone: While heavily adapted to Turkish, it is built upon a multilingual encoder.
  • Context Window: Restricted to a 512-token context window during the base phase.

๐Ÿ“ Citation

@misc{diffutron2026,
      title={Diffutron: A Masked Diffusion Language Model for Turkish Language}, 
      author={ลžuayp Talha Kocabay and Talha Rรผzgar AkkuลŸ},
      year={2026},
      eprint={2603.20466},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.20466}, 
}