610m-mlm40-12k / README.md

nielsr HF Staff

Add comprehensive model card

d665c72 verified 6 months ago

preview code

raw

history blame

3.58 kB

metadata

license: apache-2.0
library_name: transformers
pipeline_tag: feature-extraction

Should We Still Pretrain Encoders with Masked Language Modeling?

This repository contains a model artifact related to the paper "Should We Still Pretrain Encoders with Masked Language Modeling?".

The paper explores the fundamental question of whether Masked Language Modeling (MLM) or Causal Language Modeling (CLM) is more effective for pretraining encoder models to learn high-quality text representations. Through extensive large-scale ablations (30 models from 210M to 1B parameters, 15,000+ fine-tuning runs), the authors find that while MLM generally yields better performance, CLM-trained models are more data-efficient and offer improved fine-tuning stability. A key finding is that a biphasic training strategy (CLM then MLM) achieves optimal performance, especially when initializing from existing pretrained CLM models.

Project Page: https://hf.co/MLMvsCLM
Codebase (Optimus Training Library): The models discussed in the paper were trained using the Optimus Training Library, a flexible and scalable framework for training language models.

Model Description

This model is an encoder-only transformer (SLModel architecture) with a hidden size of 1152, 26 hidden layers, and a vocabulary size of 128256, as derived from its config.json. It is designed for text representation tasks and feature extraction.

Usage

This model can be used for feature extraction with the transformers library. Since SLModel is a custom architecture, trust_remote_code=True is required. The model's torch_dtype is bfloat16 as specified in its configuration.

import torch
from transformers import AutoModel, AutoTokenizer

# Replace "MLMvsCLM/your-model-id" with the actual repository ID for this model.
# Example from the research project: "MLMvsCLM/mlm-clm-210m"
model_id = "MLMvsCLM/mlm-clm-210m" 

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(model_id, torch_dtype=torch.bfloat16, trust_remote_code=True)

text = "The quick brown fox jumps over the lazy dog."
inputs = tokenizer(text, return_tensors="pt")

# Move model to GPU if available
if torch.cuda.is_available():
    model.to("cuda")
    inputs = {k: v.to("cuda") for k, v in inputs.items()}

with torch.no_grad():
    outputs = model(**inputs)

# The last_hidden_state contains the contextualized embeddings for each token.
# For sentence embeddings, common practice is to average token embeddings or use the CLS token if present.
last_hidden_state = outputs.last_hidden_state
print(f"Shape of last_hidden_state: {last_hidden_state.shape}")

Citation

If you use this model or the findings from the paper in your research, please cite:

@misc{boizard2025should,
      title={Should We Still Pretrain Encoders with Masked Language Modeling?}, 
      author={Nicolas Boizard and Hippolyte Gisserot-Boukhlef and André Martins and Ricardo Rei and Patrick Fernandes and Ayoub Hammal and João Alves and Duarte M. Alves and Céline Hudelot and Etienne Malaboeuf and Emmanuel Malherbe and Fanny Jourdan and Gabriel Hautreux and Kevin El-Haddad and Manuel Faysse and Maxime Peyrard and Nuno M. Guerreiro and Pierre Colombo and Caio Corro},
      year={2025},
      eprint={2507.00994},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.00994}, 
}