Add comprehensive model card
Browse filesThis PR adds a comprehensive model card for the model.
It includes:
- A link to the paper: [Should We Still Pretrain Encoders with Masked Language Modeling?](https://huggingface.co/papers/2507.00994)
- A link to the project page: [https://hf.co/MLMvsCLM](https://hf.co/MLMvsCLM)
- A link to the GitHub repository: [https://github.com/Nicolas-BZRD/EuroBERT](https://github.com/Nicolas-BZRD/EuroBERT)
- Relevant metadata (`license: apache-2.0`, `library_name: transformers`, `pipeline_tag: feature-extraction`) for proper discoverability on the Hugging Face Hub.
- A basic Python usage example for feature extraction.
This improves the model's documentation and usability.
README.md
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: feature-extraction
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# Should We Still Pretrain Encoders with Masked Language Modeling?
|
| 8 |
+
|
| 9 |
+
This repository contains a model artifact related to the paper **"[Should We Still Pretrain Encoders with Masked Language Modeling?](https://huggingface.co/papers/2507.00994)"**.
|
| 10 |
+
|
| 11 |
+
The paper explores the fundamental question of whether Masked Language Modeling (MLM) or Causal Language Modeling (CLM) is more effective for pretraining encoder models to learn high-quality text representations. Through extensive large-scale ablations (30 models from 210M to 1B parameters, 15,000+ fine-tuning runs), the authors find that while MLM generally yields better performance, CLM-trained models are more data-efficient and offer improved fine-tuning stability. A key finding is that a biphasic training strategy (CLM then MLM) achieves optimal performance, especially when initializing from existing pretrained CLM models.
|
| 12 |
+
|
| 13 |
+
* **Project Page:** [https://hf.co/MLMvsCLM](https://hf.co/MLMvsCLM)
|
| 14 |
+
* **Codebase (Optimus Training Library):** The models discussed in the paper were trained using the [Optimus Training Library](https://github.com/Nicolas-BZRD/EuroBERT), a flexible and scalable framework for training language models.
|
| 15 |
+
|
| 16 |
+
## Model Description
|
| 17 |
+
|
| 18 |
+
This model is an encoder-only transformer (SLModel architecture) with a hidden size of 1152, 26 hidden layers, and a vocabulary size of 128256, as derived from its `config.json`. It is designed for text representation tasks and feature extraction.
|
| 19 |
+
|
| 20 |
+
## Usage
|
| 21 |
+
|
| 22 |
+
This model can be used for feature extraction with the `transformers` library. Since `SLModel` is a custom architecture, `trust_remote_code=True` is required. The model's `torch_dtype` is `bfloat16` as specified in its configuration.
|
| 23 |
+
|
| 24 |
+
```python
|
| 25 |
+
import torch
|
| 26 |
+
from transformers import AutoModel, AutoTokenizer
|
| 27 |
+
|
| 28 |
+
# Replace "MLMvsCLM/your-model-id" with the actual repository ID for this model.
|
| 29 |
+
# Example from the research project: "MLMvsCLM/mlm-clm-210m"
|
| 30 |
+
model_id = "MLMvsCLM/mlm-clm-210m"
|
| 31 |
+
|
| 32 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
| 33 |
+
model = AutoModel.from_pretrained(model_id, torch_dtype=torch.bfloat16, trust_remote_code=True)
|
| 34 |
+
|
| 35 |
+
text = "The quick brown fox jumps over the lazy dog."
|
| 36 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 37 |
+
|
| 38 |
+
# Move model to GPU if available
|
| 39 |
+
if torch.cuda.is_available():
|
| 40 |
+
model.to("cuda")
|
| 41 |
+
inputs = {k: v.to("cuda") for k, v in inputs.items()}
|
| 42 |
+
|
| 43 |
+
with torch.no_grad():
|
| 44 |
+
outputs = model(**inputs)
|
| 45 |
+
|
| 46 |
+
# The last_hidden_state contains the contextualized embeddings for each token.
|
| 47 |
+
# For sentence embeddings, common practice is to average token embeddings or use the CLS token if present.
|
| 48 |
+
last_hidden_state = outputs.last_hidden_state
|
| 49 |
+
print(f"Shape of last_hidden_state: {last_hidden_state.shape}")
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
## Citation
|
| 53 |
+
|
| 54 |
+
If you use this model or the findings from the paper in your research, please cite:
|
| 55 |
+
|
| 56 |
+
```bibtex
|
| 57 |
+
@misc{boizard2025should,
|
| 58 |
+
title={Should We Still Pretrain Encoders with Masked Language Modeling?},
|
| 59 |
+
author={Nicolas Boizard and Hippolyte Gisserot-Boukhlef and André Martins and Ricardo Rei and Patrick Fernandes and Ayoub Hammal and João Alves and Duarte M. Alves and Céline Hudelot and Etienne Malaboeuf and Emmanuel Malherbe and Fanny Jourdan and Gabriel Hautreux and Kevin El-Haddad and Manuel Faysse and Maxime Peyrard and Nuno M. Guerreiro and Pierre Colombo and Caio Corro},
|
| 60 |
+
year={2025},
|
| 61 |
+
eprint={2507.00994},
|
| 62 |
+
archivePrefix={arXiv},
|
| 63 |
+
primaryClass={cs.CL},
|
| 64 |
+
url={https://arxiv.org/abs/2507.00994},
|
| 65 |
+
}
|
| 66 |
+
```
|