Add comprehensive model card
Browse filesThis PR adds a comprehensive model card for the model in this repository.
It includes:
- Metadata: `pipeline_tag: feature-extraction`, `library_name: transformers`, and `license: apache-2.0`, along with descriptive tags. This ensures better discoverability and integration within the Hugging Face Hub.
- Links: To the associated paper ([Should We Still Pretrain Encoders with Masked Language Modeling?](https://huggingface.co/papers/2507.00994)), the project page ([https://hf.co/MLMvsCLM](https://hf.co/MLMvsCLM)), and the GitHub repository for the `Optimus` training library ([https://github.com/Nicolas-BZRD/EuroBERT](https://github.com/Nicolas-BZRD/EuroBERT)).
- Model Description: A summary of the paper's key findings regarding MLM, CLM, and biphasic pretraining strategies for encoders.
- Sample Usage: A clear Python code snippet demonstrating how to use the model with the `transformers` library for feature extraction, including the necessary `trust_remote_code=True` parameter.
- Citation: The BibTeX entry for the paper.
|
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: feature-extraction
|
| 4 |
+
library_name: transformers
|
| 5 |
+
tags:
|
| 6 |
+
- encoder
|
| 7 |
+
- text-representation
|
| 8 |
+
- mlm
|
| 9 |
+
- clm
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
This repository contains an encoder model presented in the paper [Should We Still Pretrain Encoders with Masked Language Modeling?](https://huggingface.co/papers/2507.00994).
|
| 13 |
+
|
| 14 |
+
The project page for this research can be found at: [https://hf.co/MLMvsCLM](https://hf.co/MLMvsCLM)
|
| 15 |
+
The code repository for the Optimus training library, used to train this model, is available at: [https://github.com/Nicolas-BZRD/EuroBERT](https://github.com/Nicolas-BZRD/EuroBERT)
|
| 16 |
+
|
| 17 |
+
## Model Description
|
| 18 |
+
|
| 19 |
+
This model is part of a large-scale study investigating whether Masked Language Modeling (MLM) or Causal Language Modeling (CLM) is more effective for pretraining text encoders to learn high-quality text representations. The research involved training 30 models ranging from 210 million to 1 billion parameters.
|
| 20 |
+
|
| 21 |
+
Key findings indicate that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these observations, the paper proposes and experimentally shows that a biphasic training strategy, which sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. This strategy is particularly appealing when initializing from readily available pretrained CLM models from the existing LLM ecosystem, reducing the computational burden for training best-in-class encoder models. All project artifacts are released to foster further research.
|
| 22 |
+
|
| 23 |
+
## Usage
|
| 24 |
+
|
| 25 |
+
You can load and use this model with the `transformers` library for feature extraction. Due to its custom architecture (`SLModel`), please ensure `trust_remote_code=True` when loading the model.
|
| 26 |
+
|
| 27 |
+
```python
|
| 28 |
+
from transformers import AutoTokenizer, AutoModel
|
| 29 |
+
import torch
|
| 30 |
+
|
| 31 |
+
# Replace "your_model_id" with the actual ID of this model repository on Hugging Face Hub
|
| 32 |
+
model_id = "your_model_id"
|
| 33 |
+
|
| 34 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
| 35 |
+
model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
|
| 36 |
+
|
| 37 |
+
text = "Learning high-quality text representations is fundamental to a wide range of NLP tasks."
|
| 38 |
+
|
| 39 |
+
# Tokenize the input text
|
| 40 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 41 |
+
|
| 42 |
+
# Move inputs and model to the appropriate device (e.g., GPU if available)
|
| 43 |
+
device = "cuda" if torch.cuda.is_available() else "cpu"
|
| 44 |
+
model.to(device)
|
| 45 |
+
inputs = {k: v.to(device) for k, v in inputs.items()}
|
| 46 |
+
|
| 47 |
+
# Get model outputs
|
| 48 |
+
with torch.no_grad():
|
| 49 |
+
outputs = model(**inputs)
|
| 50 |
+
|
| 51 |
+
# The last_hidden_state contains the contextualized embeddings for each token
|
| 52 |
+
# Shape: (batch_size, sequence_length, hidden_size)
|
| 53 |
+
last_hidden_state = outputs.last_hidden_state
|
| 54 |
+
print(f"Shape of last_hidden_state: {last_hidden_state.shape}")
|
| 55 |
+
|
| 56 |
+
# For a pooled sentence embedding (e.g., [BOS] token embedding, usually the first token)
|
| 57 |
+
# Or by applying mean pooling across tokens (excluding padding)
|
| 58 |
+
pooled_output = last_hidden_state[:, 0, :]
|
| 59 |
+
print(f"Shape of pooled_output (first token): {pooled_output.shape}")
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
## Citation
|
| 63 |
+
|
| 64 |
+
If you use this model in your research, please cite the corresponding paper:
|
| 65 |
+
|
| 66 |
+
```bibtex
|
| 67 |
+
@misc{boizard2025shouldwe,
|
| 68 |
+
title={Should We Still Pretrain Encoders with Masked Language Modeling?},
|
| 69 |
+
author={Nicolas Boizard and Hippolyte Gisserot-Boukhlef and Duarte M. Alves and Andr{\'e} Martins and Ayoub Hammal and Caio Corro and C{\'e}line Hudelot and Emmanuel Malherbe and Etienne Malaboeuf and Fanny Jourdan and Gabriel Hautreux and Jo{\~a}o Alves and Kevin El-Haddad and Manuel Faysse and Maxime Peyrard and Nuno M. Guerreiro and Patrick Fernandes and Ricardo Rei and Pierre Colombo},
|
| 70 |
+
year={2025},
|
| 71 |
+
eprint={2507.00994},
|
| 72 |
+
archivePrefix={arXiv},
|
| 73 |
+
primaryClass={cs.CL},
|
| 74 |
+
url={https://arxiv.org/abs/2507.00994},
|
| 75 |
+
}
|
| 76 |
+
```
|