Add model card for EuroBERT-210M related to MLM vs CLM paper

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +68 -0
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: feature-extraction
3
+ library_name: transformers
4
+ license: apache-2.0
5
+ ---
6
+
7
+ # EuroBERT-210M: Investigating Pretraining Strategies for Encoders
8
+
9
+ This repository contains the `EuroBERT-210M` model, an encoder model trained in the context of the research presented in the paper **"[Should We Still Pretrain Encoders with Masked Language Modeling?](https://huggingface.co/papers/2507.00994)"**.
10
+
11
+ The paper addresses a fundamental question in NLP: whether Masked Language Modeling (MLM) or Causal Language Modeling (CLM) is the more effective pretraining objective for learning high-quality text representations in encoders. Through a series of large-scale, carefully controlled pretraining ablations, the research demonstrates that while MLM generally yields better performance across text representation tasks, CLM-trained models exhibit superior data efficiency and improved fine-tuning stability. Building on these insights, the paper introduces a biphasic training strategy that sequentially applies CLM followed by MLM, achieving optimal performance under a fixed computational budget. This strategy is particularly appealing when initializing from readily available pretrained CLM models from the existing LLM ecosystem, reducing the computational burden required to train best-in-class encoder models.
12
+
13
+ This model is part of the broader `EuroBERT` project, which utilizes the `Optimus` training library for efficient language model development.
14
+
15
+ - **Paper**: [Should We Still Pretrain Encoders with Masked Language Modeling?](https://huggingface.co/papers/2507.00994)
16
+ - **Project Page**: [https://hf.co/MLMvsCLM](https://hf.co/MLMvsCLM)
17
+ - **Code**: [https://github.com/Nicolas-BZRD/EuroBERT](https://github.com/Nicolas-BZRD/EuroBERT)
18
+
19
+ ## Usage
20
+
21
+ You can load and use this model for feature extraction with the `transformers` library. Note that `trust_remote_code=True` is required due to custom architectural components (`SLModel`).
22
+
23
+ ```python
24
+ from transformers import AutoModel, AutoTokenizer
25
+ import torch
26
+
27
+ # Replace with the actual model ID if different (e.g., 'org/repo-name')
28
+ model_id = "EuroBERT/EuroBERT-210m" # Assuming this is the model ID for this repository
29
+
30
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
31
+ model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
32
+
33
+ text = "This is an example sentence to obtain embeddings for."
34
+ inputs = tokenizer(text, return_tensors="pt")
35
+
36
+ # Move to GPU if available
37
+ # if torch.cuda.is_available():
38
+ # model = model.to("cuda")
39
+ # inputs = {k: v.to("cuda") for k, v in inputs.items()}
40
+
41
+ with torch.no_grad():
42
+ outputs = model(**inputs)
43
+
44
+ # The last hidden state contains the token embeddings
45
+ # For EuroBERT-210M (hidden_size=1152), the shape for a single input might be e.g., (1, num_tokens, 1152)
46
+ last_hidden_states = outputs.last_hidden_state
47
+ print(f"Shape of last hidden states: {last_hidden_states.shape}")
48
+
49
+ # For sentence-level embeddings, a common approach is mean pooling:
50
+ sentence_embedding = last_hidden_states.mean(dim=1)
51
+ print(f"Shape of sentence embedding: {sentence_embedding.shape}")
52
+ ```
53
+
54
+ ## Citation
55
+
56
+ If you use this model or the associated research in your work, please cite the original paper:
57
+
58
+ ```bibtex
59
+ @misc{boizard2025shouldwe,
60
+ title={Should We Still Pretrain Encoders with Masked Language Modeling?},
61
+ author={Nicolas Boizard and Hippolyte Gisserot-Boukhlef and Duarte M. Alves and Andr{\\'e} Martins and Ayoub Hammal and Caio Corro and C{\\'e}line Hudelot and Emmanuel Malherbe and Etienne Malaboeuf and Fanny Jourdan and Gabriel Hautreux and Jo{\\~a}o Alves and Kevin El-Haddad and Manuel Faysse and Maxime Peyrard and Nuno M. Guerreiro and Patrick Fernandes and Ricardo Rei and Pierre Colombo},
62
+ year={2025},
63
+ eprint={2507.00994},
64
+ archivePrefix={arXiv},
65
+ primaryClass={cs.CL},
66
+ url={https://arxiv.org/abs/2507.00994},
67
+ }
68
+ ```