nielsr HF Staff commited on
Commit
bc8cf81
·
verified ·
1 Parent(s): 295e82c

Add model card for EuroBERT-210M related to MLM vs CLM paper

Browse files

This PR adds a comprehensive model card for the `EuroBERT-210M` model. This model is associated with the paper "[Should We Still Pretrain Encoders with Masked Language Modeling?](https://huggingface.co/papers/2507.00994)", which investigates optimal pretraining strategies for encoders.

The model card now includes:
- Essential metadata (`pipeline_tag`, `library_name`, `license`).
- A concise summary of the paper's key findings and contributions.
- Direct links to the paper, the project page (`https://hf.co/MLMvsCLM`), and the underlying GitHub repository (`https://github.com/Nicolas-BZRD/EuroBERT`).
- A practical usage example demonstrating how to load the model and extract text features using the `transformers` library, including the necessary `trust_remote_code=True` for this custom architecture.
- A BibTeX citation for the paper.

This update significantly improves the model's discoverability and provides users with critical information for understanding, using, and citing the artifact.

Files changed (1) hide show
  1. README.md +68 -0
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: feature-extraction
3
+ library_name: transformers
4
+ license: apache-2.0
5
+ ---
6
+
7
+ # EuroBERT-210M: Investigating Pretraining Strategies for Encoders
8
+
9
+ This repository contains the `EuroBERT-210M` model, an encoder model trained in the context of the research presented in the paper **"[Should We Still Pretrain Encoders with Masked Language Modeling?](https://huggingface.co/papers/2507.00994)"**.
10
+
11
+ The paper addresses a fundamental question in NLP: whether Masked Language Modeling (MLM) or Causal Language Modeling (CLM) is the more effective pretraining objective for learning high-quality text representations in encoders. Through a series of large-scale, carefully controlled pretraining ablations, the research demonstrates that while MLM generally yields better performance across text representation tasks, CLM-trained models exhibit superior data efficiency and improved fine-tuning stability. Building on these insights, the paper introduces a biphasic training strategy that sequentially applies CLM followed by MLM, achieving optimal performance under a fixed computational budget. This strategy is particularly appealing when initializing from readily available pretrained CLM models from the existing LLM ecosystem, reducing the computational burden required to train best-in-class encoder models.
12
+
13
+ This model is part of the broader `EuroBERT` project, which utilizes the `Optimus` training library for efficient language model development.
14
+
15
+ - **Paper**: [Should We Still Pretrain Encoders with Masked Language Modeling?](https://huggingface.co/papers/2507.00994)
16
+ - **Project Page**: [https://hf.co/MLMvsCLM](https://hf.co/MLMvsCLM)
17
+ - **Code**: [https://github.com/Nicolas-BZRD/EuroBERT](https://github.com/Nicolas-BZRD/EuroBERT)
18
+
19
+ ## Usage
20
+
21
+ You can load and use this model for feature extraction with the `transformers` library. Note that `trust_remote_code=True` is required due to custom architectural components (`SLModel`).
22
+
23
+ ```python
24
+ from transformers import AutoModel, AutoTokenizer
25
+ import torch
26
+
27
+ # Replace with the actual model ID if different (e.g., 'org/repo-name')
28
+ model_id = "EuroBERT/EuroBERT-210m" # Assuming this is the model ID for this repository
29
+
30
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
31
+ model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
32
+
33
+ text = "This is an example sentence to obtain embeddings for."
34
+ inputs = tokenizer(text, return_tensors="pt")
35
+
36
+ # Move to GPU if available
37
+ # if torch.cuda.is_available():
38
+ # model = model.to("cuda")
39
+ # inputs = {k: v.to("cuda") for k, v in inputs.items()}
40
+
41
+ with torch.no_grad():
42
+ outputs = model(**inputs)
43
+
44
+ # The last hidden state contains the token embeddings
45
+ # For EuroBERT-210M (hidden_size=1152), the shape for a single input might be e.g., (1, num_tokens, 1152)
46
+ last_hidden_states = outputs.last_hidden_state
47
+ print(f"Shape of last hidden states: {last_hidden_states.shape}")
48
+
49
+ # For sentence-level embeddings, a common approach is mean pooling:
50
+ sentence_embedding = last_hidden_states.mean(dim=1)
51
+ print(f"Shape of sentence embedding: {sentence_embedding.shape}")
52
+ ```
53
+
54
+ ## Citation
55
+
56
+ If you use this model or the associated research in your work, please cite the original paper:
57
+
58
+ ```bibtex
59
+ @misc{boizard2025shouldwe,
60
+ title={Should We Still Pretrain Encoders with Masked Language Modeling?},
61
+ author={Nicolas Boizard and Hippolyte Gisserot-Boukhlef and Duarte M. Alves and Andr{\\'e} Martins and Ayoub Hammal and Caio Corro and C{\\'e}line Hudelot and Emmanuel Malherbe and Etienne Malaboeuf and Fanny Jourdan and Gabriel Hautreux and Jo{\\~a}o Alves and Kevin El-Haddad and Manuel Faysse and Maxime Peyrard and Nuno M. Guerreiro and Patrick Fernandes and Ricardo Rei and Pierre Colombo},
62
+ year={2025},
63
+ eprint={2507.00994},
64
+ archivePrefix={arXiv},
65
+ primaryClass={cs.CL},
66
+ url={https://arxiv.org/abs/2507.00994},
67
+ }
68
+ ```