nielsr HF Staff commited on
Commit
de5f12e
·
verified ·
1 Parent(s): 23c0dec

Add comprehensive model card

Browse files

This PR adds a comprehensive model card for the model in this repository.
It includes:
- Metadata: `pipeline_tag: feature-extraction`, `library_name: transformers`, and `license: apache-2.0`, along with descriptive tags. This ensures better discoverability and integration within the Hugging Face Hub.
- Links: To the associated paper ([Should We Still Pretrain Encoders with Masked Language Modeling?](https://huggingface.co/papers/2507.00994)), the project page ([https://hf.co/MLMvsCLM](https://hf.co/MLMvsCLM)), and the GitHub repository for the `Optimus` training library ([https://github.com/Nicolas-BZRD/EuroBERT](https://github.com/Nicolas-BZRD/EuroBERT)).
- Model Description: A summary of the paper's key findings regarding MLM, CLM, and biphasic pretraining strategies for encoders.
- Sample Usage: A clear Python code snippet demonstrating how to use the model with the `transformers` library for feature extraction, including the necessary `trust_remote_code=True` parameter.
- Citation: The BibTeX entry for the paper.

Files changed (1) hide show
  1. README.md +76 -0
README.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: feature-extraction
4
+ library_name: transformers
5
+ tags:
6
+ - encoder
7
+ - text-representation
8
+ - mlm
9
+ - clm
10
+ ---
11
+
12
+ This repository contains an encoder model presented in the paper [Should We Still Pretrain Encoders with Masked Language Modeling?](https://huggingface.co/papers/2507.00994).
13
+
14
+ The project page for this research can be found at: [https://hf.co/MLMvsCLM](https://hf.co/MLMvsCLM)
15
+ The code repository for the Optimus training library, used to train this model, is available at: [https://github.com/Nicolas-BZRD/EuroBERT](https://github.com/Nicolas-BZRD/EuroBERT)
16
+
17
+ ## Model Description
18
+
19
+ This model is part of a large-scale study investigating whether Masked Language Modeling (MLM) or Causal Language Modeling (CLM) is more effective for pretraining text encoders to learn high-quality text representations. The research involved training 30 models ranging from 210 million to 1 billion parameters.
20
+
21
+ Key findings indicate that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these observations, the paper proposes and experimentally shows that a biphasic training strategy, which sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. This strategy is particularly appealing when initializing from readily available pretrained CLM models from the existing LLM ecosystem, reducing the computational burden for training best-in-class encoder models. All project artifacts are released to foster further research.
22
+
23
+ ## Usage
24
+
25
+ You can load and use this model with the `transformers` library for feature extraction. Due to its custom architecture (`SLModel`), please ensure `trust_remote_code=True` when loading the model.
26
+
27
+ ```python
28
+ from transformers import AutoTokenizer, AutoModel
29
+ import torch
30
+
31
+ # Replace "your_model_id" with the actual ID of this model repository on Hugging Face Hub
32
+ model_id = "your_model_id"
33
+
34
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
35
+ model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
36
+
37
+ text = "Learning high-quality text representations is fundamental to a wide range of NLP tasks."
38
+
39
+ # Tokenize the input text
40
+ inputs = tokenizer(text, return_tensors="pt")
41
+
42
+ # Move inputs and model to the appropriate device (e.g., GPU if available)
43
+ device = "cuda" if torch.cuda.is_available() else "cpu"
44
+ model.to(device)
45
+ inputs = {k: v.to(device) for k, v in inputs.items()}
46
+
47
+ # Get model outputs
48
+ with torch.no_grad():
49
+ outputs = model(**inputs)
50
+
51
+ # The last_hidden_state contains the contextualized embeddings for each token
52
+ # Shape: (batch_size, sequence_length, hidden_size)
53
+ last_hidden_state = outputs.last_hidden_state
54
+ print(f"Shape of last_hidden_state: {last_hidden_state.shape}")
55
+
56
+ # For a pooled sentence embedding (e.g., [BOS] token embedding, usually the first token)
57
+ # Or by applying mean pooling across tokens (excluding padding)
58
+ pooled_output = last_hidden_state[:, 0, :]
59
+ print(f"Shape of pooled_output (first token): {pooled_output.shape}")
60
+ ```
61
+
62
+ ## Citation
63
+
64
+ If you use this model in your research, please cite the corresponding paper:
65
+
66
+ ```bibtex
67
+ @misc{boizard2025shouldwe,
68
+ title={Should We Still Pretrain Encoders with Masked Language Modeling?},
69
+ author={Nicolas Boizard and Hippolyte Gisserot-Boukhlef and Duarte M. Alves and Andr{\'e} Martins and Ayoub Hammal and Caio Corro and C{\'e}line Hudelot and Emmanuel Malherbe and Etienne Malaboeuf and Fanny Jourdan and Gabriel Hautreux and Jo{\~a}o Alves and Kevin El-Haddad and Manuel Faysse and Maxime Peyrard and Nuno M. Guerreiro and Patrick Fernandes and Ricardo Rei and Pierre Colombo},
70
+ year={2025},
71
+ eprint={2507.00994},
72
+ archivePrefix={arXiv},
73
+ primaryClass={cs.CL},
74
+ url={https://arxiv.org/abs/2507.00994},
75
+ }
76
+ ```