Add comprehensive model card

This PR adds a comprehensive model card for the model in this repository.
It includes:
- Metadata: `pipeline_tag: feature-extraction`, `library_name: transformers`, and `license: apache-2.0`, along with descriptive tags. This ensures better discoverability and integration within the Hugging Face Hub.
- Links: To the associated paper ([Should We Still Pretrain Encoders with Masked Language Modeling?](https://huggingface.co/papers/2507.00994)), the project page ([https://hf.co/MLMvsCLM](https://hf.co/MLMvsCLM)), and the GitHub repository for the `Optimus` training library ([https://github.com/Nicolas-BZRD/EuroBERT](https://github.com/Nicolas-BZRD/EuroBERT)).
- Model Description: A summary of the paper's key findings regarding MLM, CLM, and biphasic pretraining strategies for encoders.
- Sample Usage: A clear Python code snippet demonstrating how to use the model with the `transformers` library for feature extraction, including the necessary `trust_remote_code=True` parameter.
- Citation: The BibTeX entry for the paper.

Files changed (1) hide show

README.md +76 -0

README.md ADDED Viewed

	@@ -0,0 +1,76 @@

+---
+license: apache-2.0
+pipeline_tag: feature-extraction
+library_name: transformers
+tags:
+- encoder
+- text-representation
+- mlm
+- clm
+---
+This repository contains an encoder model presented in the paper [Should We Still Pretrain Encoders with Masked Language Modeling?](https://huggingface.co/papers/2507.00994).
+The project page for this research can be found at: [https://hf.co/MLMvsCLM](https://hf.co/MLMvsCLM)
+The code repository for the Optimus training library, used to train this model, is available at: [https://github.com/Nicolas-BZRD/EuroBERT](https://github.com/Nicolas-BZRD/EuroBERT)
+## Model Description
+This model is part of a large-scale study investigating whether Masked Language Modeling (MLM) or Causal Language Modeling (CLM) is more effective for pretraining text encoders to learn high-quality text representations. The research involved training 30 models ranging from 210 million to 1 billion parameters.
+Key findings indicate that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these observations, the paper proposes and experimentally shows that a biphasic training strategy, which sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. This strategy is particularly appealing when initializing from readily available pretrained CLM models from the existing LLM ecosystem, reducing the computational burden for training best-in-class encoder models. All project artifacts are released to foster further research.
+## Usage
+You can load and use this model with the `transformers` library for feature extraction. Due to its custom architecture (`SLModel`), please ensure `trust_remote_code=True` when loading the model.
+```python
+from transformers import AutoTokenizer, AutoModel
+import torch
+# Replace "your_model_id" with the actual ID of this model repository on Hugging Face Hub
+model_id = "your_model_id"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModel.from_pretrained(model_id, trust_remote_code=True)
+text = "Learning high-quality text representations is fundamental to a wide range of NLP tasks."
+# Tokenize the input text
+inputs = tokenizer(text, return_tensors="pt")
+# Move inputs and model to the appropriate device (e.g., GPU if available)
+device = "cuda" if torch.cuda.is_available() else "cpu"
+model.to(device)
+inputs = {k: v.to(device) for k, v in inputs.items()}
+# Get model outputs
+with torch.no_grad():
+    outputs = model(**inputs)
+# The last_hidden_state contains the contextualized embeddings for each token
+# Shape: (batch_size, sequence_length, hidden_size)
+last_hidden_state = outputs.last_hidden_state
+print(f"Shape of last_hidden_state: {last_hidden_state.shape}")
+# For a pooled sentence embedding (e.g., [BOS] token embedding, usually the first token)
+# Or by applying mean pooling across tokens (excluding padding)
+pooled_output = last_hidden_state[:, 0, :]
+print(f"Shape of pooled_output (first token): {pooled_output.shape}")
+```
+## Citation
+If you use this model in your research, please cite the corresponding paper:
+```bibtex
+@misc{boizard2025shouldwe,
+      title={Should We Still Pretrain Encoders with Masked Language Modeling?},
+      author={Nicolas Boizard and Hippolyte Gisserot-Boukhlef and Duarte M. Alves and Andr{\'e} Martins and Ayoub Hammal and Caio Corro and C{\'e}line Hudelot and Emmanuel Malherbe and Etienne Malaboeuf and Fanny Jourdan and Gabriel Hautreux and Jo{\~a}o Alves and Kevin El-Haddad and Manuel Faysse and Maxime Peyrard and Nuno M. Guerreiro and Patrick Fernandes and Ricardo Rei and Pierre Colombo},
+      year={2025},
+      eprint={2507.00994},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2507.00994},
+}
+```