MLMvsCLM
/

1b-mlm30-42k

@@ -1,7 +1,7 @@
 ---
-pipeline_tag: feature-extraction
 library_name: transformers
 license: apache-2.0
 ---
 # Overview
@@ -9,6 +9,7 @@ license: apache-2.0
 This repository contains an encoder model, part of the research presented in the paper *Should We Still Pretrain Encoders with Masked Language Modeling?* (Gisserot-Boukhlef et al.).
 *   **Paper:** [Should We Still Pretrain Encoders with Masked Language Modeling?](https://huggingface.co/papers/2507.00994)
 *   **Blog post:** [Link](https://huggingface.co/blog/Nicolas-BZRD/encoders-should-not-be-only-pre-trained-with-mlm)
 *   **Project page:** [https://hf.co/MLMvsCLM](https://hf.co/MLMvsCLM)
@@ -37,9 +38,8 @@ You can use this model for feature extraction with the Hugging Face `transformer
 from transformers import AutoTokenizer, AutoModel
 import torch
-# Replace with the actual model ID if different, e.g., "AhmedAliHassan/MLMvsCLM-Biphasic-210M"
-# This placeholder assumes the current repository is the model you want to load.
-model_name = "<YOUR_MODEL_ID_HERE>"
 # Load the tokenizer and model, ensuring trust_remote_code for custom architectures
 tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

 ---
 library_name: transformers
 license: apache-2.0
+pipeline_tag: feature-extraction
 ---
 # Overview
 This repository contains an encoder model, part of the research presented in the paper *Should We Still Pretrain Encoders with Masked Language Modeling?* (Gisserot-Boukhlef et al.).
 *   **Paper:** [Should We Still Pretrain Encoders with Masked Language Modeling?](https://huggingface.co/papers/2507.00994)
+*   **Code:** [https://github.com/Nicolas-BZRD/EuroBERT](https://github.com/Nicolas-BZRD/EuroBERT)
 *   **Blog post:** [Link](https://huggingface.co/blog/Nicolas-BZRD/encoders-should-not-be-only-pre-trained-with-mlm)
 *   **Project page:** [https://hf.co/MLMvsCLM](https://hf.co/MLMvsCLM)
 from transformers import AutoTokenizer, AutoModel
 import torch
+# This example uses a representative model ID from the paper's artifacts.
+model_name = "AhmedAliHassan/MLMvsCLM-Biphasic-210M"
 # Load the tokenizer and model, ensuring trust_remote_code for custom architectures
 tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)