| | --- |
| | datasets: |
| | - tiiuae/falcon-refinedweb |
| | language: |
| | - en |
| | library_name: transformers |
| | license: mit |
| | pipeline_tag: feature-extraction |
| | --- |
| | |
| | # NeoBERT |
| |
|
| | [](https://huggingface.co/chandar-lab/NeoBERT) |
| |
|
| | NeoBERT is a **next-generation encoder** model for English text representation, pre-trained from scratch on the RefinedWeb dataset. NeoBERT integrates state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. It is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an **optimal depth-to-width ratio**, and leverages an extended context length of **4,096 tokens**. Despite its compact 250M parameter footprint, it is the most efficient model of its kind and achieves **state-of-the-art results** on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. |
| |
|
| | - Paper: [paper](https://arxiv.org/abs/2502.19587) |
| | - Repository: [github](https://github.com/chandar-lab/NeoBERT). |
| |
|
| | ## Get started |
| |
|
| | Ensure you have the following dependencies installed: |
| |
|
| | ```bash |
| | pip install transformers torch xformers==0.0.28.post3 |
| | ``` |
| |
|
| | If you would like to use sequence packing (un-padding), you will need to also install flash-attention: |
| |
|
| | ```bash |
| | pip install transformers torch xformers==0.0.28.post3 flash_attn |
| | ``` |
| |
|
| | ## How to use |
| |
|
| | Load the model using Hugging Face Transformers: |
| |
|
| | ```python |
| | from transformers import AutoModel, AutoTokenizer |
| | |
| | model_name = "chandar-lab/NeoBERT" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| | model = AutoModel.from_pretrained(model_name, trust_remote_code=True) |
| | |
| | # Tokenize input text |
| | text = "NeoBERT is the most efficient model of its kind!" |
| | inputs = tokenizer(text, return_tensors="pt") |
| | |
| | # Generate embeddings |
| | outputs = model(**inputs) |
| | embedding = outputs.last_hidden_state[:, 0, :] |
| | print(embedding.shape) |
| | ``` |
| |
|
| | ## Features |
| | | **Feature** | **NeoBERT** | |
| | |---------------------------|-----------------------------| |
| | | `Depth-to-width` | 28 × 768 | |
| | | `Parameter count` | 250M | |
| | | `Activation` | SwiGLU | |
| | | `Positional embeddings` | RoPE | |
| | | `Normalization` | Pre-RMSNorm | |
| | | `Data Source` | RefinedWeb | |
| | | `Data Size` | 2.8 TB | |
| | | `Tokenizer` | google/bert | |
| | | `Context length` | 4,096 | |
| | | `MLM Masking Rate` | 20% | |
| | | `Optimizer` | AdamW | |
| | | `Scheduler` | CosineDecay | |
| | | `Training Tokens` | 2.1 T | |
| | | `Efficiency` | FlashAttention | |
| |
|
| | ## License |
| |
|
| | Model weights and code repository are licensed under the permissive MIT license. |
| |
|
| | ## Citation |
| |
|
| | If you use this model in your research, please cite: |
| |
|
| | ```bibtex |
| | @misc{breton2025neobertnextgenerationbert, |
| | title={NeoBERT: A Next-Generation BERT}, |
| | author={Lola Le Breton and Quentin Fournier and Mariam El Mezouar and Sarath Chandar}, |
| | year={2025}, |
| | eprint={2502.19587}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.CL}, |
| | url={https://arxiv.org/abs/2502.19587}, |
| | } |
| | ``` |
| |
|
| | ## Contact |
| |
|
| | For questions, do not hesitate to reach out and open an issue on here or on our **[GitHub](https://github.com/chandar-lab/NeoBERT)**. |
| |
|
| | --- |