chandar-lab
/

NeoBERT

Feature Extraction

Model card Files Files and versions

NeoBERT / README.md

nielsr's picture

nielsr HF Staff

Add feature-extraction pipeline tag

d6bef32 verified 12 months ago

|

3.65 kB

	---
	datasets:
	- tiiuae/falcon-refinedweb
	language:
	- en
	library_name: transformers
	license: mit
	pipeline_tag: feature-extraction
	---

	# NeoBERT

	[![Hugging Face Model Card](https://img.shields.io/badge/Hugging%20Face-Model%20Card-blue)](https://huggingface.co/chandar-lab/NeoBERT)

	NeoBERT is a next-generation encoder model for English text representation, pre-trained from scratch on the RefinedWeb dataset. NeoBERT integrates state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. It is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it is the most efficient model of its kind and achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions.

	- Paper: [paper](https://arxiv.org/abs/2502.19587)
	- Repository: [github](https://github.com/chandar-lab/NeoBERT).

	## Get started

	Ensure you have the following dependencies installed:

	```bash
	pip install transformers torch xformers==0.0.28.post3
	```

	If you would like to use sequence packing (un-padding), you will need to also install flash-attention:

	```bash
	pip install transformers torch xformers==0.0.28.post3 flash_attn
	```

	## How to use

	Load the model using Hugging Face Transformers:

	```python
	from transformers import AutoModel, AutoTokenizer

	model_name = "chandar-lab/NeoBERT"
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModel.from_pretrained(model_name, trust_remote_code=True)

	# Tokenize input text
	text = "NeoBERT is the most efficient model of its kind!"
	inputs = tokenizer(text, return_tensors="pt")

	# Generate embeddings
	outputs = model(**inputs)
	embedding = outputs.last_hidden_state[:, 0, :]
	print(embedding.shape)
	```

	## Features
	\| Feature \| NeoBERT \|
	\|---------------------------\|-----------------------------\|
	\| `Depth-to-width` \| 28 × 768 \|
	\| `Parameter count` \| 250M \|
	\| `Activation` \| SwiGLU \|
	\| `Positional embeddings` \| RoPE \|
	\| `Normalization` \| Pre-RMSNorm \|
	\| `Data Source` \| RefinedWeb \|
	\| `Data Size` \| 2.8 TB \|
	\| `Tokenizer` \| google/bert \|
	\| `Context length` \| 4,096 \|
	\| `MLM Masking Rate` \| 20% \|
	\| `Optimizer` \| AdamW \|
	\| `Scheduler` \| CosineDecay \|
	\| `Training Tokens` \| 2.1 T \|
	\| `Efficiency` \| FlashAttention \|

	## License

	Model weights and code repository are licensed under the permissive MIT license.

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{breton2025neobertnextgenerationbert,
	title={NeoBERT: A Next-Generation BERT},
	author={Lola Le Breton and Quentin Fournier and Mariam El Mezouar and Sarath Chandar},
	year={2025},
	eprint={2502.19587},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2502.19587},
	}
	```

	## Contact

	For questions, do not hesitate to reach out and open an issue on here or on our [GitHub](https://github.com/chandar-lab/NeoBERT).

	---