|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# Ettin Checkpoints |
|
|
|
|
|
[](https://opensource.org/licenses/MIT) |
|
|
[](https://arxiv.org/abs/2509.06888) |
|
|
[](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) |
|
|
[](https://github.com/jhu-clsp/mmBERT) |
|
|
|
|
|
This repository contains the raw training checkpoints for the mmBERT models. Each model contains three subfolders for `decay`, `ext`, and `pretrain`. |
|
|
|
|
|
These files work with Composer and contain all state needed to resume pre-training. Please see the [ModernBERT repository](https://github.com/AnswerDotAI/ModernBERT) for usage details. |
|
|
|
|
|
|
|
|
## π Related Resources |
|
|
|
|
|
- **Models**: [mmBERT Model Suite](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) |
|
|
- **Phase 1**: [Pre-training Data](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs) (2.3T tokens) |
|
|
- **Phase 2**: [Mid-training Data](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining) (600B tokens) |
|
|
- **Phase 3**: [Decay Phase Data](https://huggingface.co/datasets/jhu-clsp/mmbert-decay) (100B tokens) |
|
|
- **Paper**: [Arxiv link](https://arxiv.org/abs/2509.06888) |
|
|
- **Code**: [GitHub Repository](https://github.com/jhu-clsp/mmBERT) |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{marone2025mmbertmodernmultilingualencoder, |
|
|
title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning}, |
|
|
author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme}, |
|
|
year={2025}, |
|
|
eprint={2509.06888}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2509.06888}, |
|
|
} |
|
|
``` |