HalleluBERT: Let every token that has meaning bear its weight
HalleluBERT is a family of RoBERTa-based Modern Hebrew language models pre-trained from scratch on ~49.1 GB of deduplicated Hebrew web text (HeDC4 / HeRo corpus) and Hebrew Wikipedia. The models aim to provide the first fully converged Hebrew RoBERTa encoder family, including a large variant, and to push state-of-the-art performance on core Hebrew benchmarks.
We release two variants:
HalleluBERT-base: 126M parameters (fp32)HalleluBERT-large: 357M parameters (fp32)
Model Details
| Detail | HalleluBERT-base | HalleluBERT-large |
|---|---|---|
| Architecture | RoBERTa-base | RoBERTa-large |
| Parameters | ~126M | ~357M |
| Tokenizer | GPT-2 style byte-level BPE (52,009 vocab) | Same |
| Pretraining corpus | HeDC4 (mC4 + OSCAR22) + Hebrew Wikipedia (~49.1 GB) | Same |
| Objective | Masked Language Modeling | Same |
| Training steps | 100k updates, global batch size 8k | Same |
| LR schedule | 10k warmup + polynomial decay | Same |
| Peak learning rate | 0.0004 | 0.00015 |
| Training time | ~30.2 hours (TPUv4-128 pod) | ~6.0 days (TPUv4-128 pod) |
| Precision | fp32 | fp32 |
| Framework | fairseq | fairseq |
Downstream Evaluation
We evaluate HalleluBERT on three Hebrew benchmarks (following the HeRo suite, restricted to NER + sentiment):
- NER (BMC split 1): micro-F1
- NER (NEMOΒ², token-level): micro-F1
- Sentiment (SMCD, deduplicated): macro-F1
We select the best configuration by validation performance and report the best score out of 10 runs on the official test sets.
π§ͺ Evaluation Results
Legend: Bold = best, underline = second-best within each model size group.
| Model | BMC (micro-F1) | NEMO (micro-F1) | AVG NER | SMCD (macro-F1) | AVG (all) |
|---|---|---|---|---|---|
| Large models | |||||
| HalleluBERT_large | 93.23 | 88.70 | 90.96 | 84.91 | 88.95 |
| XLM-RoBERTa_large | 92.31 | 86.41 | 89.36 | 83.74 | 87.49 |
| Base models | |||||
| HeBERT | 89.33 | 76.16 | 82.74 | 82.64 | 82.71 |
| AlephBERT | 91.36 | 81.52 | 86.44 | 83.66 | 85.51 |
| HeRo | 92.00 | 83.35 | 87.68 | 80.95 | 85.43 |
| HalleluBERT_base | 93.33 | 87.06 | 90.20 | 83.09 | 87.83 |
| mmBERT_small | 83.96 | 71.95 | 77.96 | 81.89 | 79.27 |
| AlephBERT-Gimmel | 92.46 | 85.86 | 89.16 | 82.66 | 86.99 |
| XLM-RoBERTa_base | 86.32 | 79.37 | 82.84 | 82.07 | 82.59 |
| mmBERT_base | 84.61 | 77.97 | 81.29 | 83.55 | 82.04 |
Fairseq Checkpoint
Get the fairseq checkpoint here.
Citation
If you use HalleluBERT in your research, please cite the corresponding paper (replace with your final bib entry if you already have one):
@misc{scheibleschmitt2025hallelubertlettokenmeaning,
title={HalleluBERT: Let every token that has meaning bear its weight},
author={Raphael Scheible-Schmitt},
year={2025},
eprint={2510.21372},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.21372},
}
π License
MIT License
- Downloads last month
- 9