HalleluBERT: Let every token that has meaning bear its weight

HalleluBERT is a family of RoBERTa-based Modern Hebrew language models pre-trained from scratch on ~49.1 GB of deduplicated Hebrew web text (HeDC4 / HeRo corpus) and Hebrew Wikipedia. The models aim to provide the first fully converged Hebrew RoBERTa encoder family, including a large variant, and to push state-of-the-art performance on core Hebrew benchmarks.

We release two variants:

HalleluBERT-base: 126M parameters (fp32)
HalleluBERT-large: 357M parameters (fp32)

Model Details

Detail	HalleluBERT-base	HalleluBERT-large
Architecture	RoBERTa-base	RoBERTa-large
Parameters	~126M	~357M
Tokenizer	GPT-2 style byte-level BPE (52,009 vocab)	Same
Pretraining corpus	HeDC4 (mC4 + OSCAR22) + Hebrew Wikipedia (~49.1 GB)	Same
Objective	Masked Language Modeling	Same
Training steps	100k updates, global batch size 8k	Same
LR schedule	10k warmup + polynomial decay	Same
Peak learning rate	0.0004	0.00015
Training time	~30.2 hours (TPUv4-128 pod)	~6.0 days (TPUv4-128 pod)
Precision	fp32	fp32
Framework	fairseq	fairseq

Downstream Evaluation

We evaluate HalleluBERT on three Hebrew benchmarks (following the HeRo suite, restricted to NER + sentiment):

NER (BMC split 1): micro-F1
NER (NEMO², token-level): micro-F1
Sentiment (SMCD, deduplicated): macro-F1

We select the best configuration by validation performance and report the best score out of 10 runs on the official test sets.

🧪 Evaluation Results

Legend: Bold = best, underline = second-best within each model size group.

Model	BMC (micro-F1)	NEMO (micro-F1)	AVG NER	SMCD (macro-F1)	AVG (all)
Large models
HalleluBERT_large	93.23	88.70	90.96	84.91	88.95
XLM-RoBERTa_large	92.31	86.41	89.36	83.74	87.49
Base models
HeBERT	89.33	76.16	82.74	82.64	82.71
AlephBERT	91.36	81.52	86.44	83.66	85.51
HeRo	92.00	83.35	87.68	80.95	85.43
HalleluBERT_base	93.33	87.06	90.20	83.09	87.83
mmBERT_small	83.96	71.95	77.96	81.89	79.27
AlephBERT-Gimmel	92.46	85.86	89.16	82.66	86.99
XLM-RoBERTa_base	86.32	79.37	82.84	82.07	82.59
mmBERT_base	84.61	77.97	81.29	83.55	82.04

Fairseq Checkpoint

Get the fairseq checkpoint here.

Citation

If you use HalleluBERT in your research, please cite the corresponding paper (replace with your final bib entry if you already have one):

@misc{scheibleschmitt2025hallelubertlettokenmeaning,
      title={HalleluBERT: Let every token that has meaning bear its weight}, 
      author={Raphael Scheible-Schmitt},
      year={2025},
      eprint={2510.21372},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.21372}, 
}

📜 License

MIT License

Downloads last month: 36

Safetensors

Model size

0.1B params

Tensor type

I64

F32

Model tree for HalleluBERT/HalleluBERT_base

Quantizations

1 model

Paper for HalleluBERT/HalleluBERT_base

HalleluBERT: Let every token that has meaning bear its weight

Paper • 2510.21372 • Published Oct 24, 2025