mjbommar
/

ogbert-2m-base

masked-language-model

Eval Results (legacy)

Model card Files Files and versions

ogbert-2m-base / README.md

mjbommar's picture

Upload README.md with huggingface_hub

d7ffe7a verified 2 months ago

|

history blame contribute delete

3.04 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- transformers
	- modernbert
	- fill-mask
	- masked-language-model
	pipeline_tag: fill-mask
	datasets:
	- mjbommar/ogbert-v1-mlm
	model-index:
	- name: ogbert-2m-base
	results:
	- task:
	type: word-similarity
	dataset:
	name: SimLex-999
	type: simlex999
	metrics:
	- type: spearman
	value: 0.162
	---

	# OGBert-2M-Base

	A tiny (2.1M parameter) ModernBERT-based masked language model for glossary and domain-specific text.

	Related models:
	- [mjbommar/ogbert-2m-sentence](https://huggingface.co/mjbommar/ogbert-2m-sentence) - Sentence embedding version with mean pooling + L2 normalization

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Architecture \| ModernBERT \|
	\| Parameters \| 2.1M \|
	\| Hidden size \| 128 \|
	\| Layers \| 4 \|
	\| Attention heads \| 4 \|
	\| Vocab size \| 8,192 \|
	\| Max sequence \| 1,024 tokens \|

	## Training

	- Task: Masked Language Modeling (MLM)
	- Dataset: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm) - derived from [OpenGloss](https://arxiv.org/abs/2511.18622), a synthetic encyclopedic dictionary with 537K senses across 150K lexemes
	- Masking: Standard 15% token masking

	## Performance

	### Word Similarity (SimLex-999)

	SimLex-999 measures Spearman correlation between model cosine similarities and human judgments on 999 word pairs. Higher = better alignment with human perception of word similarity.

	\| Model \| Params \| SimLex-999 (ρ) \|
	\|-------\|--------\|----------------\|
	\| OGBert-2M-Base \| 2.1M \| 0.162 \|
	\| BERT-base \| 110M \| 0.070 \|
	\| RoBERTa-base \| 125M \| -0.061 \|

	OGBert-2M-Base achieves 2.3x better word similarity than BERT-base with 52x fewer parameters.

	## Usage

	### Fill-Mask Pipeline

	```python
	from transformers import pipeline

	fill_mask = pipeline('fill-mask', model='mjbommar/ogbert-2m-base')
	result = fill_mask('The financial <\|mask\|> was approved.')
	```

	Output:
	\| Rank \| Token \| Score \|
	\|------\|-------\|-------\|
	\| 1 \| report \| 0.031 \|
	\| 2 \| transaction \| 0.025 \|
	\| 3 \| system \| 0.021 \|
	\| 4 \| audit \| 0.019 \|
	\| 5 \| account \| 0.017 \|

	### Direct Model Usage

	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-2m-base')
	model = AutoModelForMaskedLM.from_pretrained('mjbommar/ogbert-2m-base')

	inputs = tokenizer('The <\|mask\|> definition is clear.', return_tensors='pt')
	outputs = model(**inputs)
	```

	### For Sentence Embeddings

	Use [mjbommar/ogbert-2m-sentence](https://huggingface.co/mjbommar/ogbert-2m-sentence) instead, which includes mean pooling and L2 normalization for optimal similarity search.

	## Citation

	If you use this model, please cite the OpenGloss dataset:

	```bibtex
	@article{bommarito2025opengloss,
	title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
	author={Bommarito II, Michael J.},
	journal={arXiv preprint arXiv:2511.18622},
	year={2025}
	}
	```

	## License

	Apache 2.0