mjbommar
/

ogbert-110m-base

masked-language-model

Eval Results (legacy)

Model card Files Files and versions

ogbert-110m-base / README.md

mjbommar's picture

Upload model card

2d258b4 verified 3 months ago

|

history blame contribute delete

3.44 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- transformers
	- modernbert
	- fill-mask
	- masked-language-model
	pipeline_tag: fill-mask
	datasets:
	- mjbommar/ogbert-v1-mlm
	model-index:
	- name: ogbert-110m-base
	results:
	- task:
	type: word-similarity
	dataset:
	name: SimLex-999
	type: simlex999
	metrics:
	- type: spearman
	value: 0.345
	---

	# OGBert-110M-Base

	A 110M parameter ModernBERT-based masked language model trained on glossary and domain-specific text.

	Related models:
	- [mjbommar/ogbert-110m-sentence](https://huggingface.co/mjbommar/ogbert-110m-sentence) - Sentence embedding version with mean pooling + L2 normalization

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Architecture \| ModernBERT \|
	\| Parameters \| 110M \|
	\| Hidden size \| 768 \|
	\| Layers \| 12 \|
	\| Attention heads \| 12 \|
	\| Vocab size \| 32,768 \|
	\| Max sequence \| 1,024 tokens \|

	## Training

	- Task: Masked Language Modeling (MLM)
	- Dataset: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm) - derived from [OpenGloss](https://arxiv.org/abs/2511.18622), a synthetic encyclopedic dictionary with 537K senses across 150K lexemes
	- Masking: Standard 15% token masking
	- Training steps: 8,000 steps (selected for optimal downstream performance)
	- Tokens processed: ~4.5B
	- Batch size: 1,024
	- Peak learning rate: 3e-4

	## Performance

	### Word Similarity (SimLex-999)

	SimLex-999 measures Spearman correlation between model cosine similarities and human judgments on 999 word pairs. Higher = better alignment with human perception of word similarity.

	\| Model \| Params \| SimLex-999 (ρ) \|
	\|-------\|--------\|----------------\|
	\| OGBert-110M-Base \| 110M \| 0.345 \|
	\| BERT-base \| 110M \| 0.070 \|
	\| RoBERTa-base \| 125M \| -0.061 \|

	OGBert-110M-Base achieves 5x better word similarity than BERT-base with the same parameter count.

	### Document Clustering

	Evaluated on 80 domain-specific documents across 10 categories using KMeans.

	\| Model \| Params \| ARI \| Cluster Acc \|
	\|-------\|--------\|-----\|-------------\|
	\| OGBert-110M-Base \| 110M \| 0.941 \| 0.975 \|
	\| BERT-base \| 110M \| 0.896 \| 0.950 \|
	\| RoBERTa-base \| 125M \| 0.941 \| 0.975 \|

	OGBert-110M-Base matches or exceeds RoBERTa-base on clustering tasks.

	## Usage

	### Fill-Mask Pipeline

	```python
	from transformers import pipeline

	fill_mask = pipeline('fill-mask', model='mjbommar/ogbert-110m-base')
	result = fill_mask('The financial <\|mask\|> was approved.')
	```

	### Direct Model Usage

	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-110m-base')
	model = AutoModelForMaskedLM.from_pretrained('mjbommar/ogbert-110m-base')

	inputs = tokenizer('The <\|mask\|> definition is clear.', return_tensors='pt')
	outputs = model(**inputs)
	```

	### For Sentence Embeddings

	Use [mjbommar/ogbert-110m-sentence](https://huggingface.co/mjbommar/ogbert-110m-sentence) instead, which includes mean pooling and L2 normalization for optimal similarity search.

	## Citation

	If you use this model, please cite the OpenGloss dataset:

	```bibtex
	@article{bommarito2025opengloss,
	title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
	author={Bommarito II, Michael J.},
	journal={arXiv preprint arXiv:2511.18622},
	year={2025}
	}
	```

	## License

	Apache 2.0