--- language: - en license: apache-2.0 library_name: transformers tags: - transformers - modernbert - fill-mask - masked-language-model pipeline_tag: fill-mask datasets: - mjbommar/ogbert-v1-mlm model-index: - name: ogbert-110m-base results: - task: type: word-similarity dataset: name: SimLex-999 type: simlex999 metrics: - type: spearman value: 0.345 --- # OGBert-110M-Base A 110M parameter ModernBERT-based masked language model trained on glossary and domain-specific text. **Related models:** - [mjbommar/ogbert-110m-sentence](https://huggingface.co/mjbommar/ogbert-110m-sentence) - Sentence embedding version with mean pooling + L2 normalization ## Model Details | Property | Value | |----------|-------| | Architecture | ModernBERT | | Parameters | 110M | | Hidden size | 768 | | Layers | 12 | | Attention heads | 12 | | Vocab size | 32,768 | | Max sequence | 1,024 tokens | ## Training - **Task**: Masked Language Modeling (MLM) - **Dataset**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm) - derived from [OpenGloss](https://arxiv.org/abs/2511.18622), a synthetic encyclopedic dictionary with 537K senses across 150K lexemes - **Masking**: Standard 15% token masking - **Training steps**: 8,000 steps (selected for optimal downstream performance) - **Tokens processed**: ~4.5B - **Batch size**: 1,024 - **Peak learning rate**: 3e-4 ## Performance ### Word Similarity (SimLex-999) **SimLex-999** measures Spearman correlation between model cosine similarities and human judgments on 999 word pairs. Higher = better alignment with human perception of word similarity. | Model | Params | SimLex-999 (ρ) | |-------|--------|----------------| | **OGBert-110M-Base** | **110M** | **0.345** | | BERT-base | 110M | 0.070 | | RoBERTa-base | 125M | -0.061 | OGBert-110M-Base achieves **5x better** word similarity than BERT-base with the same parameter count. ### Document Clustering Evaluated on 80 domain-specific documents across 10 categories using KMeans. | Model | Params | ARI | Cluster Acc | |-------|--------|-----|-------------| | **OGBert-110M-Base** | **110M** | **0.941** | **0.975** | | BERT-base | 110M | 0.896 | 0.950 | | RoBERTa-base | 125M | 0.941 | 0.975 | OGBert-110M-Base matches or exceeds RoBERTa-base on clustering tasks. ## Usage ### Fill-Mask Pipeline ```python from transformers import pipeline fill_mask = pipeline('fill-mask', model='mjbommar/ogbert-110m-base') result = fill_mask('The financial <|mask|> was approved.') ``` ### Direct Model Usage ```python from transformers import AutoModelForMaskedLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-110m-base') model = AutoModelForMaskedLM.from_pretrained('mjbommar/ogbert-110m-base') inputs = tokenizer('The <|mask|> definition is clear.', return_tensors='pt') outputs = model(**inputs) ``` ### For Sentence Embeddings Use [mjbommar/ogbert-110m-sentence](https://huggingface.co/mjbommar/ogbert-110m-sentence) instead, which includes mean pooling and L2 normalization for optimal similarity search. ## Citation If you use this model, please cite the OpenGloss dataset: ```bibtex @article{bommarito2025opengloss, title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph}, author={Bommarito II, Michael J.}, journal={arXiv preprint arXiv:2511.18622}, year={2025} } ``` ## License Apache 2.0