| | --- |
| | language: |
| | - en |
| | license: apache-2.0 |
| | library_name: transformers |
| | tags: |
| | - transformers |
| | - modernbert |
| | - fill-mask |
| | - masked-language-model |
| | pipeline_tag: fill-mask |
| | datasets: |
| | - mjbommar/ogbert-v1-mlm |
| | model-index: |
| | - name: ogbert-110m-base |
| | results: |
| | - task: |
| | type: word-similarity |
| | dataset: |
| | name: SimLex-999 |
| | type: simlex999 |
| | metrics: |
| | - type: spearman |
| | value: 0.345 |
| | --- |
| | |
| | # OGBert-110M-Base |
| |
|
| | A 110M parameter ModernBERT-based masked language model trained on glossary and domain-specific text. |
| |
|
| | **Related models:** |
| | - [mjbommar/ogbert-110m-sentence](https://huggingface.co/mjbommar/ogbert-110m-sentence) - Sentence embedding version with mean pooling + L2 normalization |
| |
|
| | ## Model Details |
| |
|
| | | Property | Value | |
| | |----------|-------| |
| | | Architecture | ModernBERT | |
| | | Parameters | 110M | |
| | | Hidden size | 768 | |
| | | Layers | 12 | |
| | | Attention heads | 12 | |
| | | Vocab size | 32,768 | |
| | | Max sequence | 1,024 tokens | |
| |
|
| | ## Training |
| |
|
| | - **Task**: Masked Language Modeling (MLM) |
| | - **Dataset**: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm) - derived from [OpenGloss](https://arxiv.org/abs/2511.18622), a synthetic encyclopedic dictionary with 537K senses across 150K lexemes |
| | - **Masking**: Standard 15% token masking |
| | - **Training steps**: 8,000 steps (selected for optimal downstream performance) |
| | - **Tokens processed**: ~4.5B |
| | - **Batch size**: 1,024 |
| | - **Peak learning rate**: 3e-4 |
| |
|
| | ## Performance |
| |
|
| | ### Word Similarity (SimLex-999) |
| |
|
| | **SimLex-999** measures Spearman correlation between model cosine similarities and human judgments on 999 word pairs. Higher = better alignment with human perception of word similarity. |
| |
|
| | | Model | Params | SimLex-999 (ρ) | |
| | |-------|--------|----------------| |
| | | **OGBert-110M-Base** | **110M** | **0.345** | |
| | | BERT-base | 110M | 0.070 | |
| | | RoBERTa-base | 125M | -0.061 | |
| |
|
| | OGBert-110M-Base achieves **5x better** word similarity than BERT-base with the same parameter count. |
| |
|
| | ### Document Clustering |
| |
|
| | Evaluated on 80 domain-specific documents across 10 categories using KMeans. |
| |
|
| | | Model | Params | ARI | Cluster Acc | |
| | |-------|--------|-----|-------------| |
| | | **OGBert-110M-Base** | **110M** | **0.941** | **0.975** | |
| | | BERT-base | 110M | 0.896 | 0.950 | |
| | | RoBERTa-base | 125M | 0.941 | 0.975 | |
| |
|
| | OGBert-110M-Base matches or exceeds RoBERTa-base on clustering tasks. |
| |
|
| | ## Usage |
| |
|
| | ### Fill-Mask Pipeline |
| |
|
| | ```python |
| | from transformers import pipeline |
| | |
| | fill_mask = pipeline('fill-mask', model='mjbommar/ogbert-110m-base') |
| | result = fill_mask('The financial <|mask|> was approved.') |
| | ``` |
| |
|
| | ### Direct Model Usage |
| |
|
| | ```python |
| | from transformers import AutoModelForMaskedLM, AutoTokenizer |
| | |
| | tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-110m-base') |
| | model = AutoModelForMaskedLM.from_pretrained('mjbommar/ogbert-110m-base') |
| | |
| | inputs = tokenizer('The <|mask|> definition is clear.', return_tensors='pt') |
| | outputs = model(**inputs) |
| | ``` |
| |
|
| | ### For Sentence Embeddings |
| |
|
| | Use [mjbommar/ogbert-110m-sentence](https://huggingface.co/mjbommar/ogbert-110m-sentence) instead, which includes mean pooling and L2 normalization for optimal similarity search. |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite the OpenGloss dataset: |
| |
|
| | ```bibtex |
| | @article{bommarito2025opengloss, |
| | title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph}, |
| | author={Bommarito II, Michael J.}, |
| | journal={arXiv preprint arXiv:2511.18622}, |
| | year={2025} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | Apache 2.0 |
| |
|