Instructions to use Glazkov/mars-shared-cross-attention-modernbert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Glazkov/mars-shared-cross-attention-modernbert with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="Glazkov/mars-shared-cross-attention-modernbert")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("Glazkov/mars-shared-cross-attention-modernbert", dtype="auto") - Notebooks
- Google Colab
- Kaggle
MARS — Shared Cross-Attention (ModernBERT-base)
Best single-model checkpoint from the MARS (Masked Accuracy Recovery Score) research project: a shared-encoder cross-attention model for evaluating text summarization quality through masked entity recovery.
A summary is "good" if a separate language model can recover the entities that were masked out of the original article using only the summary as context. The MARS score is the composite metric on those reconstructions.
Results
Evaluated on a 1,500-sample CNN/DailyMail subset (seed=42):
| Model | MARSv2 | Notes |
|---|---|---|
| baseline (merged input, non-shared) | ~52 | original reference |
| shared cross-attention (this model) | 57.02 | best single model, 2L × 3 epochs |
| 4-way ensemble (1L+2L+6L+baseline) | ~59.0 | aggregate champion |
Single-model copy-ceiling analysis: this checkpoint achieves 56.17% strict entity recall on the eval subset, with an estimated 72% achievable if a perfect copy-from-summary mechanism were attached.
Architecture
- Single
answerdotai/ModernBERT-baseencoder (149M params), shared between the summary and the masked-text streams. - 2 cross-attention layers where masked-text queries attend to summary keys/values.
- Linear vocabulary projection head over the resized vocab (50,368 base + 21 special tokens = 50,389).
Total: ~190M parameters, single safetensors-equivalent torch checkpoint
(model.pt, ~770 MB fp32).
Special tokens
[ENTMASK]— generic mask[ENTSTART]/[ENTEND]— multi-token entity boundaries[ENTMASK_<TYPE>]for 18 spaCy NER types:PERSON, ORG, GPE, LOC, DATE, TIME, MONEY, QUANTITY, PERCENT, CARDINAL, ORDINAL, EVENT, WORK_OF_ART, LAW, LANGUAGE, FAC, PRODUCT, NORP
Usage
Install
pip install transformers torch huggingface_hub
Download the two helper files from this repo: modeling_mars.py and
inference.py (they are not auto-loaded by AutoModel because the
architecture is custom).
Quick example
from inference import MarsInference
inf = MarsInference("Glazkov/mars-shared-cross-attention-modernbert")
summary = (
"The president announced a new climate policy in Washington on Tuesday, "
"promising to cut emissions by 40% by 2030."
)
masked_text = (
"<mask> announced a new climate policy in <mask> on <mask>, "
"promising to cut emissions by <mask> by <mask>."
)
entity_types = ["PERSON", "GPE", "DATE", "PERCENT", "DATE"]
predictions, confidences = inf.predict(
summary, masked_text,
entity_types=entity_types,
return_confidence=True,
)
for t, p, c in zip(entity_types, predictions, confidences):
print(f" [{t}] -> {p!r} (conf={c:.2f})")
Manual loading
from modeling_mars import load_model_from_checkpoint
model, tokenizer, device = load_model_from_checkpoint(
"Glazkov/mars-shared-cross-attention-modernbert"
)
# Both inputs go through the SAME encoder; the masked stream cross-attends
# to the summary stream via the 2 cross-attention layers.
summary_enc = tokenizer("the summary text", return_tensors="pt").to(device)
masked_enc = tokenizer(
"the original text with [ENTSTART] [ENTMASK_PERSON] removed",
return_tensors="pt",
).to(device)
with torch.no_grad():
out = model(
summary_input_ids=summary_enc["input_ids"],
summary_attention_mask=summary_enc["attention_mask"],
masked_input_ids=masked_enc["input_ids"],
masked_attention_mask=masked_enc["attention_mask"],
)
logits = out.logits # [batch, seq_len, vocab]
Scoring a summary (MARS-style)
import spacy
import re
nlp = spacy.load("en_core_web_sm")
def mask_entities(text: str):
doc = nlp(text)
masked, types, golds = text, [], []
# iterate in reverse so character offsets remain valid
for ent in sorted(doc.ents, key=lambda e: -e.start_char):
masked = masked[:ent.start_char] + "<mask>" + masked[ent.end_char:]
types.insert(0, ent.label_)
golds.insert(0, ent.text)
return masked, types, golds
article = "..."
summary = "..."
masked_text, types, gold = mask_entities(article)
preds = inf.predict(summary, masked_text, entity_types=types)
recall = sum(p.lower() == g.lower() for p, g in zip(preds, gold)) / max(1, len(gold))
print(f"Entity recall: {recall:.2%}")
Higher recall = the summary preserves more of the original article's factual content. This is the core signal behind the MARS metric.
Training data
Train splits of four English summarization datasets:
- CNN/DailyMail
- XSum
- Multi-News
- SAMSum
Entities were extracted with spaCy en_core_web_sm NER and replaced with
typed mask tokens. The model was trained for 3 epochs at LR 5e-5,
batch size 8, on a single A100 (~24 h wall time).
Limitations
- English only. Multilingual transfer was not tested.
- Max sequence length 1024 (ModernBERT). Long articles get truncated.
- The model exhibits "confident hallucination" of plausible-but-wrong same-type entities (e.g. PERSON → wrong person). PERSON error rate is ~60% on held-out eval.
- Best as a relative-comparison metric across summaries, not as an absolute factuality judgment on any single summary.
Citation
Internal research project. If you use this checkpoint, please cite it as:
@misc{mars2026,
title = {MARS: Masked Accuracy Recovery Score for Summarization},
author = {Glazkov, Nikita},
year = {2026},
url = {https://huggingface.co/Glazkov/mars-shared-cross-attention-modernbert}
}
- Downloads last month
- 58
Model tree for Glazkov/mars-shared-cross-attention-modernbert
Base model
answerdotai/ModernBERT-base