MARS — Shared Cross-Attention (ModernBERT-base)

Best single-model checkpoint from the MARS (Masked Accuracy Recovery Score) research project: a shared-encoder cross-attention model for evaluating text summarization quality through masked entity recovery.

A summary is "good" if a separate language model can recover the entities that were masked out of the original article using only the summary as context. The MARS score is the composite metric on those reconstructions.

Results

Evaluated on a 1,500-sample CNN/DailyMail subset (seed=42):

Model	MARSv2	Notes
baseline (merged input, non-shared)	~52	original reference
shared cross-attention (this model)	57.02	best single model, 2L × 3 epochs
4-way ensemble (1L+2L+6L+baseline)	~59.0	aggregate champion

Single-model copy-ceiling analysis: this checkpoint achieves 56.17% strict entity recall on the eval subset, with an estimated 72% achievable if a perfect copy-from-summary mechanism were attached.

Architecture

Single answerdotai/ModernBERT-base encoder (149M params), shared between the summary and the masked-text streams.
2 cross-attention layers where masked-text queries attend to summary keys/values.
Linear vocabulary projection head over the resized vocab (50,368 base + 21 special tokens = 50,389).

Total: ~190M parameters, single safetensors-equivalent torch checkpoint (model.pt, ~770 MB fp32).

Special tokens

[ENTMASK] — generic mask
[ENTSTART] / [ENTEND] — multi-token entity boundaries
[ENTMASK_<TYPE>] for 18 spaCy NER types: PERSON, ORG, GPE, LOC, DATE, TIME, MONEY, QUANTITY, PERCENT, CARDINAL, ORDINAL, EVENT, WORK_OF_ART, LAW, LANGUAGE, FAC, PRODUCT, NORP

Usage

Install

pip install transformers torch huggingface_hub

Download the two helper files from this repo: modeling_mars.py and inference.py (they are not auto-loaded by AutoModel because the architecture is custom).

Quick example

from inference import MarsInference

inf = MarsInference("Glazkov/mars-shared-cross-attention-modernbert")

summary = (
    "The president announced a new climate policy in Washington on Tuesday, "
    "promising to cut emissions by 40% by 2030."
)
masked_text = (
    "<mask> announced a new climate policy in <mask> on <mask>, "
    "promising to cut emissions by <mask> by <mask>."
)
entity_types = ["PERSON", "GPE", "DATE", "PERCENT", "DATE"]

predictions, confidences = inf.predict(
    summary, masked_text,
    entity_types=entity_types,
    return_confidence=True,
)
for t, p, c in zip(entity_types, predictions, confidences):
    print(f"  [{t}] -> {p!r}  (conf={c:.2f})")

Manual loading

from modeling_mars import load_model_from_checkpoint

model, tokenizer, device = load_model_from_checkpoint(
    "Glazkov/mars-shared-cross-attention-modernbert"
)

# Both inputs go through the SAME encoder; the masked stream cross-attends
# to the summary stream via the 2 cross-attention layers.
summary_enc = tokenizer("the summary text", return_tensors="pt").to(device)
masked_enc = tokenizer(
    "the original text with [ENTSTART] [ENTMASK_PERSON] removed",
    return_tensors="pt",
).to(device)

with torch.no_grad():
    out = model(
        summary_input_ids=summary_enc["input_ids"],
        summary_attention_mask=summary_enc["attention_mask"],
        masked_input_ids=masked_enc["input_ids"],
        masked_attention_mask=masked_enc["attention_mask"],
    )
logits = out.logits   # [batch, seq_len, vocab]

Scoring a summary (MARS-style)

import spacy
import re

nlp = spacy.load("en_core_web_sm")

def mask_entities(text: str):
    doc = nlp(text)
    masked, types, golds = text, [], []
    # iterate in reverse so character offsets remain valid
    for ent in sorted(doc.ents, key=lambda e: -e.start_char):
        masked = masked[:ent.start_char] + "<mask>" + masked[ent.end_char:]
        types.insert(0, ent.label_)
        golds.insert(0, ent.text)
    return masked, types, golds

article = "..."
summary = "..."

masked_text, types, gold = mask_entities(article)
preds = inf.predict(summary, masked_text, entity_types=types)
recall = sum(p.lower() == g.lower() for p, g in zip(preds, gold)) / max(1, len(gold))
print(f"Entity recall: {recall:.2%}")

Higher recall = the summary preserves more of the original article's factual content. This is the core signal behind the MARS metric.

Training data

Train splits of four English summarization datasets:

CNN/DailyMail
XSum
Multi-News
SAMSum

Entities were extracted with spaCy en_core_web_sm NER and replaced with typed mask tokens. The model was trained for 3 epochs at LR 5e-5, batch size 8, on a single A100 (~24 h wall time).

Limitations

English only. Multilingual transfer was not tested.
Max sequence length 1024 (ModernBERT). Long articles get truncated.
The model exhibits "confident hallucination" of plausible-but-wrong same-type entities (e.g. PERSON → wrong person). PERSON error rate is ~60% on held-out eval.
Best as a relative-comparison metric across summaries, not as an absolute factuality judgment on any single summary.

Citation

Internal research project. If you use this checkpoint, please cite it as:

@misc{mars2026,
  title  = {MARS: Masked Accuracy Recovery Score for Summarization},
  author = {Glazkov, Nikita},
  year   = {2026},
  url    = {https://huggingface.co/Glazkov/mars-shared-cross-attention-modernbert}
}

Downloads last month: 58

Model tree for Glazkov/mars-shared-cross-attention-modernbert

Base model

answerdotai/ModernBERT-base

Finetuned

(1295)

this model

Glazkov
/

mars-shared-cross-attention-modernbert