- Paper: arXiv:2606.19468
- Collection: Narratives in LLM Pretraining Data
narrative-event-relation-roberta
RoBERTa-base fine-tuned for event-relation classification. It has two binary heads for a pair of event spans that you mark in the text with [E1]…[/E1] and [E2]…[/E2]:
- temporal — whether the two events stand in a temporal/sequential relation
- causal — whether the two events stand in a causal relation Trained on LLM (Gemma) pseudo-labels and evaluated against held-out human gold. Part of the NarraBert suite from Characterizing Narrative Content in Web-Scale LLM Pretraining Data.
Note: Extended model card with full training details coming soon.
⚠️ Performance & intended use
The initial v0.1 version of the event relation NarraBERT model is the weakest model in the NarraBert suite, and predictions should be treated as noisy.
- Against held-out human gold (paper, Tab. A3), it reaches F1 ≈ 0.58 (temporal) and ≈ 0.68 (causal) — macro F1 ≈ 0.63 — below its Gemma teacher (≈ 0.78). The single
test_f1_gold(0.805) in the config below is a weighted aggregate that is inflated by the dominant temporal class; the per-task figures are the better guide. - The gap is driven by severe class imbalance in the training labels: ~95% of event pairs are temporally related and ~75% are not causally related. The minority classes (non-temporal, causal) are therefore the least reliable.
- Recommended use: aggregate, corpus-level signals (e.g., mean causal density over many passages), not high-stakes per-pair decisions. Individual predictions — especially minority-class ones — carry real noise.
Prerequisites: detect event spans first
This model classifies the relation between two event spans that you provide; it does not detect events itself. You must run an event-span detector first, then wrap an adjacent pair of spans in the marker tokens before calling this model.
The pipeline used in the paper:
- Detect event-trigger spans with a DeBERTa event detector fine-tuned on LitBank (Sims et al., 2019); ≈ F1 0.85 on our web-scale data.
- Add verb spans via spaCy
en_core_web_trf, discarding any that overlap a detected event span. - Select one adjacent span pair and wrap each span:
[E1]…[/E1]for the first,[E2]…[/E2]for the second. - Run this model on the marked text.
Any reasonable event/trigger detector works — the only requirement is that the two spans are wrapped in the
[E1]/[E2]markers exactly as during training. The providedtokenizer/is set up to match that training format; if you reconstruct the tokenizer, make sure the four marker strings tokenize the same way they did in training.
Input format
She [E1]dropped[/E1] her phone. The screen [E2]cracked[/E2].
Loading
Download model.pt and tokenizer/ from this repo, then:
import torch
from transformers import AutoModel, AutoTokenizer
from torch import nn
ENTITY_MARKERS = ["[E1]", "[/E1]", "[E2]", "[/E2]"]
class EventRelationRoBERTa(nn.Module):
def __init__(self, model_name):
super().__init__()
self.backbone = AutoModel.from_pretrained(model_name)
hidden = self.backbone.config.hidden_size
self.temporal_head = nn.Linear(hidden, 1)
self.causal_head = nn.Linear(hidden, 1)
def forward(self, input_ids, attention_mask):
cls = self.backbone(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state[:, 0, :]
return self.temporal_head(cls), self.causal_head(cls)
tokenizer = AutoTokenizer.from_pretrained("tokenizer/")
model = EventRelationRoBERTa("roberta-base")
model.load_state_dict(torch.load("model.pt", map_location="cpu", weights_only=True))
model.eval()
Inference
# 1. run your event detector, pick an adjacent span pair
# 2. wrap the two spans with the markers:
text = "She [E1]dropped[/E1] her phone. The screen [E2]cracked[/E2]."
enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
temporal_logit, causal_logit = model(enc["input_ids"], enc["attention_mask"])
temporal = torch.sigmoid(temporal_logit).item() # P(temporal/sequential relation)
causal = torch.sigmoid(causal_logit).item() # P(causal relation)
print(f"temporal={temporal:.2f} causal={causal:.2f}")
To reproduce the paper's passage-level scores, run this over every adjacent event pair in a passage: temporal sequencing is the fraction of pairs with a temporal relation, and causal density is the fraction with a causal relation. Because per-pair predictions are noisy, these are most meaningful averaged over many pairs/passages.
Config
{
"model_name": "roberta-base",
"max_len": 256,
"dims": [
"temporal_sequential",
"causal"
],
"data_source": "gemma-4-31b-it pseudo-labels (internal)",
"n_train": 6219,
"n_val": 690,
"val_frac": 0.1,
"best_epoch": 4,
"seed": 42,
"test_f1_gold": 0.805
}
- Downloads last month
- 48
Model tree for teagrjohnson/narrative-event-relation-roberta
Base model
FacebookAI/roberta-base