Paper: arXiv:2606.19468
Collection: Narratives in LLM Pretraining Data

narrative-event-relation-roberta

RoBERTa-base fine-tuned for event-relation classification. It has two binary heads for a pair of event spans that you mark in the text with [E1]…[/E1] and [E2]…[/E2]:

temporal — whether the two events stand in a temporal/sequential relation
causal — whether the two events stand in a causal relation Trained on LLM (Gemma) pseudo-labels and evaluated against held-out human gold. Part of the NarraBert suite from Characterizing Narrative Content in Web-Scale LLM Pretraining Data.

Note: Extended model card with full training details coming soon.

⚠️ Performance & intended use

The initial v0.1 version of the event relation NarraBERT model is the weakest model in the NarraBert suite, and predictions should be treated as noisy.

Against held-out human gold (paper, Tab. A3), it reaches F1 ≈ 0.58 (temporal) and ≈ 0.68 (causal) — macro F1 ≈ 0.63 — below its Gemma teacher (≈ 0.78). The single test_f1_gold (0.805) in the config below is a weighted aggregate that is inflated by the dominant temporal class; the per-task figures are the better guide.
The gap is driven by severe class imbalance in the training labels: ~95% of event pairs are temporally related and ~75% are not causally related. The minority classes (non-temporal, causal) are therefore the least reliable.
Recommended use: aggregate, corpus-level signals (e.g., mean causal density over many passages), not high-stakes per-pair decisions. Individual predictions — especially minority-class ones — carry real noise.

Prerequisites: detect event spans first

This model classifies the relation between two event spans that you provide; it does not detect events itself. You must run an event-span detector first, then wrap an adjacent pair of spans in the marker tokens before calling this model.

The pipeline used in the paper:

Detect event-trigger spans with a DeBERTa event detector fine-tuned on LitBank (Sims et al., 2019); ≈ F1 0.85 on our web-scale data.
Add verb spans via spaCy en_core_web_trf, discarding any that overlap a detected event span.
Select one adjacent span pair and wrap each span: [E1]…[/E1] for the first, [E2]…[/E2] for the second.
Run this model on the marked text. Any reasonable event/trigger detector works — the only requirement is that the two spans are wrapped in the [E1]/[E2] markers exactly as during training. The provided tokenizer/ is set up to match that training format; if you reconstruct the tokenizer, make sure the four marker strings tokenize the same way they did in training.

Input format

She [E1]dropped[/E1] her phone. The screen [E2]cracked[/E2].

Loading

Download model.pt and tokenizer/ from this repo, then:

import torch
from transformers import AutoModel, AutoTokenizer
from torch import nn
 
ENTITY_MARKERS = ["[E1]", "[/E1]", "[E2]", "[/E2]"]
 
class EventRelationRoBERTa(nn.Module):
    def __init__(self, model_name):
        super().__init__()
        self.backbone = AutoModel.from_pretrained(model_name)
        hidden = self.backbone.config.hidden_size
        self.temporal_head = nn.Linear(hidden, 1)
        self.causal_head   = nn.Linear(hidden, 1)
    def forward(self, input_ids, attention_mask):
        cls = self.backbone(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state[:, 0, :]
        return self.temporal_head(cls), self.causal_head(cls)
 
tokenizer = AutoTokenizer.from_pretrained("tokenizer/")
model = EventRelationRoBERTa("roberta-base")
model.load_state_dict(torch.load("model.pt", map_location="cpu", weights_only=True))
model.eval()

Inference

# 1. run your event detector, pick an adjacent span pair
# 2. wrap the two spans with the markers:
text = "She [E1]dropped[/E1] her phone. The screen [E2]cracked[/E2]."
 
enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
    temporal_logit, causal_logit = model(enc["input_ids"], enc["attention_mask"])
 
temporal = torch.sigmoid(temporal_logit).item()  # P(temporal/sequential relation)
causal   = torch.sigmoid(causal_logit).item()    # P(causal relation)
print(f"temporal={temporal:.2f}  causal={causal:.2f}")

To reproduce the paper's passage-level scores, run this over every adjacent event pair in a passage: temporal sequencing is the fraction of pairs with a temporal relation, and causal density is the fraction with a causal relation. Because per-pair predictions are noisy, these are most meaningful averaged over many pairs/passages.

Config

{
  "model_name": "roberta-base",
  "max_len": 256,
  "dims": [
    "temporal_sequential",
    "causal"
  ],
  "data_source": "gemma-4-31b-it pseudo-labels (internal)",
  "n_train": 6219,
  "n_val": 690,
  "val_frac": 0.1,
  "best_epoch": 4,
  "seed": 42,
  "test_f1_gold": 0.805
}

Downloads last month: 48

Model tree for teagrjohnson/narrative-event-relation-roberta

Base model

FacebookAI/roberta-base

Finetuned

(2344)

this model

Dataset used to train teagrjohnson/narrative-event-relation-roberta

Collection including teagrjohnson/narrative-event-relation-roberta

Narratives in LLM Pretraining Data

Collection

Models & datasets from Characterizing Narrative Content in Web-Scale LLM Pretraining Data (NarraDolma & NarraBERT) • 7 items • Updated about 16 hours ago • 2

Paper for teagrjohnson/narrative-event-relation-roberta

Characterizing Narrative Content in Web-scale LLM Pretraining Data

Paper • 2606.19468 • Published 3 days ago • 1