precision    recall  f1-score   support

           O       0.99      0.95      0.97     75421
      B-SUBJ       0.45      0.42      0.43       445
      I-SUBJ       0.46      0.67      0.55      2120
       B-OBJ       0.45      0.45      0.45       461
       I-OBJ       0.45      0.81      0.58      2321

    accuracy                           0.93     80768
   macro avg       0.56      0.66      0.60     80768
weighted avg       0.95      0.93      0.94     80768

# Notes
# Best dev macro F1: 0.7000 (epoch 6)
# Model: dslim/bert-large-NER, 10 epochs, BIO token classification
# Train/Dev/Test rows: 6791 / 627 / 631
# Label scheme: O, B-SUBJ, I-SUBJ, B-OBJ, I-OBJ
# Known span-level pattern: I-class F1 > B-class F1, so spans may be off-by-one
# at boundaries. Resolver should expand spans to token boundaries when matching
# to coref clusters by char-offset overlap.