precision recall f1-score support O 0.99 0.95 0.97 75421 B-SUBJ 0.45 0.42 0.43 445 I-SUBJ 0.46 0.67 0.55 2120 B-OBJ 0.45 0.45 0.45 461 I-OBJ 0.45 0.81 0.58 2321 accuracy 0.93 80768 macro avg 0.56 0.66 0.60 80768 weighted avg 0.95 0.93 0.94 80768 # Notes # Best dev macro F1: 0.7000 (epoch 6) # Model: dslim/bert-large-NER, 10 epochs, BIO token classification # Train/Dev/Test rows: 6791 / 627 / 631 # Label scheme: O, B-SUBJ, I-SUBJ, B-OBJ, I-OBJ # Known span-level pattern: I-class F1 > B-class F1, so spans may be off-by-one # at boundaries. Resolver should expand spans to token boundaries when matching # to coref clusters by char-offset overlap.