| precision recall f1-score support | |
| O 0.99 0.95 0.97 75421 | |
| B-SUBJ 0.45 0.42 0.43 445 | |
| I-SUBJ 0.46 0.67 0.55 2120 | |
| B-OBJ 0.45 0.45 0.45 461 | |
| I-OBJ 0.45 0.81 0.58 2321 | |
| accuracy 0.93 80768 | |
| macro avg 0.56 0.66 0.60 80768 | |
| weighted avg 0.95 0.93 0.94 80768 | |
| # Notes | |
| # Best dev macro F1: 0.7000 (epoch 6) | |
| # Model: dslim/bert-large-NER, 10 epochs, BIO token classification | |
| # Train/Dev/Test rows: 6791 / 627 / 631 | |
| # Label scheme: O, B-SUBJ, I-SUBJ, B-OBJ, I-OBJ | |
| # Known span-level pattern: I-class F1 > B-class F1, so spans may be off-by-one | |
| # at boundaries. Resolver should expand spans to token boundaries when matching | |
| # to coref clusters by char-offset overlap. | |