permutans commited on
Commit
6d23e34
·
verified ·
1 Parent(s): e4690d4

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +103 -121
  2. config.json +0 -1
  3. model.safetensors +2 -2
README.md CHANGED
@@ -25,10 +25,10 @@ model-index:
25
  name: Marker Subtype Classification
26
  metrics:
27
  - type: f1
28
- value: 0.5320
29
  name: F1 (macro)
30
  - type: accuracy
31
- value: 0.517
32
  name: Accuracy
33
  ---
34
 
@@ -46,8 +46,9 @@ This is the finest level of the Havelock span classification hierarchy. Given a
46
  | Architecture | `BertForSequenceClassification` |
47
  | Task | Multi-class classification (71 classes) |
48
  | Max sequence length | 128 tokens |
49
- | Best F1 (macro) | **0.5320** |
50
- | Best Accuracy | **0.517** |
 
51
  | Parameters | ~109M |
52
 
53
  ## Usage
@@ -78,12 +79,12 @@ print(f"Marker subtype: {model.config.id2label[pred]}")
78
  | **Repetition & Pattern** | `anaphora`, `epistrophe`, `parallelism`, `tricolon`, `lexical_repetition`, `refrain` |
79
  | **Sound & Rhythm** | `alliteration`, `assonance`, `rhyme`, `rhythm` |
80
  | **Address & Interaction** | `vocative`, `imperative`, `second_person`, `inclusive_we`, `rhetorical_question`, `audience_response`, `phatic_check`, `phatic_filler` |
81
- | **Conjunction** | `polysyndeton`, `asyndeton`, `simple_conjunction`, `binomial_expression` |
82
  | **Formulas** | `discourse_formula`, `proverb`, `religious_formula`, `epithet` |
83
  | **Narrative** | `named_individual`, `specific_place`, `temporal_anchor`, `sensory_detail`, `embodied_action`, `everyday_example` |
84
  | **Performance** | `dramatic_pause`, `self_correction`, `conflict_frame`, `us_them`, `intensifier_doubling`, `antithesis` |
85
 
86
- ### Literate Subtypes (36)
87
 
88
  | Category | Subtypes |
89
  |----------|----------|
@@ -99,129 +100,111 @@ print(f"Marker subtype: {model.config.id2label[pred]}")
99
 
100
  ### Data
101
 
102
- Span-level annotations from the Havelock corpus. Each span carries a `marker_subtype` field. Only subtypes with ≥15 examples in the full dataset are included. The corpus draws from Project Gutenberg, textfiles.com, Reddit, and Wikipedia talk pages.
103
-
104
- A stratified 80/20 train/test split was used (random seed 42). The test set contains 4,608 spans.
105
 
106
  ### Hyperparameters
107
 
108
  | Parameter | Value |
109
  |-----------|-------|
110
- | Epochs | 10 |
111
- | Batch size | 256 |
112
- | Learning rate | 1.5e-4 |
113
- | Optimizer | AdamW |
114
- | LR schedule | Linear warmup (10% of total steps) |
115
  | Gradient clipping | 1.0 |
116
- | Loss | Cross-entropy with class weights (range 0.23–4.33) |
117
- | Min examples per class | 15 |
118
-
119
- ### Training Metrics
120
-
121
- | Epoch | Loss | Accuracy | F1 (macro) |
122
- |-------|------|----------|------------|
123
- | 1 | 3.7795 | 0.3249 | 0.1618 |
124
- | 2 | 2.3703 | 0.4918 | 0.4254 |
125
- | 3 | 1.5864 | 0.5139 | 0.4964 |
126
- | 4 | 1.0582 | 0.5195 | 0.5238 |
127
- | 5 | 0.6955 | 0.5189 | 0.5196 |
128
- | 6 | 0.4761 | 0.5148 | 0.5227 |
129
- | 7 | 0.3279 | 0.5178 | **0.5320** |
130
- | 8 | 0.2419 | 0.5119 | 0.5213 |
131
- | 9 | 0.1885 | 0.5206 | 0.5283 |
132
- | 10 | 0.1454 | 0.5169 | 0.5250 |
133
-
134
- Best checkpoint selected by F1 at epoch 7. Accuracy plateaus from epoch 3 onward while F1 continues improving through rare-class gains.
135
 
136
  ### Test Set Classification Report
137
 
138
  <details><summary>Click to expand per-class precision/recall/F1/support</summary>
139
-
140
  ```
141
  precision recall f1-score support
142
 
143
- abstract_noun 0.315 0.312 0.314 144
144
- additive_formal 0.478 0.423 0.449 26
145
- agent_demoted 0.909 0.645 0.755 31
146
- agentless_passive 0.533 0.543 0.538 105
147
- alliteration 0.632 0.400 0.490 30
148
- anaphora 0.526 0.466 0.494 88
149
- antithesis 0.641 0.806 0.714 31
150
- aside 0.261 0.218 0.238 55
151
- assonance 0.917 1.000 0.957 33
152
- asyndeton 0.677 0.700 0.689 30
153
- audience_response 0.808 0.700 0.750 30
154
- categorical_statement 0.329 0.245 0.281 98
155
- causal_chain 0.442 0.425 0.433 80
156
- causal_explicit 0.406 0.406 0.406 69
157
- citation 0.646 0.627 0.636 67
158
- conceptual_metaphor 0.298 0.233 0.262 73
159
- concessive 0.690 0.659 0.674 88
160
- concessive_connector 0.920 0.742 0.821 31
161
- conditional 0.620 0.684 0.650 155
162
- conflict_frame 0.833 0.806 0.820 31
163
- contrastive 0.463 0.543 0.500 116
164
- cross_reference 0.538 0.412 0.467 34
165
- definitional_move 0.300 0.308 0.304 39
166
- discourse_formula 0.559 0.565 0.562 276
167
- dramatic_pause 0.781 0.806 0.794 31
168
- embodied_action 0.333 0.362 0.347 69
169
- enumeration 0.607 0.600 0.604 85
170
- epistemic_hedge 0.491 0.554 0.521 101
171
- epistrophe 0.867 0.812 0.839 32
172
- epithet 0.424 0.519 0.467 27
173
- everyday_example 0.361 0.317 0.338 41
174
- evidential 0.526 0.556 0.541 54
175
- footnote_reference 0.615 0.533 0.571 15
176
- imperative 0.659 0.753 0.703 146
177
- inclusive_we 0.613 0.608 0.611 120
178
- institutional_subject 0.600 0.581 0.590 31
179
- intensifier_doubling 0.833 0.667 0.741 30
180
- lexical_repetition 0.486 0.564 0.522 94
181
- list_structure 0.286 0.278 0.282 36
182
- metadiscourse 0.320 0.276 0.296 87
183
- methodological_framing 0.269 0.219 0.241 32
184
- named_individual 0.364 0.436 0.397 55
185
- nested_clauses 0.370 0.310 0.338 87
186
- nominalization 0.377 0.433 0.403 120
187
- objectifying_stance 0.125 0.233 0.163 43
188
- parallelism 0.218 0.293 0.250 58
189
- phatic_check 0.636 0.667 0.651 21
190
- phatic_filler 0.333 0.400 0.364 30
191
- polysyndeton 0.964 0.844 0.900 32
192
- probability 0.574 0.551 0.562 49
193
- proverb 0.304 0.226 0.259 31
194
- qualified_assertion 0.219 0.233 0.226 60
195
- refrain 0.818 0.600 0.692 30
196
- relative_chain 0.558 0.504 0.530 115
197
- religious_formula 0.840 0.656 0.737 32
198
- rhetorical_question 0.686 0.745 0.714 161
199
- rhyme 0.480 0.375 0.421 32
200
- rhythm 0.778 0.875 0.824 32
201
- second_person 0.543 0.596 0.568 235
202
- self_correction 0.826 0.633 0.717 30
203
- sensory_detail 0.387 0.324 0.353 37
204
- simple_conjunction 0.222 0.195 0.208 41
205
- specific_place 0.526 0.385 0.444 26
206
- technical_abbreviation 0.278 0.263 0.270 19
207
- technical_term 0.615 0.466 0.530 161
208
- temporal_anchor 0.404 0.429 0.416 49
209
- temporal_embedding 0.438 0.519 0.475 81
210
- third_person_reference 0.788 0.839 0.812 31
211
- tricolon 0.607 0.567 0.586 30
212
- us_them 0.606 0.645 0.625 31
213
- vocative 0.643 0.621 0.632 58
214
-
215
- accuracy 0.517 4608
216
- macro avg 0.540 0.517 0.525 4608
217
- weighted avg 0.522 0.517 0.517 4608
218
  ```
219
 
220
  </details>
221
 
222
- **Top performing subtypes (F1 > 0.75):** `assonance` (0.957), `polysyndeton` (0.900), `epistrophe` (0.839), `rhythm` (0.824), `concessive_connector` (0.821), `conflict_frame` (0.820), `third_person_reference` (0.812), `dramatic_pause` (0.794), `agent_demoted` (0.755), `audience_response` (0.750).
223
 
224
- **Weakest subtypes (F1 < 0.25):** `objectifying_stance` (0.163), `simple_conjunction` (0.208), `qualified_assertion` (0.226), `aside` (0.238), `methodological_framing` (0.241), `parallelism` (0.250). These tend to be semantically diffuse classes that overlap heavily with neighbouring subtypes.
225
 
226
  ## Class Distribution
227
 
@@ -229,17 +212,16 @@ The test set exhibits significant imbalance across 71 classes:
229
 
230
  | Support Range | Classes | % of Total |
231
  |---------------|---------|------------|
232
- | >200 | 2 (`discourse_formula`, `second_person`) | 3% |
233
- | 100200 | 11 | 15% |
234
- | 50100 | 18 | 25% |
235
- | 2550 | 41 | 57% |
236
 
237
  ## Limitations
238
 
239
- - **Accuracy plateau with F1 headroom**: Accuracy saturated around 0.52 from epoch 3 while F1 continued climbing through epoch 7, suggesting the model is still finding better decision boundaries for rare classes. Further training with LR decay or curriculum strategies may help.
240
- - **71-way classification on ~23k spans**: The data budget per class is thin, particularly for classes near the 15-example minimum. More data or class consolidation would help.
241
  - **Semantic overlap**: Some subtypes are difficult to distinguish from surface text alone (e.g., `parallelism` vs `anaphora` vs `tricolon`; `epistemic_hedge` vs `qualified_assertion` vs `probability`). The model may benefit from hierarchical classification that conditions on type-level predictions.
242
- - **Recall-precision tradeoff**: Many rare classes show high precision but lower recall (e.g., `polysyndeton`: P=0.964, R=0.844; `agent_demoted`: P=0.909, R=0.645), suggesting the model learns narrow prototypes but misses variation.
243
  - **Span-level only**: Requires pre-extracted spans. Does not detect boundaries.
244
  - **128-token context window**: Longer spans are truncated.
245
 
@@ -252,8 +234,8 @@ The 71 subtypes represent the full granularity of the Havelock taxonomy, operati
252
  | Model | Task | Classes | F1 |
253
  |-------|------|---------|-----|
254
  | [`HavelockAI/bert-marker-category`](https://huggingface.co/HavelockAI/bert-marker-category) | Binary (oral/literate) | 2 | 0.875 |
255
- | [`HavelockAI/bert-marker-type`](https://huggingface.co/HavelockAI/bert-marker-type) | Functional type | 25 | 0.449 |
256
- | **This model** | Fine-grained subtype | 71 | 0.532 |
257
  | [`HavelockAI/bert-orality-regressor`](https://huggingface.co/HavelockAI/bert-orality-regressor) | Document-level score | Regression | MAE 0.079 |
258
  | [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier) | Span detection (BIO) | 145 | 0.500 |
259
 
@@ -273,4 +255,4 @@ The 71 subtypes represent the full granularity of the Havelock taxonomy, operati
273
 
274
  ---
275
 
276
- *Model version: da931b4a · Trained: February 2026*
 
25
  name: Marker Subtype Classification
26
  metrics:
27
  - type: f1
28
+ value: 0.500
29
  name: F1 (macro)
30
  - type: accuracy
31
+ value: 0.498
32
  name: Accuracy
33
  ---
34
 
 
46
  | Architecture | `BertForSequenceClassification` |
47
  | Task | Multi-class classification (71 classes) |
48
  | Max sequence length | 128 tokens |
49
+ | Test F1 (macro) | **0.500** |
50
+ | Test Accuracy | **0.498** |
51
+ | Missing labels (test) | 1/71 (`rhyme`) |
52
  | Parameters | ~109M |
53
 
54
  ## Usage
 
79
  | **Repetition & Pattern** | `anaphora`, `epistrophe`, `parallelism`, `tricolon`, `lexical_repetition`, `refrain` |
80
  | **Sound & Rhythm** | `alliteration`, `assonance`, `rhyme`, `rhythm` |
81
  | **Address & Interaction** | `vocative`, `imperative`, `second_person`, `inclusive_we`, `rhetorical_question`, `audience_response`, `phatic_check`, `phatic_filler` |
82
+ | **Conjunction** | `polysyndeton`, `asyndeton`, `simple_conjunction` |
83
  | **Formulas** | `discourse_formula`, `proverb`, `religious_formula`, `epithet` |
84
  | **Narrative** | `named_individual`, `specific_place`, `temporal_anchor`, `sensory_detail`, `embodied_action`, `everyday_example` |
85
  | **Performance** | `dramatic_pause`, `self_correction`, `conflict_frame`, `us_them`, `intensifier_doubling`, `antithesis` |
86
 
87
+ ### Literate Subtypes (35)
88
 
89
  | Category | Subtypes |
90
  |----------|----------|
 
100
 
101
  ### Data
102
 
103
+ Span-level annotations from the Havelock corpus with marker types normalized against a canonical taxonomy at build time. Each span carries a `marker_subtype` field. Only subtypes with ≥50 examples are included. A stratified 80/10/10 train/val/test split was used with swap-based optimization to balance label distributions across splits. The test set contains 2,357 spans.
 
 
104
 
105
  ### Hyperparameters
106
 
107
  | Parameter | Value |
108
  |-----------|-------|
109
+ | Epochs | 20 |
110
+ | Batch size | 16 |
111
+ | Learning rate | 3e-5 |
112
+ | Optimizer | AdamW (weight decay 0.01) |
113
+ | LR schedule | Cosine with 10% warmup |
114
  | Gradient clipping | 1.0 |
115
+ | Loss | Focal loss (γ=2.0) + class weights |
116
+ | Mixout | 0.1 |
117
+ | Mixed precision | FP16 |
118
+ | Min examples per class | 50 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
 
120
  ### Test Set Classification Report
121
 
122
  <details><summary>Click to expand per-class precision/recall/F1/support</summary>
 
123
  ```
124
  precision recall f1-score support
125
 
126
+ abstract_noun 0.376 0.364 0.370 88
127
+ additive_formal 0.455 0.417 0.435 12
128
+ agent_demoted 0.533 0.800 0.640 10
129
+ agentless_passive 0.542 0.456 0.495 57
130
+ alliteration 0.714 0.500 0.588 10
131
+ anaphora 0.490 0.585 0.533 41
132
+ antithesis 0.947 0.818 0.878 22
133
+ aside 0.225 0.243 0.234 37
134
+ assonance 0.926 1.000 0.962 25
135
+ asyndeton 0.583 0.500 0.538 14
136
+ audience_response 0.778 0.700 0.737 10
137
+ categorical_statement 0.209 0.450 0.286 20
138
+ causal_chain 0.425 0.405 0.415 42
139
+ causal_explicit 0.537 0.468 0.500 47
140
+ citation 0.794 0.587 0.675 46
141
+ conceptual_metaphor 0.176 0.077 0.107 39
142
+ concessive 0.617 0.644 0.630 45
143
+ concessive_connector 0.833 0.833 0.833 18
144
+ conditional 0.582 0.655 0.616 87
145
+ conflict_frame 0.588 0.667 0.625 15
146
+ contrastive 0.442 0.557 0.493 61
147
+ cross_reference 0.733 0.458 0.564 24
148
+ definitional_move 0.333 0.200 0.250 10
149
+ discourse_formula 0.485 0.424 0.452 118
150
+ dramatic_pause 0.875 0.700 0.778 10
151
+ embodied_action 0.271 0.310 0.289 42
152
+ enumeration 0.556 0.581 0.568 43
153
+ epistemic_hedge 0.206 0.500 0.292 14
154
+ epistrophe 0.778 0.875 0.824 16
155
+ epithet 0.385 0.417 0.400 12
156
+ everyday_example 0.278 0.179 0.217 28
157
+ evidential 0.606 0.541 0.571 37
158
+ footnote_reference 0.444 0.400 0.421 10
159
+ imperative 0.628 0.590 0.608 100
160
+ inclusive_we 0.561 0.627 0.592 59
161
+ institutional_subject 0.947 0.857 0.900 21
162
+ intensifier_doubling 0.905 0.864 0.884 22
163
+ lexical_repetition 0.447 0.467 0.457 45
164
+ list_structure 0.190 0.174 0.182 23
165
+ metadiscourse 0.073 0.182 0.104 22
166
+ methodological_framing 0.500 0.238 0.323 21
167
+ named_individual 0.455 0.333 0.385 30
168
+ nested_clauses 0.294 0.326 0.309 46
169
+ nominalization 0.353 0.429 0.387 56
170
+ objectifying_stance 0.167 0.300 0.214 10
171
+ parallelism 0.188 0.222 0.203 27
172
+ phatic_check 0.444 0.364 0.400 11
173
+ phatic_filler 0.300 0.600 0.400 10
174
+ polysyndeton 1.000 0.833 0.909 24
175
+ probability 0.500 0.682 0.577 22
176
+ proverb 0.059 0.100 0.074 10
177
+ qualified_assertion 0.280 0.241 0.259 29
178
+ refrain 0.850 0.708 0.773 24
179
+ relative_chain 0.431 0.455 0.442 55
180
+ religious_formula 1.000 0.688 0.815 16
181
+ rhetorical_question 0.646 0.738 0.689 84
182
+ rhyme 0.000 0.000 0.000 10
183
+ rhythm 1.000 0.625 0.769 16
184
+ second_person 0.573 0.474 0.519 116
185
+ self_correction 0.952 0.500 0.656 40
186
+ sensory_detail 0.538 0.350 0.424 20
187
+ simple_conjunction 0.133 0.200 0.160 10
188
+ specific_place 0.625 0.278 0.385 18
189
+ technical_abbreviation 0.818 0.321 0.462 28
190
+ technical_term 0.438 0.432 0.435 74
191
+ temporal_anchor 0.472 0.500 0.486 34
192
+ temporal_embedding 0.475 0.604 0.532 48
193
+ third_person_reference 0.692 0.900 0.783 10
194
+ tricolon 0.667 0.667 0.667 18
195
+ us_them 0.750 0.500 0.600 18
196
+ vocative 0.414 0.600 0.490 20
197
+
198
+ accuracy 0.498 2357
199
+ macro avg 0.528 0.497 0.500 2357
200
+ weighted avg 0.525 0.498 0.502 2357
201
  ```
202
 
203
  </details>
204
 
205
+ **Top performing subtypes (F1 0.75):** `assonance` (0.962), `polysyndeton` (0.909), `institutional_subject` (0.900), `intensifier_doubling` (0.884), `antithesis` (0.878), `concessive_connector` (0.833), `epistrophe` (0.824), `religious_formula` (0.815), `third_person_reference` (0.783), `dramatic_pause` (0.778), `refrain` (0.773), `rhythm` (0.769).
206
 
207
+ **Weakest subtypes (F1 < 0.20):** `rhyme` (0.000), `proverb` (0.074), `metadiscourse` (0.104), `simple_conjunction` (0.160), `list_structure` (0.182). These tend to be semantically diffuse classes that overlap heavily with neighbouring subtypes or have very low test support.
208
 
209
  ## Class Distribution
210
 
 
212
 
213
  | Support Range | Classes | % of Total |
214
  |---------------|---------|------------|
215
+ | >100 | 3 (`discourse_formula`, `second_person`, `imperative`) | 4% |
216
+ | 50100 | 11 | 15% |
217
+ | 2550 | 26 | 37% |
218
+ | 1025 | 31 | 44% |
219
 
220
  ## Limitations
221
 
222
+ - **71-way classification on ~22k spans**: The data budget per class is thin, particularly for classes near the 50-example minimum. More data or class consolidation would help.
 
223
  - **Semantic overlap**: Some subtypes are difficult to distinguish from surface text alone (e.g., `parallelism` vs `anaphora` vs `tricolon`; `epistemic_hedge` vs `qualified_assertion` vs `probability`). The model may benefit from hierarchical classification that conditions on type-level predictions.
224
+ - **Recall-precision tradeoff on rare classes**: Many rare classes show high precision but lower recall (e.g., `self_correction`: P=0.952, R=0.500; `religious_formula`: P=1.000, R=0.688), suggesting the model learns narrow prototypes but misses variation.
225
  - **Span-level only**: Requires pre-extracted spans. Does not detect boundaries.
226
  - **128-token context window**: Longer spans are truncated.
227
 
 
234
  | Model | Task | Classes | F1 |
235
  |-------|------|---------|-----|
236
  | [`HavelockAI/bert-marker-category`](https://huggingface.co/HavelockAI/bert-marker-category) | Binary (oral/literate) | 2 | 0.875 |
237
+ | [`HavelockAI/bert-marker-type`](https://huggingface.co/HavelockAI/bert-marker-type) | Functional type | 18 | 0.583 |
238
+ | **This model** | Fine-grained subtype | 71 | 0.500 |
239
  | [`HavelockAI/bert-orality-regressor`](https://huggingface.co/HavelockAI/bert-orality-regressor) | Document-level score | Regression | MAE 0.079 |
240
  | [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier) | Span detection (BIO) | 145 | 0.500 |
241
 
 
255
 
256
  ---
257
 
258
+ *Model version: b31f147d · Trained: February 2026*
config.json CHANGED
@@ -168,7 +168,6 @@
168
  "num_hidden_layers": 12,
169
  "pad_token_id": 0,
170
  "position_embedding_type": "absolute",
171
- "problem_type": "single_label_classification",
172
  "tie_word_embeddings": true,
173
  "transformers_version": "5.0.0",
174
  "type_vocab_size": 2,
 
168
  "num_hidden_layers": 12,
169
  "pad_token_id": 0,
170
  "position_embedding_type": "absolute",
 
171
  "tie_word_embeddings": true,
172
  "transformers_version": "5.0.0",
173
  "type_vocab_size": 2,
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:908751e3d1db4b122a3c05ea81d50dfbef7cacda40e601e6b09905b8aa7fb99f
3
- size 438170868
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1ff78a23e1f73a3c2b1b41f7b253d652d236d03395d41483f87deba0000c9124
3
+ size 780277732