permutans commited on
Commit
da248ed
·
verified ·
1 Parent(s): 23afe9e

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +55 -22
  2. config.json +66 -18
  3. model.safetensors +2 -2
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +7 -5
README.md CHANGED
@@ -2,7 +2,7 @@
2
  license: mit
3
  tags:
4
  - text-classification
5
- - bert
6
  - orality
7
  - linguistics
8
  - rhetorical-analysis
@@ -12,7 +12,7 @@ metrics:
12
  - f1
13
  - accuracy
14
  base_model:
15
- - google-bert/bert-base-uncased
16
  pipeline_tag: text-classification
17
  library_name: transformers
18
  datasets:
@@ -25,16 +25,16 @@ model-index:
25
  name: Oral/Literate Span Classification
26
  metrics:
27
  - type: f1
28
- value: 0.835
29
  name: F1 (macro)
30
  - type: accuracy
31
- value: 0.858
32
  name: Accuracy
33
  ---
34
 
35
  # Havelock Marker Category Classifier
36
 
37
- BERT-based binary classifier that determines whether a rhetorical span is **oral** or **literate**, grounded in Walter Ong's *Orality and Literacy* (1982).
38
 
39
  This is the coarsest level of the Havelock span classification hierarchy. Given a text span that has been identified as a rhetorical marker, the model classifies it into one of two categories: oral (characteristic of spoken, performative discourse) or literate (characteristic of written, analytic discourse).
40
 
@@ -42,15 +42,15 @@ This is the coarsest level of the Havelock span classification hierarchy. Given
42
 
43
  | Property | Value |
44
  |----------|-------|
45
- | Base model | `bert-base-uncased` |
46
- | Architecture | `BertForSequenceClassification` |
47
  | Task | Binary classification |
48
  | Labels | 2 (`oral`, `literate`) |
49
  | Max sequence length | 128 tokens |
50
- | Test F1 (macro) | **0.835** |
51
- | Test Accuracy | **0.858** |
52
  | Missing labels | 0/2 |
53
- | Parameters | ~109M |
54
 
55
  ## Usage
56
  ```python
@@ -76,7 +76,7 @@ print(f"Category: {label_map[pred]}")
76
 
77
  ### Data
78
 
79
- Span-level annotations from the Havelock corpus with marker types normalized against a canonical taxonomy at build time. Spans are drawn from documents sourced from Project Gutenberg, textfiles.com, Reddit, and Wikipedia talk pages. A stratified 80/10/10 train/val/test split was used with swap-based optimization. The test set contains 1,609 spans (1,162 oral, 447 literate).
80
 
81
  ### Hyperparameters
82
 
@@ -84,7 +84,7 @@ Span-level annotations from the Havelock corpus with marker types normalized aga
84
  |-----------|-------|
85
  | Epochs | 20 |
86
  | Batch size | 16 |
87
- | Learning rate | 3e-5 |
88
  | Optimizer | AdamW (weight decay 0.01) |
89
  | LR schedule | Cosine with 10% warmup |
90
  | Gradient clipping | 1.0 |
@@ -92,19 +92,50 @@ Span-level annotations from the Havelock corpus with marker types normalized aga
92
  | Mixout | 0.1 |
93
  | Mixed precision | FP16 |
94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  ### Test Set Classification Report
96
  ```
97
  precision recall f1-score support
98
 
99
- oral 0.945 0.853 0.896 1162
100
- literate 0.695 0.870 0.773 447
101
 
102
- accuracy 0.858 1609
103
- macro avg 0.820 0.862 0.835 1609
104
- weighted avg 0.875 0.858 0.862 1609
105
  ```
106
 
107
- The model achieves high precision on oral spans (0.945) and high recall on literate spans (0.870). The precision gap on literate (0.695) indicates some oral spans are misclassified as literate — expected given the class imbalance (72% oral in test).
108
 
109
  ## Limitations
110
 
@@ -121,9 +152,9 @@ The oral–literate distinction follows Ong's framework. Oral markers include fe
121
 
122
  | Model | Task | Classes | F1 |
123
  |-------|------|---------|-----|
124
- | **This model** | Binary (oral/literate) | 2 | 0.835 |
125
- | [`HavelockAI/bert-marker-type`](https://huggingface.co/HavelockAI/bert-marker-type) | Functional type | 18 | 0.583 |
126
- | [`HavelockAI/bert-marker-subtype`](https://huggingface.co/HavelockAI/bert-marker-subtype) | Fine-grained subtype | 71 | 0.500 |
127
  | [`HavelockAI/bert-orality-regressor`](https://huggingface.co/HavelockAI/bert-orality-regressor) | Document-level score | Regression | MAE 0.079 |
128
  | [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier) | Span detection (BIO) | 145 | 0.500 |
129
 
@@ -140,7 +171,9 @@ The oral–literate distinction follows Ong's framework. Oral markers include fe
140
  ## References
141
 
142
  - Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
 
 
143
 
144
  ---
145
 
146
- *Model version: b31f147d · Trained: February 2026*
 
2
  license: mit
3
  tags:
4
  - text-classification
5
+ - modernbert
6
  - orality
7
  - linguistics
8
  - rhetorical-analysis
 
12
  - f1
13
  - accuracy
14
  base_model:
15
+ - answerdotai/ModernBERT-base
16
  pipeline_tag: text-classification
17
  library_name: transformers
18
  datasets:
 
25
  name: Oral/Literate Span Classification
26
  metrics:
27
  - type: f1
28
+ value: 0.804
29
  name: F1 (macro)
30
  - type: accuracy
31
+ value: 0.825
32
  name: Accuracy
33
  ---
34
 
35
  # Havelock Marker Category Classifier
36
 
37
+ ModernBERT-based binary classifier that determines whether a rhetorical span is **oral** or **literate**, grounded in Walter Ong's *Orality and Literacy* (1982).
38
 
39
  This is the coarsest level of the Havelock span classification hierarchy. Given a text span that has been identified as a rhetorical marker, the model classifies it into one of two categories: oral (characteristic of spoken, performative discourse) or literate (characteristic of written, analytic discourse).
40
 
 
42
 
43
  | Property | Value |
44
  |----------|-------|
45
+ | Base model | `answerdotai/ModernBERT-base` |
46
+ | Architecture | `ModernBertForSequenceClassification` |
47
  | Task | Binary classification |
48
  | Labels | 2 (`oral`, `literate`) |
49
  | Max sequence length | 128 tokens |
50
+ | Test F1 (macro) | **0.804** |
51
+ | Test Accuracy | **0.825** |
52
  | Missing labels | 0/2 |
53
+ | Parameters | ~149M |
54
 
55
  ## Usage
56
  ```python
 
76
 
77
  ### Data
78
 
79
+ 22,367 span-level annotations from the Havelock corpus with marker types normalized against a canonical taxonomy at build time. Spans are drawn from documents sourced from Project Gutenberg, textfiles.com, Reddit, and Wikipedia talk pages. A stratified 80/10/10 train/val/test split was used with swap-based optimization. The test set contains 1,609 spans (1,162 oral, 447 literate).
80
 
81
  ### Hyperparameters
82
 
 
84
  |-----------|-------|
85
  | Epochs | 20 |
86
  | Batch size | 16 |
87
+ | Learning rate | 2e-5 |
88
  | Optimizer | AdamW (weight decay 0.01) |
89
  | LR schedule | Cosine with 10% warmup |
90
  | Gradient clipping | 1.0 |
 
92
  | Mixout | 0.1 |
93
  | Mixed precision | FP16 |
94
 
95
+ ### Training Metrics
96
+
97
+ Best checkpoint selected at epoch 13 by missing-label-primary, F1-tiebreaker (0 missing, F1 0.850).
98
+
99
+ <details><summary>Click to show per-epoch metrics</summary>
100
+
101
+ | Epoch | Loss | Val F1 | F1 range |
102
+ |-------|------|--------|----------|
103
+ | 1 | 0.1231 | 0.815 | 0.786–0.843 |
104
+ | 2 | 0.0785 | 0.829 | 0.795–0.863 |
105
+ | 3 | 0.0599 | 0.835 | 0.804–0.866 |
106
+ | 4 | 0.0457 | 0.816 | 0.788–0.844 |
107
+ | 5 | 0.0356 | 0.826 | 0.794–0.857 |
108
+ | 6 | 0.0290 | 0.834 | 0.787–0.881 |
109
+ | 7 | 0.0235 | 0.836 | 0.802–0.869 |
110
+ | 8 | 0.0188 | 0.837 | 0.799–0.876 |
111
+ | 9 | 0.0175 | 0.840 | 0.805–0.875 |
112
+ | 10 | 0.0162 | 0.839 | 0.802–0.875 |
113
+ | 11 | 0.0115 | 0.834 | 0.796–0.872 |
114
+ | 12 | 0.0103 | 0.836 | 0.801–0.870 |
115
+ | **13** | **0.0097** | **0.850** | **0.812–0.887** |
116
+ | 14 | 0.0086 | 0.827 | 0.794–0.861 |
117
+ | 15 | 0.0075 | 0.835 | 0.799–0.871 |
118
+ | 16 | 0.0074 | 0.828 | 0.794–0.862 |
119
+ | 17 | 0.0071 | 0.830 | 0.796–0.863 |
120
+ | 18 | 0.0073 | 0.840 | 0.804–0.877 |
121
+ | 19 | 0.0068 | 0.843 | 0.806–0.880 |
122
+ | 20 | 0.0070 | 0.844 | 0.808–0.880 |
123
+
124
+ </details>
125
+
126
  ### Test Set Classification Report
127
  ```
128
  precision recall f1-score support
129
 
130
+ oral 0.953 0.798 0.868 1162
131
+ literate 0.631 0.897 0.741 447
132
 
133
+ accuracy 0.825 1609
134
+ macro avg 0.792 0.847 0.804 1609
135
+ weighted avg 0.863 0.825 0.833 1609
136
  ```
137
 
138
+ The model achieves high precision on oral spans (0.953) and high recall on literate spans (0.897). The precision gap on literate (0.631) indicates some oral spans are misclassified as literate — expected given the class imbalance (72% oral in test).
139
 
140
  ## Limitations
141
 
 
152
 
153
  | Model | Task | Classes | F1 |
154
  |-------|------|---------|-----|
155
+ | **This model** | Binary (oral/literate) | 2 | 0.804 |
156
+ | [`HavelockAI/bert-marker-type`](https://huggingface.co/HavelockAI/bert-marker-type) | Functional type | 18 | 0.573 |
157
+ | [`HavelockAI/bert-marker-subtype`](https://huggingface.co/HavelockAI/bert-marker-subtype) | Fine-grained subtype | 71 | 0.493 |
158
  | [`HavelockAI/bert-orality-regressor`](https://huggingface.co/HavelockAI/bert-orality-regressor) | Document-level score | Regression | MAE 0.079 |
159
  | [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier) | Span detection (BIO) | 145 | 0.500 |
160
 
 
171
  ## References
172
 
173
  - Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
174
+ - Lee, C. et al. "Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models." ICLR 2020.
175
+ - Warner, A. et al. "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference." 2024.
176
 
177
  ---
178
 
179
+ *Trained: February 2026*
config.json CHANGED
@@ -1,30 +1,78 @@
1
  {
2
- "add_cross_attention": false,
3
  "architectures": [
4
- "BertForSequenceClassification"
5
  ],
6
- "attention_probs_dropout_prob": 0.1,
7
- "bos_token_id": null,
8
- "classifier_dropout": null,
 
 
 
 
 
 
 
9
  "dtype": "float32",
10
- "eos_token_id": null,
 
 
11
  "gradient_checkpointing": false,
12
- "hidden_act": "gelu",
13
- "hidden_dropout_prob": 0.1,
14
  "hidden_size": 768,
 
15
  "initializer_range": 0.02,
16
- "intermediate_size": 3072,
17
- "is_decoder": false,
18
- "layer_norm_eps": 1e-12,
19
- "max_position_embeddings": 512,
20
- "model_type": "bert",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  "num_attention_heads": 12,
22
- "num_hidden_layers": 12,
23
- "pad_token_id": 0,
24
  "position_embedding_type": "absolute",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  "tie_word_embeddings": true,
26
  "transformers_version": "5.0.0",
27
- "type_vocab_size": 2,
28
- "use_cache": true,
29
- "vocab_size": 30522
30
  }
 
1
  {
 
2
  "architectures": [
3
+ "ModernBertForSequenceClassification"
4
  ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 50281,
8
+ "classifier_activation": "gelu",
9
+ "classifier_bias": false,
10
+ "classifier_dropout": 0.0,
11
+ "classifier_pooling": "mean",
12
+ "cls_token_id": 50281,
13
+ "decoder_bias": true,
14
+ "deterministic_flash_attn": false,
15
  "dtype": "float32",
16
+ "embedding_dropout": 0.0,
17
+ "eos_token_id": 50282,
18
+ "global_attn_every_n_layers": 3,
19
  "gradient_checkpointing": false,
20
+ "hidden_activation": "gelu",
 
21
  "hidden_size": 768,
22
+ "initializer_cutoff_factor": 2.0,
23
  "initializer_range": 0.02,
24
+ "intermediate_size": 1152,
25
+ "layer_norm_eps": 1e-05,
26
+ "layer_types": [
27
+ "full_attention",
28
+ "sliding_attention",
29
+ "sliding_attention",
30
+ "full_attention",
31
+ "sliding_attention",
32
+ "sliding_attention",
33
+ "full_attention",
34
+ "sliding_attention",
35
+ "sliding_attention",
36
+ "full_attention",
37
+ "sliding_attention",
38
+ "sliding_attention",
39
+ "full_attention",
40
+ "sliding_attention",
41
+ "sliding_attention",
42
+ "full_attention",
43
+ "sliding_attention",
44
+ "sliding_attention",
45
+ "full_attention",
46
+ "sliding_attention",
47
+ "sliding_attention",
48
+ "full_attention"
49
+ ],
50
+ "local_attention": 128,
51
+ "max_position_embeddings": 8192,
52
+ "mlp_bias": false,
53
+ "mlp_dropout": 0.0,
54
+ "model_type": "modernbert",
55
+ "norm_bias": false,
56
+ "norm_eps": 1e-05,
57
  "num_attention_heads": 12,
58
+ "num_hidden_layers": 22,
59
+ "pad_token_id": 50283,
60
  "position_embedding_type": "absolute",
61
+ "repad_logits_with_grad": false,
62
+ "rope_parameters": {
63
+ "full_attention": {
64
+ "rope_theta": 160000.0,
65
+ "rope_type": "default"
66
+ },
67
+ "sliding_attention": {
68
+ "rope_theta": 10000.0,
69
+ "rope_type": "default"
70
+ }
71
+ },
72
+ "sep_token_id": 50282,
73
+ "sparse_pred_ignore_index": -100,
74
+ "sparse_prediction": false,
75
  "tie_word_embeddings": true,
76
  "transformers_version": "5.0.0",
77
+ "vocab_size": 50368
 
 
78
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d59549d144944b5af2731a495cf0bc210fdfbda0825c20917d016f0f2921a121
3
- size 780065488
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0b38e8b6bcb2ea4652922602925a3a332772ab650ba7f1644b9b7e00e7d32d3d
3
+ size 1039637504
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -1,14 +1,16 @@
1
  {
2
  "backend": "tokenizers",
 
3
  "cls_token": "[CLS]",
4
- "do_lower_case": true,
5
  "is_local": false,
6
  "mask_token": "[MASK]",
7
- "model_max_length": 512,
 
 
 
 
8
  "pad_token": "[PAD]",
9
  "sep_token": "[SEP]",
10
- "strip_accents": null,
11
- "tokenize_chinese_chars": true,
12
- "tokenizer_class": "BertTokenizer",
13
  "unk_token": "[UNK]"
14
  }
 
1
  {
2
  "backend": "tokenizers",
3
+ "clean_up_tokenization_spaces": true,
4
  "cls_token": "[CLS]",
 
5
  "is_local": false,
6
  "mask_token": "[MASK]",
7
+ "model_input_names": [
8
+ "input_ids",
9
+ "attention_mask"
10
+ ],
11
+ "model_max_length": 8192,
12
  "pad_token": "[PAD]",
13
  "sep_token": "[SEP]",
14
+ "tokenizer_class": "TokenizersBackend",
 
 
15
  "unk_token": "[UNK]"
16
  }