Upload folder using huggingface_hub
Browse files- README.md +55 -22
- config.json +66 -18
- model.safetensors +2 -2
- tokenizer.json +0 -0
- tokenizer_config.json +7 -5
README.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
| 2 |
license: mit
|
| 3 |
tags:
|
| 4 |
- text-classification
|
| 5 |
-
-
|
| 6 |
- orality
|
| 7 |
- linguistics
|
| 8 |
- rhetorical-analysis
|
|
@@ -12,7 +12,7 @@ metrics:
|
|
| 12 |
- f1
|
| 13 |
- accuracy
|
| 14 |
base_model:
|
| 15 |
-
-
|
| 16 |
pipeline_tag: text-classification
|
| 17 |
library_name: transformers
|
| 18 |
datasets:
|
|
@@ -25,16 +25,16 @@ model-index:
|
|
| 25 |
name: Oral/Literate Span Classification
|
| 26 |
metrics:
|
| 27 |
- type: f1
|
| 28 |
-
value: 0.
|
| 29 |
name: F1 (macro)
|
| 30 |
- type: accuracy
|
| 31 |
-
value: 0.
|
| 32 |
name: Accuracy
|
| 33 |
---
|
| 34 |
|
| 35 |
# Havelock Marker Category Classifier
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
This is the coarsest level of the Havelock span classification hierarchy. Given a text span that has been identified as a rhetorical marker, the model classifies it into one of two categories: oral (characteristic of spoken, performative discourse) or literate (characteristic of written, analytic discourse).
|
| 40 |
|
|
@@ -42,15 +42,15 @@ This is the coarsest level of the Havelock span classification hierarchy. Given
|
|
| 42 |
|
| 43 |
| Property | Value |
|
| 44 |
|----------|-------|
|
| 45 |
-
| Base model | `
|
| 46 |
-
| Architecture | `
|
| 47 |
| Task | Binary classification |
|
| 48 |
| Labels | 2 (`oral`, `literate`) |
|
| 49 |
| Max sequence length | 128 tokens |
|
| 50 |
-
| Test F1 (macro) | **0.
|
| 51 |
-
| Test Accuracy | **0.
|
| 52 |
| Missing labels | 0/2 |
|
| 53 |
-
| Parameters | ~
|
| 54 |
|
| 55 |
## Usage
|
| 56 |
```python
|
|
@@ -76,7 +76,7 @@ print(f"Category: {label_map[pred]}")
|
|
| 76 |
|
| 77 |
### Data
|
| 78 |
|
| 79 |
-
|
| 80 |
|
| 81 |
### Hyperparameters
|
| 82 |
|
|
@@ -84,7 +84,7 @@ Span-level annotations from the Havelock corpus with marker types normalized aga
|
|
| 84 |
|-----------|-------|
|
| 85 |
| Epochs | 20 |
|
| 86 |
| Batch size | 16 |
|
| 87 |
-
| Learning rate |
|
| 88 |
| Optimizer | AdamW (weight decay 0.01) |
|
| 89 |
| LR schedule | Cosine with 10% warmup |
|
| 90 |
| Gradient clipping | 1.0 |
|
|
@@ -92,19 +92,50 @@ Span-level annotations from the Havelock corpus with marker types normalized aga
|
|
| 92 |
| Mixout | 0.1 |
|
| 93 |
| Mixed precision | FP16 |
|
| 94 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
### Test Set Classification Report
|
| 96 |
```
|
| 97 |
precision recall f1-score support
|
| 98 |
|
| 99 |
-
oral 0.
|
| 100 |
-
literate 0.
|
| 101 |
|
| 102 |
-
accuracy 0.
|
| 103 |
-
macro avg 0.
|
| 104 |
-
weighted avg 0.
|
| 105 |
```
|
| 106 |
|
| 107 |
-
The model achieves high precision on oral spans (0.
|
| 108 |
|
| 109 |
## Limitations
|
| 110 |
|
|
@@ -121,9 +152,9 @@ The oral–literate distinction follows Ong's framework. Oral markers include fe
|
|
| 121 |
|
| 122 |
| Model | Task | Classes | F1 |
|
| 123 |
|-------|------|---------|-----|
|
| 124 |
-
| **This model** | Binary (oral/literate) | 2 | 0.
|
| 125 |
-
| [`HavelockAI/bert-marker-type`](https://huggingface.co/HavelockAI/bert-marker-type) | Functional type | 18 | 0.
|
| 126 |
-
| [`HavelockAI/bert-marker-subtype`](https://huggingface.co/HavelockAI/bert-marker-subtype) | Fine-grained subtype | 71 | 0.
|
| 127 |
| [`HavelockAI/bert-orality-regressor`](https://huggingface.co/HavelockAI/bert-orality-regressor) | Document-level score | Regression | MAE 0.079 |
|
| 128 |
| [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier) | Span detection (BIO) | 145 | 0.500 |
|
| 129 |
|
|
@@ -140,7 +171,9 @@ The oral–literate distinction follows Ong's framework. Oral markers include fe
|
|
| 140 |
## References
|
| 141 |
|
| 142 |
- Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
|
|
|
|
|
|
|
| 143 |
|
| 144 |
---
|
| 145 |
|
| 146 |
-
*
|
|
|
|
| 2 |
license: mit
|
| 3 |
tags:
|
| 4 |
- text-classification
|
| 5 |
+
- modernbert
|
| 6 |
- orality
|
| 7 |
- linguistics
|
| 8 |
- rhetorical-analysis
|
|
|
|
| 12 |
- f1
|
| 13 |
- accuracy
|
| 14 |
base_model:
|
| 15 |
+
- answerdotai/ModernBERT-base
|
| 16 |
pipeline_tag: text-classification
|
| 17 |
library_name: transformers
|
| 18 |
datasets:
|
|
|
|
| 25 |
name: Oral/Literate Span Classification
|
| 26 |
metrics:
|
| 27 |
- type: f1
|
| 28 |
+
value: 0.804
|
| 29 |
name: F1 (macro)
|
| 30 |
- type: accuracy
|
| 31 |
+
value: 0.825
|
| 32 |
name: Accuracy
|
| 33 |
---
|
| 34 |
|
| 35 |
# Havelock Marker Category Classifier
|
| 36 |
|
| 37 |
+
ModernBERT-based binary classifier that determines whether a rhetorical span is **oral** or **literate**, grounded in Walter Ong's *Orality and Literacy* (1982).
|
| 38 |
|
| 39 |
This is the coarsest level of the Havelock span classification hierarchy. Given a text span that has been identified as a rhetorical marker, the model classifies it into one of two categories: oral (characteristic of spoken, performative discourse) or literate (characteristic of written, analytic discourse).
|
| 40 |
|
|
|
|
| 42 |
|
| 43 |
| Property | Value |
|
| 44 |
|----------|-------|
|
| 45 |
+
| Base model | `answerdotai/ModernBERT-base` |
|
| 46 |
+
| Architecture | `ModernBertForSequenceClassification` |
|
| 47 |
| Task | Binary classification |
|
| 48 |
| Labels | 2 (`oral`, `literate`) |
|
| 49 |
| Max sequence length | 128 tokens |
|
| 50 |
+
| Test F1 (macro) | **0.804** |
|
| 51 |
+
| Test Accuracy | **0.825** |
|
| 52 |
| Missing labels | 0/2 |
|
| 53 |
+
| Parameters | ~149M |
|
| 54 |
|
| 55 |
## Usage
|
| 56 |
```python
|
|
|
|
| 76 |
|
| 77 |
### Data
|
| 78 |
|
| 79 |
+
22,367 span-level annotations from the Havelock corpus with marker types normalized against a canonical taxonomy at build time. Spans are drawn from documents sourced from Project Gutenberg, textfiles.com, Reddit, and Wikipedia talk pages. A stratified 80/10/10 train/val/test split was used with swap-based optimization. The test set contains 1,609 spans (1,162 oral, 447 literate).
|
| 80 |
|
| 81 |
### Hyperparameters
|
| 82 |
|
|
|
|
| 84 |
|-----------|-------|
|
| 85 |
| Epochs | 20 |
|
| 86 |
| Batch size | 16 |
|
| 87 |
+
| Learning rate | 2e-5 |
|
| 88 |
| Optimizer | AdamW (weight decay 0.01) |
|
| 89 |
| LR schedule | Cosine with 10% warmup |
|
| 90 |
| Gradient clipping | 1.0 |
|
|
|
|
| 92 |
| Mixout | 0.1 |
|
| 93 |
| Mixed precision | FP16 |
|
| 94 |
|
| 95 |
+
### Training Metrics
|
| 96 |
+
|
| 97 |
+
Best checkpoint selected at epoch 13 by missing-label-primary, F1-tiebreaker (0 missing, F1 0.850).
|
| 98 |
+
|
| 99 |
+
<details><summary>Click to show per-epoch metrics</summary>
|
| 100 |
+
|
| 101 |
+
| Epoch | Loss | Val F1 | F1 range |
|
| 102 |
+
|-------|------|--------|----------|
|
| 103 |
+
| 1 | 0.1231 | 0.815 | 0.786–0.843 |
|
| 104 |
+
| 2 | 0.0785 | 0.829 | 0.795–0.863 |
|
| 105 |
+
| 3 | 0.0599 | 0.835 | 0.804–0.866 |
|
| 106 |
+
| 4 | 0.0457 | 0.816 | 0.788–0.844 |
|
| 107 |
+
| 5 | 0.0356 | 0.826 | 0.794–0.857 |
|
| 108 |
+
| 6 | 0.0290 | 0.834 | 0.787–0.881 |
|
| 109 |
+
| 7 | 0.0235 | 0.836 | 0.802–0.869 |
|
| 110 |
+
| 8 | 0.0188 | 0.837 | 0.799–0.876 |
|
| 111 |
+
| 9 | 0.0175 | 0.840 | 0.805–0.875 |
|
| 112 |
+
| 10 | 0.0162 | 0.839 | 0.802–0.875 |
|
| 113 |
+
| 11 | 0.0115 | 0.834 | 0.796–0.872 |
|
| 114 |
+
| 12 | 0.0103 | 0.836 | 0.801–0.870 |
|
| 115 |
+
| **13** | **0.0097** | **0.850** | **0.812–0.887** |
|
| 116 |
+
| 14 | 0.0086 | 0.827 | 0.794–0.861 |
|
| 117 |
+
| 15 | 0.0075 | 0.835 | 0.799–0.871 |
|
| 118 |
+
| 16 | 0.0074 | 0.828 | 0.794–0.862 |
|
| 119 |
+
| 17 | 0.0071 | 0.830 | 0.796–0.863 |
|
| 120 |
+
| 18 | 0.0073 | 0.840 | 0.804–0.877 |
|
| 121 |
+
| 19 | 0.0068 | 0.843 | 0.806–0.880 |
|
| 122 |
+
| 20 | 0.0070 | 0.844 | 0.808–0.880 |
|
| 123 |
+
|
| 124 |
+
</details>
|
| 125 |
+
|
| 126 |
### Test Set Classification Report
|
| 127 |
```
|
| 128 |
precision recall f1-score support
|
| 129 |
|
| 130 |
+
oral 0.953 0.798 0.868 1162
|
| 131 |
+
literate 0.631 0.897 0.741 447
|
| 132 |
|
| 133 |
+
accuracy 0.825 1609
|
| 134 |
+
macro avg 0.792 0.847 0.804 1609
|
| 135 |
+
weighted avg 0.863 0.825 0.833 1609
|
| 136 |
```
|
| 137 |
|
| 138 |
+
The model achieves high precision on oral spans (0.953) and high recall on literate spans (0.897). The precision gap on literate (0.631) indicates some oral spans are misclassified as literate — expected given the class imbalance (72% oral in test).
|
| 139 |
|
| 140 |
## Limitations
|
| 141 |
|
|
|
|
| 152 |
|
| 153 |
| Model | Task | Classes | F1 |
|
| 154 |
|-------|------|---------|-----|
|
| 155 |
+
| **This model** | Binary (oral/literate) | 2 | 0.804 |
|
| 156 |
+
| [`HavelockAI/bert-marker-type`](https://huggingface.co/HavelockAI/bert-marker-type) | Functional type | 18 | 0.573 |
|
| 157 |
+
| [`HavelockAI/bert-marker-subtype`](https://huggingface.co/HavelockAI/bert-marker-subtype) | Fine-grained subtype | 71 | 0.493 |
|
| 158 |
| [`HavelockAI/bert-orality-regressor`](https://huggingface.co/HavelockAI/bert-orality-regressor) | Document-level score | Regression | MAE 0.079 |
|
| 159 |
| [`HavelockAI/bert-token-classifier`](https://huggingface.co/HavelockAI/bert-token-classifier) | Span detection (BIO) | 145 | 0.500 |
|
| 160 |
|
|
|
|
| 171 |
## References
|
| 172 |
|
| 173 |
- Ong, Walter J. *Orality and Literacy: The Technologizing of the Word*. Routledge, 1982.
|
| 174 |
+
- Lee, C. et al. "Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models." ICLR 2020.
|
| 175 |
+
- Warner, A. et al. "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference." 2024.
|
| 176 |
|
| 177 |
---
|
| 178 |
|
| 179 |
+
*Trained: February 2026*
|
config.json
CHANGED
|
@@ -1,30 +1,78 @@
|
|
| 1 |
{
|
| 2 |
-
"add_cross_attention": false,
|
| 3 |
"architectures": [
|
| 4 |
-
"
|
| 5 |
],
|
| 6 |
-
"
|
| 7 |
-
"
|
| 8 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
"dtype": "float32",
|
| 10 |
-
"
|
|
|
|
|
|
|
| 11 |
"gradient_checkpointing": false,
|
| 12 |
-
"
|
| 13 |
-
"hidden_dropout_prob": 0.1,
|
| 14 |
"hidden_size": 768,
|
|
|
|
| 15 |
"initializer_range": 0.02,
|
| 16 |
-
"intermediate_size":
|
| 17 |
-
"
|
| 18 |
-
"
|
| 19 |
-
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
"num_attention_heads": 12,
|
| 22 |
-
"num_hidden_layers":
|
| 23 |
-
"pad_token_id":
|
| 24 |
"position_embedding_type": "absolute",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
"tie_word_embeddings": true,
|
| 26 |
"transformers_version": "5.0.0",
|
| 27 |
-
"
|
| 28 |
-
"use_cache": true,
|
| 29 |
-
"vocab_size": 30522
|
| 30 |
}
|
|
|
|
| 1 |
{
|
|
|
|
| 2 |
"architectures": [
|
| 3 |
+
"ModernBertForSequenceClassification"
|
| 4 |
],
|
| 5 |
+
"attention_bias": false,
|
| 6 |
+
"attention_dropout": 0.0,
|
| 7 |
+
"bos_token_id": 50281,
|
| 8 |
+
"classifier_activation": "gelu",
|
| 9 |
+
"classifier_bias": false,
|
| 10 |
+
"classifier_dropout": 0.0,
|
| 11 |
+
"classifier_pooling": "mean",
|
| 12 |
+
"cls_token_id": 50281,
|
| 13 |
+
"decoder_bias": true,
|
| 14 |
+
"deterministic_flash_attn": false,
|
| 15 |
"dtype": "float32",
|
| 16 |
+
"embedding_dropout": 0.0,
|
| 17 |
+
"eos_token_id": 50282,
|
| 18 |
+
"global_attn_every_n_layers": 3,
|
| 19 |
"gradient_checkpointing": false,
|
| 20 |
+
"hidden_activation": "gelu",
|
|
|
|
| 21 |
"hidden_size": 768,
|
| 22 |
+
"initializer_cutoff_factor": 2.0,
|
| 23 |
"initializer_range": 0.02,
|
| 24 |
+
"intermediate_size": 1152,
|
| 25 |
+
"layer_norm_eps": 1e-05,
|
| 26 |
+
"layer_types": [
|
| 27 |
+
"full_attention",
|
| 28 |
+
"sliding_attention",
|
| 29 |
+
"sliding_attention",
|
| 30 |
+
"full_attention",
|
| 31 |
+
"sliding_attention",
|
| 32 |
+
"sliding_attention",
|
| 33 |
+
"full_attention",
|
| 34 |
+
"sliding_attention",
|
| 35 |
+
"sliding_attention",
|
| 36 |
+
"full_attention",
|
| 37 |
+
"sliding_attention",
|
| 38 |
+
"sliding_attention",
|
| 39 |
+
"full_attention",
|
| 40 |
+
"sliding_attention",
|
| 41 |
+
"sliding_attention",
|
| 42 |
+
"full_attention",
|
| 43 |
+
"sliding_attention",
|
| 44 |
+
"sliding_attention",
|
| 45 |
+
"full_attention",
|
| 46 |
+
"sliding_attention",
|
| 47 |
+
"sliding_attention",
|
| 48 |
+
"full_attention"
|
| 49 |
+
],
|
| 50 |
+
"local_attention": 128,
|
| 51 |
+
"max_position_embeddings": 8192,
|
| 52 |
+
"mlp_bias": false,
|
| 53 |
+
"mlp_dropout": 0.0,
|
| 54 |
+
"model_type": "modernbert",
|
| 55 |
+
"norm_bias": false,
|
| 56 |
+
"norm_eps": 1e-05,
|
| 57 |
"num_attention_heads": 12,
|
| 58 |
+
"num_hidden_layers": 22,
|
| 59 |
+
"pad_token_id": 50283,
|
| 60 |
"position_embedding_type": "absolute",
|
| 61 |
+
"repad_logits_with_grad": false,
|
| 62 |
+
"rope_parameters": {
|
| 63 |
+
"full_attention": {
|
| 64 |
+
"rope_theta": 160000.0,
|
| 65 |
+
"rope_type": "default"
|
| 66 |
+
},
|
| 67 |
+
"sliding_attention": {
|
| 68 |
+
"rope_theta": 10000.0,
|
| 69 |
+
"rope_type": "default"
|
| 70 |
+
}
|
| 71 |
+
},
|
| 72 |
+
"sep_token_id": 50282,
|
| 73 |
+
"sparse_pred_ignore_index": -100,
|
| 74 |
+
"sparse_prediction": false,
|
| 75 |
"tie_word_embeddings": true,
|
| 76 |
"transformers_version": "5.0.0",
|
| 77 |
+
"vocab_size": 50368
|
|
|
|
|
|
|
| 78 |
}
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0b38e8b6bcb2ea4652922602925a3a332772ab650ba7f1644b9b7e00e7d32d3d
|
| 3 |
+
size 1039637504
|
tokenizer.json
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
CHANGED
|
@@ -1,14 +1,16 @@
|
|
| 1 |
{
|
| 2 |
"backend": "tokenizers",
|
|
|
|
| 3 |
"cls_token": "[CLS]",
|
| 4 |
-
"do_lower_case": true,
|
| 5 |
"is_local": false,
|
| 6 |
"mask_token": "[MASK]",
|
| 7 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
"pad_token": "[PAD]",
|
| 9 |
"sep_token": "[SEP]",
|
| 10 |
-
"
|
| 11 |
-
"tokenize_chinese_chars": true,
|
| 12 |
-
"tokenizer_class": "BertTokenizer",
|
| 13 |
"unk_token": "[UNK]"
|
| 14 |
}
|
|
|
|
| 1 |
{
|
| 2 |
"backend": "tokenizers",
|
| 3 |
+
"clean_up_tokenization_spaces": true,
|
| 4 |
"cls_token": "[CLS]",
|
|
|
|
| 5 |
"is_local": false,
|
| 6 |
"mask_token": "[MASK]",
|
| 7 |
+
"model_input_names": [
|
| 8 |
+
"input_ids",
|
| 9 |
+
"attention_mask"
|
| 10 |
+
],
|
| 11 |
+
"model_max_length": 8192,
|
| 12 |
"pad_token": "[PAD]",
|
| 13 |
"sep_token": "[SEP]",
|
| 14 |
+
"tokenizer_class": "TokenizersBackend",
|
|
|
|
|
|
|
| 15 |
"unk_token": "[UNK]"
|
| 16 |
}
|