davanstrien HF Staff commited on
Commit
517f8c6
·
verified ·
1 Parent(s): ab7544d

Replace auto-generated card with full model card

Browse files
Files changed (1) hide show
  1. README.md +113 -43
README.md CHANGED
@@ -1,73 +1,143 @@
1
  ---
2
- library_name: transformers
3
  license: apache-2.0
4
  base_model: answerdotai/ModernBERT-base
 
 
 
 
 
 
5
  tags:
 
 
 
 
 
 
 
6
  - generated_from_trainer
7
  metrics:
8
- - accuracy
9
  - f1
10
- - precision
11
- - recall
12
  model-index:
13
  - name: dhd-demo
14
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
18
- should probably proofread and complete it, then remove this comment. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
- # dhd-demo
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
- This model is a fine-tuned version of [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) on an unknown dataset.
23
- It achieves the following results on the evaluation set:
24
- - Loss: 0.0949
25
- - Accuracy: 0.9832
26
- - F1: 0.9709
27
- - Precision: 0.9615
28
- - Recall: 0.9804
29
- - F1 Macro: 0.9796
30
- - Roc Auc: 0.9980
31
 
32
- ## Model description
 
 
 
 
 
 
 
33
 
34
- More information needed
35
 
36
- ## Intended uses & limitations
 
 
 
 
 
37
 
38
- More information needed
39
 
40
- ## Training and evaluation data
 
41
 
42
- More information needed
 
 
43
 
44
- ## Training procedure
45
 
46
- ### Training hyperparameters
 
 
 
 
 
 
 
 
47
 
48
- The following hyperparameters were used during training:
49
- - learning_rate: 5e-05
50
- - train_batch_size: 16
51
- - eval_batch_size: 32
52
- - seed: 42
53
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
54
- - lr_scheduler_type: linear
55
- - lr_scheduler_warmup_steps: 50
56
- - num_epochs: 4
57
 
58
- ### Training results
59
 
60
- | Training Loss | Epoch | Step | Validation Loss | Accuracy | F1 | Precision | Recall | F1 Macro | Roc Auc |
61
- |:-------------:|:-----:|:----:|:---------------:|:--------:|:------:|:---------:|:------:|:--------:|:-------:|
62
- | 0.0856 | 1.0 | 101 | 0.1061 | 0.9553 | 0.9273 | 0.8644 | 1.0 | 0.9475 | 0.9960 |
63
- | 0.0353 | 2.0 | 202 | 0.0538 | 0.9777 | 0.9615 | 0.9434 | 0.9804 | 0.9729 | 0.9989 |
64
- | 0.0015 | 3.0 | 303 | 0.1310 | 0.9777 | 0.96 | 0.9796 | 0.9412 | 0.9722 | 0.9980 |
65
- | 0.0019 | 4.0 | 404 | 0.0949 | 0.9832 | 0.9709 | 0.9615 | 0.9804 | 0.9796 | 0.9980 |
66
 
 
 
 
67
 
68
- ### Framework versions
69
 
70
  - Transformers 5.7.0
71
- - Pytorch 2.11.0+cu130
72
  - Datasets 4.8.5
73
  - Tokenizers 0.22.2
 
1
  ---
 
2
  license: apache-2.0
3
  base_model: answerdotai/ModernBERT-base
4
+ datasets:
5
+ - biglam/on_the_books
6
+ language:
7
+ - en
8
+ library_name: transformers
9
+ pipeline_tag: text-classification
10
  tags:
11
+ - text-classification
12
+ - legal
13
+ - glam
14
+ - digital-humanities
15
+ - jim-crow
16
+ - north-carolina
17
+ - legislation
18
  - generated_from_trainer
19
  metrics:
 
20
  - f1
21
+ - accuracy
22
+ - roc_auc
23
  model-index:
24
  - name: dhd-demo
25
+ results:
26
+ - task:
27
+ type: text-classification
28
+ name: Text Classification
29
+ dataset:
30
+ name: biglam/on_the_books
31
+ type: biglam/on_the_books
32
+ split: train (held-out 10%)
33
+ metrics:
34
+ - type: accuracy
35
+ value: 0.9832
36
+ - type: f1
37
+ value: 0.9709
38
+ - type: precision
39
+ value: 0.9615
40
+ - type: recall
41
+ value: 0.9804
42
+ - type: f1_macro
43
+ value: 0.9796
44
+ - type: roc_auc
45
+ value: 0.9980
46
  ---
47
 
48
+ # dhd-demo: ModernBERT Jim Crow law classifier
49
+
50
+ Fine-tuned [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) on
51
+ [`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books) to classify North
52
+ Carolina session-law sections (1866–1967) as Jim Crow laws or not.
53
+
54
+ Built as a live demo for the *Digital Humanities & Discovery* webinar
55
+ (2026-05-05) showing end-to-end fine-tuning via `hf jobs`.
56
+
57
+ ## Labels
58
+
59
+ - `0` = `no_jim_crow`
60
+ - `1` = `jim_crow`
61
+
62
+ ## Training data
63
+
64
+ [`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books) — 1,785 expert-labeled chapter/section pairs from NC session
65
+ laws, 512 positive / 1,273 negative. Split 90/10 (stratified) for train/eval.
66
+ Class imbalance handled with inverse-frequency cross-entropy weights.
67
+
68
+ ## Training setup
69
 
70
+ | | |
71
+ |---|---|
72
+ | Base model | `answerdotai/ModernBERT-base` |
73
+ | Epochs | 4 |
74
+ | Batch size | 16 |
75
+ | Learning rate | 5e-5 |
76
+ | Warmup steps | 50 |
77
+ | Weight decay | 0.01 |
78
+ | Max sequence length | 1024 |
79
+ | Precision | bf16 |
80
+ | Loss | weighted cross-entropy |
81
+ | Seed | 42 |
82
+ | Hardware | 1× NVIDIA L4 (24 GB) via `hf jobs` |
83
+ | Train runtime | 223 s |
84
 
85
+ ## Evaluation (held-out 10% split, n=179)
 
 
 
 
 
 
 
 
86
 
87
+ | Metric | Value |
88
+ |---|---|
89
+ | Accuracy | 0.9832 |
90
+ | F1 (positive class) | 0.9709 |
91
+ | Precision | 0.9615 |
92
+ | Recall | 0.9804 |
93
+ | F1 (macro) | 0.9796 |
94
+ | ROC-AUC | 0.9980 |
95
 
96
+ ### Per-epoch results
97
 
98
+ | Epoch | Train loss | Val loss | Accuracy | F1 | Precision | Recall | ROC-AUC |
99
+ |------:|-----------:|---------:|---------:|----:|----------:|-------:|--------:|
100
+ | 1 | 0.0856 | 0.1061 | 0.9553 | 0.9273 | 0.8644 | 1.0000 | 0.9960 |
101
+ | 2 | 0.0353 | 0.0538 | 0.9777 | 0.9615 | 0.9434 | 0.9804 | 0.9989 |
102
+ | 3 | 0.0015 | 0.1310 | 0.9777 | 0.9600 | 0.9796 | 0.9412 | 0.9980 |
103
+ | 4 | 0.0019 | 0.0949 | **0.9832** | **0.9709** | 0.9615 | 0.9804 | 0.9980 |
104
 
105
+ ## Usage
106
 
107
+ ```python
108
+ from transformers import pipeline
109
 
110
+ clf = pipeline("text-classification", model="davanstrien/dhd-demo")
111
+ clf("All schools for the white and colored races shall be kept separate.")
112
+ ```
113
 
114
+ ## Limitations
115
 
116
+ - Trained on **North Carolina** laws, 1866–1967. Will not transfer cleanly to
117
+ other jurisdictions or modern legal language.
118
+ - The training labels reflect what named expert sources / project staff
119
+ flagged. The negative class is "not flagged," not "verified
120
+ non-discriminatory."
121
+ - OCR noise from period scans is present in training and will be present at
122
+ inference time on similar corpora.
123
+ - Eval set is small (n=179); treat the high metrics as encouraging but
124
+ bounded by sample size.
125
 
126
+ See the [dataset card](https://huggingface.co/datasets/biglam/on_the_books) for full
127
+ context, including the *Algorithms of Resistance* framing of the original
128
+ **On the Books** project at UNC Chapel Hill Libraries.
 
 
 
 
 
 
129
 
130
+ ## Citation
131
 
132
+ Please cite the original project:
 
 
 
 
 
133
 
134
+ > On the Books: Jim Crow and Algorithms of Resistance.
135
+ > University of North Carolina at Chapel Hill Libraries.
136
+ > https://onthebooks.lib.unc.edu — DOI: https://doi.org/10.17615/5c4g-sd44
137
 
138
+ ## Framework versions
139
 
140
  - Transformers 5.7.0
141
+ - PyTorch 2.11.0+cu130
142
  - Datasets 4.8.5
143
  - Tokenizers 0.22.2