dhd-demo / README.md
davanstrien's picture
davanstrien HF Staff
Replace auto-generated card with full model card
517f8c6 verified
---
license: apache-2.0
base_model: answerdotai/ModernBERT-base
datasets:
- biglam/on_the_books
language:
- en
library_name: transformers
pipeline_tag: text-classification
tags:
- text-classification
- legal
- glam
- digital-humanities
- jim-crow
- north-carolina
- legislation
- generated_from_trainer
metrics:
- f1
- accuracy
- roc_auc
model-index:
- name: dhd-demo
results:
- task:
type: text-classification
name: Text Classification
dataset:
name: biglam/on_the_books
type: biglam/on_the_books
split: train (held-out 10%)
metrics:
- type: accuracy
value: 0.9832
- type: f1
value: 0.9709
- type: precision
value: 0.9615
- type: recall
value: 0.9804
- type: f1_macro
value: 0.9796
- type: roc_auc
value: 0.9980
---
# dhd-demo: ModernBERT Jim Crow law classifier
Fine-tuned [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) on
[`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books) to classify North
Carolina session-law sections (1866–1967) as Jim Crow laws or not.
Built as a live demo for the *Digital Humanities & Discovery* webinar
(2026-05-05) showing end-to-end fine-tuning via `hf jobs`.
## Labels
- `0` = `no_jim_crow`
- `1` = `jim_crow`
## Training data
[`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books) — 1,785 expert-labeled chapter/section pairs from NC session
laws, 512 positive / 1,273 negative. Split 90/10 (stratified) for train/eval.
Class imbalance handled with inverse-frequency cross-entropy weights.
## Training setup
| | |
|---|---|
| Base model | `answerdotai/ModernBERT-base` |
| Epochs | 4 |
| Batch size | 16 |
| Learning rate | 5e-5 |
| Warmup steps | 50 |
| Weight decay | 0.01 |
| Max sequence length | 1024 |
| Precision | bf16 |
| Loss | weighted cross-entropy |
| Seed | 42 |
| Hardware | 1× NVIDIA L4 (24 GB) via `hf jobs` |
| Train runtime | 223 s |
## Evaluation (held-out 10% split, n=179)
| Metric | Value |
|---|---|
| Accuracy | 0.9832 |
| F1 (positive class) | 0.9709 |
| Precision | 0.9615 |
| Recall | 0.9804 |
| F1 (macro) | 0.9796 |
| ROC-AUC | 0.9980 |
### Per-epoch results
| Epoch | Train loss | Val loss | Accuracy | F1 | Precision | Recall | ROC-AUC |
|------:|-----------:|---------:|---------:|----:|----------:|-------:|--------:|
| 1 | 0.0856 | 0.1061 | 0.9553 | 0.9273 | 0.8644 | 1.0000 | 0.9960 |
| 2 | 0.0353 | 0.0538 | 0.9777 | 0.9615 | 0.9434 | 0.9804 | 0.9989 |
| 3 | 0.0015 | 0.1310 | 0.9777 | 0.9600 | 0.9796 | 0.9412 | 0.9980 |
| 4 | 0.0019 | 0.0949 | **0.9832** | **0.9709** | 0.9615 | 0.9804 | 0.9980 |
## Usage
```python
from transformers import pipeline
clf = pipeline("text-classification", model="davanstrien/dhd-demo")
clf("All schools for the white and colored races shall be kept separate.")
```
## Limitations
- Trained on **North Carolina** laws, 1866–1967. Will not transfer cleanly to
other jurisdictions or modern legal language.
- The training labels reflect what named expert sources / project staff
flagged. The negative class is "not flagged," not "verified
non-discriminatory."
- OCR noise from period scans is present in training and will be present at
inference time on similar corpora.
- Eval set is small (n=179); treat the high metrics as encouraging but
bounded by sample size.
See the [dataset card](https://huggingface.co/datasets/biglam/on_the_books) for full
context, including the *Algorithms of Resistance* framing of the original
**On the Books** project at UNC Chapel Hill Libraries.
## Citation
Please cite the original project:
> On the Books: Jim Crow and Algorithms of Resistance.
> University of North Carolina at Chapel Hill Libraries.
> https://onthebooks.lib.unc.edu — DOI: https://doi.org/10.17615/5c4g-sd44
## Framework versions
- Transformers 5.7.0
- PyTorch 2.11.0+cu130
- Datasets 4.8.5
- Tokenizers 0.22.2