dhd-demo / README.md

davanstrien HF Staff

Replace auto-generated card with full model card

517f8c6 verified 10 days ago

preview code

raw

history blame contribute delete

3.95 kB

metadata

license: apache-2.0
base_model: answerdotai/ModernBERT-base
datasets:
  - biglam/on_the_books
language:
  - en
library_name: transformers
pipeline_tag: text-classification
tags:
  - text-classification
  - legal
  - glam
  - digital-humanities
  - jim-crow
  - north-carolina
  - legislation
  - generated_from_trainer
metrics:
  - f1
  - accuracy
  - roc_auc
model-index:
  - name: dhd-demo
    results:
      - task:
          type: text-classification
          name: Text Classification
        dataset:
          name: biglam/on_the_books
          type: biglam/on_the_books
          split: train (held-out 10%)
        metrics:
          - type: accuracy
            value: 0.9832
          - type: f1
            value: 0.9709
          - type: precision
            value: 0.9615
          - type: recall
            value: 0.9804
          - type: f1_macro
            value: 0.9796
          - type: roc_auc
            value: 0.998

dhd-demo: ModernBERT Jim Crow law classifier

Fine-tuned answerdotai/ModernBERT-base on biglam/on_the_books to classify North Carolina session-law sections (1866–1967) as Jim Crow laws or not.

Built as a live demo for the Digital Humanities & Discovery webinar (2026-05-05) showing end-to-end fine-tuning via hf jobs.

Labels

0 = no_jim_crow
1 = jim_crow

Training data

biglam/on_the_books — 1,785 expert-labeled chapter/section pairs from NC session laws, 512 positive / 1,273 negative. Split 90/10 (stratified) for train/eval. Class imbalance handled with inverse-frequency cross-entropy weights.

Training setup


Base model	`answerdotai/ModernBERT-base`
Epochs	4
Batch size	16
Learning rate	5e-5
Warmup steps	50
Weight decay	0.01
Max sequence length	1024
Precision	bf16
Loss	weighted cross-entropy
Seed	42
Hardware	1× NVIDIA L4 (24 GB) via `hf jobs`
Train runtime	223 s

Evaluation (held-out 10% split, n=179)

Metric	Value
Accuracy	0.9832
F1 (positive class)	0.9709
Precision	0.9615
Recall	0.9804
F1 (macro)	0.9796
ROC-AUC	0.9980

Per-epoch results

Epoch	Train loss	Val loss	Accuracy	F1	Precision	Recall	ROC-AUC
1	0.0856	0.1061	0.9553	0.9273	0.8644	1.0000	0.9960
2	0.0353	0.0538	0.9777	0.9615	0.9434	0.9804	0.9989
3	0.0015	0.1310	0.9777	0.9600	0.9796	0.9412	0.9980
4	0.0019	0.0949	0.9832	0.9709	0.9615	0.9804	0.9980

Usage

from transformers import pipeline

clf = pipeline("text-classification", model="davanstrien/dhd-demo")
clf("All schools for the white and colored races shall be kept separate.")

Limitations

Trained on North Carolina laws, 1866–1967. Will not transfer cleanly to other jurisdictions or modern legal language.
The training labels reflect what named expert sources / project staff flagged. The negative class is "not flagged," not "verified non-discriminatory."
OCR noise from period scans is present in training and will be present at inference time on similar corpora.
Eval set is small (n=179); treat the high metrics as encouraging but bounded by sample size.

See the dataset card for full context, including the Algorithms of Resistance framing of the original On the Books project at UNC Chapel Hill Libraries.

Citation

Please cite the original project:

On the Books: Jim Crow and Algorithms of Resistance. University of North Carolina at Chapel Hill Libraries. https://onthebooks.lib.unc.edu — DOI: https://doi.org/10.17615/5c4g-sd44

Framework versions

Transformers 5.7.0
PyTorch 2.11.0+cu130
Datasets 4.8.5
Tokenizers 0.22.2