--- license: apache-2.0 base_model: answerdotai/ModernBERT-base datasets: - biglam/on_the_books language: - en library_name: transformers pipeline_tag: text-classification tags: - text-classification - legal - glam - digital-humanities - jim-crow - north-carolina - legislation - generated_from_trainer metrics: - f1 - accuracy - roc_auc model-index: - name: dhd-demo results: - task: type: text-classification name: Text Classification dataset: name: biglam/on_the_books type: biglam/on_the_books split: train (held-out 10%) metrics: - type: accuracy value: 0.9832 - type: f1 value: 0.9709 - type: precision value: 0.9615 - type: recall value: 0.9804 - type: f1_macro value: 0.9796 - type: roc_auc value: 0.9980 --- # dhd-demo: ModernBERT Jim Crow law classifier Fine-tuned [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) on [`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books) to classify North Carolina session-law sections (1866–1967) as Jim Crow laws or not. Built as a live demo for the *Digital Humanities & Discovery* webinar (2026-05-05) showing end-to-end fine-tuning via `hf jobs`. ## Labels - `0` = `no_jim_crow` - `1` = `jim_crow` ## Training data [`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books) — 1,785 expert-labeled chapter/section pairs from NC session laws, 512 positive / 1,273 negative. Split 90/10 (stratified) for train/eval. Class imbalance handled with inverse-frequency cross-entropy weights. ## Training setup | | | |---|---| | Base model | `answerdotai/ModernBERT-base` | | Epochs | 4 | | Batch size | 16 | | Learning rate | 5e-5 | | Warmup steps | 50 | | Weight decay | 0.01 | | Max sequence length | 1024 | | Precision | bf16 | | Loss | weighted cross-entropy | | Seed | 42 | | Hardware | 1× NVIDIA L4 (24 GB) via `hf jobs` | | Train runtime | 223 s | ## Evaluation (held-out 10% split, n=179) | Metric | Value | |---|---| | Accuracy | 0.9832 | | F1 (positive class) | 0.9709 | | Precision | 0.9615 | | Recall | 0.9804 | | F1 (macro) | 0.9796 | | ROC-AUC | 0.9980 | ### Per-epoch results | Epoch | Train loss | Val loss | Accuracy | F1 | Precision | Recall | ROC-AUC | |------:|-----------:|---------:|---------:|----:|----------:|-------:|--------:| | 1 | 0.0856 | 0.1061 | 0.9553 | 0.9273 | 0.8644 | 1.0000 | 0.9960 | | 2 | 0.0353 | 0.0538 | 0.9777 | 0.9615 | 0.9434 | 0.9804 | 0.9989 | | 3 | 0.0015 | 0.1310 | 0.9777 | 0.9600 | 0.9796 | 0.9412 | 0.9980 | | 4 | 0.0019 | 0.0949 | **0.9832** | **0.9709** | 0.9615 | 0.9804 | 0.9980 | ## Usage ```python from transformers import pipeline clf = pipeline("text-classification", model="davanstrien/dhd-demo") clf("All schools for the white and colored races shall be kept separate.") ``` ## Limitations - Trained on **North Carolina** laws, 1866–1967. Will not transfer cleanly to other jurisdictions or modern legal language. - The training labels reflect what named expert sources / project staff flagged. The negative class is "not flagged," not "verified non-discriminatory." - OCR noise from period scans is present in training and will be present at inference time on similar corpora. - Eval set is small (n=179); treat the high metrics as encouraging but bounded by sample size. See the [dataset card](https://huggingface.co/datasets/biglam/on_the_books) for full context, including the *Algorithms of Resistance* framing of the original **On the Books** project at UNC Chapel Hill Libraries. ## Citation Please cite the original project: > On the Books: Jim Crow and Algorithms of Resistance. > University of North Carolina at Chapel Hill Libraries. > https://onthebooks.lib.unc.edu — DOI: https://doi.org/10.17615/5c4g-sd44 ## Framework versions - Transformers 5.7.0 - PyTorch 2.11.0+cu130 - Datasets 4.8.5 - Tokenizers 0.22.2