File size: 3,946 Bytes

---
license: apache-2.0
base_model: answerdotai/ModernBERT-base
datasets:
- biglam/on_the_books
language:
- en
library_name: transformers
pipeline_tag: text-classification
tags:
- text-classification
- legal
- glam
- digital-humanities
- jim-crow
- north-carolina
- legislation
- generated_from_trainer
metrics:
- f1
- accuracy
- roc_auc
model-index:
- name: dhd-demo
  results:
  - task:
      type: text-classification
      name: Text Classification
    dataset:
      name: biglam/on_the_books
      type: biglam/on_the_books
      split: train (held-out 10%)
    metrics:
    - type: accuracy
      value: 0.9832
    - type: f1
      value: 0.9709
    - type: precision
      value: 0.9615
    - type: recall
      value: 0.9804
    - type: f1_macro
      value: 0.9796
    - type: roc_auc
      value: 0.9980
---

# dhd-demo: ModernBERT Jim Crow law classifier

Fine-tuned [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) on
[`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books) to classify North
Carolina session-law sections (1866–1967) as Jim Crow laws or not.

Built as a live demo for the *Digital Humanities & Discovery* webinar
(2026-05-05) showing end-to-end fine-tuning via `hf jobs`.

## Labels

- `0` = `no_jim_crow`
- `1` = `jim_crow`

## Training data

[`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books) — 1,785 expert-labeled chapter/section pairs from NC session
laws, 512 positive / 1,273 negative. Split 90/10 (stratified) for train/eval.
Class imbalance handled with inverse-frequency cross-entropy weights.

## Training setup

| | |
|---|---|
| Base model | `answerdotai/ModernBERT-base` |
| Epochs | 4 |
| Batch size | 16 |
| Learning rate | 5e-5 |
| Warmup steps | 50 |
| Weight decay | 0.01 |
| Max sequence length | 1024 |
| Precision | bf16 |
| Loss | weighted cross-entropy |
| Seed | 42 |
| Hardware | 1× NVIDIA L4 (24 GB) via `hf jobs` |
| Train runtime | 223 s |

## Evaluation (held-out 10% split, n=179)

| Metric | Value |
|---|---|
| Accuracy | 0.9832 |
| F1 (positive class) | 0.9709 |
| Precision | 0.9615 |
| Recall | 0.9804 |
| F1 (macro) | 0.9796 |
| ROC-AUC | 0.9980 |

### Per-epoch results

| Epoch | Train loss | Val loss | Accuracy | F1 | Precision | Recall | ROC-AUC |
|------:|-----------:|---------:|---------:|----:|----------:|-------:|--------:|
| 1 | 0.0856 | 0.1061 | 0.9553 | 0.9273 | 0.8644 | 1.0000 | 0.9960 |
| 2 | 0.0353 | 0.0538 | 0.9777 | 0.9615 | 0.9434 | 0.9804 | 0.9989 |
| 3 | 0.0015 | 0.1310 | 0.9777 | 0.9600 | 0.9796 | 0.9412 | 0.9980 |
| 4 | 0.0019 | 0.0949 | **0.9832** | **0.9709** | 0.9615 | 0.9804 | 0.9980 |

## Usage

```python
from transformers import pipeline

clf = pipeline("text-classification", model="davanstrien/dhd-demo")
clf("All schools for the white and colored races shall be kept separate.")
```

## Limitations

- Trained on **North Carolina** laws, 1866–1967. Will not transfer cleanly to
  other jurisdictions or modern legal language.
- The training labels reflect what named expert sources / project staff
  flagged. The negative class is "not flagged," not "verified
  non-discriminatory."
- OCR noise from period scans is present in training and will be present at
  inference time on similar corpora.
- Eval set is small (n=179); treat the high metrics as encouraging but
  bounded by sample size.

See the [dataset card](https://huggingface.co/datasets/biglam/on_the_books) for full
context, including the *Algorithms of Resistance* framing of the original
**On the Books** project at UNC Chapel Hill Libraries.

## Citation

Please cite the original project:

> On the Books: Jim Crow and Algorithms of Resistance.
> University of North Carolina at Chapel Hill Libraries.
> https://onthebooks.lib.unc.edu — DOI: https://doi.org/10.17615/5c4g-sd44

## Framework versions

- Transformers 5.7.0
- PyTorch 2.11.0+cu130
- Datasets 4.8.5
- Tokenizers 0.22.2