dhd-demo / README.md

Replace auto-generated card with full model card

517f8c6 verified 12 days ago

3.95 kB

	---
	license: apache-2.0
	base_model: answerdotai/ModernBERT-base
	datasets:
	- biglam/on_the_books
	language:
	- en
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- text-classification
	- legal
	- glam
	- digital-humanities
	- jim-crow
	- north-carolina
	- legislation
	- generated_from_trainer
	metrics:
	- f1
	- accuracy
	- roc_auc
	model-index:
	- name: dhd-demo
	results:
	- task:
	type: text-classification
	name: Text Classification
	dataset:
	name: biglam/on_the_books
	type: biglam/on_the_books
	split: train (held-out 10%)
	metrics:
	- type: accuracy
	value: 0.9832
	- type: f1
	value: 0.9709
	- type: precision
	value: 0.9615
	- type: recall
	value: 0.9804
	- type: f1_macro
	value: 0.9796
	- type: roc_auc
	value: 0.9980
	---

	# dhd-demo: ModernBERT Jim Crow law classifier

	Fine-tuned [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) on
	[`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books) to classify North
	Carolina session-law sections (1866–1967) as Jim Crow laws or not.

	Built as a live demo for the Digital Humanities & Discovery webinar
	(2026-05-05) showing end-to-end fine-tuning via `hf jobs`.

	## Labels

	- `0` = `no_jim_crow`
	- `1` = `jim_crow`

	## Training data

	[`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books) — 1,785 expert-labeled chapter/section pairs from NC session
	laws, 512 positive / 1,273 negative. Split 90/10 (stratified) for train/eval.
	Class imbalance handled with inverse-frequency cross-entropy weights.

	## Training setup

	\| \| \|
	\|---\|---\|
	\| Base model \| `answerdotai/ModernBERT-base` \|
	\| Epochs \| 4 \|
	\| Batch size \| 16 \|
	\| Learning rate \| 5e-5 \|
	\| Warmup steps \| 50 \|
	\| Weight decay \| 0.01 \|
	\| Max sequence length \| 1024 \|
	\| Precision \| bf16 \|
	\| Loss \| weighted cross-entropy \|
	\| Seed \| 42 \|
	\| Hardware \| 1× NVIDIA L4 (24 GB) via `hf jobs` \|
	\| Train runtime \| 223 s \|

	## Evaluation (held-out 10% split, n=179)

	\| Metric \| Value \|
	\|---\|---\|
	\| Accuracy \| 0.9832 \|
	\| F1 (positive class) \| 0.9709 \|
	\| Precision \| 0.9615 \|
	\| Recall \| 0.9804 \|
	\| F1 (macro) \| 0.9796 \|
	\| ROC-AUC \| 0.9980 \|

	### Per-epoch results

	\| Epoch \| Train loss \| Val loss \| Accuracy \| F1 \| Precision \| Recall \| ROC-AUC \|
	\|------:\|-----------:\|---------:\|---------:\|----:\|----------:\|-------:\|--------:\|
	\| 1 \| 0.0856 \| 0.1061 \| 0.9553 \| 0.9273 \| 0.8644 \| 1.0000 \| 0.9960 \|
	\| 2 \| 0.0353 \| 0.0538 \| 0.9777 \| 0.9615 \| 0.9434 \| 0.9804 \| 0.9989 \|
	\| 3 \| 0.0015 \| 0.1310 \| 0.9777 \| 0.9600 \| 0.9796 \| 0.9412 \| 0.9980 \|
	\| 4 \| 0.0019 \| 0.0949 \| 0.9832 \| 0.9709 \| 0.9615 \| 0.9804 \| 0.9980 \|

	## Usage

	```python
	from transformers import pipeline

	clf = pipeline("text-classification", model="davanstrien/dhd-demo")
	clf("All schools for the white and colored races shall be kept separate.")
	```

	## Limitations

	- Trained on North Carolina laws, 1866–1967. Will not transfer cleanly to
	other jurisdictions or modern legal language.
	- The training labels reflect what named expert sources / project staff
	flagged. The negative class is "not flagged," not "verified
	non-discriminatory."
	- OCR noise from period scans is present in training and will be present at
	inference time on similar corpora.
	- Eval set is small (n=179); treat the high metrics as encouraging but
	bounded by sample size.

	See the [dataset card](https://huggingface.co/datasets/biglam/on_the_books) for full
	context, including the Algorithms of Resistance framing of the original
	On the Books project at UNC Chapel Hill Libraries.

	## Citation

	Please cite the original project:

	> On the Books: Jim Crow and Algorithms of Resistance.
	> University of North Carolina at Chapel Hill Libraries.
	> https://onthebooks.lib.unc.edu — DOI: https://doi.org/10.17615/5c4g-sd44

	## Framework versions

	- Transformers 5.7.0
	- PyTorch 2.11.0+cu130
	- Datasets 4.8.5
	- Tokenizers 0.22.2