Improve model card: add dataset/training context, intended use, limitations

15bf613 verified 22 days ago

5.99 kB

	---
	license: apache-2.0
	base_model: answerdotai/ModernBERT-base
	tags:
	- text-classification
	- modernbert
	- legal
	- glam
	- jim-crow
	- north-carolina
	- history
	- generated_from_trainer
	datasets:
	- biglam/on_the_books
	language:
	- en
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	- roc_auc
	model-index:
	- name: jim-crow-laws-claude-code
	results:
	- task:
	type: text-classification
	name: Binary text classification
	dataset:
	name: biglam/on_the_books
	type: biglam/on_the_books
	split: train (held-out 20% stratified)
	metrics:
	- type: accuracy
	value: 0.9776
	- type: f1
	value: 0.9619
	name: F1 (jim_crow class)
	- type: precision
	value: 0.9352
	name: Precision (jim_crow class)
	- type: recall
	value: 0.9902
	name: Recall (jim_crow class)
	- type: roc_auc
	value: 0.9965
	---

	# jim-crow-laws-claude-code

	A binary text classifier that flags whether a North Carolina session-law section
	(1866–1967) is a Jim Crow law. Fine-tuned from
	[`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)
	on [`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books),
	the labeled training set from UNC Chapel Hill Libraries' *On the Books: Jim Crow
	and Algorithms of Resistance* project.

	## Intended use

	- Surface candidate Jim Crow laws within historical NC session-law corpora to
	support archival, library, and digital-humanities work.
	- Reproduce / extend the On the Books methodology on related corpora.
	- Teaching: ML-for-cultural-heritage, computational legal history, OCR-tolerant
	text classification.

	The original On the Books project trained a classifier on this data and ran it
	over the full ~century corpus. This model is a re-training of that idea with
	a modern long-context encoder (ModernBERT) and is intended to be applied the
	same way: as a retrieval / triage tool whose flagged outputs are then reviewed
	by domain experts.

	## Out-of-scope / limitations

	- Jurisdiction: trained on North Carolina session laws only. Patterns
	will not transfer cleanly to other states without adaptation.
	- Period: 1866–1967 legal language. Modern statutes differ substantially.
	- OCR noise: training texts contain period-OCR errors; expect degraded
	performance on cleaner or differently-OCR'd inputs.
	- Label scope: the negative class means *"not flagged by the project's
	labeling process"* — laws with discriminatory effect that the source
	compilations did not catalogue may be present in the negatives. Treat model
	predictions as candidates for review, not ground truth.
	- Class imbalance: training data is ~29% positive; trained with
	inverse-frequency class weights to compensate.

	Per the dataset's authors, the texts include slurs and dehumanising language
	present in the historical record. Downstream users should preserve the
	project's framing and not strip the historical context.

	## How to use

	```python
	from transformers import pipeline

	clf = pipeline(
	"text-classification",
	model="davanstrien/jim-crow-laws-claude-code",
	)

	text = "..." # text of a single law section
	print(clf(text))
	# [{'label': 'jim_crow', 'score': 0.99}]
	```

	Labels: `no_jim_crow` (0) and `jim_crow` (1).

	## Training data

	- Dataset: [`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books)
	(1,785 rows; single `train` split).
	- Input field used: `section_text` (the OCR text of the labeled section).
	`chapter_text` and `source` were ignored — `source` would leak the label
	(`paschal` is 100% positive, `murray` is 92% positive).
	- Split: stratified 80/20 train/eval split (seed 42) — 1,428 train / 357
	eval, preserving the ~29% positive rate in both.

	## Training procedure

	- Base model: `answerdotai/ModernBERT-base` (~150M params, 8K context).
	- Max sequence length: 1024 tokens (covers ~95th percentile of
	`section_text` token lengths; long-tail truncated).
	- Loss: cross-entropy with inverse-frequency class weights computed
	from the training split (`[0.701, 1.741]`) to handle class imbalance.
	- Hardware: trained on a single L4 GPU via `hf jobs uv run`.

	### Hyperparameters

	\| \| \|
	\|---\|---\|
	\| Optimizer \| AdamW (fused), β=(0.9, 0.999), ε=1e-8 \|
	\| Learning rate \| 3e-5 \|
	\| LR schedule \| Linear with 10% warmup \|
	\| Weight decay \| 0.01 \|
	\| Train batch size \| 16 \|
	\| Eval batch size \| 32 \|
	\| Epochs \| 5 \|
	\| Precision \| bf16 \|
	\| Seed \| 42 \|
	\| Best-model selection \| F1 on `jim_crow` class \|

	### Training results

	Best checkpoint selected by `f1_jim_crow` on the held-out eval split (epoch 3):

	\| Metric \| Value \|
	\|---\|---\|
	\| Accuracy \| 0.9776 \|
	\| Precision (jim_crow) \| 0.9352 \|
	\| Recall (jim_crow) \| 0.9902 \|
	\| F1 (jim_crow) \| 0.9619 \|
	\| F1 (macro) \| 0.9730 \|
	\| ROC AUC \| 0.9965 \|

	Per-epoch eval:

	\| Training Loss \| Epoch \| Step \| Val Loss \| Accuracy \| Precision (jim_crow) \| Recall (jim_crow) \| F1 (jim_crow) \| F1 macro \| ROC AUC \|
	\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| 0.2893 \| 1 \| 90 \| 0.1920 \| 0.9524 \| 0.8972 \| 0.9412 \| 0.9187 \| 0.9425 \| 0.9913 \|
	\| 0.0716 \| 2 \| 180 \| 0.0793 \| 0.9776 \| 0.9519 \| 0.9706 \| 0.9612 \| 0.9727 \| 0.9971 \|
	\| 0.1101 \| 3 \| 270 \| 0.1205 \| 0.9776 \| 0.9352 \| 0.9902 \| 0.9619 \| 0.9730 \| 0.9965 \|
	\| 0.0027 \| 4 \| 360 \| 0.1251 \| 0.9776 \| 0.9352 \| 0.9902 \| 0.9619 \| 0.9730 \| 0.9958 \|
	\| 0.0001 \| 5 \| 450 \| 0.1231 \| 0.9748 \| 0.9346 \| 0.9804 \| 0.9569 \| 0.9696 \| 0.9960 \|

	Held-out eval is small (357 rows; 102 positive). Treat differences in the
	fourth decimal as noise.

	## Citation

	Please cite the original On the Books project for the data and methodology:

	```
	On the Books: Jim Crow and Algorithms of Resistance.
	University of North Carolina at Chapel Hill Libraries.
	https://onthebooks.lib.unc.edu
	DOI: https://doi.org/10.17615/5c4g-sd44
	```

	### Framework versions

	- Transformers 5.7.0
	- PyTorch 2.11.0+cu130
	- Datasets 4.8.5
	- Tokenizers 0.22.2