davanstrien's picture
davanstrien HF Staff
Improve model card: add dataset/training context, intended use, limitations
15bf613 verified
---
license: apache-2.0
base_model: answerdotai/ModernBERT-base
tags:
- text-classification
- modernbert
- legal
- glam
- jim-crow
- north-carolina
- history
- generated_from_trainer
datasets:
- biglam/on_the_books
language:
- en
metrics:
- accuracy
- f1
- precision
- recall
- roc_auc
model-index:
- name: jim-crow-laws-claude-code
results:
- task:
type: text-classification
name: Binary text classification
dataset:
name: biglam/on_the_books
type: biglam/on_the_books
split: train (held-out 20% stratified)
metrics:
- type: accuracy
value: 0.9776
- type: f1
value: 0.9619
name: F1 (jim_crow class)
- type: precision
value: 0.9352
name: Precision (jim_crow class)
- type: recall
value: 0.9902
name: Recall (jim_crow class)
- type: roc_auc
value: 0.9965
---
# jim-crow-laws-claude-code
A binary text classifier that flags whether a North Carolina session-law section
(1866–1967) is a **Jim Crow law**. Fine-tuned from
[`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)
on [`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books),
the labeled training set from UNC Chapel Hill Libraries' *On the Books: Jim Crow
and Algorithms of Resistance* project.
## Intended use
- Surface candidate Jim Crow laws within historical NC session-law corpora to
support archival, library, and digital-humanities work.
- Reproduce / extend the *On the Books* methodology on related corpora.
- Teaching: ML-for-cultural-heritage, computational legal history, OCR-tolerant
text classification.
The original *On the Books* project trained a classifier on this data and ran it
over the **full ~century corpus**. This model is a re-training of that idea with
a modern long-context encoder (ModernBERT) and is intended to be applied the
same way: as a *retrieval / triage* tool whose flagged outputs are then reviewed
by domain experts.
## Out-of-scope / limitations
- **Jurisdiction:** trained on **North Carolina** session laws only. Patterns
will not transfer cleanly to other states without adaptation.
- **Period:** 1866–1967 legal language. Modern statutes differ substantially.
- **OCR noise:** training texts contain period-OCR errors; expect degraded
performance on cleaner or differently-OCR'd inputs.
- **Label scope:** the negative class means *"not flagged by the project's
labeling process"* — laws with discriminatory effect that the source
compilations did not catalogue may be present in the negatives. Treat model
predictions as candidates for review, not ground truth.
- **Class imbalance:** training data is ~29% positive; trained with
inverse-frequency class weights to compensate.
Per the dataset's authors, the texts include slurs and dehumanising language
present in the historical record. Downstream users should preserve the
project's framing and not strip the historical context.
## How to use
```python
from transformers import pipeline
clf = pipeline(
"text-classification",
model="davanstrien/jim-crow-laws-claude-code",
)
text = "..." # text of a single law section
print(clf(text))
# [{'label': 'jim_crow', 'score': 0.99}]
```
Labels: `no_jim_crow` (0) and `jim_crow` (1).
## Training data
- **Dataset:** [`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books)
(1,785 rows; single `train` split).
- **Input field used:** `section_text` (the OCR text of the labeled section).
`chapter_text` and `source` were ignored — `source` would leak the label
(`paschal` is 100% positive, `murray` is 92% positive).
- **Split:** stratified 80/20 train/eval split (seed 42) — 1,428 train / 357
eval, preserving the ~29% positive rate in both.
## Training procedure
- **Base model:** `answerdotai/ModernBERT-base` (~150M params, 8K context).
- **Max sequence length:** 1024 tokens (covers ~95th percentile of
`section_text` token lengths; long-tail truncated).
- **Loss:** cross-entropy with **inverse-frequency class weights** computed
from the training split (`[0.701, 1.741]`) to handle class imbalance.
- **Hardware:** trained on a single L4 GPU via `hf jobs uv run`.
### Hyperparameters
| | |
|---|---|
| Optimizer | AdamW (fused), β=(0.9, 0.999), ε=1e-8 |
| Learning rate | 3e-5 |
| LR schedule | Linear with 10% warmup |
| Weight decay | 0.01 |
| Train batch size | 16 |
| Eval batch size | 32 |
| Epochs | 5 |
| Precision | bf16 |
| Seed | 42 |
| Best-model selection | F1 on `jim_crow` class |
### Training results
Best checkpoint selected by `f1_jim_crow` on the held-out eval split (epoch 3):
| Metric | Value |
|---|---|
| Accuracy | 0.9776 |
| Precision (jim_crow) | 0.9352 |
| Recall (jim_crow) | 0.9902 |
| F1 (jim_crow) | 0.9619 |
| F1 (macro) | 0.9730 |
| ROC AUC | 0.9965 |
Per-epoch eval:
| Training Loss | Epoch | Step | Val Loss | Accuracy | Precision (jim_crow) | Recall (jim_crow) | F1 (jim_crow) | F1 macro | ROC AUC |
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| 0.2893 | 1 | 90 | 0.1920 | 0.9524 | 0.8972 | 0.9412 | 0.9187 | 0.9425 | 0.9913 |
| 0.0716 | 2 | 180 | 0.0793 | 0.9776 | 0.9519 | 0.9706 | 0.9612 | 0.9727 | 0.9971 |
| 0.1101 | 3 | 270 | 0.1205 | 0.9776 | 0.9352 | 0.9902 | 0.9619 | 0.9730 | 0.9965 |
| 0.0027 | 4 | 360 | 0.1251 | 0.9776 | 0.9352 | 0.9902 | 0.9619 | 0.9730 | 0.9958 |
| 0.0001 | 5 | 450 | 0.1231 | 0.9748 | 0.9346 | 0.9804 | 0.9569 | 0.9696 | 0.9960 |
Held-out eval is small (357 rows; 102 positive). Treat differences in the
fourth decimal as noise.
## Citation
Please cite the original *On the Books* project for the data and methodology:
```
On the Books: Jim Crow and Algorithms of Resistance.
University of North Carolina at Chapel Hill Libraries.
https://onthebooks.lib.unc.edu
DOI: https://doi.org/10.17615/5c4g-sd44
```
### Framework versions
- Transformers 5.7.0
- PyTorch 2.11.0+cu130
- Datasets 4.8.5
- Tokenizers 0.22.2