--- license: apache-2.0 base_model: answerdotai/ModernBERT-base tags: - text-classification - modernbert - legal - glam - jim-crow - north-carolina - history - generated_from_trainer datasets: - biglam/on_the_books language: - en metrics: - accuracy - f1 - precision - recall - roc_auc model-index: - name: jim-crow-laws-claude-code results: - task: type: text-classification name: Binary text classification dataset: name: biglam/on_the_books type: biglam/on_the_books split: train (held-out 20% stratified) metrics: - type: accuracy value: 0.9776 - type: f1 value: 0.9619 name: F1 (jim_crow class) - type: precision value: 0.9352 name: Precision (jim_crow class) - type: recall value: 0.9902 name: Recall (jim_crow class) - type: roc_auc value: 0.9965 --- # jim-crow-laws-claude-code A binary text classifier that flags whether a North Carolina session-law section (1866–1967) is a **Jim Crow law**. Fine-tuned from [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) on [`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books), the labeled training set from UNC Chapel Hill Libraries' *On the Books: Jim Crow and Algorithms of Resistance* project. ## Intended use - Surface candidate Jim Crow laws within historical NC session-law corpora to support archival, library, and digital-humanities work. - Reproduce / extend the *On the Books* methodology on related corpora. - Teaching: ML-for-cultural-heritage, computational legal history, OCR-tolerant text classification. The original *On the Books* project trained a classifier on this data and ran it over the **full ~century corpus**. This model is a re-training of that idea with a modern long-context encoder (ModernBERT) and is intended to be applied the same way: as a *retrieval / triage* tool whose flagged outputs are then reviewed by domain experts. ## Out-of-scope / limitations - **Jurisdiction:** trained on **North Carolina** session laws only. Patterns will not transfer cleanly to other states without adaptation. - **Period:** 1866–1967 legal language. Modern statutes differ substantially. - **OCR noise:** training texts contain period-OCR errors; expect degraded performance on cleaner or differently-OCR'd inputs. - **Label scope:** the negative class means *"not flagged by the project's labeling process"* — laws with discriminatory effect that the source compilations did not catalogue may be present in the negatives. Treat model predictions as candidates for review, not ground truth. - **Class imbalance:** training data is ~29% positive; trained with inverse-frequency class weights to compensate. Per the dataset's authors, the texts include slurs and dehumanising language present in the historical record. Downstream users should preserve the project's framing and not strip the historical context. ## How to use ```python from transformers import pipeline clf = pipeline( "text-classification", model="davanstrien/jim-crow-laws-claude-code", ) text = "..." # text of a single law section print(clf(text)) # [{'label': 'jim_crow', 'score': 0.99}] ``` Labels: `no_jim_crow` (0) and `jim_crow` (1). ## Training data - **Dataset:** [`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books) (1,785 rows; single `train` split). - **Input field used:** `section_text` (the OCR text of the labeled section). `chapter_text` and `source` were ignored — `source` would leak the label (`paschal` is 100% positive, `murray` is 92% positive). - **Split:** stratified 80/20 train/eval split (seed 42) — 1,428 train / 357 eval, preserving the ~29% positive rate in both. ## Training procedure - **Base model:** `answerdotai/ModernBERT-base` (~150M params, 8K context). - **Max sequence length:** 1024 tokens (covers ~95th percentile of `section_text` token lengths; long-tail truncated). - **Loss:** cross-entropy with **inverse-frequency class weights** computed from the training split (`[0.701, 1.741]`) to handle class imbalance. - **Hardware:** trained on a single L4 GPU via `hf jobs uv run`. ### Hyperparameters | | | |---|---| | Optimizer | AdamW (fused), β=(0.9, 0.999), ε=1e-8 | | Learning rate | 3e-5 | | LR schedule | Linear with 10% warmup | | Weight decay | 0.01 | | Train batch size | 16 | | Eval batch size | 32 | | Epochs | 5 | | Precision | bf16 | | Seed | 42 | | Best-model selection | F1 on `jim_crow` class | ### Training results Best checkpoint selected by `f1_jim_crow` on the held-out eval split (epoch 3): | Metric | Value | |---|---| | Accuracy | 0.9776 | | Precision (jim_crow) | 0.9352 | | Recall (jim_crow) | 0.9902 | | F1 (jim_crow) | 0.9619 | | F1 (macro) | 0.9730 | | ROC AUC | 0.9965 | Per-epoch eval: | Training Loss | Epoch | Step | Val Loss | Accuracy | Precision (jim_crow) | Recall (jim_crow) | F1 (jim_crow) | F1 macro | ROC AUC | |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | 0.2893 | 1 | 90 | 0.1920 | 0.9524 | 0.8972 | 0.9412 | 0.9187 | 0.9425 | 0.9913 | | 0.0716 | 2 | 180 | 0.0793 | 0.9776 | 0.9519 | 0.9706 | 0.9612 | 0.9727 | 0.9971 | | 0.1101 | 3 | 270 | 0.1205 | 0.9776 | 0.9352 | 0.9902 | 0.9619 | 0.9730 | 0.9965 | | 0.0027 | 4 | 360 | 0.1251 | 0.9776 | 0.9352 | 0.9902 | 0.9619 | 0.9730 | 0.9958 | | 0.0001 | 5 | 450 | 0.1231 | 0.9748 | 0.9346 | 0.9804 | 0.9569 | 0.9696 | 0.9960 | Held-out eval is small (357 rows; 102 positive). Treat differences in the fourth decimal as noise. ## Citation Please cite the original *On the Books* project for the data and methodology: ``` On the Books: Jim Crow and Algorithms of Resistance. University of North Carolina at Chapel Hill Libraries. https://onthebooks.lib.unc.edu DOI: https://doi.org/10.17615/5c4g-sd44 ``` ### Framework versions - Transformers 5.7.0 - PyTorch 2.11.0+cu130 - Datasets 4.8.5 - Tokenizers 0.22.2