davanstrien HF Staff
Improve model card: add dataset/training context, intended use, limitations
15bf613 verified | license: apache-2.0 | |
| base_model: answerdotai/ModernBERT-base | |
| tags: | |
| - text-classification | |
| - modernbert | |
| - legal | |
| - glam | |
| - jim-crow | |
| - north-carolina | |
| - history | |
| - generated_from_trainer | |
| datasets: | |
| - biglam/on_the_books | |
| language: | |
| - en | |
| metrics: | |
| - accuracy | |
| - f1 | |
| - precision | |
| - recall | |
| - roc_auc | |
| model-index: | |
| - name: jim-crow-laws-claude-code | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Binary text classification | |
| dataset: | |
| name: biglam/on_the_books | |
| type: biglam/on_the_books | |
| split: train (held-out 20% stratified) | |
| metrics: | |
| - type: accuracy | |
| value: 0.9776 | |
| - type: f1 | |
| value: 0.9619 | |
| name: F1 (jim_crow class) | |
| - type: precision | |
| value: 0.9352 | |
| name: Precision (jim_crow class) | |
| - type: recall | |
| value: 0.9902 | |
| name: Recall (jim_crow class) | |
| - type: roc_auc | |
| value: 0.9965 | |
| # jim-crow-laws-claude-code | |
| A binary text classifier that flags whether a North Carolina session-law section | |
| (1866–1967) is a **Jim Crow law**. Fine-tuned from | |
| [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) | |
| on [`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books), | |
| the labeled training set from UNC Chapel Hill Libraries' *On the Books: Jim Crow | |
| and Algorithms of Resistance* project. | |
| ## Intended use | |
| - Surface candidate Jim Crow laws within historical NC session-law corpora to | |
| support archival, library, and digital-humanities work. | |
| - Reproduce / extend the *On the Books* methodology on related corpora. | |
| - Teaching: ML-for-cultural-heritage, computational legal history, OCR-tolerant | |
| text classification. | |
| The original *On the Books* project trained a classifier on this data and ran it | |
| over the **full ~century corpus**. This model is a re-training of that idea with | |
| a modern long-context encoder (ModernBERT) and is intended to be applied the | |
| same way: as a *retrieval / triage* tool whose flagged outputs are then reviewed | |
| by domain experts. | |
| ## Out-of-scope / limitations | |
| - **Jurisdiction:** trained on **North Carolina** session laws only. Patterns | |
| will not transfer cleanly to other states without adaptation. | |
| - **Period:** 1866–1967 legal language. Modern statutes differ substantially. | |
| - **OCR noise:** training texts contain period-OCR errors; expect degraded | |
| performance on cleaner or differently-OCR'd inputs. | |
| - **Label scope:** the negative class means *"not flagged by the project's | |
| labeling process"* — laws with discriminatory effect that the source | |
| compilations did not catalogue may be present in the negatives. Treat model | |
| predictions as candidates for review, not ground truth. | |
| - **Class imbalance:** training data is ~29% positive; trained with | |
| inverse-frequency class weights to compensate. | |
| Per the dataset's authors, the texts include slurs and dehumanising language | |
| present in the historical record. Downstream users should preserve the | |
| project's framing and not strip the historical context. | |
| ## How to use | |
| ```python | |
| from transformers import pipeline | |
| clf = pipeline( | |
| "text-classification", | |
| model="davanstrien/jim-crow-laws-claude-code", | |
| ) | |
| text = "..." # text of a single law section | |
| print(clf(text)) | |
| # [{'label': 'jim_crow', 'score': 0.99}] | |
| ``` | |
| Labels: `no_jim_crow` (0) and `jim_crow` (1). | |
| ## Training data | |
| - **Dataset:** [`biglam/on_the_books`](https://huggingface.co/datasets/biglam/on_the_books) | |
| (1,785 rows; single `train` split). | |
| - **Input field used:** `section_text` (the OCR text of the labeled section). | |
| `chapter_text` and `source` were ignored — `source` would leak the label | |
| (`paschal` is 100% positive, `murray` is 92% positive). | |
| - **Split:** stratified 80/20 train/eval split (seed 42) — 1,428 train / 357 | |
| eval, preserving the ~29% positive rate in both. | |
| ## Training procedure | |
| - **Base model:** `answerdotai/ModernBERT-base` (~150M params, 8K context). | |
| - **Max sequence length:** 1024 tokens (covers ~95th percentile of | |
| `section_text` token lengths; long-tail truncated). | |
| - **Loss:** cross-entropy with **inverse-frequency class weights** computed | |
| from the training split (`[0.701, 1.741]`) to handle class imbalance. | |
| - **Hardware:** trained on a single L4 GPU via `hf jobs uv run`. | |
| ### Hyperparameters | |
| | | | | |
| |---|---| | |
| | Optimizer | AdamW (fused), β=(0.9, 0.999), ε=1e-8 | | |
| | Learning rate | 3e-5 | | |
| | LR schedule | Linear with 10% warmup | | |
| | Weight decay | 0.01 | | |
| | Train batch size | 16 | | |
| | Eval batch size | 32 | | |
| | Epochs | 5 | | |
| | Precision | bf16 | | |
| | Seed | 42 | | |
| | Best-model selection | F1 on `jim_crow` class | | |
| ### Training results | |
| Best checkpoint selected by `f1_jim_crow` on the held-out eval split (epoch 3): | |
| | Metric | Value | | |
| |---|---| | |
| | Accuracy | 0.9776 | | |
| | Precision (jim_crow) | 0.9352 | | |
| | Recall (jim_crow) | 0.9902 | | |
| | F1 (jim_crow) | 0.9619 | | |
| | F1 (macro) | 0.9730 | | |
| | ROC AUC | 0.9965 | | |
| Per-epoch eval: | |
| | Training Loss | Epoch | Step | Val Loss | Accuracy | Precision (jim_crow) | Recall (jim_crow) | F1 (jim_crow) | F1 macro | ROC AUC | | |
| |:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| | |
| | 0.2893 | 1 | 90 | 0.1920 | 0.9524 | 0.8972 | 0.9412 | 0.9187 | 0.9425 | 0.9913 | | |
| | 0.0716 | 2 | 180 | 0.0793 | 0.9776 | 0.9519 | 0.9706 | 0.9612 | 0.9727 | 0.9971 | | |
| | 0.1101 | 3 | 270 | 0.1205 | 0.9776 | 0.9352 | 0.9902 | 0.9619 | 0.9730 | 0.9965 | | |
| | 0.0027 | 4 | 360 | 0.1251 | 0.9776 | 0.9352 | 0.9902 | 0.9619 | 0.9730 | 0.9958 | | |
| | 0.0001 | 5 | 450 | 0.1231 | 0.9748 | 0.9346 | 0.9804 | 0.9569 | 0.9696 | 0.9960 | | |
| Held-out eval is small (357 rows; 102 positive). Treat differences in the | |
| fourth decimal as noise. | |
| ## Citation | |
| Please cite the original *On the Books* project for the data and methodology: | |
| ``` | |
| On the Books: Jim Crow and Algorithms of Resistance. | |
| University of North Carolina at Chapel Hill Libraries. | |
| https://onthebooks.lib.unc.edu | |
| DOI: https://doi.org/10.17615/5c4g-sd44 | |
| ``` | |
| ### Framework versions | |
| - Transformers 5.7.0 | |
| - PyTorch 2.11.0+cu130 | |
| - Datasets 4.8.5 | |
| - Tokenizers 0.22.2 | |