| --- |
| license: mit |
| language: |
| - en |
| tags: |
| - text-classification |
| - document-classification |
| - binary-classification |
| - legal-documents |
| - hoa |
| - property-management |
| - ccr |
| - declaration-of-covenants |
| - sklearn |
| - logistic-regression |
| pipeline_tag: text-classification |
| library_name: scikit-learn |
| metrics: |
| - f1 |
| - accuracy |
| - roc_auc |
| model-index: |
| - name: ccr-binary-logreg |
| results: |
| - task: |
| type: text-classification |
| name: Document Binary Classification |
| metrics: |
| - type: f1 |
| value: 0.940 |
| name: Test F1 |
| - type: roc_auc |
| value: 0.955 |
| name: Test ROC AUC |
| - type: accuracy |
| value: 0.908 |
| name: Test Accuracy |
| --- |
| |
| # CCR Binary Classifier (LogReg over OpenAI Embeddings) |
|
|
| Document-level binary classifier for distinguishing actual **Declarations of Covenants, Conditions & Restrictions (CC&Rs)** from auxiliary HOA governance documents that share legal-document vocabulary (bylaws, articles of incorporation, rules & regulations, amendments, board resolutions, CDD bond disclosures, ordinances, easement policies). |
|
|
| > **Note:** This README is the HuggingFace model card. The canonical deployed version is at https://huggingface.co/GoverningDocs/ccr-binary-logreg |
|
|
| ## Model Description |
|
|
| This is a **scikit-learn `LogisticRegression`** trained on **mean-pooled OpenAI `text-embedding-3-small` embeddings** (1536 dimensions) of substantive document pages. It serves as the upstream "second-opinion" gate in the CCR report pipeline: it decides whether a document tagged as CCR by the upstream XGBoost page classifier is actually a Declaration of Covenants worth dispatching CCR extraction on. |
|
|
| ### Why this model exists |
|
|
| The XGBoost page classifier (`GoverningDocs/xgboost-page-classifier`) labels individual pages with one of 12 categories. CCR is the largest source of overdetection: CDD bond disclosures, municipal ordinances, easement policies, HOA bylaws, and articles of incorporation all use legal-document phrasing that XGBoost has learned correlates with CCR. The result: ~73-85% positive-pattern false-positive rate on auxiliary HOA documents β which means the CCR report pipeline frequently runs on garbage input, producing fabricated red flags or degraded reports. |
|
|
| This binary classifier sits AFTER the page classifier and asks: "Yes the page classifier flagged some pages as CCR β but is this DOCUMENT actually a Declaration of Covenants?" If no, the CCR pipeline never runs on that document. |
|
|
| ### Architecture |
|
|
| - **Embedding model:** OpenAI `text-embedding-3-small` (1536 dims) β same encoder used by the production vectorstore (`langchain_pg_embedding`) |
| - **Aggregation:** mean-pool the embeddings of up to 20 substantive (non-boilerplate) pages per document into a single 1536-dim doc vector |
| - **Classifier head:** sklearn `LogisticRegression(class_weight="balanced", C=1.0, max_iter=2000)` |
| - **Operating threshold:** 0.436 (F1-maximizing on validation set) |
| - **Output:** `P(is_actual_declaration_of_covenants)` β [0, 1] |
|
|
| Why LogReg over MLP / SetFit / BGE fine-tune: Phase 1 trained both LogReg and a shallow MLP (1Γ512 ReLU). Both converged to the same operating point at their best thresholds. LogReg won on the simplicity tiebreak (smaller artifact, natively-calibrated probabilities, no GPU needed). Phase 0 produced ~7,100 labeled examples, well above the few-shot regime where SetFit's contrastive head adds value. |
|
|
| ## Training Data |
|
|
| Trained on **7,129 high-confidence labeled pages from 465 unique HOA / CCR-adjacent documents**, produced by a multi-signal corpus relabeling pipeline: |
|
|
| | Signal | Coverage | |
| |---|---| |
| | Signature anti-patterns (CDD, ordinance, bond, policy markers) | Auto-labels obvious non-CCR cases | |
| | Page-structural heuristics (TOC, recording stamps, signature blocks, blanks) | Auto-labels boilerplate at high confidence | |
| | Claude Opus subagent verification | Verifies all DECLARATION-tentative + INDETERMINATE pages with rubric-based reasoning | |
|
|
| ### Page-level label distribution |
|
|
| | Class | Count | % | |
| |---|---|---| |
| | DECLARATION_OF_COVENANTS | 3,014 | 42.3% | |
| | AUXILIARY_HOA_DOC | 1,551 | 21.8% | |
| | BOILERPLATE | 2,564 | 35.9% | |
|
|
| ### Document-level binary labels |
|
|
| A document is labeled `is_declaration_of_covenants = 1` if `count(DECLARATION pages) / count(non-BOILERPLATE pages) >= 0.5`. This handles multi-document composites (Declaration + embedded Bylaws + Amendments + signatures bundled in one PDF) by requiring that the document is DOMINANTLY a real Declaration, not just one that contains some Declaration content. |
|
|
| Final binary class balance: 42.3% positive / 57.7% negative across 465 documents. |
|
|
| ## Performance |
|
|
| ### Test set (held-out, 65 documents, 47 positives) |
|
|
| | Metric | Value | |
| |---|---| |
| | **F1** | **0.940** | |
| | ROC AUC | 0.955 | |
| | Accuracy | 0.908 | |
| | Confusion matrix | `[[12 TN, 6 FP], [0 FN, 47 TP]]` | |
| | **Recall** | **100%** β never misses a real Declaration in this composition | |
| | Precision | 88.7% | |
| | Brier score | 0.134 | |
| | ECE (10-bin) | 0.278 | |
|
|
| ### Validation set (75 documents, 39 positives) |
|
|
| | Metric | Value | |
| |---|---| |
| | F1 | 0.894 | |
| | ROC AUC | 0.875 | |
| | Brier score | 0.191 | |
| | ECE | 0.207 | |
|
|
| The val/test difference reflects diverse class composition across stratified-by-document splits β val landed on a harder subset (52% positive), test on an easier one (72% positive). Both are unbiased estimates of different facets. |
|
|
| ### Cross-architecture comparison |
|
|
| All three Phase 1 candidates converged to the same operating point: |
|
|
| | Model | Val F1 | Test F1 | Test AUC | |
| |---|---|---|---| |
| | LogReg (default 0.5) | 0.85 | β | 0.875 | |
| | **LogReg (threshold 0.436)** | **0.894** | **0.940** | **0.955** | |
| | LogReg + Platt-calibrated | 0.894 | β | 0.874 | |
| | MLP (1Γ512 ReLU) at best threshold | 0.892 | β | 0.847 | |
|
|
| LogReg wins on simplicity-tiebreak. |
|
|
| ## Intended Use |
|
|
| ### Primary use case |
|
|
| Upstream gate in the CCR report pipeline. After the XGBoost page classifier flags pages as CCR, this model evaluates whether the parent document is actually a Declaration of Covenants worth running CCR extraction on. Decision band (recalibrated empirically β the original `(0.30, 0.85)` plan-time bands left FAST_PASS empty in production because real Declarations score 0.45-0.70 raw): |
| |
| - **Score < 0.25**: confident NOT-CCR. Skip CCR pipeline entirely. Removes the document from CCR dispatch. |
| - **Score >= 0.55**: confident IS-CCR. Trust the classifier, fast-path bypasses the more expensive agentic `detect_ccr` validator. |
| - **0.25 <= Score < 0.55**: ambiguous. Escalate to agentic `detect_ccr` for a deeper look. |
|
|
| ### Out-of-scope use |
|
|
| - Per-page CCR detection (this is a document-level model) |
| - Multi-class document categorization (use `GoverningDocs/xgboost-page-classifier` for that) |
| - Standalone use without the embedding pipeline (requires OpenAI text-embedding-3-small inference) |
|
|
| ## Limitations |
|
|
| ### Calibration |
|
|
| The raw LogReg artifact has ECE 0.19-0.28 on validation/test β predicted probabilities are systematically miscalibrated. The decision-band thresholds `(0.25, 0.55)` above are **empirically tuned on the production score distribution, not probability-calibrated**. |
|
|
| A separate isotonic calibrator artifact (`ccr_binary_isotonic_calibrator.joblib`) ships in the same repo and reduces test-set ECE from 0.278 to 0.087 (3.2x improvement). It is **purely additive metadata** β the production gate still consumes raw scores. Use the calibrator if you need probability-calibrated outputs for drift monitoring, signal combination with other classifiers, or user-facing confidence display. See the "Calibration Support" section below for details. |
|
|
| ### Sample size |
|
|
| Trained on 465 unique documents (325 train / 75 val / 65 test). Reasonable for binary classification on dense semantic embeddings, but generalization to extreme-out-of-distribution document types (e.g., CCRs in languages other than English, or document structures not represented in the training corpus) is unknown. |
|
|
| ### Geographic distribution |
|
|
| Training corpus is heavily Florida and California. CCRs from underrepresented states (NY, IL, AZ, OH) may have lower recall. Model serves as the gate, not the only safety net β downstream agentic validation + output-layer grounding filter (PR-B) catch edge cases. |
|
|
| ### Composite documents |
|
|
| The "is dominantly a Declaration" rule (>=50% of substantive pages are DECLARATION) means a long bundled package with a small Declaration section + large Bylaws section will be classified as NOT-CCR. This is operationally correct for the upstream-gate use case (don't run CCR extraction on a doc that's mostly bylaws), but means some real Declaration content gets filtered out of the CCR pipeline. The downstream pipelines for Bylaws, Articles, etc. would handle those. |
|
|
| ## How to Use |
|
|
| ### Inference |
|
|
| ```python |
| import joblib |
| import numpy as np |
| from huggingface_hub import hf_hub_download |
| from langchain_openai import OpenAIEmbeddings |
| |
| # Load model + threshold + config |
| model_path = hf_hub_download( |
| repo_id="GoverningDocs/ccr-binary-logreg", |
| filename="ccr_binary_logreg_tuned.joblib", |
| ) |
| artifact = joblib.load(model_path) |
| model = artifact["model"] |
| threshold = artifact["threshold"] # 0.436 |
| cfg = artifact["config"] |
| |
| # Compute mean-pooled embedding for a document's first N substantive pages |
| embeddings = OpenAIEmbeddings(model=cfg["embedding_model"]) |
| page_texts = [...] # first ~5-20 substantive (non-boilerplate) pages |
| page_vectors = embeddings.embed_documents(page_texts) |
| doc_vector = np.mean(page_vectors, axis=0).reshape(1, -1) |
| |
| # Predict |
| score = model.predict_proba(doc_vector)[0, 1] |
| |
| # Three-band decision (recalibrated production bands) |
| if score < 0.25: |
| decision = "REJECT" # confident not a Declaration; skip CCR pipeline |
| elif score >= 0.55: |
| decision = "FAST_PASS" # confident Declaration; bypass agentic validator |
| else: |
| decision = "ESCALATE" # ambiguous; run agentic detect_ccr |
| ``` |
|
|
| ### Calibration Support |
|
|
| Optional isotonic calibrator (`ccr_binary_isotonic_calibrator.joblib`) maps raw scores to probability-calibrated outputs. |
|
|
| ```python |
| calibrator_path = hf_hub_download( |
| repo_id="GoverningDocs/ccr-binary-logreg", |
| filename="ccr_binary_isotonic_calibrator.joblib", |
| ) |
| cal_artifact = joblib.load(calibrator_path) |
| calibrator = cal_artifact["calibrator"] |
| |
| # Apply isotonic to a raw score (cv="prefit" + method="isotonic" + binary |
| # fits on raw predict_proba outputs, so we can apply directly to a float) |
| inner = calibrator.calibrated_classifiers_[0].calibrators[0] |
| calibrated = float(inner.predict([score])[0]) |
| ``` |
|
|
| **Caveats:** |
| - The shipped isotonic was fit on a small (~70-doc) validation split and produces approximately 3 plateau outputs (0.737, 0.833, 1.000). Treat calibrated scores as 3-level (low / med / high) confidence rather than fine-grained probabilities. |
| - The calibrator's `shipped_model_filename` field MUST match the model file you loaded. Cross-check before use to guard against artifact mismatch. |
|
|
| ### Files in this repo |
|
|
| - `ccr_binary_logreg_tuned.joblib` β pickled dict containing `model` (sklearn LogisticRegression) and `config` (dict with `embedding_model`, `max_pages_per_doc`, `skip_boilerplate` flags). The `threshold` field (0.436) is a Phase 1 artifact; production uses bands, not a single threshold. |
| - `ccr_binary_isotonic_calibrator.joblib` β pickled dict containing `calibrator` (sklearn `CalibratedClassifierCV` with `cv="prefit"`, `method="isotonic"`), `shipped_model_filename` (paired model artifact), and ECE before/after metadata. |
| - `config.json` β JSON-readable summary of the model configuration, decision bands, and calibrator metadata. |
|
|
| ## Training Procedure |
|
|
| ### Data preparation |
|
|
| 1. Source: `setfit_experiments` PostgreSQL DB. 16,896 unlabeled pages from CCR-tagged documents. |
| 2. Phase 0 corpus relabeling (4 stages): |
| - Deterministic signal pass: signature anti-patterns + positive patterns + page-structural heuristics β 4,049 pages auto-labeled (BOILERPLATE / AUXILIARY / DECLARATION-tentative) |
| - Opus subagent verification stage 1: 75 batches Γ 50 pages, prioritized by class-balance need |
| - Opus subagent verification stage 2 (deferred-pages pass): 28 batches Γ 50 pages, all 1,392 deterministic-DECLARATION-tentative pages verified |
| - Curation: 7,129 high-confidence pages retained |
| 3. Page β document aggregation: `is_declaration = (count(DECLARATION pages) / count(non-BOILERPLATE pages) >= 0.5)` |
| 4. Per-document substantive-page sampling: up to 20 pages per doc, BOILERPLATE filtered out |
| 5. Mean-pool embeddings of first 5-20 substantive pages β 1536-dim doc vector |
|
|
| ### Training |
|
|
| - Stratified split by `document_id`: 70% train / 15% val / 15% test (325 / 75 / 65 docs) |
| - `LogisticRegression(C=1.0, class_weight="balanced", max_iter=2000, random_state=42)` |
| - Training time: ~1 second on CPU |
| - Threshold selection: F1-maximizing on validation set β 0.436 |
|
|
| ### Reproducibility |
|
|
| Training script: https://github.com/governingdocs/backend (path: `experiments/setfit_ccr_binary/scripts/train_tier1_v2.py`) |
|
|
| Phase 0 relabeling pipeline: same repo, `experiments/setfit_ccr_binary/scripts/opus_verify.py` and `experiments/setfit_ccr_binary/scripts/heuristics/` |
|
|
| Findings document with full Phase 0 + Phase 1 results: `experiments/setfit_ccr_binary/PHASE1_TIER1_FINDINGS.md` |
|
|
| ## Citation |
|
|
| If you use this model in research or production, please cite: |
|
|
| ``` |
| @misc{ccr-binary-logreg-2026, |
| title = {CCR Binary Classifier: Document-Level Detection of Declarations of Covenants, Conditions, and Restrictions}, |
| author = {GoverningDocs Engineering}, |
| year = {2026}, |
| publisher = {HuggingFace}, |
| howpublished = {\url{https://huggingface.co/GoverningDocs/ccr-binary-logreg}} |
| } |
| ``` |
|
|
| ## Versioning |
|
|
| Model artifacts are versioned via HuggingFace commit history. `config.json` includes the corpus snapshot commit hash for reproducibility. |
|
|
| ## Maintenance |
|
|
| This model is part of the T18 plan (CCR Upstream Input Hardening) in the GoverningDocs platform. See `plans/T18_CCR_UPSTREAM_INPUT_HARDENING_PLAN.md` (v2.2.1, Completed) in the product repo for design rationale, alternatives considered (page-classifier retrain, agentic-only, signature patterns), and Phase 2 wire-in. |
|
|
| Calibrator artifact added per `plans/CCR_BINARY_ISOTONIC_RECALIBRATION_PLAN.md` (v1.4.0). Phase 1 findings: `experiments/setfit_ccr_binary/ISOTONIC_CALIBRATION_FINDINGS.md`. |
|
|