GoverningDocs
/

ccr-binary-logreg

+---
+license: mit
+language:
+- en
+tags:
+- text-classification
+- document-classification
+- binary-classification
+- legal-documents
+- hoa
+- property-management
+- ccr
+- declaration-of-covenants
+- sklearn
+- logistic-regression
+pipeline_tag: text-classification
+library_name: scikit-learn
+metrics:
+- f1
+- accuracy
+- roc_auc
+model-index:
+- name: ccr-binary-logreg
+  results:
+  - task:
+      type: text-classification
+      name: Document Binary Classification
+    metrics:
+    - type: f1
+      value: 0.940
+      name: Test F1
+    - type: roc_auc
+      value: 0.955
+      name: Test ROC AUC
+    - type: accuracy
+      value: 0.908
+      name: Test Accuracy
+---
+# CCR Binary Classifier (LogReg over OpenAI Embeddings)
+Document-level binary classifier for distinguishing actual **Declarations of Covenants, Conditions & Restrictions (CC&Rs)** from auxiliary HOA governance documents that share legal-document vocabulary (bylaws, articles of incorporation, rules & regulations, amendments, board resolutions, CDD bond disclosures, ordinances, easement policies).
+> **Note:** This README is the HuggingFace model card. The canonical deployed version is at https://huggingface.co/GoverningDocs/ccr-binary-logreg
+## Model Description
+This is a **scikit-learn `LogisticRegression`** trained on **mean-pooled OpenAI `text-embedding-3-small` embeddings** (1536 dimensions) of substantive document pages. It serves as the upstream "second-opinion" gate in the CCR report pipeline: it decides whether a document tagged as CCR by the upstream XGBoost page classifier is actually a Declaration of Covenants worth dispatching CCR extraction on.
+### Why this model exists
+The XGBoost page classifier (`GoverningDocs/xgboost-page-classifier`) labels individual pages with one of 12 categories. CCR is the largest source of overdetection: CDD bond disclosures, municipal ordinances, easement policies, HOA bylaws, and articles of incorporation all use legal-document phrasing that XGBoost has learned correlates with CCR. The result: ~73-85% positive-pattern false-positive rate on auxiliary HOA documents — which means the CCR report pipeline frequently runs on garbage input, producing fabricated red flags or degraded reports.
+This binary classifier sits AFTER the page classifier and asks: "Yes the page classifier flagged some pages as CCR — but is this DOCUMENT actually a Declaration of Covenants?" If no, the CCR pipeline never runs on that document.
+### Architecture
+- **Embedding model:** OpenAI `text-embedding-3-small` (1536 dims) — same encoder used by the production vectorstore (`langchain_pg_embedding`)
+- **Aggregation:** mean-pool the embeddings of up to 20 substantive (non-boilerplate) pages per document into a single 1536-dim doc vector
+- **Classifier head:** sklearn `LogisticRegression(class_weight="balanced", C=1.0, max_iter=2000)`
+- **Operating threshold:** 0.436 (F1-maximizing on validation set)
+- **Output:** `P(is_actual_declaration_of_covenants)` ∈ [0, 1]
+Why LogReg over MLP / SetFit / BGE fine-tune: Phase 1 trained both LogReg and a shallow MLP (1×512 ReLU). Both converged to the same operating point at their best thresholds. LogReg won on the simplicity tiebreak (smaller artifact, natively-calibrated probabilities, no GPU needed). Phase 0 produced ~7,100 labeled examples, well above the few-shot regime where SetFit's contrastive head adds value.
+## Training Data
+Trained on **7,129 high-confidence labeled pages from 465 unique HOA / CCR-adjacent documents**, produced by a multi-signal corpus relabeling pipeline:
+| Signal | Coverage |
+|---|---|
+| Signature anti-patterns (CDD, ordinance, bond, policy markers) | Auto-labels obvious non-CCR cases |
+| Page-structural heuristics (TOC, recording stamps, signature blocks, blanks) | Auto-labels boilerplate at high confidence |
+| Claude Opus subagent verification | Verifies all DECLARATION-tentative + INDETERMINATE pages with rubric-based reasoning |
+### Page-level label distribution
+| Class | Count | % |
+|---|---|---|
+| DECLARATION_OF_COVENANTS | 3,014 | 42.3% |
+| AUXILIARY_HOA_DOC | 1,551 | 21.8% |
+| BOILERPLATE | 2,564 | 35.9% |
+### Document-level binary labels
+A document is labeled `is_declaration_of_covenants = 1` if `count(DECLARATION pages) / count(non-BOILERPLATE pages) >= 0.5`. This handles multi-document composites (Declaration + embedded Bylaws + Amendments + signatures bundled in one PDF) by requiring that the document is DOMINANTLY a real Declaration, not just one that contains some Declaration content.
+Final binary class balance: 42.3% positive / 57.7% negative across 465 documents.
+## Performance
+### Test set (held-out, 65 documents, 47 positives)
+| Metric | Value |
+|---|---|
+| **F1** | **0.940** |
+| ROC AUC | 0.955 |
+| Accuracy | 0.908 |
+| Confusion matrix | `[[12 TN, 6 FP], [0 FN, 47 TP]]` |
+| **Recall** | **100%** — never misses a real Declaration in this composition |
+| Precision | 88.7% |
+| Brier score | 0.134 |
+| ECE (10-bin) | 0.278 |
+### Validation set (75 documents, 39 positives)
+| Metric | Value |
+|---|---|
+| F1 | 0.894 |
+| ROC AUC | 0.875 |
+| Brier score | 0.191 |
+| ECE | 0.207 |
+The val/test difference reflects diverse class composition across stratified-by-document splits — val landed on a harder subset (52% positive), test on an easier one (72% positive). Both are unbiased estimates of different facets.
+### Cross-architecture comparison
+All three Phase 1 candidates converged to the same operating point:
+| Model | Val F1 | Test F1 | Test AUC |
+|---|---|---|---|
+| LogReg (default 0.5) | 0.85 | — | 0.875 |
+| **LogReg (threshold 0.436)** | **0.894** | **0.940** | **0.955** |
+| LogReg + Platt-calibrated | 0.894 | — | 0.874 |
+| MLP (1×512 ReLU) at best threshold | 0.892 | — | 0.847 |
+LogReg wins on simplicity-tiebreak.
+## Intended Use
+### Primary use case
+Upstream gate in the CCR report pipeline. After the XGBoost page classifier flags pages as CCR, this model evaluates whether the parent document is actually a Declaration of Covenants worth running CCR extraction on. Decision band:
+- **Score < 0.30**: confident NOT-CCR. Skip CCR pipeline entirely. Removes the document from CCR dispatch.
+- **Score >= 0.85**: confident IS-CCR. Trust the classifier, fast-path bypasses the more expensive agentic `detect_ccr` validator.
+- **0.30 <= Score < 0.85**: ambiguous. Escalate to agentic `detect_ccr` for a deeper look.
+### Out-of-scope use
+- Per-page CCR detection (this is a document-level model)
+- Multi-class document categorization (use `GoverningDocs/xgboost-page-classifier` for that)
+- Standalone use without the embedding pipeline (requires OpenAI text-embedding-3-small inference)
+## Limitations
+### Calibration
+ECE 0.21-0.28 on validation/test means the model's predicted probabilities are systematically over-confident. The decision-band thresholds (0.30 / 0.85) above are **operationally tuned, not probability-calibrated**.
+If you redeploy with new thresholds based on different operating goals, derive them empirically from the test set's reliability diagram. Wrap with isotonic recalibration if probability calibration matters for your use case.
+### Sample size
+Trained on 465 unique documents (325 train / 75 val / 65 test). Reasonable for binary classification on dense semantic embeddings, but generalization to extreme-out-of-distribution document types (e.g., CCRs in languages other than English, or document structures not represented in the training corpus) is unknown.
+### Geographic distribution
+Training corpus is heavily Florida and California. CCRs from underrepresented states (NY, IL, AZ, OH) may have lower recall. Model serves as the gate, not the only safety net — downstream agentic validation + output-layer grounding filter (PR-B) catch edge cases.
+### Composite documents
+The "is dominantly a Declaration" rule (>=50% of substantive pages are DECLARATION) means a long bundled package with a small Declaration section + large Bylaws section will be classified as NOT-CCR. This is operationally correct for the upstream-gate use case (don't run CCR extraction on a doc that's mostly bylaws), but means some real Declaration content gets filtered out of the CCR pipeline. The downstream pipelines for Bylaws, Articles, etc. would handle those.
+## How to Use
+### Inference
+```python
+import joblib
+import numpy as np
+from huggingface_hub import hf_hub_download
+from langchain_openai import OpenAIEmbeddings
+# Load model + threshold + config
+model_path = hf_hub_download(
+    repo_id="GoverningDocs/ccr-binary-logreg",
+    filename="ccr_binary_logreg_tuned.joblib",
+)
+artifact = joblib.load(model_path)
+model = artifact["model"]
+threshold = artifact["threshold"]  # 0.436
+cfg = artifact["config"]
+# Compute mean-pooled embedding for a document's first N substantive pages
+embeddings = OpenAIEmbeddings(model=cfg["embedding_model"])
+page_texts = [...]  # first ~5-20 substantive (non-boilerplate) pages
+page_vectors = embeddings.embed_documents(page_texts)
+doc_vector = np.mean(page_vectors, axis=0).reshape(1, -1)
+# Predict
+score = model.predict_proba(doc_vector)[0, 1]
+# Three-band decision
+if score < 0.30:
+    decision = "REJECT"  # confident not a Declaration; skip CCR pipeline
+elif score >= 0.85:
+    decision = "FAST_PASS"  # confident Declaration; bypass agentic validator
+else:
+    decision = "ESCALATE"  # ambiguous; run agentic detect_ccr
+```
+### Files in this repo
+- `ccr_binary_logreg_tuned.joblib` — pickled dict containing `model` (sklearn LogisticRegression), `threshold` (float, 0.436), and `config` (dict with `embedding_model`, `max_pages_per_doc`, `skip_boilerplate` flags)
+- `config.json` — JSON-readable summary of the model configuration
+## Training Procedure
+### Data preparation
+1. Source: `setfit_experiments` PostgreSQL DB. 16,896 unlabeled pages from CCR-tagged documents.
+2. Phase 0 corpus relabeling (4 stages):
+   - Deterministic signal pass: signature anti-patterns + positive patterns + page-structural heuristics → 4,049 pages auto-labeled (BOILERPLATE / AUXILIARY / DECLARATION-tentative)
+   - Opus subagent verification stage 1: 75 batches × 50 pages, prioritized by class-balance need
+   - Opus subagent verification stage 2 (deferred-pages pass): 28 batches × 50 pages, all 1,392 deterministic-DECLARATION-tentative pages verified
+   - Curation: 7,129 high-confidence pages retained
+3. Page → document aggregation: `is_declaration = (count(DECLARATION pages) / count(non-BOILERPLATE pages) >= 0.5)`
+4. Per-document substantive-page sampling: up to 20 pages per doc, BOILERPLATE filtered out
+5. Mean-pool embeddings of first 5-20 substantive pages → 1536-dim doc vector
+### Training
+- Stratified split by `document_id`: 70% train / 15% val / 15% test (325 / 75 / 65 docs)
+- `LogisticRegression(C=1.0, class_weight="balanced", max_iter=2000, random_state=42)`
+- Training time: ~1 second on CPU
+- Threshold selection: F1-maximizing on validation set → 0.436
+### Reproducibility
+Training script: https://github.com/governingdocs/backend (path: `experiments/setfit_ccr_binary/scripts/train_tier1_v2.py`)
+Phase 0 relabeling pipeline: same repo, `experiments/setfit_ccr_binary/scripts/opus_verify.py` and `experiments/setfit_ccr_binary/scripts/heuristics/`
+Findings document with full Phase 0 + Phase 1 results: `experiments/setfit_ccr_binary/PHASE1_TIER1_FINDINGS.md`
+## Citation
+If you use this model in research or production, please cite:
+```
+@misc{ccr-binary-logreg-2026,
+  title  = {CCR Binary Classifier: Document-Level Detection of Declarations of Covenants, Conditions, and Restrictions},
+  author = {GoverningDocs Engineering},
+  year   = {2026},
+  publisher = {HuggingFace},
+  howpublished = {\url{https://huggingface.co/GoverningDocs/ccr-binary-logreg}}
+}
+```
+## Versioning
+Model artifacts are versioned via HuggingFace commit history. `config.json` includes the corpus snapshot commit hash for reproducibility.
+## Maintenance
+This model is part of the T18 plan (CCR Upstream Input Hardening) in the GoverningDocs platform. See `plans/T18_CCR_UPSTREAM_INPUT_HARDENING_PLAN.md` (v2.1.1) in the product repo for design rationale, alternatives considered (page-classifier retrain, agentic-only, signature patterns), and Phase 2 wire-in plans.