hslee1981 commited on
Commit
76932f1
Β·
verified Β·
1 Parent(s): 729504d

T18 Phase 1 Tier 1: model card

Browse files
Files changed (1) hide show
  1. README.md +257 -0
README.md ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - text-classification
7
+ - document-classification
8
+ - binary-classification
9
+ - legal-documents
10
+ - hoa
11
+ - property-management
12
+ - ccr
13
+ - declaration-of-covenants
14
+ - sklearn
15
+ - logistic-regression
16
+ pipeline_tag: text-classification
17
+ library_name: scikit-learn
18
+ metrics:
19
+ - f1
20
+ - accuracy
21
+ - roc_auc
22
+ model-index:
23
+ - name: ccr-binary-logreg
24
+ results:
25
+ - task:
26
+ type: text-classification
27
+ name: Document Binary Classification
28
+ metrics:
29
+ - type: f1
30
+ value: 0.940
31
+ name: Test F1
32
+ - type: roc_auc
33
+ value: 0.955
34
+ name: Test ROC AUC
35
+ - type: accuracy
36
+ value: 0.908
37
+ name: Test Accuracy
38
+ ---
39
+
40
+ # CCR Binary Classifier (LogReg over OpenAI Embeddings)
41
+
42
+ Document-level binary classifier for distinguishing actual **Declarations of Covenants, Conditions & Restrictions (CC&Rs)** from auxiliary HOA governance documents that share legal-document vocabulary (bylaws, articles of incorporation, rules & regulations, amendments, board resolutions, CDD bond disclosures, ordinances, easement policies).
43
+
44
+ > **Note:** This README is the HuggingFace model card. The canonical deployed version is at https://huggingface.co/GoverningDocs/ccr-binary-logreg
45
+
46
+ ## Model Description
47
+
48
+ This is a **scikit-learn `LogisticRegression`** trained on **mean-pooled OpenAI `text-embedding-3-small` embeddings** (1536 dimensions) of substantive document pages. It serves as the upstream "second-opinion" gate in the CCR report pipeline: it decides whether a document tagged as CCR by the upstream XGBoost page classifier is actually a Declaration of Covenants worth dispatching CCR extraction on.
49
+
50
+ ### Why this model exists
51
+
52
+ The XGBoost page classifier (`GoverningDocs/xgboost-page-classifier`) labels individual pages with one of 12 categories. CCR is the largest source of overdetection: CDD bond disclosures, municipal ordinances, easement policies, HOA bylaws, and articles of incorporation all use legal-document phrasing that XGBoost has learned correlates with CCR. The result: ~73-85% positive-pattern false-positive rate on auxiliary HOA documents β€” which means the CCR report pipeline frequently runs on garbage input, producing fabricated red flags or degraded reports.
53
+
54
+ This binary classifier sits AFTER the page classifier and asks: "Yes the page classifier flagged some pages as CCR β€” but is this DOCUMENT actually a Declaration of Covenants?" If no, the CCR pipeline never runs on that document.
55
+
56
+ ### Architecture
57
+
58
+ - **Embedding model:** OpenAI `text-embedding-3-small` (1536 dims) β€” same encoder used by the production vectorstore (`langchain_pg_embedding`)
59
+ - **Aggregation:** mean-pool the embeddings of up to 20 substantive (non-boilerplate) pages per document into a single 1536-dim doc vector
60
+ - **Classifier head:** sklearn `LogisticRegression(class_weight="balanced", C=1.0, max_iter=2000)`
61
+ - **Operating threshold:** 0.436 (F1-maximizing on validation set)
62
+ - **Output:** `P(is_actual_declaration_of_covenants)` ∈ [0, 1]
63
+
64
+ Why LogReg over MLP / SetFit / BGE fine-tune: Phase 1 trained both LogReg and a shallow MLP (1Γ—512 ReLU). Both converged to the same operating point at their best thresholds. LogReg won on the simplicity tiebreak (smaller artifact, natively-calibrated probabilities, no GPU needed). Phase 0 produced ~7,100 labeled examples, well above the few-shot regime where SetFit's contrastive head adds value.
65
+
66
+ ## Training Data
67
+
68
+ Trained on **7,129 high-confidence labeled pages from 465 unique HOA / CCR-adjacent documents**, produced by a multi-signal corpus relabeling pipeline:
69
+
70
+ | Signal | Coverage |
71
+ |---|---|
72
+ | Signature anti-patterns (CDD, ordinance, bond, policy markers) | Auto-labels obvious non-CCR cases |
73
+ | Page-structural heuristics (TOC, recording stamps, signature blocks, blanks) | Auto-labels boilerplate at high confidence |
74
+ | Claude Opus subagent verification | Verifies all DECLARATION-tentative + INDETERMINATE pages with rubric-based reasoning |
75
+
76
+ ### Page-level label distribution
77
+
78
+ | Class | Count | % |
79
+ |---|---|---|
80
+ | DECLARATION_OF_COVENANTS | 3,014 | 42.3% |
81
+ | AUXILIARY_HOA_DOC | 1,551 | 21.8% |
82
+ | BOILERPLATE | 2,564 | 35.9% |
83
+
84
+ ### Document-level binary labels
85
+
86
+ A document is labeled `is_declaration_of_covenants = 1` if `count(DECLARATION pages) / count(non-BOILERPLATE pages) >= 0.5`. This handles multi-document composites (Declaration + embedded Bylaws + Amendments + signatures bundled in one PDF) by requiring that the document is DOMINANTLY a real Declaration, not just one that contains some Declaration content.
87
+
88
+ Final binary class balance: 42.3% positive / 57.7% negative across 465 documents.
89
+
90
+ ## Performance
91
+
92
+ ### Test set (held-out, 65 documents, 47 positives)
93
+
94
+ | Metric | Value |
95
+ |---|---|
96
+ | **F1** | **0.940** |
97
+ | ROC AUC | 0.955 |
98
+ | Accuracy | 0.908 |
99
+ | Confusion matrix | `[[12 TN, 6 FP], [0 FN, 47 TP]]` |
100
+ | **Recall** | **100%** β€” never misses a real Declaration in this composition |
101
+ | Precision | 88.7% |
102
+ | Brier score | 0.134 |
103
+ | ECE (10-bin) | 0.278 |
104
+
105
+ ### Validation set (75 documents, 39 positives)
106
+
107
+ | Metric | Value |
108
+ |---|---|
109
+ | F1 | 0.894 |
110
+ | ROC AUC | 0.875 |
111
+ | Brier score | 0.191 |
112
+ | ECE | 0.207 |
113
+
114
+ The val/test difference reflects diverse class composition across stratified-by-document splits β€” val landed on a harder subset (52% positive), test on an easier one (72% positive). Both are unbiased estimates of different facets.
115
+
116
+ ### Cross-architecture comparison
117
+
118
+ All three Phase 1 candidates converged to the same operating point:
119
+
120
+ | Model | Val F1 | Test F1 | Test AUC |
121
+ |---|---|---|---|
122
+ | LogReg (default 0.5) | 0.85 | β€” | 0.875 |
123
+ | **LogReg (threshold 0.436)** | **0.894** | **0.940** | **0.955** |
124
+ | LogReg + Platt-calibrated | 0.894 | β€” | 0.874 |
125
+ | MLP (1Γ—512 ReLU) at best threshold | 0.892 | β€” | 0.847 |
126
+
127
+ LogReg wins on simplicity-tiebreak.
128
+
129
+ ## Intended Use
130
+
131
+ ### Primary use case
132
+
133
+ Upstream gate in the CCR report pipeline. After the XGBoost page classifier flags pages as CCR, this model evaluates whether the parent document is actually a Declaration of Covenants worth running CCR extraction on. Decision band:
134
+
135
+ - **Score < 0.30**: confident NOT-CCR. Skip CCR pipeline entirely. Removes the document from CCR dispatch.
136
+ - **Score >= 0.85**: confident IS-CCR. Trust the classifier, fast-path bypasses the more expensive agentic `detect_ccr` validator.
137
+ - **0.30 <= Score < 0.85**: ambiguous. Escalate to agentic `detect_ccr` for a deeper look.
138
+
139
+ ### Out-of-scope use
140
+
141
+ - Per-page CCR detection (this is a document-level model)
142
+ - Multi-class document categorization (use `GoverningDocs/xgboost-page-classifier` for that)
143
+ - Standalone use without the embedding pipeline (requires OpenAI text-embedding-3-small inference)
144
+
145
+ ## Limitations
146
+
147
+ ### Calibration
148
+
149
+ ECE 0.21-0.28 on validation/test means the model's predicted probabilities are systematically over-confident. The decision-band thresholds (0.30 / 0.85) above are **operationally tuned, not probability-calibrated**.
150
+
151
+ If you redeploy with new thresholds based on different operating goals, derive them empirically from the test set's reliability diagram. Wrap with isotonic recalibration if probability calibration matters for your use case.
152
+
153
+ ### Sample size
154
+
155
+ Trained on 465 unique documents (325 train / 75 val / 65 test). Reasonable for binary classification on dense semantic embeddings, but generalization to extreme-out-of-distribution document types (e.g., CCRs in languages other than English, or document structures not represented in the training corpus) is unknown.
156
+
157
+ ### Geographic distribution
158
+
159
+ Training corpus is heavily Florida and California. CCRs from underrepresented states (NY, IL, AZ, OH) may have lower recall. Model serves as the gate, not the only safety net β€” downstream agentic validation + output-layer grounding filter (PR-B) catch edge cases.
160
+
161
+ ### Composite documents
162
+
163
+ The "is dominantly a Declaration" rule (>=50% of substantive pages are DECLARATION) means a long bundled package with a small Declaration section + large Bylaws section will be classified as NOT-CCR. This is operationally correct for the upstream-gate use case (don't run CCR extraction on a doc that's mostly bylaws), but means some real Declaration content gets filtered out of the CCR pipeline. The downstream pipelines for Bylaws, Articles, etc. would handle those.
164
+
165
+ ## How to Use
166
+
167
+ ### Inference
168
+
169
+ ```python
170
+ import joblib
171
+ import numpy as np
172
+ from huggingface_hub import hf_hub_download
173
+ from langchain_openai import OpenAIEmbeddings
174
+
175
+ # Load model + threshold + config
176
+ model_path = hf_hub_download(
177
+ repo_id="GoverningDocs/ccr-binary-logreg",
178
+ filename="ccr_binary_logreg_tuned.joblib",
179
+ )
180
+ artifact = joblib.load(model_path)
181
+ model = artifact["model"]
182
+ threshold = artifact["threshold"] # 0.436
183
+ cfg = artifact["config"]
184
+
185
+ # Compute mean-pooled embedding for a document's first N substantive pages
186
+ embeddings = OpenAIEmbeddings(model=cfg["embedding_model"])
187
+ page_texts = [...] # first ~5-20 substantive (non-boilerplate) pages
188
+ page_vectors = embeddings.embed_documents(page_texts)
189
+ doc_vector = np.mean(page_vectors, axis=0).reshape(1, -1)
190
+
191
+ # Predict
192
+ score = model.predict_proba(doc_vector)[0, 1]
193
+
194
+ # Three-band decision
195
+ if score < 0.30:
196
+ decision = "REJECT" # confident not a Declaration; skip CCR pipeline
197
+ elif score >= 0.85:
198
+ decision = "FAST_PASS" # confident Declaration; bypass agentic validator
199
+ else:
200
+ decision = "ESCALATE" # ambiguous; run agentic detect_ccr
201
+ ```
202
+
203
+ ### Files in this repo
204
+
205
+ - `ccr_binary_logreg_tuned.joblib` β€” pickled dict containing `model` (sklearn LogisticRegression), `threshold` (float, 0.436), and `config` (dict with `embedding_model`, `max_pages_per_doc`, `skip_boilerplate` flags)
206
+ - `config.json` β€” JSON-readable summary of the model configuration
207
+
208
+ ## Training Procedure
209
+
210
+ ### Data preparation
211
+
212
+ 1. Source: `setfit_experiments` PostgreSQL DB. 16,896 unlabeled pages from CCR-tagged documents.
213
+ 2. Phase 0 corpus relabeling (4 stages):
214
+ - Deterministic signal pass: signature anti-patterns + positive patterns + page-structural heuristics β†’ 4,049 pages auto-labeled (BOILERPLATE / AUXILIARY / DECLARATION-tentative)
215
+ - Opus subagent verification stage 1: 75 batches Γ— 50 pages, prioritized by class-balance need
216
+ - Opus subagent verification stage 2 (deferred-pages pass): 28 batches Γ— 50 pages, all 1,392 deterministic-DECLARATION-tentative pages verified
217
+ - Curation: 7,129 high-confidence pages retained
218
+ 3. Page β†’ document aggregation: `is_declaration = (count(DECLARATION pages) / count(non-BOILERPLATE pages) >= 0.5)`
219
+ 4. Per-document substantive-page sampling: up to 20 pages per doc, BOILERPLATE filtered out
220
+ 5. Mean-pool embeddings of first 5-20 substantive pages β†’ 1536-dim doc vector
221
+
222
+ ### Training
223
+
224
+ - Stratified split by `document_id`: 70% train / 15% val / 15% test (325 / 75 / 65 docs)
225
+ - `LogisticRegression(C=1.0, class_weight="balanced", max_iter=2000, random_state=42)`
226
+ - Training time: ~1 second on CPU
227
+ - Threshold selection: F1-maximizing on validation set β†’ 0.436
228
+
229
+ ### Reproducibility
230
+
231
+ Training script: https://github.com/governingdocs/backend (path: `experiments/setfit_ccr_binary/scripts/train_tier1_v2.py`)
232
+
233
+ Phase 0 relabeling pipeline: same repo, `experiments/setfit_ccr_binary/scripts/opus_verify.py` and `experiments/setfit_ccr_binary/scripts/heuristics/`
234
+
235
+ Findings document with full Phase 0 + Phase 1 results: `experiments/setfit_ccr_binary/PHASE1_TIER1_FINDINGS.md`
236
+
237
+ ## Citation
238
+
239
+ If you use this model in research or production, please cite:
240
+
241
+ ```
242
+ @misc{ccr-binary-logreg-2026,
243
+ title = {CCR Binary Classifier: Document-Level Detection of Declarations of Covenants, Conditions, and Restrictions},
244
+ author = {GoverningDocs Engineering},
245
+ year = {2026},
246
+ publisher = {HuggingFace},
247
+ howpublished = {\url{https://huggingface.co/GoverningDocs/ccr-binary-logreg}}
248
+ }
249
+ ```
250
+
251
+ ## Versioning
252
+
253
+ Model artifacts are versioned via HuggingFace commit history. `config.json` includes the corpus snapshot commit hash for reproducibility.
254
+
255
+ ## Maintenance
256
+
257
+ This model is part of the T18 plan (CCR Upstream Input Hardening) in the GoverningDocs platform. See `plans/T18_CCR_UPSTREAM_INPUT_HARDENING_PLAN.md` (v2.1.1) in the product repo for design rationale, alternatives considered (page-classifier retrain, agentic-only, signature patterns), and Phase 2 wire-in plans.