collapseindex commited on
Commit
ef045e6
·
verified ·
1 Parent(s): 7f76451

added repro

Browse files
Files changed (1) hide show
  1. README.md +41 -9
README.md CHANGED
@@ -36,9 +36,9 @@ ProBERT classifies text into three patterns:
36
 
37
  Use it to flag risky language in LLM outputs, documentation, support tickets, or any text where confident assertions without reasoning could cause problems.
38
 
39
- **Trained on just 450 examples (150 per class).** Achieves 95.6% accuracy and generalizes perfectly to real-world domains it never saw during training. When tested on Yelp reviews, ProBERT and untrained base DistilBERT disagree 84% of the time—proving the training added real capability, not just noise.
40
 
41
- **Why safety teams care:** When you evaluate ProBERT itself under perturbation testing (the Collapse Index protocol), it exhibits **zero Type I errors**—predictions that are stable, confident, and wrong. Most models have 5-15% Type I errors. ProBERT: 0. This makes it a reliable signal for downstream safety systems.
42
 
43
  ---
44
 
@@ -92,15 +92,20 @@ A 66M-parameter DistilBERT specialist trained to detect rhetorical overconfidenc
92
  | Collapse Index (CI) — Behavioral Stability | 0.003 |
93
  | Structural Retention (SRI) — Decision Coherence | 0.997 |
94
  | Type I Errors (Stable + Confident + Wrong) | 0 |
 
 
 
95
 
96
- ### Baseline Comparison: ProBERT vs. Vanilla DistilBERT
97
 
98
  **The Question:** Is ProBERT just a renamed DistilBERT, or did training actually matter?
99
 
100
- **The Test:** ProBERT trained on **450 synthetic examples** vs. vanilla DistilBERT with a **random 3-class classification head** (untrained baseline). Both tested on three real-world datasets they never saw during training (zero-shot, no fine-tuning):
101
 
102
- | Dataset | Domain | ProBERT Conf | Base Conf | Agreement | Training Impact |
103
- |---------|--------|--------------|-----------|-----------|-----------------|
 
 
104
  | **Python Code** | Clear technical | 0.744 | 0.359 | **94%** | 2x confidence boost - Base has weak signal, ProBERT makes it decisive |
105
  | **Dolly-15k** | Mixed instructions | 0.413 | 0.361 | **43%** | Pattern recognition - Training teaches structure on general content |
106
  | **Yelp Reviews** | Ambiguous narrative | 0.412 | 0.356 | **16%** | Essential learning - Base completely lost, ProBERT learned the pattern |
@@ -110,10 +115,10 @@ A 66M-parameter DistilBERT specialist trained to detect rhetorical overconfidenc
110
  **Training matters MORE as content gets more ambiguous:**
111
  - **Clear signal (Python code):** Base model's embeddings capture some structure (94% agreement), but ProBERT doubles confidence (0.74 vs 0.36) and eliminates confusion
112
  - **Mixed content (Dolly-15k):** Moderate disagreement (43%) shows training teaches pattern recognition beyond embeddings alone
113
- - **Ambiguous narratives (Yelp):** Massive disagreement (16%) proves training essential - base model predicts randomly, ProBERT learned scope_blur pattern
114
 
115
  Key Findings:
116
- 1. **ProBERT is demonstrably different from base DistilBERT** - This isn't a renamed model. 450 synthetic examples generalized perfectly to completely unseen real-world domains
117
  2. **Insane data efficiency** - From 450 training examples to 84% disagreement with base model on Yelp (16% agreement = base is clueless, ProBERT learned the pattern)
118
  3. **Self-calibrating confidence** - High confidence (0.74) on clear signals, low confidence (0.40) on ambiguous data, no retraining required
119
  4. **Training impact scales with ambiguity** - On content where base models fail (16% agreement), ProBERT's training made the difference
@@ -138,7 +143,34 @@ Key Findings:
138
 
139
  - **Type I Errors**: Predictions that are stable (low CI), confident (high probability), but **wrong**. These are dangerous because they look like correct predictions behaviorally. Most models have 5-15% Type I errors. **ProBERT: 0**.
140
 
141
- **What this means:** ProBERT doesn't just predict accurately, it predicts *consistently and coherently* across different wordings of the same input. When combined with perturbation testing, you get a complete picture of model reliability.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
 
143
  **Evaluation Transparency:**
144
 
 
36
 
37
  Use it to flag risky language in LLM outputs, documentation, support tickets, or any text where confident assertions without reasoning could cause problems.
38
 
39
+ **Trained on just 450 examples (150 per class).** Achieves 95.6% accuracy and shows strong transfer signals to real-world domains it never saw during training. When tested on Yelp reviews, ProBERT and untrained base DistilBERT disagree 84% of the time—suggesting the training added real capability, not just noise.
40
 
41
+ **Why safety teams care:** When you evaluate ProBERT itself under perturbation testing (the Collapse Index protocol), it exhibits **zero Type I errors**—predictions that are stable, confident, and wrong. Most models have 5-15% Type I errors. ProBERT: 0. Additionally, ProBERT is **underconfident by design**: 98.4% accuracy but only 72.5% mean confidence (26% confidence deficit). This conservative calibration means it doubts itself when right rather than being cocky when wrong—the safe direction for production deployments.
42
 
43
  ---
44
 
 
92
  | Collapse Index (CI) — Behavioral Stability | 0.003 |
93
  | Structural Retention (SRI) — Decision Coherence | 0.997 |
94
  | Type I Errors (Stable + Confident + Wrong) | 0 |
95
+ | **Calibration (ECE)** | **0.263** |
96
+ | **Confidence on Errors** | **0.673 max, 0.0% ≥ 0.8** |
97
+ | **Accuracy vs Mean Confidence** | **98.4% acc, 72.5% conf (-26% gap)** |
98
 
99
+ ### Training Impact Demonstration (Unlabeled Transfer)
100
 
101
  **The Question:** Is ProBERT just a renamed DistilBERT, or did training actually matter?
102
 
103
+ **The Test:** ProBERT trained on **450 synthetic examples** vs. vanilla DistilBERT with a **random 3-class classification head** (untrained baseline). Both tested on three real-world datasets they never saw during training (zero-shot, no fine-tuning).
104
 
105
+ **Important:** These datasets are unlabeled for this 3-class taxonomy (process_clarity, rhetorical_confidence, scope_blur). We report confidence and agreement as behavioral transfer signals, not accuracy. No ground-truth labels exist, so no accuracy/F1/ECE can be computed per domain. Calibration (ECE) and accuracy are measured on the labeled 450-sample test split.
106
+
107
+ | Dataset | Domain | Mean max-prob (ProBERT, unlabeled) | Mean max-prob (Base, unlabeled) | Top-1 agreement (unlabeled) | Training Impact |
108
+ |---------|--------|-------------------------------------|-------------------------------------|------------------------------|-----------------|
109
  | **Python Code** | Clear technical | 0.744 | 0.359 | **94%** | 2x confidence boost - Base has weak signal, ProBERT makes it decisive |
110
  | **Dolly-15k** | Mixed instructions | 0.413 | 0.361 | **43%** | Pattern recognition - Training teaches structure on general content |
111
  | **Yelp Reviews** | Ambiguous narrative | 0.412 | 0.356 | **16%** | Essential learning - Base completely lost, ProBERT learned the pattern |
 
115
  **Training matters MORE as content gets more ambiguous:**
116
  - **Clear signal (Python code):** Base model's embeddings capture some structure (94% agreement), but ProBERT doubles confidence (0.74 vs 0.36) and eliminates confusion
117
  - **Mixed content (Dolly-15k):** Moderate disagreement (43%) shows training teaches pattern recognition beyond embeddings alone
118
+ - **Ambiguous narratives (Yelp):** Massive disagreement (16%) suggests training essential - base model predicts randomly, ProBERT learned scope_blur pattern
119
 
120
  Key Findings:
121
+ 1. **ProBERT is demonstrably different from base DistilBERT** - This isn't a renamed model. 450 synthetic examples produced strong behavioral transfer to completely unseen real-world domains
122
  2. **Insane data efficiency** - From 450 training examples to 84% disagreement with base model on Yelp (16% agreement = base is clueless, ProBERT learned the pattern)
123
  3. **Self-calibrating confidence** - High confidence (0.74) on clear signals, low confidence (0.40) on ambiguous data, no retraining required
124
  4. **Training impact scales with ambiguity** - On content where base models fail (16% agreement), ProBERT's training made the difference
 
143
 
144
  - **Type I Errors**: Predictions that are stable (low CI), confident (high probability), but **wrong**. These are dangerous because they look like correct predictions behaviorally. Most models have 5-15% Type I errors. **ProBERT: 0**.
145
 
146
+ **Calibration Metrics (Post-Training Validation):**
147
+
148
+ - **ECE (Expected Calibration Error)**: Measures alignment between confidence and accuracy. Range 0-1, lower is better.
149
+ - ECE ≤ 0.05 = Well-calibrated ✅
150
+ - ECE > 0.15 = Miscalibrated ⚠️
151
+ - **ProBERT: 0.263** (high, but from underconfidence—see below)
152
+
153
+ - **Confidence Gap**: Accuracy minus mean confidence. Positive gap = underconfident (too cautious), negative gap = overconfident (too bold).
154
+ - **ProBERT: +26%** (98.4% accuracy, 72.5% mean confidence)
155
+ - Model doubts itself when RIGHT, not cocky when wrong
156
+
157
+ - **High-Confidence Error Rate**: What % of high-confidence predictions (≥0.8) are wrong?
158
+ - Most models: 3-10% errors at high confidence
159
+ - **ProBERT: 0%** (perfect separation—all 7 errors below 0.7 confidence)
160
+
161
+ **What this means:** ProBERT doesn't just predict accurately, it predicts *consistently and coherently* across different wordings of the same input. The high ECE comes from **conservative calibration**—ProBERT is less confident than it should be, which makes it safer for production (won't confidently output wrong answers). When combined with perturbation testing, you get a complete picture of model reliability.
162
+
163
+ **Transparency Note:** Current calibration metrics measured on synthetic test set (450 samples). For production use cases requiring OOD calibration validation, we recommend evaluating on domain-specific held-out data to confirm conservative calibration holds.
164
+
165
+ **Reproducibility:**
166
+
167
+ All calibration metrics can be reproduced using the included evaluation script:
168
+
169
+ ```bash
170
+ python eval_calibration.py --probert
171
+ ```
172
+
173
+ This command auto-detects the ProBERT model and latest predictions CSV (`probert_training_20260131_004706.csv`), then computes ECE, confidence gaps, and high-confidence error rates. The script is included in the model repository for full transparency.
174
 
175
  **Evaluation Transparency:**
176