added repro
Browse files
README.md
CHANGED
|
@@ -36,9 +36,9 @@ ProBERT classifies text into three patterns:
|
|
| 36 |
|
| 37 |
Use it to flag risky language in LLM outputs, documentation, support tickets, or any text where confident assertions without reasoning could cause problems.
|
| 38 |
|
| 39 |
-
**Trained on just 450 examples (150 per class).** Achieves 95.6% accuracy and
|
| 40 |
|
| 41 |
-
**Why safety teams care:** When you evaluate ProBERT itself under perturbation testing (the Collapse Index protocol), it exhibits **zero Type I errors**—predictions that are stable, confident, and wrong. Most models have 5-15% Type I errors. ProBERT: 0. This
|
| 42 |
|
| 43 |
---
|
| 44 |
|
|
@@ -92,15 +92,20 @@ A 66M-parameter DistilBERT specialist trained to detect rhetorical overconfidenc
|
|
| 92 |
| Collapse Index (CI) — Behavioral Stability | 0.003 |
|
| 93 |
| Structural Retention (SRI) — Decision Coherence | 0.997 |
|
| 94 |
| Type I Errors (Stable + Confident + Wrong) | 0 |
|
|
|
|
|
|
|
|
|
|
| 95 |
|
| 96 |
-
###
|
| 97 |
|
| 98 |
**The Question:** Is ProBERT just a renamed DistilBERT, or did training actually matter?
|
| 99 |
|
| 100 |
-
**The Test:** ProBERT trained on **450 synthetic examples** vs. vanilla DistilBERT with a **random 3-class classification head** (untrained baseline). Both tested on three real-world datasets they never saw during training (zero-shot, no fine-tuning)
|
| 101 |
|
| 102 |
-
|
| 103 |
-
|
|
|
|
|
|
|
| 104 |
| **Python Code** | Clear technical | 0.744 | 0.359 | **94%** | 2x confidence boost - Base has weak signal, ProBERT makes it decisive |
|
| 105 |
| **Dolly-15k** | Mixed instructions | 0.413 | 0.361 | **43%** | Pattern recognition - Training teaches structure on general content |
|
| 106 |
| **Yelp Reviews** | Ambiguous narrative | 0.412 | 0.356 | **16%** | Essential learning - Base completely lost, ProBERT learned the pattern |
|
|
@@ -110,10 +115,10 @@ A 66M-parameter DistilBERT specialist trained to detect rhetorical overconfidenc
|
|
| 110 |
**Training matters MORE as content gets more ambiguous:**
|
| 111 |
- **Clear signal (Python code):** Base model's embeddings capture some structure (94% agreement), but ProBERT doubles confidence (0.74 vs 0.36) and eliminates confusion
|
| 112 |
- **Mixed content (Dolly-15k):** Moderate disagreement (43%) shows training teaches pattern recognition beyond embeddings alone
|
| 113 |
-
- **Ambiguous narratives (Yelp):** Massive disagreement (16%)
|
| 114 |
|
| 115 |
Key Findings:
|
| 116 |
-
1. **ProBERT is demonstrably different from base DistilBERT** - This isn't a renamed model. 450 synthetic examples
|
| 117 |
2. **Insane data efficiency** - From 450 training examples to 84% disagreement with base model on Yelp (16% agreement = base is clueless, ProBERT learned the pattern)
|
| 118 |
3. **Self-calibrating confidence** - High confidence (0.74) on clear signals, low confidence (0.40) on ambiguous data, no retraining required
|
| 119 |
4. **Training impact scales with ambiguity** - On content where base models fail (16% agreement), ProBERT's training made the difference
|
|
@@ -138,7 +143,34 @@ Key Findings:
|
|
| 138 |
|
| 139 |
- **Type I Errors**: Predictions that are stable (low CI), confident (high probability), but **wrong**. These are dangerous because they look like correct predictions behaviorally. Most models have 5-15% Type I errors. **ProBERT: 0**.
|
| 140 |
|
| 141 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
|
| 143 |
**Evaluation Transparency:**
|
| 144 |
|
|
|
|
| 36 |
|
| 37 |
Use it to flag risky language in LLM outputs, documentation, support tickets, or any text where confident assertions without reasoning could cause problems.
|
| 38 |
|
| 39 |
+
**Trained on just 450 examples (150 per class).** Achieves 95.6% accuracy and shows strong transfer signals to real-world domains it never saw during training. When tested on Yelp reviews, ProBERT and untrained base DistilBERT disagree 84% of the time—suggesting the training added real capability, not just noise.
|
| 40 |
|
| 41 |
+
**Why safety teams care:** When you evaluate ProBERT itself under perturbation testing (the Collapse Index protocol), it exhibits **zero Type I errors**—predictions that are stable, confident, and wrong. Most models have 5-15% Type I errors. ProBERT: 0. Additionally, ProBERT is **underconfident by design**: 98.4% accuracy but only 72.5% mean confidence (26% confidence deficit). This conservative calibration means it doubts itself when right rather than being cocky when wrong—the safe direction for production deployments.
|
| 42 |
|
| 43 |
---
|
| 44 |
|
|
|
|
| 92 |
| Collapse Index (CI) — Behavioral Stability | 0.003 |
|
| 93 |
| Structural Retention (SRI) — Decision Coherence | 0.997 |
|
| 94 |
| Type I Errors (Stable + Confident + Wrong) | 0 |
|
| 95 |
+
| **Calibration (ECE)** | **0.263** |
|
| 96 |
+
| **Confidence on Errors** | **0.673 max, 0.0% ≥ 0.8** |
|
| 97 |
+
| **Accuracy vs Mean Confidence** | **98.4% acc, 72.5% conf (-26% gap)** |
|
| 98 |
|
| 99 |
+
### Training Impact Demonstration (Unlabeled Transfer)
|
| 100 |
|
| 101 |
**The Question:** Is ProBERT just a renamed DistilBERT, or did training actually matter?
|
| 102 |
|
| 103 |
+
**The Test:** ProBERT trained on **450 synthetic examples** vs. vanilla DistilBERT with a **random 3-class classification head** (untrained baseline). Both tested on three real-world datasets they never saw during training (zero-shot, no fine-tuning).
|
| 104 |
|
| 105 |
+
**Important:** These datasets are unlabeled for this 3-class taxonomy (process_clarity, rhetorical_confidence, scope_blur). We report confidence and agreement as behavioral transfer signals, not accuracy. No ground-truth labels exist, so no accuracy/F1/ECE can be computed per domain. Calibration (ECE) and accuracy are measured on the labeled 450-sample test split.
|
| 106 |
+
|
| 107 |
+
| Dataset | Domain | Mean max-prob (ProBERT, unlabeled) | Mean max-prob (Base, unlabeled) | Top-1 agreement (unlabeled) | Training Impact |
|
| 108 |
+
|---------|--------|-------------------------------------|-------------------------------------|------------------------------|-----------------|
|
| 109 |
| **Python Code** | Clear technical | 0.744 | 0.359 | **94%** | 2x confidence boost - Base has weak signal, ProBERT makes it decisive |
|
| 110 |
| **Dolly-15k** | Mixed instructions | 0.413 | 0.361 | **43%** | Pattern recognition - Training teaches structure on general content |
|
| 111 |
| **Yelp Reviews** | Ambiguous narrative | 0.412 | 0.356 | **16%** | Essential learning - Base completely lost, ProBERT learned the pattern |
|
|
|
|
| 115 |
**Training matters MORE as content gets more ambiguous:**
|
| 116 |
- **Clear signal (Python code):** Base model's embeddings capture some structure (94% agreement), but ProBERT doubles confidence (0.74 vs 0.36) and eliminates confusion
|
| 117 |
- **Mixed content (Dolly-15k):** Moderate disagreement (43%) shows training teaches pattern recognition beyond embeddings alone
|
| 118 |
+
- **Ambiguous narratives (Yelp):** Massive disagreement (16%) suggests training essential - base model predicts randomly, ProBERT learned scope_blur pattern
|
| 119 |
|
| 120 |
Key Findings:
|
| 121 |
+
1. **ProBERT is demonstrably different from base DistilBERT** - This isn't a renamed model. 450 synthetic examples produced strong behavioral transfer to completely unseen real-world domains
|
| 122 |
2. **Insane data efficiency** - From 450 training examples to 84% disagreement with base model on Yelp (16% agreement = base is clueless, ProBERT learned the pattern)
|
| 123 |
3. **Self-calibrating confidence** - High confidence (0.74) on clear signals, low confidence (0.40) on ambiguous data, no retraining required
|
| 124 |
4. **Training impact scales with ambiguity** - On content where base models fail (16% agreement), ProBERT's training made the difference
|
|
|
|
| 143 |
|
| 144 |
- **Type I Errors**: Predictions that are stable (low CI), confident (high probability), but **wrong**. These are dangerous because they look like correct predictions behaviorally. Most models have 5-15% Type I errors. **ProBERT: 0**.
|
| 145 |
|
| 146 |
+
**Calibration Metrics (Post-Training Validation):**
|
| 147 |
+
|
| 148 |
+
- **ECE (Expected Calibration Error)**: Measures alignment between confidence and accuracy. Range 0-1, lower is better.
|
| 149 |
+
- ECE ≤ 0.05 = Well-calibrated ✅
|
| 150 |
+
- ECE > 0.15 = Miscalibrated ⚠️
|
| 151 |
+
- **ProBERT: 0.263** (high, but from underconfidence—see below)
|
| 152 |
+
|
| 153 |
+
- **Confidence Gap**: Accuracy minus mean confidence. Positive gap = underconfident (too cautious), negative gap = overconfident (too bold).
|
| 154 |
+
- **ProBERT: +26%** (98.4% accuracy, 72.5% mean confidence)
|
| 155 |
+
- Model doubts itself when RIGHT, not cocky when wrong
|
| 156 |
+
|
| 157 |
+
- **High-Confidence Error Rate**: What % of high-confidence predictions (≥0.8) are wrong?
|
| 158 |
+
- Most models: 3-10% errors at high confidence
|
| 159 |
+
- **ProBERT: 0%** (perfect separation—all 7 errors below 0.7 confidence)
|
| 160 |
+
|
| 161 |
+
**What this means:** ProBERT doesn't just predict accurately, it predicts *consistently and coherently* across different wordings of the same input. The high ECE comes from **conservative calibration**—ProBERT is less confident than it should be, which makes it safer for production (won't confidently output wrong answers). When combined with perturbation testing, you get a complete picture of model reliability.
|
| 162 |
+
|
| 163 |
+
**Transparency Note:** Current calibration metrics measured on synthetic test set (450 samples). For production use cases requiring OOD calibration validation, we recommend evaluating on domain-specific held-out data to confirm conservative calibration holds.
|
| 164 |
+
|
| 165 |
+
**Reproducibility:**
|
| 166 |
+
|
| 167 |
+
All calibration metrics can be reproduced using the included evaluation script:
|
| 168 |
+
|
| 169 |
+
```bash
|
| 170 |
+
python eval_calibration.py --probert
|
| 171 |
+
```
|
| 172 |
+
|
| 173 |
+
This command auto-detects the ProBERT model and latest predictions CSV (`probert_training_20260131_004706.csv`), then computes ECE, confidence gaps, and high-confidence error rates. The script is included in the model repository for full transparency.
|
| 174 |
|
| 175 |
**Evaluation Transparency:**
|
| 176 |
|