| --- |
| tags: |
| - deberta-v3 |
| - cross-encoder |
| - osmosis |
| - response-sufficiency |
| - binary-classification |
| language: en |
| license: mit |
| base_model: MoritzLaurer/deberta-v3-base-zeroshot-v2.0 |
| datasets: |
| - KingTechnician/yahoo-answers-osmosis-labeled |
| - KingTechnician/triage-synthetic-data-v1 |
| --- |
| |
| # OSMoSIS Binary Cross-Encoder |
|
|
| DeBERTa-v3 cross-encoder for binary response-sufficiency classification. |
| Given `(objective, response)`, predicts ADDR (response addresses the objective) |
| or NOADDR (response does not). |
|
|
| ## Intended use |
|
|
| First stage of a cascaded pipeline. Confident binary predictions are used |
| directly; low-confidence cases should route to an LLM judge for fine-grained |
| classification. |
|
|
| Trained on Sonnet-4.6-generated labels (flat prompt, echo-stripped responses), |
| validated against 254 human-reviewed labels for deployment-grade evaluation. |
|
|
| ## Performance |
|
|
| ### Yahoo within-domain test |
|
|
| Accuracy: **0.806** | Macro F1: **0.669** |
|
|
| | Class | Precision | Recall | F1 | Support | |
| |---|---|---|---|---| |
| | ADDR | 0.857 | 0.909 | 0.882 | 798 | |
| | NOADDR | 0.526 | 0.401 | 0.455 | 202 | |
|
|
| ### Triage synthetic held-out (architecture validation) |
|
|
| Accuracy: **0.995** | Macro F1: **0.994** |
|
|
| | Class | Precision | Recall | F1 | Support | |
| |---|---|---|---|---| |
| | ADDR | 1.000 | 0.987 | 0.993 | 150 | |
| | NOADDR | 0.991 | 1.000 | 0.996 | 223 | |
|
|
| ### Human gold-standard held-out |
|
|
| Accuracy: **0.776** | Macro F1: **0.680** |
|
|
| | Class | Precision | Recall | F1 | Support | |
| |---|---|---|---|---| |
| | ADDR | 0.824 | 0.889 | 0.855 | 189 | |
| | NOADDR | 0.580 | 0.446 | 0.504 | 65 | |
|
|
|
|
| ## Training |
| - Base: `MoritzLaurer/deberta-v3-base-zeroshot-v2.0` (NLI-pretrained) |
| - Data: Yahoo Answers (echo-stripped) + Triage synthetic, joint training |
| - Best epoch: 3 (selected by val macro-F1) |
| - Batch size: 16, max length: 512, LR: 2e-05 |
| - Class weights: [0.724, 1.615] |
| - Early stopping: patience 3 on val macro-F1 |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| import torch |
| |
| model = AutoModelForSequenceClassification.from_pretrained("KingTechnician/osmosis-crossencoder-binary") |
| tokenizer = AutoTokenizer.from_pretrained("KingTechnician/osmosis-crossencoder-binary") |
| |
| inputs = tokenizer("What causes rain?", |
| "Rain forms when water vapor condenses into droplets.", |
| return_tensors="pt", truncation=True, max_length=512) |
| with torch.no_grad(): |
| logits = model(**inputs).logits |
| pred = logits.argmax(dim=-1).item() |
| print(["ADDR", "NOADDR"][pred]) |
| ``` |
|
|
| ## Limitations |
|
|
| - NOADDR class is heterogeneous (on-topic-but-not-answering, tangential, off-topic |
| all map to the same target). Sub-classification of NOADDR requires a stronger |
| model — see the cascade evaluation in the OSMoSIS repo. |
| - Synthetic Triage results (near-ceiling) validate the architecture but are not |
| representative of open-domain difficulty. Use the human held-out number as the |
| realistic deployment estimate. |
|
|