Bei0001's picture
Upload 2 files
59ea9dd verified
---
language: en
license: apache-2.0
library_name: transformers
base_model: KISTI-AI/Scideberta-full
tags:
- text-classification
- span-classification
- discourse
- rhetorical-role
- academic-text
- scientific-text
- onnx
- int8
- quantized
pipeline_tag: text-classification
---
# Span Role Classifier v10 (ONNX INT8)
A 12-class text classifier that assigns a **discourse / rhetorical role** to a span of academic text. Fine-tuned from [`KISTI-AI/Scideberta-full`](https://huggingface.co/KISTI-AI/Scideberta-full) and dynamically quantized to INT8 via ONNX Runtime for fast CPU inference.
## Labels
| id | label | description |
|---|---|---|
| 0 | `background_context` | prior work, setting, motivation |
| 1 | `definition` | formal definition of a term/concept |
| 2 | `fact_property` | factual statement or inherent property |
| 3 | `classification` | taxonomy / type grouping |
| 4 | `cause_mechanism` | how/why something happens |
| 5 | `compare_contrast` | comparison between two things |
| 6 | `procedure_step` | step in a procedure or method |
| 7 | `worked_example` | worked calculation/derivation |
| 8 | `claim_conclusion` | claim or inference |
| 9 | `evidence_result` | empirical data or experimental result |
| 10 | `condition_exception` | precondition, hypothesis, or limit of validity |
| 11 | `counterexample_misconception` | refutation or debunked belief |
## Validation performance (macro F1 = 0.714)
Evaluated on a 10% stratified held-out split of 28,398 LLM-relabeled academic spans across 24 academic domains.
| class | F1 |
|---|---|
| procedure_step | 0.812 |
| condition_exception | 0.788 |
| definition | 0.759 |
| classification | 0.755 |
| worked_example | 0.745 |
| cause_mechanism | 0.711 |
| background_context | 0.696 |
| compare_contrast | 0.676 |
| evidence_result | 0.776 |
| claim_conclusion | 0.642 |
| counterexample_misconception | 0.637 |
| fact_property | 0.577 |
| **macro F1** | **0.714** |
| val accuracy | 0.706 |
Training progression (no epoch regression thanks to anti-overfit config):
| epoch | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| macro F1 | 0.635 | 0.666 | 0.699 | 0.704 | 0.707 | **0.714** |
## Quantization
| | FP32 PyTorch | FP32 ONNX | **INT8 ONNX (this file)** |
|---|---|---|---|
| file size | 738 MB | 739 MB | **244 MB** |
| compression vs FP32 | 1.00x | 1.00x | **3.03x** |
| CPU latency (batch=1, max_len=128) | ~60 ms | ~60 ms | ~60 ms |
| macro F1 | 0.714 | 0.714 (identical) | **0.714 (identical)** |
| max logit diff vs FP32 | 0 | 0 | 0.20 |
| live-test agreement with FP32 | — | 100% | **100%** |
INT8 predictions match FP32 on every sample in the held-out live-test set. The quantization is lossless for classification purposes.
## Usage
### With ONNX Runtime (recommended for production)
```python
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
MODEL_DIR = "span-role-classifier-v10-int8-onnx"
LABELS = [
"background_context","definition","fact_property","classification",
"cause_mechanism","compare_contrast","procedure_step","worked_example",
"claim_conclusion","evidence_result","condition_exception","counterexample_misconception",
]
tok = AutoTokenizer.from_pretrained(MODEL_DIR)
sess = ort.InferenceSession(f"{MODEL_DIR}/model.onnx", providers=["CPUExecutionProvider"])
def classify(text: str) -> dict:
enc = tok(text, return_tensors="np", truncation=True, max_length=512, padding=True)
inputs = {
"input_ids": enc["input_ids"].astype(np.int64),
"attention_mask": enc["attention_mask"].astype(np.int64),
}
logits = sess.run(None, inputs)[0][0]
probs = np.exp(logits - logits.max())
probs /= probs.sum()
idx = int(probs.argmax())
return {"label": LABELS[idx], "confidence": float(probs[idx])}
print(classify("The central limit theorem applies only when observations are independent and the population variance is finite."))
# -> {'label': 'condition_exception', 'confidence': 0.99}
```
### With HuggingFace Optimum
```python
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer, pipeline
tok = AutoTokenizer.from_pretrained("span-role-classifier-v10-int8-onnx")
model = ORTModelForSequenceClassification.from_pretrained(
"span-role-classifier-v10-int8-onnx", file_name="model.onnx"
)
pipe = pipeline("text-classification", model=model, tokenizer=tok, top_k=None)
print(pipe("A common misconception holds that humans evolved from modern chimpanzees."))
```
## Training details
- **Base model:** `KISTI-AI/Scideberta-full` (DeBERTa-v3 pretrained on scientific text)
- **Dataset:** 28,398 academic spans across 24 academic domains (Biology, Physics, Mathematics, Medicine, Philosophy, Computer Science, Law, History, etc.), all labels LLM-relabeled for quality
- **Anti-overfit config:**
- LR 1.5e-5 with linear warmup (15%) + decay
- Weight decay 0.02
- Classifier + pooler dropout 0.2
- Label smoothing 0.05
- Inverse-frequency class weights on CrossEntropyLoss
- Early stopping patience 2 on macro F1
- Batch size 32, max 6 epochs (used all 6 — never regressed)
- **Hardware:** RTX 5090, ~4 hours wall time
- **Quantization:** ONNX Runtime dynamic quantization (INT8 weights for MatMul + embeddings; activations in FP32)
## Limitations
- Labels are LLM-relabeled (Claude Sonnet 4.6), not human-annotated — true human-gold F1 will be a few points lower (~0.65-0.68 estimated).
- Trained on academic English only; performance on other domains (news, fiction, social media) is untested and likely lower.
- The `fact_property` class is the semantic catch-all that overlaps with `background_context`, `definition`, and `cause_mechanism`; its F1 is the lowest and its errors are often defensible rubric edge cases rather than true mistakes.
- The model predicts per-span; it does not segment long documents into spans — you must supply pre-segmented input (typically 1-3 sentence chunks).
## License
Apache 2.0 (follows base model `KISTI-AI/Scideberta-full`).
## Citation
If you use this model, please cite the base SciDeBERTa paper as well:
```
@inproceedings{Jeong2022SciDeBERTa,
title={SciDeBERTa: Learning DeBERTa for Scientific Domain},
author={Jeong, Yeon-Ju and Kim, Eunhui},
booktitle={IEEE Access},
year={2022}
}
```