Upload 2 files

59ea9dd verified 22 days ago

6.35 kB

	---
	language: en
	license: apache-2.0
	library_name: transformers
	base_model: KISTI-AI/Scideberta-full
	tags:
	- text-classification
	- span-classification
	- discourse
	- rhetorical-role
	- academic-text
	- scientific-text
	- onnx
	- int8
	- quantized
	pipeline_tag: text-classification
	---

	# Span Role Classifier v10 (ONNX INT8)

	A 12-class text classifier that assigns a discourse / rhetorical role to a span of academic text. Fine-tuned from [`KISTI-AI/Scideberta-full`](https://huggingface.co/KISTI-AI/Scideberta-full) and dynamically quantized to INT8 via ONNX Runtime for fast CPU inference.

	## Labels

	\| id \| label \| description \|
	\|---\|---\|---\|
	\| 0 \| `background_context` \| prior work, setting, motivation \|
	\| 1 \| `definition` \| formal definition of a term/concept \|
	\| 2 \| `fact_property` \| factual statement or inherent property \|
	\| 3 \| `classification` \| taxonomy / type grouping \|
	\| 4 \| `cause_mechanism` \| how/why something happens \|
	\| 5 \| `compare_contrast` \| comparison between two things \|
	\| 6 \| `procedure_step` \| step in a procedure or method \|
	\| 7 \| `worked_example` \| worked calculation/derivation \|
	\| 8 \| `claim_conclusion` \| claim or inference \|
	\| 9 \| `evidence_result` \| empirical data or experimental result \|
	\| 10 \| `condition_exception` \| precondition, hypothesis, or limit of validity \|
	\| 11 \| `counterexample_misconception` \| refutation or debunked belief \|

	## Validation performance (macro F1 = 0.714)

	Evaluated on a 10% stratified held-out split of 28,398 LLM-relabeled academic spans across 24 academic domains.

	\| class \| F1 \|
	\|---\|---\|
	\| procedure_step \| 0.812 \|
	\| condition_exception \| 0.788 \|
	\| definition \| 0.759 \|
	\| classification \| 0.755 \|
	\| worked_example \| 0.745 \|
	\| cause_mechanism \| 0.711 \|
	\| background_context \| 0.696 \|
	\| compare_contrast \| 0.676 \|
	\| evidence_result \| 0.776 \|
	\| claim_conclusion \| 0.642 \|
	\| counterexample_misconception \| 0.637 \|
	\| fact_property \| 0.577 \|
	\| macro F1 \| 0.714 \|
	\| val accuracy \| 0.706 \|

	Training progression (no epoch regression thanks to anti-overfit config):

	\| epoch \| 1 \| 2 \| 3 \| 4 \| 5 \| 6 \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| macro F1 \| 0.635 \| 0.666 \| 0.699 \| 0.704 \| 0.707 \| 0.714 \|

	## Quantization

	\| \| FP32 PyTorch \| FP32 ONNX \| INT8 ONNX (this file) \|
	\|---\|---\|---\|---\|
	\| file size \| 738 MB \| 739 MB \| 244 MB \|
	\| compression vs FP32 \| 1.00x \| 1.00x \| 3.03x \|
	\| CPU latency (batch=1, max_len=128) \| ~60 ms \| ~60 ms \| ~60 ms \|
	\| macro F1 \| 0.714 \| 0.714 (identical) \| 0.714 (identical) \|
	\| max logit diff vs FP32 \| 0 \| 0 \| 0.20 \|
	\| live-test agreement with FP32 \| — \| 100% \| 100% \|

	INT8 predictions match FP32 on every sample in the held-out live-test set. The quantization is lossless for classification purposes.

	## Usage

	### With ONNX Runtime (recommended for production)

	```python
	import numpy as np
	import onnxruntime as ort
	from transformers import AutoTokenizer

	MODEL_DIR = "span-role-classifier-v10-int8-onnx"
	LABELS = [
	"background_context","definition","fact_property","classification",
	"cause_mechanism","compare_contrast","procedure_step","worked_example",
	"claim_conclusion","evidence_result","condition_exception","counterexample_misconception",
	]

	tok = AutoTokenizer.from_pretrained(MODEL_DIR)
	sess = ort.InferenceSession(f"{MODEL_DIR}/model.onnx", providers=["CPUExecutionProvider"])

	def classify(text: str) -> dict:
	enc = tok(text, return_tensors="np", truncation=True, max_length=512, padding=True)
	inputs = {
	"input_ids": enc["input_ids"].astype(np.int64),
	"attention_mask": enc["attention_mask"].astype(np.int64),
	}
	logits = sess.run(None, inputs)[0][0]
	probs = np.exp(logits - logits.max())
	probs /= probs.sum()
	idx = int(probs.argmax())
	return {"label": LABELS[idx], "confidence": float(probs[idx])}

	print(classify("The central limit theorem applies only when observations are independent and the population variance is finite."))
	# -> {'label': 'condition_exception', 'confidence': 0.99}
	```

	### With HuggingFace Optimum

	```python
	from optimum.onnxruntime import ORTModelForSequenceClassification
	from transformers import AutoTokenizer, pipeline

	tok = AutoTokenizer.from_pretrained("span-role-classifier-v10-int8-onnx")
	model = ORTModelForSequenceClassification.from_pretrained(
	"span-role-classifier-v10-int8-onnx", file_name="model.onnx"
	)
	pipe = pipeline("text-classification", model=model, tokenizer=tok, top_k=None)
	print(pipe("A common misconception holds that humans evolved from modern chimpanzees."))
	```

	## Training details

	- Base model: `KISTI-AI/Scideberta-full` (DeBERTa-v3 pretrained on scientific text)
	- Dataset: 28,398 academic spans across 24 academic domains (Biology, Physics, Mathematics, Medicine, Philosophy, Computer Science, Law, History, etc.), all labels LLM-relabeled for quality
	- Anti-overfit config:
	- LR 1.5e-5 with linear warmup (15%) + decay
	- Weight decay 0.02
	- Classifier + pooler dropout 0.2
	- Label smoothing 0.05
	- Inverse-frequency class weights on CrossEntropyLoss
	- Early stopping patience 2 on macro F1
	- Batch size 32, max 6 epochs (used all 6 — never regressed)
	- Hardware: RTX 5090, ~4 hours wall time
	- Quantization: ONNX Runtime dynamic quantization (INT8 weights for MatMul + embeddings; activations in FP32)

	## Limitations

	- Labels are LLM-relabeled (Claude Sonnet 4.6), not human-annotated — true human-gold F1 will be a few points lower (~0.65-0.68 estimated).
	- Trained on academic English only; performance on other domains (news, fiction, social media) is untested and likely lower.
	- The `fact_property` class is the semantic catch-all that overlaps with `background_context`, `definition`, and `cause_mechanism`; its F1 is the lowest and its errors are often defensible rubric edge cases rather than true mistakes.
	- The model predicts per-span; it does not segment long documents into spans — you must supply pre-segmented input (typically 1-3 sentence chunks).

	## License

	Apache 2.0 (follows base model `KISTI-AI/Scideberta-full`).

	## Citation

	If you use this model, please cite the base SciDeBERTa paper as well:

	```
	@inproceedings{Jeong2022SciDeBERTa,
	title={SciDeBERTa: Learning DeBERTa for Scientific Domain},
	author={Jeong, Yeon-Ju and Kim, Eunhui},
	booktitle={IEEE Access},
	year={2022}
	}
	```