Ian-Khalzov
/

article-topic-service-scibert

Text Classification

scientific-text-classification

text-embeddings-inference

Model card Files Files and versions

article-topic-service-scibert / README.md

Ian-Khalzov's picture

Upload final model artifacts

328c7b9 verified 3 months ago

|

History Blame Contribute Delete

1.86 kB

	---
	language:
	- en
	library_name: transformers
	license: mit
	pipeline_tag: text-classification
	tags:
	- arxiv
	- scientific-text-classification
	- scibert
	- streamlit-demo
	datasets:
	- librarian-bots/arxiv-metadata-snapshot
	metrics:
	- accuracy
	- f1
	---

	# Article Topic Service SciBERT

	SciBERT text classifier for scientific article topic prediction from article title and abstract.

	## Labels

	- Artificial Intelligence
	- Natural Language Processing
	- Computer Vision
	- Machine Learning
	- Computer Science Theory and Algorithms
	- Mathematics
	- Statistics
	- Electrical Engineering
	- Astrophysics
	- Condensed Matter Physics
	- Quantum Physics
	- Quantitative Biology

	## Dataset

	Balanced 12-class subset built from `librarian-bots/arxiv-metadata-snapshot`.

	- Train: 30,000 examples
	- Validation: 3,600 examples
	- Test: 3,600 examples

	## Metrics

	- Validation accuracy: 0.8350
	- Validation macro F1: 0.8351
	- Test accuracy: 0.8356
	- Test macro F1: 0.8351
	- Title-only test accuracy: 0.7522
	- Title-only test macro F1: 0.7495

	## Usage

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	import torch

	model_id = "Ian-Khalzov/article-topic-service-scibert"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSequenceClassification.from_pretrained(model_id)

	text = "Title: Large language models for scientific document classification\n\nAbstract: We study..."
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
	with torch.inference_mode():
	probs = torch.softmax(model(**inputs).logits[0], dim=-1)

	predicted_label = model.config.id2label[int(probs.argmax())]
	print(predicted_label)
	```

	## Notes

	The current baseline is strongest on physics-heavy classes and weakest on the broad `Machine Learning` category, where topical overlap with AI, NLP, CV, and Statistics remains high.