Text Classification
Transformers
Safetensors
English
bert
arxiv
scientific-text-classification
scibert
streamlit-demo
text-embeddings-inference
Instructions to use Ian-Khalzov/article-topic-service-scibert with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Ian-Khalzov/article-topic-service-scibert with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="Ian-Khalzov/article-topic-service-scibert")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("Ian-Khalzov/article-topic-service-scibert") model = AutoModelForSequenceClassification.from_pretrained("Ian-Khalzov/article-topic-service-scibert") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| library_name: transformers | |
| license: mit | |
| pipeline_tag: text-classification | |
| tags: | |
| - arxiv | |
| - scientific-text-classification | |
| - scibert | |
| - streamlit-demo | |
| datasets: | |
| - librarian-bots/arxiv-metadata-snapshot | |
| metrics: | |
| - accuracy | |
| - f1 | |
| # Article Topic Service SciBERT | |
| SciBERT text classifier for scientific article topic prediction from article title and abstract. | |
| ## Labels | |
| - Artificial Intelligence | |
| - Natural Language Processing | |
| - Computer Vision | |
| - Machine Learning | |
| - Computer Science Theory and Algorithms | |
| - Mathematics | |
| - Statistics | |
| - Electrical Engineering | |
| - Astrophysics | |
| - Condensed Matter Physics | |
| - Quantum Physics | |
| - Quantitative Biology | |
| ## Dataset | |
| Balanced 12-class subset built from `librarian-bots/arxiv-metadata-snapshot`. | |
| - Train: 30,000 examples | |
| - Validation: 3,600 examples | |
| - Test: 3,600 examples | |
| ## Metrics | |
| - Validation accuracy: 0.8350 | |
| - Validation macro F1: 0.8351 | |
| - Test accuracy: 0.8356 | |
| - Test macro F1: 0.8351 | |
| - Title-only test accuracy: 0.7522 | |
| - Title-only test macro F1: 0.7495 | |
| ## Usage | |
| ```python | |
| from transformers import AutoModelForSequenceClassification, AutoTokenizer | |
| import torch | |
| model_id = "Ian-Khalzov/article-topic-service-scibert" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForSequenceClassification.from_pretrained(model_id) | |
| text = "Title: Large language models for scientific document classification\n\nAbstract: We study..." | |
| inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) | |
| with torch.inference_mode(): | |
| probs = torch.softmax(model(**inputs).logits[0], dim=-1) | |
| predicted_label = model.config.id2label[int(probs.argmax())] | |
| print(predicted_label) | |
| ``` | |
| ## Notes | |
| The current baseline is strongest on physics-heavy classes and weakest on the broad `Machine Learning` category, where topical overlap with AI, NLP, CV, and Statistics remains high. | |