|
|
--- |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
base_model: allenai/specter2_base |
|
|
tags: |
|
|
- generated_from_trainer |
|
|
metrics: |
|
|
- accuracy |
|
|
model-index: |
|
|
- name: results |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
|
|
# 📗 SPECTER2–MAG (Multiclass Classification on MAG Level-0 Fields of Study) |
|
|
|
|
|
This model is a fine-tuned version of [allenai/specter2_base](https://huggingface.co/allenai/specter2_base) for multiclass bibliometric classification using MAG Fields of Study – Level 0 (SciDocs). |
|
|
It achieves the following results on the evaluation set: |
|
|
- Loss: 1.0598 |
|
|
- Accuracy: 0.8310 |
|
|
- Precision Micro: 0.8310 |
|
|
- Precision Macro: 0.8290 |
|
|
- Recall Micro: 0.8310 |
|
|
- Recall Macro: 0.8276 |
|
|
- F1 Micro: 0.8310 |
|
|
- F1 Macro: 0.8263 |
|
|
|
|
|
## Model description |
|
|
|
|
|
This model is a fine-tuned version of SPECTER2 (`allenai/specter2_base`) adapted for multiclass classification across the 19 top-level Fields of Study (FoS) from the Microsoft Academic Graph (MAG). |
|
|
|
|
|
The model accepts the title, abstract, or title + abstract of a scientific publication and assigns it to exactly one of the MAG Level-0 domains (e.g., Biology, Chemistry, Computer Science, Engineering, Psychology). |
|
|
|
|
|
Key characteristics: |
|
|
* Base model: allenai/specter2_base |
|
|
* Task: multiclass document classification |
|
|
* Labels: 19 MAG Field of Study Level-0 categories |
|
|
* Activation: softmax |
|
|
* Loss: CrossEntropyLoss |
|
|
* Output: single best-matching FoS category |
|
|
|
|
|
MAG Level-0 represents broad disciplinary domains designed for high-level categorization of scientific documents. |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
|
|
### Intended uses |
|
|
This multiclass MAG model is suitable for: |
|
|
|
|
|
- Assigning publications to **top-level scientific disciplines** |
|
|
- Enriching metadata in: |
|
|
- repositories |
|
|
- research output systems |
|
|
- funding and project datasets |
|
|
- bibliometric dashboards |
|
|
- Supporting scientometric analyses such as: |
|
|
- broad-discipline portfolio mapping |
|
|
- domain-level clustering |
|
|
- modeling research diversification |
|
|
- Classifying documents when only **title/abstract** is available |
|
|
|
|
|
The model supports inputs such as: |
|
|
- **title only** |
|
|
- **abstract only** |
|
|
- **title + abstract** (recommended) |
|
|
|
|
|
### Limitations |
|
|
- MAG Level-0 categories are **very coarse** (e.g., *Biology*, *Medicine*, *Engineering*), and do not represent subfields. |
|
|
- Documents spanning multiple fields must be forced into **one** label—an inherent limitation of multiclass classification. |
|
|
- The training labels come from **MAG’s automatic field assignment pipeline**, not manual expert annotation. |
|
|
- Not suitable for: |
|
|
- fine-grained subdisciplines |
|
|
- downstream tasks requiring multilabel outputs |
|
|
- WoS Categories or ASJC Areas (use separate models) |
|
|
- clinical or regulatory decision-making |
|
|
|
|
|
Predictions should be treated as **high-level disciplinary metadata**, not detailed field classification. |
|
|
|
|
|
## Training and evaluation data |
|
|
|
|
|
### Source dataset: **SciDocs** |
|
|
|
|
|
Training data comes from the **SciDocs** dataset, introduced together with the original SPECTER paper: |
|
|
|
|
|
> **SciDocs** provides citation graphs, titles, abstracts, and **MAG Fields of Study** for scientific documents derived from MAG. |
|
|
> For this model, we use **MAG Level-0 FoS**, the 19 top-level scientific domains. |
|
|
|
|
|
Dataset characteristics: |
|
|
|
|
|
| Property | Value | |
|
|
|---------|-------| |
|
|
| Documents | ~40k scientific papers | |
|
|
| Labels | 19 FoS Level-0 categories | |
|
|
| Input fields | Abstract | |
|
|
| Task type | Multiclass | |
|
|
| Source | SciDocs (SPECTER paper) | |
|
|
| License | CC-BY | |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
### Preprocessing |
|
|
- Input text constructed as: |
|
|
`abstract` |
|
|
- Tokenization using the SPECTER2 tokenizer |
|
|
- Maximum sequence length: **512 tokens** |
|
|
|
|
|
### Model |
|
|
- Base model: `allenai/specter2_base` |
|
|
- Classification head: linear layer → softmax |
|
|
- Loss: **CrossEntropyLoss** |
|
|
|
|
|
### Training hyperparameters |
|
|
|
|
|
The following hyperparameters were used during training: |
|
|
- learning_rate: 2e-05 |
|
|
- train_batch_size: 16 |
|
|
- eval_batch_size: 16 |
|
|
- seed: 42 |
|
|
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
|
|
- lr_scheduler_type: linear |
|
|
- num_epochs: 10 |
|
|
|
|
|
### Training results |
|
|
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Accuracy | Precision Micro | Precision Macro | Recall Micro | Recall Macro | F1 Micro | F1 Macro | |
|
|
|:-------------:|:-----:|:----:|:---------------:|:--------:|:---------------:|:---------------:|:------------:|:------------:|:--------:|:--------:| |
|
|
| 0.2603 | 1.0 | 1094 | 0.6733 | 0.8243 | 0.8243 | 0.8315 | 0.8243 | 0.8198 | 0.8243 | 0.8222 | |
|
|
| 0.1779 | 2.0 | 2188 | 0.6955 | 0.8240 | 0.8240 | 0.8198 | 0.8240 | 0.8203 | 0.8240 | 0.8176 | |
|
|
| 0.1628 | 3.0 | 3282 | 0.8130 | 0.8315 | 0.8315 | 0.8296 | 0.8315 | 0.8265 | 0.8315 | 0.8269 | |
|
|
| 0.1136 | 4.0 | 4376 | 0.9842 | 0.8227 | 0.8227 | 0.8254 | 0.8227 | 0.8192 | 0.8227 | 0.8205 | |
|
|
| 0.0666 | 5.0 | 5470 | 1.0598 | 0.8310 | 0.8310 | 0.8290 | 0.8310 | 0.8276 | 0.8310 | 0.8263 | |
|
|
|
|
|
### Evaluation results |
|
|
|
|
|
| | precision | recall | f1-score | support | |
|
|
|:----------------------|------------:|---------:|-----------:|------------:| |
|
|
| Art | 0.654867 | 0.845714 | 0.738155 | 175 | |
|
|
| Biology | 0.982222 | 0.973568 | 0.977876 | 227 | |
|
|
| Business | 0.914894 | 0.877551 | 0.895833 | 196 | |
|
|
| Chemistry | 0.97449 | 0.969543 | 0.97201 | 197 | |
|
|
| Computer science | 0.960452 | 0.894737 | 0.926431 | 190 | |
|
|
| Economics | 0.816425 | 0.782407 | 0.799054 | 216 | |
|
|
| Engineering | 0.906103 | 0.927885 | 0.916865 | 208 | |
|
|
| Environmental science | 0.975369 | 0.916667 | 0.945107 | 216 | |
|
|
| Geography | 0.758454 | 0.912791 | 0.828496 | 172 | |
|
|
| Geology | 0.96729 | 0.976415 | 0.971831 | 212 | |
|
|
| History | 0.62987 | 0.518717 | 0.568915 | 187 | |
|
|
| Materials science | 0.932432 | 0.958333 | 0.945205 | 216 | |
|
|
| Mathematics | 0.938776 | 0.94359 | 0.941176 | 195 | |
|
|
| Medicine | 0.982558 | 0.923497 | 0.952113 | 183 | |
|
|
| Philosophy | 0.752874 | 0.748571 | 0.750716 | 175 | |
|
|
| Physics | 0.964824 | 0.974619 | 0.969697 | 197 | |
|
|
| Political science | 0.642512 | 0.661692 | 0.651961 | 201 | |
|
|
| Psychology | 0.806283 | 0.758621 | 0.781726 | 203 | |
|
|
| Sociology | 0.438889 | 0.427027 | 0.432877 | 185 | |
|
|
| accuracy | 0.845641 | 0.845641 | 0.845641 | 0.845641 | |
|
|
| macro avg | 0.842083 | 0.841681 | 0.840318 | 3751 | |
|
|
| weighted avg | 0.847843 | 0.845641 | 0.845311 | 3751 | |
|
|
|
|
|
|
|
|
### Framework versions |
|
|
|
|
|
- Transformers 4.57.1 |
|
|
- Pytorch 2.8.0+cu126 |
|
|
- Datasets 3.6.0 |
|
|
- Tokenizers 0.22.1 |
|
|
|