Update README.md
Browse files
README.md
CHANGED
|
@@ -14,7 +14,7 @@ model-index:
|
|
| 14 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
| 15 |
should probably proofread and complete it, then remove this comment. -->
|
| 16 |
|
| 17 |
-
#
|
| 18 |
|
| 19 |
This model is a fine-tuned version of [allenai/specter2_base](https://huggingface.co/allenai/specter2_base) on the None dataset.
|
| 20 |
It achieves the following results on the evaluation set:
|
|
@@ -29,18 +29,90 @@ It achieves the following results on the evaluation set:
|
|
| 29 |
|
| 30 |
## Model description
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
## Intended uses & limitations
|
| 35 |
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
## Training and evaluation data
|
| 39 |
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
## Training procedure
|
| 43 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
### Training hyperparameters
|
| 45 |
|
| 46 |
The following hyperparameters were used during training:
|
|
@@ -63,6 +135,21 @@ The following hyperparameters were used during training:
|
|
| 63 |
| 0.1161 | 5.0 | 19035 | 1.0616 | 0.8229 | 0.8229 | 0.8207 | 0.8229 | 0.8214 | 0.8229 | 0.8205 |
|
| 64 |
| 0.0864 | 6.0 | 22842 | 1.2011 | 0.8212 | 0.8212 | 0.8176 | 0.8212 | 0.8198 | 0.8212 | 0.8178 |
|
| 65 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
### Framework versions
|
| 68 |
|
|
|
|
| 14 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
| 15 |
should probably proofread and complete it, then remove this comment. -->
|
| 16 |
|
| 17 |
+
# 📗 SPECTER2–FAPESP Cluster (Multiclass Classification on FAPESP Grande Area do Conhecimento (Level 1))
|
| 18 |
|
| 19 |
This model is a fine-tuned version of [allenai/specter2_base](https://huggingface.co/allenai/specter2_base) on the None dataset.
|
| 20 |
It achieves the following results on the evaluation set:
|
|
|
|
| 29 |
|
| 30 |
## Model description
|
| 31 |
|
| 32 |
+
This model is a fine-tuned version of SPECTER2 (`allenai/specter2_base`) adapted for multiclass classification across the 8 [Grande Áreas do Conhecimento of FAPESP](https://bv.fapesp.br/pt/area_conhecimento/).
|
| 33 |
+
|
| 34 |
+
The model accepts the title, abstract, or title + abstract of a research projects and assigns it to exactly one of the Areas (e.g., Linguistics, Literature and Arts; Health Sciences; Biological Sciences).
|
| 35 |
+
|
| 36 |
+
Key characteristics:
|
| 37 |
+
* Base model: allenai/specter2_base
|
| 38 |
+
* Task: multiclass document classification
|
| 39 |
+
* Labels: 8 Cluster Areas
|
| 40 |
+
* Activation: softmax
|
| 41 |
+
* Loss: CrossEntropyLoss
|
| 42 |
+
* Output: single best-matching FAPESP's Cluster Area category
|
| 43 |
+
|
| 44 |
+
FAPESP's Clusters represents broad disciplinary domains designed for high-level categorization of R&I documents.
|
| 45 |
|
| 46 |
## Intended uses & limitations
|
| 47 |
|
| 48 |
+
This multiclass model is suitable for:
|
| 49 |
+
|
| 50 |
+
- Assigning publications to **top-level scientific disciplines**
|
| 51 |
+
- Enriching metadata in:
|
| 52 |
+
- repositories
|
| 53 |
+
- research output systems
|
| 54 |
+
- funding and project datasets
|
| 55 |
+
- bibliometric dashboards
|
| 56 |
+
- Supporting scientometric analyses such as:
|
| 57 |
+
- broad-discipline portfolio mapping
|
| 58 |
+
- domain-level clustering
|
| 59 |
+
- modeling research diversification
|
| 60 |
+
- Classifying documents when only **title/abstract** is available
|
| 61 |
+
|
| 62 |
+
The model supports inputs such as:
|
| 63 |
+
- **title only**
|
| 64 |
+
- **abstract only**
|
| 65 |
+
- **title + abstract** (recommended)
|
| 66 |
+
|
| 67 |
+
### Limitations
|
| 68 |
+
- Documents spanning multiple fields must be forced into **one** label—an inherent limitation of multiclass classification.
|
| 69 |
+
- The training labels come from **FAPESP funded projects**, not manual expert annotation.
|
| 70 |
+
- Not suitable for:
|
| 71 |
+
- downstream tasks requiring multilabel outputs
|
| 72 |
+
- WoS Categories or ASJC Areas (use separate models)
|
| 73 |
+
- clinical or regulatory decision-making
|
| 74 |
+
|
| 75 |
+
Predictions should be treated as **field-level disciplinary metadata**.
|
| 76 |
|
| 77 |
## Training and evaluation data
|
| 78 |
|
| 79 |
+
The training and evaluation dataset was constructed from publicly available [**FAPESP (Fundação de Amparo à Pesquisa do Estado de São Paulo)**](https://bv.fapesp.br/pt/pesquisa/download_projetos/) research project records. These records cover funded research projects and scholarships across all scientific domains in Brazil.
|
| 80 |
+
|
| 81 |
+
The dataset was assembled using the following CSV downloads provided by FAPESP:
|
| 82 |
+
|
| 83 |
+
- **Auxílios em andamento** (ongoing research grants)
|
| 84 |
+
- **Auxílios concluídos** (completed research grants)
|
| 85 |
+
- **Bolsas no Brasil em andamento** (ongoing domestic scholarships)
|
| 86 |
+
- **Bolsas no Brasil concluídas** (completed domestic scholarships)
|
| 87 |
+
- **Bolsas no exterior em andamento** (ongoing international scholarships)
|
| 88 |
+
- **Bolsas no exterior concluídas** (completed international scholarships)
|
| 89 |
+
|
| 90 |
+
Each record contains metadata such as project titles, abstracts, funding type, and scientific classifications.
|
| 91 |
+
From these files, the following fields were extracted and standardized:
|
| 92 |
+
|
| 93 |
+
- **Title (English)**
|
| 94 |
+
- **Abstract (English)**
|
| 95 |
+
- **Grande Área do Conhecimento** (major scientific domain)
|
| 96 |
+
- **Área do Conhecimento** (field of study)
|
| 97 |
+
|
| 98 |
+
Only entries containing at least one English component (title or abstract) were retained.
|
| 99 |
+
Scientific areas were normalized and mapped to a controlled English taxonomy to ensure consistency and comparability across records.
|
| 100 |
+
|
| 101 |
+
The final dataset consists of labeled scientific text samples distributed across multiple domains, providing a balanced corpus for supervised classification.
|
| 102 |
|
| 103 |
## Training procedure
|
| 104 |
|
| 105 |
+
### Preprocessing
|
| 106 |
+
- Input text constructed as:
|
| 107 |
+
`abstract`
|
| 108 |
+
- Tokenization using the SPECTER2 tokenizer
|
| 109 |
+
- Maximum sequence length: **512 tokens**
|
| 110 |
+
|
| 111 |
+
### Model
|
| 112 |
+
- Base model: `allenai/specter2_base`
|
| 113 |
+
- Classification head: linear layer → softmax
|
| 114 |
+
- Loss: **CrossEntropyLoss**
|
| 115 |
+
|
| 116 |
### Training hyperparameters
|
| 117 |
|
| 118 |
The following hyperparameters were used during training:
|
|
|
|
| 135 |
| 0.1161 | 5.0 | 19035 | 1.0616 | 0.8229 | 0.8229 | 0.8207 | 0.8229 | 0.8214 | 0.8229 | 0.8205 |
|
| 136 |
| 0.0864 | 6.0 | 22842 | 1.2011 | 0.8212 | 0.8212 | 0.8176 | 0.8212 | 0.8198 | 0.8212 | 0.8178 |
|
| 137 |
|
| 138 |
+
### Evaluation results
|
| 139 |
+
|
| 140 |
+
| | precision | recall | f1-score | support |
|
| 141 |
+
|:----------------------------------|------------:|---------:|-----------:|------------:|
|
| 142 |
+
| Agronomical Sciences | 0.848943 | 0.805158 | 0.826471 | 349 |
|
| 143 |
+
| Applied Social Sciences | 0.745152 | 0.890728 | 0.811463 | 302 |
|
| 144 |
+
| Biological Sciences | 0.835052 | 0.826531 | 0.830769 | 686 |
|
| 145 |
+
| Engineering | 0.836036 | 0.890595 | 0.862454 | 521 |
|
| 146 |
+
| Health Sciences | 0.828283 | 0.833333 | 0.8308 | 492 |
|
| 147 |
+
| Humanities | 0.891648 | 0.816116 | 0.852211 | 484 |
|
| 148 |
+
| Linguistics, Literature and Arts | 0.855346 | 0.85 | 0.852665 | 160 |
|
| 149 |
+
| Physical Sciences and Mathematics | 0.872576 | 0.807692 | 0.838881 | 390 |
|
| 150 |
+
| accuracy | 0.838357 | 0.838357 | 0.838357 | 0.838357 |
|
| 151 |
+
| macro avg | 0.839129 | 0.840019 | 0.838214 | 3384 |
|
| 152 |
+
| weighted avg | 0.841008 | 0.838357 | 0.838523 | 3384 |
|
| 153 |
|
| 154 |
### Framework versions
|
| 155 |
|