SIRIS-Lab
/

specter2-fapesp-cluster-multiclass

@@ -14,7 +14,7 @@ model-index:
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
-# results
 This model is a fine-tuned version of [allenai/specter2_base](https://huggingface.co/allenai/specter2_base) on the None dataset.
 It achieves the following results on the evaluation set:
@@ -29,18 +29,90 @@ It achieves the following results on the evaluation set:
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
@@ -63,6 +135,21 @@ The following hyperparameters were used during training:
 | 0.1161        | 5.0   | 19035 | 1.0616          | 0.8229   | 0.8229          | 0.8207          | 0.8229       | 0.8214       | 0.8229   | 0.8205   |
 | 0.0864        | 6.0   | 22842 | 1.2011          | 0.8212   | 0.8212          | 0.8176          | 0.8212       | 0.8198       | 0.8212   | 0.8178   |
 ### Framework versions

 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
+# 📗 SPECTER2–FAPESP Cluster (Multiclass Classification on FAPESP Grande Area do Conhecimento (Level 1))
 This model is a fine-tuned version of [allenai/specter2_base](https://huggingface.co/allenai/specter2_base) on the None dataset.
 It achieves the following results on the evaluation set:
 ## Model description
+This model is a fine-tuned version of SPECTER2 (`allenai/specter2_base`) adapted for multiclass classification across the 8 [Grande Áreas do Conhecimento of FAPESP](https://bv.fapesp.br/pt/area_conhecimento/).
+The model accepts the title, abstract, or title + abstract of a research projects and assigns it to exactly one of the Areas (e.g., Linguistics, Literature and Arts; Health Sciences; Biological Sciences).
+Key characteristics:
+* Base model: allenai/specter2_base
+* Task: multiclass document classification
+* Labels: 8 Cluster Areas
+* Activation: softmax
+* Loss: CrossEntropyLoss
+* Output: single best-matching FAPESP's Cluster Area category
+FAPESP's Clusters represents broad disciplinary domains designed for high-level categorization of R&I documents.
 ## Intended uses & limitations
+This multiclass model is suitable for:
+- Assigning publications to **top-level scientific disciplines**
+- Enriching metadata in:
+  - repositories
+  - research output systems
+  - funding and project datasets
+  - bibliometric dashboards
+- Supporting scientometric analyses such as:
+  - broad-discipline portfolio mapping
+  - domain-level clustering
+  - modeling research diversification
+- Classifying documents when only **title/abstract** is available
+The model supports inputs such as:
+- **title only**
+- **abstract only**
+- **title + abstract** (recommended)
+### Limitations
+- Documents spanning multiple fields must be forced into **one** label—an inherent limitation of multiclass classification.
+- The training labels come from **FAPESP funded projects**, not manual expert annotation.
+- Not suitable for:
+  - downstream tasks requiring multilabel outputs
+  - WoS Categories or ASJC Areas (use separate models)
+  - clinical or regulatory decision-making
+Predictions should be treated as **field-level disciplinary metadata**.
 ## Training and evaluation data
+The training and evaluation dataset was constructed from publicly available [**FAPESP (Fundação de Amparo à Pesquisa do Estado de São Paulo)**](https://bv.fapesp.br/pt/pesquisa/download_projetos/) research project records. These records cover funded research projects and scholarships across all scientific domains in Brazil.
+The dataset was assembled using the following CSV downloads provided by FAPESP:
+- **Auxílios em andamento** (ongoing research grants)
+- **Auxílios concluídos** (completed research grants)
+- **Bolsas no Brasil em andamento** (ongoing domestic scholarships)
+- **Bolsas no Brasil concluídas** (completed domestic scholarships)
+- **Bolsas no exterior em andamento** (ongoing international scholarships)
+- **Bolsas no exterior concluídas** (completed international scholarships)
+Each record contains metadata such as project titles, abstracts, funding type, and scientific classifications.
+From these files, the following fields were extracted and standardized:
+- **Title (English)**
+- **Abstract (English)**
+- **Grande Área do Conhecimento** (major scientific domain)
+- **Área do Conhecimento** (field of study)
+Only entries containing at least one English component (title or abstract) were retained.
+Scientific areas were normalized and mapped to a controlled English taxonomy to ensure consistency and comparability across records.
+The final dataset consists of labeled scientific text samples distributed across multiple domains, providing a balanced corpus for supervised classification.
 ## Training procedure
+### Preprocessing
+- Input text constructed as:
+  `abstract`
+- Tokenization using the SPECTER2 tokenizer
+- Maximum sequence length: **512 tokens**
+### Model
+- Base model: `allenai/specter2_base`
+- Classification head: linear layer → softmax
+- Loss: **CrossEntropyLoss**
 ### Training hyperparameters
 The following hyperparameters were used during training:
 | 0.1161        | 5.0   | 19035 | 1.0616          | 0.8229   | 0.8229          | 0.8207          | 0.8229       | 0.8214       | 0.8229   | 0.8205   |
 | 0.0864        | 6.0   | 22842 | 1.2011          | 0.8212   | 0.8212          | 0.8176          | 0.8212       | 0.8198       | 0.8212   | 0.8178   |
+### Evaluation results
+|                                   |   precision |   recall |   f1-score |     support |
+|:----------------------------------|------------:|---------:|-----------:|------------:|
+| Agronomical Sciences              |    0.848943 | 0.805158 |   0.826471 |  349        |
+| Applied Social Sciences           |    0.745152 | 0.890728 |   0.811463 |  302        |
+| Biological Sciences               |    0.835052 | 0.826531 |   0.830769 |  686        |
+| Engineering                       |    0.836036 | 0.890595 |   0.862454 |  521        |
+| Health Sciences                   |    0.828283 | 0.833333 |   0.8308   |  492        |
+| Humanities                        |    0.891648 | 0.816116 |   0.852211 |  484        |
+| Linguistics, Literature and Arts  |    0.855346 | 0.85     |   0.852665 |  160        |
+| Physical Sciences and Mathematics |    0.872576 | 0.807692 |   0.838881 |  390        |
+| accuracy                          |    0.838357 | 0.838357 |   0.838357 |    0.838357 |
+| macro avg                         |    0.839129 | 0.840019 |   0.838214 | 3384        |
+| weighted avg                      |    0.841008 | 0.838357 |   0.838523 | 3384        |
 ### Framework versions