nicolauduran45 commited on
Commit
482ebf6
·
verified ·
1 Parent(s): 3f5bb07

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -4
README.md CHANGED
@@ -14,7 +14,7 @@ model-index:
14
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
  should probably proofread and complete it, then remove this comment. -->
16
 
17
- # results
18
 
19
  This model is a fine-tuned version of [allenai/specter2_base](https://huggingface.co/allenai/specter2_base) on the None dataset.
20
  It achieves the following results on the evaluation set:
@@ -29,18 +29,90 @@ It achieves the following results on the evaluation set:
29
 
30
  ## Model description
31
 
32
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
  ## Intended uses & limitations
35
 
36
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
  ## Training and evaluation data
39
 
40
- More information needed
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
  ## Training procedure
43
 
 
 
 
 
 
 
 
 
 
 
 
44
  ### Training hyperparameters
45
 
46
  The following hyperparameters were used during training:
@@ -63,6 +135,21 @@ The following hyperparameters were used during training:
63
  | 0.1161 | 5.0 | 19035 | 1.0616 | 0.8229 | 0.8229 | 0.8207 | 0.8229 | 0.8214 | 0.8229 | 0.8205 |
64
  | 0.0864 | 6.0 | 22842 | 1.2011 | 0.8212 | 0.8212 | 0.8176 | 0.8212 | 0.8198 | 0.8212 | 0.8178 |
65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  ### Framework versions
68
 
 
14
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
  should probably proofread and complete it, then remove this comment. -->
16
 
17
+ # 📗 SPECTER2–FAPESP Cluster (Multiclass Classification on FAPESP Grande Area do Conhecimento (Level 1))
18
 
19
  This model is a fine-tuned version of [allenai/specter2_base](https://huggingface.co/allenai/specter2_base) on the None dataset.
20
  It achieves the following results on the evaluation set:
 
29
 
30
  ## Model description
31
 
32
+ This model is a fine-tuned version of SPECTER2 (`allenai/specter2_base`) adapted for multiclass classification across the 8 [Grande Áreas do Conhecimento of FAPESP](https://bv.fapesp.br/pt/area_conhecimento/).
33
+
34
+ The model accepts the title, abstract, or title + abstract of a research projects and assigns it to exactly one of the Areas (e.g., Linguistics, Literature and Arts; Health Sciences; Biological Sciences).
35
+
36
+ Key characteristics:
37
+ * Base model: allenai/specter2_base
38
+ * Task: multiclass document classification
39
+ * Labels: 8 Cluster Areas
40
+ * Activation: softmax
41
+ * Loss: CrossEntropyLoss
42
+ * Output: single best-matching FAPESP's Cluster Area category
43
+
44
+ FAPESP's Clusters represents broad disciplinary domains designed for high-level categorization of R&I documents.
45
 
46
  ## Intended uses & limitations
47
 
48
+ This multiclass model is suitable for:
49
+
50
+ - Assigning publications to **top-level scientific disciplines**
51
+ - Enriching metadata in:
52
+ - repositories
53
+ - research output systems
54
+ - funding and project datasets
55
+ - bibliometric dashboards
56
+ - Supporting scientometric analyses such as:
57
+ - broad-discipline portfolio mapping
58
+ - domain-level clustering
59
+ - modeling research diversification
60
+ - Classifying documents when only **title/abstract** is available
61
+
62
+ The model supports inputs such as:
63
+ - **title only**
64
+ - **abstract only**
65
+ - **title + abstract** (recommended)
66
+
67
+ ### Limitations
68
+ - Documents spanning multiple fields must be forced into **one** label—an inherent limitation of multiclass classification.
69
+ - The training labels come from **FAPESP funded projects**, not manual expert annotation.
70
+ - Not suitable for:
71
+ - downstream tasks requiring multilabel outputs
72
+ - WoS Categories or ASJC Areas (use separate models)
73
+ - clinical or regulatory decision-making
74
+
75
+ Predictions should be treated as **field-level disciplinary metadata**.
76
 
77
  ## Training and evaluation data
78
 
79
+ The training and evaluation dataset was constructed from publicly available [**FAPESP (Fundação de Amparo à Pesquisa do Estado de São Paulo)**](https://bv.fapesp.br/pt/pesquisa/download_projetos/) research project records. These records cover funded research projects and scholarships across all scientific domains in Brazil.
80
+
81
+ The dataset was assembled using the following CSV downloads provided by FAPESP:
82
+
83
+ - **Auxílios em andamento** (ongoing research grants)
84
+ - **Auxílios concluídos** (completed research grants)
85
+ - **Bolsas no Brasil em andamento** (ongoing domestic scholarships)
86
+ - **Bolsas no Brasil concluídas** (completed domestic scholarships)
87
+ - **Bolsas no exterior em andamento** (ongoing international scholarships)
88
+ - **Bolsas no exterior concluídas** (completed international scholarships)
89
+
90
+ Each record contains metadata such as project titles, abstracts, funding type, and scientific classifications.
91
+ From these files, the following fields were extracted and standardized:
92
+
93
+ - **Title (English)**
94
+ - **Abstract (English)**
95
+ - **Grande Área do Conhecimento** (major scientific domain)
96
+ - **Área do Conhecimento** (field of study)
97
+
98
+ Only entries containing at least one English component (title or abstract) were retained.
99
+ Scientific areas were normalized and mapped to a controlled English taxonomy to ensure consistency and comparability across records.
100
+
101
+ The final dataset consists of labeled scientific text samples distributed across multiple domains, providing a balanced corpus for supervised classification.
102
 
103
  ## Training procedure
104
 
105
+ ### Preprocessing
106
+ - Input text constructed as:
107
+ `abstract`
108
+ - Tokenization using the SPECTER2 tokenizer
109
+ - Maximum sequence length: **512 tokens**
110
+
111
+ ### Model
112
+ - Base model: `allenai/specter2_base`
113
+ - Classification head: linear layer → softmax
114
+ - Loss: **CrossEntropyLoss**
115
+
116
  ### Training hyperparameters
117
 
118
  The following hyperparameters were used during training:
 
135
  | 0.1161 | 5.0 | 19035 | 1.0616 | 0.8229 | 0.8229 | 0.8207 | 0.8229 | 0.8214 | 0.8229 | 0.8205 |
136
  | 0.0864 | 6.0 | 22842 | 1.2011 | 0.8212 | 0.8212 | 0.8176 | 0.8212 | 0.8198 | 0.8212 | 0.8178 |
137
 
138
+ ### Evaluation results
139
+
140
+ | | precision | recall | f1-score | support |
141
+ |:----------------------------------|------------:|---------:|-----------:|------------:|
142
+ | Agronomical Sciences | 0.848943 | 0.805158 | 0.826471 | 349 |
143
+ | Applied Social Sciences | 0.745152 | 0.890728 | 0.811463 | 302 |
144
+ | Biological Sciences | 0.835052 | 0.826531 | 0.830769 | 686 |
145
+ | Engineering | 0.836036 | 0.890595 | 0.862454 | 521 |
146
+ | Health Sciences | 0.828283 | 0.833333 | 0.8308 | 492 |
147
+ | Humanities | 0.891648 | 0.816116 | 0.852211 | 484 |
148
+ | Linguistics, Literature and Arts | 0.855346 | 0.85 | 0.852665 | 160 |
149
+ | Physical Sciences and Mathematics | 0.872576 | 0.807692 | 0.838881 | 390 |
150
+ | accuracy | 0.838357 | 0.838357 | 0.838357 | 0.838357 |
151
+ | macro avg | 0.839129 | 0.840019 | 0.838214 | 3384 |
152
+ | weighted avg | 0.841008 | 0.838357 | 0.838523 | 3384 |
153
 
154
  ### Framework versions
155