Commit
·
80f8d4f
0
Parent(s):
feat: initial model release
Browse files- .gitattributes +36 -0
- LICENSE +180 -0
- README.md +119 -0
- config.json +28 -0
- data/nemenji.png +0 -0
- model.safetensors +3 -0
- sentencepiece.bpe.model +3 -0
- special_tokens_map.json +51 -0
- tokenizer.json +3 -0
- tokenizer_config.json +57 -0
.gitattributes
ADDED
|
@@ -0,0 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
tokenizer.json filter=lfs diff=lfs merge=lfs -text
|
LICENSE
ADDED
|
@@ -0,0 +1,180 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
RigoBERTa Clinical NON-COMMERCIAL LICENSE AGREEMENT
|
| 2 |
+
|
| 3 |
+
Release Date: 15-01-2025
|
| 4 |
+
|
| 5 |
+
By using or distributing any portion or element of the RigoBERTa Clinical
|
| 6 |
+
Language Model, you agree to be bound by this Agreement, acknowledging the
|
| 7 |
+
following terms and conditions.
|
| 8 |
+
|
| 9 |
+
1\. DEFINITIONS
|
| 10 |
+
|
| 11 |
+
1.1. Model: Refers to the RigoBERTa Clinical Language Model, owned by ADIC,
|
| 12 |
+
made available under this Agreement.
|
| 13 |
+
|
| 14 |
+
1.2. Licensor: ADIC.
|
| 15 |
+
|
| 16 |
+
1.3. Agreement: Means this Non-Commercial License Agreement.
|
| 17 |
+
|
| 18 |
+
1.4. Acceptable Use Policy: Use of the Model is subject to the
|
| 19 |
+
Acceptable Use Policy defined below. The Licensee agrees to comply with
|
| 20 |
+
it and ensure that users and/or third parties do not use the Model to:
|
| 21 |
+
|
| 22 |
+
i\) violate or encourage the violation of fundamental rights of third
|
| 23 |
+
parties;
|
| 24 |
+
|
| 25 |
+
ii\) harm or injure any persons, particularly minors or groups of
|
| 26 |
+
vulnerable, minority, or protected individuals;
|
| 27 |
+
|
| 28 |
+
iii\) engage in any illegal, invasive, infringing, defamatory, or
|
| 29 |
+
fraudulent activities;
|
| 30 |
+
|
| 31 |
+
iv\) generate or disseminate false information or content intended to
|
| 32 |
+
harm others;
|
| 33 |
+
|
| 34 |
+
v\) generate or disseminate personally identifiable information that
|
| 35 |
+
could be used to harm an individual or group of individuals;
|
| 36 |
+
|
| 37 |
+
vi\) carry out fully automated decision-making that negatively impacts
|
| 38 |
+
individuals' rights;
|
| 39 |
+
|
| 40 |
+
vii\) engage in, promote, or encourage illegal activities;
|
| 41 |
+
|
| 42 |
+
viii\) deliberately distribute viruses, worms, trojans, corrupted files,
|
| 43 |
+
hoaxes, or other destructive or deceptive elements;
|
| 44 |
+
|
| 45 |
+
ix\) interfere with the use of the Model to cause malfunction;
|
| 46 |
+
|
| 47 |
+
x\) disable, interfere with, or circumvent any aspect of the Model;
|
| 48 |
+
|
| 49 |
+
xi\) use the Model to access any other ADIC product or service in a way
|
| 50 |
+
that violates the terms of service of such other ADIC product or
|
| 51 |
+
service.
|
| 52 |
+
|
| 53 |
+
1.5. Derivative Work(s): Means (a) any derivative work from the Model as
|
| 54 |
+
recognized by Spanish and European intellectual property laws, and (b)
|
| 55 |
+
any modifications of the Model and any other model created based on or
|
| 56 |
+
derived from the Model. For clarity, Derivative Works do not include the
|
| 57 |
+
output of any Model.
|
| 58 |
+
|
| 59 |
+
1.6. Documentation: Means any specifications, manuals, documentation, or
|
| 60 |
+
other written information provided by ADIC related to the Model.
|
| 61 |
+
|
| 62 |
+
1.7. Licensee: Means you, your employer, or any other person or entity
|
| 63 |
+
(if you are entering into this Agreement on such person or entity's
|
| 64 |
+
behalf), of the age required under applicable laws, rules or regulations
|
| 65 |
+
to provide legal consent and that has legal authority to bind your
|
| 66 |
+
employer or such other person or entity if you are entering in this
|
| 67 |
+
Agreement on their behalf.
|
| 68 |
+
|
| 69 |
+
1.8. Non-commercial Use: Means exercising any of the rights granted
|
| 70 |
+
under this Agreement for research and/or non-commercial purposes.
|
| 71 |
+
Non-commercial use does not include any production and/or commercial use
|
| 72 |
+
of the Model or any Derivative Work.
|
| 73 |
+
|
| 74 |
+
2\. LICENSE RIGHTS
|
| 75 |
+
|
| 76 |
+
a\. Subject to your compliance with this Agreement, the Acceptable Use
|
| 77 |
+
Policy, and the Documentation, ADIC grants you a non-exclusive,
|
| 78 |
+
worldwide, non-transferable, non-sublicensable, revocable, royalty-free,
|
| 79 |
+
and limited license under ADIC's intellectual property rights or other
|
| 80 |
+
proprietary rights embodied in the Model to use, reproduce, distribute,
|
| 81 |
+
and create Derivative Works from the Model, in each case solely for
|
| 82 |
+
research and/or non-commercial uses.
|
| 83 |
+
|
| 84 |
+
b\. You may not use the Model or Derivative Works to enable third
|
| 85 |
+
parties to use the Model or Derivative Works as part of your hosted
|
| 86 |
+
service or via your APIs, regardless of whether you are adding
|
| 87 |
+
substantial additional functionality thereto or not. Merely distributing
|
| 88 |
+
the Model or Derivative Works for online download without offering any
|
| 89 |
+
related service is not a violation of this section. If you wish to use
|
| 90 |
+
the Model or any Derivative Work for commercial and/or production use or
|
| 91 |
+
make the Model or any Derivative Work available to third parties via
|
| 92 |
+
your hosted service or APIs, please contact ADIC. In case of using the
|
| 93 |
+
Model or any Derivative Work for commercial and/or production purposes,
|
| 94 |
+
the terms of this license will not apply, and different license terms
|
| 95 |
+
and conditions must be accepted and acknowledged.
|
| 96 |
+
|
| 97 |
+
c\. If you distribute or make the Model or any Derivative Work available
|
| 98 |
+
to a third party, such distribution or availability will remain subject
|
| 99 |
+
to this Agreement, and you must (i) provide a copy of this Agreement to
|
| 100 |
+
such third party and (ii) retain the following attribution notice in a
|
| 101 |
+
"Notice" text file distributed as part of such copies: "*RigoBERTa Clinical
|
| 102 |
+
is a language model owned by ADIC and is distributed under a
|
| 103 |
+
non-commercial research license granted by ADIC.*" If you create a
|
| 104 |
+
Derivative Work from the Model, you must add your own attribution
|
| 105 |
+
notices to the Notice file included with the Model, clearly indicating
|
| 106 |
+
which attributions apply to the Model, and you must state in the Notice
|
| 107 |
+
file that you changed the Model and describe how it was modified.
|
| 108 |
+
|
| 109 |
+
3\. WARRANTY DISCLAIMER
|
| 110 |
+
|
| 111 |
+
Unless required by applicable law, the Model and any output or results
|
| 112 |
+
therefrom are provided "as is" without warranties of any kind, either
|
| 113 |
+
express or implied, including, without limitation, any warranties of
|
| 114 |
+
title, non-infringement, merchantability, or fitness for a particular
|
| 115 |
+
purpose. You are solely responsible for determining the appropriateness
|
| 116 |
+
of using or redistributing the Model, Derivative Works, or any output or
|
| 117 |
+
results, and assume any risks associated with your use of the Model,
|
| 118 |
+
Derivative Works, and any output or results.
|
| 119 |
+
|
| 120 |
+
4\. LIMITATION OF LIABILITY
|
| 121 |
+
|
| 122 |
+
In no event will ADIC be liable under any theory of liability, whether
|
| 123 |
+
in contract, tort, negligence, products liability or otherwise, arising
|
| 124 |
+
out of this Agreement, for any lost profits or any direct, indirect,
|
| 125 |
+
special, consequential, incidental, or punitive damages, even if ADIC
|
| 126 |
+
has been advised of the possibility of any of the foregoing. ADIC shall
|
| 127 |
+
not be liable to the Licensee for any use of the Model in violation of
|
| 128 |
+
the terms and conditions of this Agreement.
|
| 129 |
+
|
| 130 |
+
5\. INTELLECTUAL PROPERTY
|
| 131 |
+
|
| 132 |
+
a\. No trademark licenses are granted under this Agreement, and neither
|
| 133 |
+
ADIC nor the Licensee may use any name or mark owned by or associated
|
| 134 |
+
with the other except as required for reasonable and customary use in
|
| 135 |
+
describing and redistributing the Model or Derivative Works.
|
| 136 |
+
|
| 137 |
+
b\. All rights, title, and interest in and to the Model, including all
|
| 138 |
+
intellectual property rights, are and will remain the exclusive property
|
| 139 |
+
of ADIC.
|
| 140 |
+
|
| 141 |
+
c\. Subject to ADIC's ownership of the Model and Derivative Works
|
| 142 |
+
created by or for ADIC, regarding any Derivative Work created by you, as
|
| 143 |
+
between you and ADIC, you shall own such Derivative Works. Similarly,
|
| 144 |
+
any Derivative Work not commissioned by ADIC shall be owned by you.
|
| 145 |
+
|
| 146 |
+
d\. If you initiate litigation or other proceedings against ADIC
|
| 147 |
+
(including a counterclaim in a lawsuit) alleging that the Model,
|
| 148 |
+
Derivative Works, outputs, or associated products, or any part thereof,
|
| 149 |
+
infringe intellectual property rights or other rights you own or are
|
| 150 |
+
licensable by you, any license granted to you under this Agreement will
|
| 151 |
+
terminate as of the date such litigation or claim is filed or initiated.
|
| 152 |
+
You agree to indemnify and hold ADIC harmless from any third-party
|
| 153 |
+
claims arising out of or related to your use or distribution of the
|
| 154 |
+
Model or Derivative Works in violation of this Agreement.
|
| 155 |
+
|
| 156 |
+
6\. TERM AND TERMINATION
|
| 157 |
+
|
| 158 |
+
The term of this Agreement will commence upon your acceptance of this
|
| 159 |
+
Agreement or your access to the Model and will continue in full force
|
| 160 |
+
and effect until terminated in accordance with the terms and conditions
|
| 161 |
+
herein.
|
| 162 |
+
|
| 163 |
+
This Agreement will automatically terminate if the Licensee breaches any
|
| 164 |
+
of its terms and conditions. Upon termination, the Licensee must cease
|
| 165 |
+
using the Model and delete all copies of it from any device on which it
|
| 166 |
+
may reside. Clauses 3 to 5 shall survive the termination of this
|
| 167 |
+
Agreement.
|
| 168 |
+
|
| 169 |
+
7\. GENERAL PROVISIONS
|
| 170 |
+
|
| 171 |
+
7.1. Governing Law: This Agreement shall be governed by and construed in
|
| 172 |
+
accordance with the laws of Spain.
|
| 173 |
+
|
| 174 |
+
7.2. Entire Agreement: This Agreement constitutes the entire agreement
|
| 175 |
+
between ADIC and the Licensee regarding the use of the Model and
|
| 176 |
+
supersedes all prior agreements and understandings, whether oral or
|
| 177 |
+
written, relating to the Model.
|
| 178 |
+
|
| 179 |
+
7.3. Modifications: No modification to this Agreement shall be valid
|
| 180 |
+
unless in writing and signed by both ADIC and the Licensee.
|
README.md
ADDED
|
@@ -0,0 +1,119 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
license: other
|
| 4 |
+
license_name: rigoclinical-nc
|
| 5 |
+
license_link: https://huggingface.co/IIC/RigoBERTa-Clinical/blob/main/LICENSE
|
| 6 |
+
datasets:
|
| 7 |
+
- IIC/ClinText-SP
|
| 8 |
+
language:
|
| 9 |
+
- es
|
| 10 |
+
pipeline_tag: fill-mask
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# RigoBERTa Clinical
|
| 14 |
+
|
| 15 |
+
**RigoBERTa Clinical** is a state-of-the-art clinical encoder language model for Spanish, developed through domain-adaptive pretraining on the largest publicly available Spanish clinical corpus, **ClinText-SP**. This model significantly improves performance on multiple clinical NLP benchmarks while offering robust language understanding in the clinical domain.
|
| 16 |
+
|
| 17 |
+
## Model Details
|
| 18 |
+
|
| 19 |
+
### Model Description
|
| 20 |
+
|
| 21 |
+
**RigoBERTa Clinical** was built by further pretraining the general-purpose RigoBERTa 2 on a meticulously curated clinical corpus. The pretraining leverages masked language modeling (MLM) to adapt the model’s linguistic knowledge to the Spanish clinical domain.
|
| 22 |
+
|
| 23 |
+
- **Developed by:** IIC
|
| 24 |
+
- **Model type:** Encoder
|
| 25 |
+
- **Language(s) (NLP):** Spanish
|
| 26 |
+
- **License:** rigoclinical-nc (permissive Non Commercial)
|
| 27 |
+
- **Finetuned from model:** RigoBERTa 2
|
| 28 |
+
|
| 29 |
+
### Model Sources
|
| 30 |
+
|
| 31 |
+
- **Paper:** [ClinText-SP and RigoBERTa Clinical: a new set of open resources for Spanish Clinical NLP](https://arxiv.org/abs/2503.18594)
|
| 32 |
+
|
| 33 |
+
## Intended Use & Limitations
|
| 34 |
+
|
| 35 |
+
### Intended Use
|
| 36 |
+
|
| 37 |
+
**RigoBERTa Clinical** is designed for:
|
| 38 |
+
|
| 39 |
+
- Clinical text understanding in Spanish.
|
| 40 |
+
- Applications in healthcare NLP tasks such as clinical note classification, entity recognition in clinical texts, and related downstream tasks.
|
| 41 |
+
- Research and development purposes, including benchmarking and further model adaptation.
|
| 42 |
+
|
| 43 |
+
### Limitations & Caveats
|
| 44 |
+
|
| 45 |
+
- **Domain Specificity:** Although highly effective for Spanish clinical texts, the model may not generalize to other domains or languages.
|
| 46 |
+
- **Data Biases:** ClinText-SP, while the largest corpus available, may contain biases due to source selection and the inherent limitations of public clinical data.
|
| 47 |
+
- **Operational Cost:** Despite being an encoder-based model with relatively lower computational costs compared to generative LLMs, deployment in resource-constrained settings should be carefully evaluated.
|
| 48 |
+
|
| 49 |
+
## Training Details
|
| 50 |
+
|
| 51 |
+
### Training Data: ClinText-SP
|
| 52 |
+
|
| 53 |
+
ClinText-SP is the largest open Spanish clinical corpus and includes data from various open sources:
|
| 54 |
+
|
| 55 |
+
- **Volume:** ~26 million tokens, 35,996 samples
|
| 56 |
+
- **Sample Details:** Average of ~700 tokens per sample; contains both long-form clinical cases and shorter, schematic texts,
|
| 57 |
+
- **Sources:** Medical journals, clinical shared tasks, radiological reports, and Wikipedia extracts.
|
| 58 |
+
- **Availability:** [ClinText-SP](https://huggingface.co/datasets/IIC/ClinText-SP) on Hugging Face Datasets
|
| 59 |
+
|
| 60 |
+
### Training Procedure
|
| 61 |
+
|
| 62 |
+
#### Preprocessing
|
| 63 |
+
|
| 64 |
+
- **Tokenizer:** Uses the tokenizer from RigoBERTa 2 to ensure consistency with the base model.
|
| 65 |
+
- **Handling Long Sequences:** Clinical texts exceeding 512 tokens are segmented with a stride of 128 tokens; shorter sequences are padded as necessary.
|
| 66 |
+
- **OOV Handling:** Out-of-vocabulary words are managed using subword tokenization, maintaining robust handling of clinical terminology.
|
| 67 |
+
|
| 68 |
+
#### Training Details
|
| 69 |
+
|
| 70 |
+
- **Objective:** Masked Language Modeling (MLM)
|
| 71 |
+
- **Epochs:** 2 full epochs (with the best model selected after ~1.8 epochs, based on downstream performance)
|
| 72 |
+
- **Hyperparameters Grid:**
|
| 73 |
+
- **Batch Sizes:** 32, 64, 128
|
| 74 |
+
- **Learning Rates:** Ranges of {5e-6, 1e-5, 2e-5} for batch size 32, {1e-5, 2e-5, 4e-5} for 64, and {1e-5, 4e-5, 8e-5} for 128
|
| 75 |
+
- **Best Settings:** Batch size = 32, Learning rate = 2e-5, ~2800 training steps (~1.8 epochs)
|
| 76 |
+
- **Optimizer:** AdamW with weight decay of 0.1
|
| 77 |
+
- **Hardware:** Trained on a single NVIDIA A100 GPU (80GB memory)
|
| 78 |
+
|
| 79 |
+
## Evaluation
|
| 80 |
+
|
| 81 |
+
RigoBERTa Clinical was evaluated on several Spanish clinical NLP tasks including Named Entity Recognition (NER) and multilabel classification. Evaluation metrics (F1 score and micro-averaged F1) indicate that the model outperforms previous clinical and general Spanish language models.
|
| 82 |
+
|
| 83 |
+
**Key Results:**
|
| 84 |
+
|
| 85 |
+
- Achieves top performance on datasets such as cantemist, meddocan, and livingner1, among others.
|
| 86 |
+
- Consistently surpasses the performance of models that were trained solely on clinical data, demonstrating the advantage of leveraging general domain knowledge during domain adaptation.
|
| 87 |
+
- Detailed benchmarking results and comparisons are provided in the associated publication.
|
| 88 |
+
|
| 89 |
+
For a full breakdown of results (including performance on multilingual baselines and other clinical-specific models), please refer to Table 1 and the Nemenyi plot in the original paper.
|
| 90 |
+
|
| 91 |
+

|
| 92 |
+
|
| 93 |
+
## Citation
|
| 94 |
+
|
| 95 |
+
If you use RigoBERTa Clinical in your research, please cite the associated paper:
|
| 96 |
+
|
| 97 |
+
**BibTeX:**
|
| 98 |
+
|
| 99 |
+
```bibtex
|
| 100 |
+
@misc{subies2025clintextsprigobertaclinicalnew,
|
| 101 |
+
title={ClinText-SP and RigoBERTa Clinical: a new set of open resources for Spanish Clinical NLP},
|
| 102 |
+
author={Guillem García Subies and Álvaro Barbero Jiménez and Paloma Martínez Fernández},
|
| 103 |
+
year={2025},
|
| 104 |
+
eprint={2503.18594},
|
| 105 |
+
archivePrefix={arXiv},
|
| 106 |
+
primaryClass={cs.CL},
|
| 107 |
+
url={https://arxiv.org/abs/2503.18594},
|
| 108 |
+
}
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
**APA:**
|
| 112 |
+
|
| 113 |
+
```
|
| 114 |
+
Subies, G. G., Barbero Jiménez, Á., & Martínez Fernández, P. (2025). ClinText-SP and RigoBERTa Clinical: A new set of open resources for Spanish Clinical NLP. arXiv. https://arxiv.org/abs/2503.18594
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
## Model Card Authors and Contact
|
| 118 |
+
|
| 119 |
+
Guillem García Subies: guillem.garcia@iic.uam.es, 100500844@alumnos.uc3m.es
|
config.json
ADDED
|
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_name_or_path": "/disco/guillem.garcia/models/bs32lr2e5/train_save/checkpoint-2800",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"XLMRobertaModel"
|
| 5 |
+
],
|
| 6 |
+
"attention_probs_dropout_prob": 0.1,
|
| 7 |
+
"bos_token_id": 0,
|
| 8 |
+
"classifier_dropout": null,
|
| 9 |
+
"eos_token_id": 2,
|
| 10 |
+
"hidden_act": "gelu",
|
| 11 |
+
"hidden_dropout_prob": 0.1,
|
| 12 |
+
"hidden_size": 1024,
|
| 13 |
+
"initializer_range": 0.02,
|
| 14 |
+
"intermediate_size": 4096,
|
| 15 |
+
"layer_norm_eps": 1e-05,
|
| 16 |
+
"max_position_embeddings": 514,
|
| 17 |
+
"model_type": "xlm-roberta",
|
| 18 |
+
"num_attention_heads": 16,
|
| 19 |
+
"num_hidden_layers": 24,
|
| 20 |
+
"output_past": true,
|
| 21 |
+
"pad_token_id": 1,
|
| 22 |
+
"position_embedding_type": "absolute",
|
| 23 |
+
"torch_dtype": "float32",
|
| 24 |
+
"transformers_version": "4.46.1",
|
| 25 |
+
"type_vocab_size": 1,
|
| 26 |
+
"use_cache": true,
|
| 27 |
+
"vocab_size": 250002
|
| 28 |
+
}
|
data/nemenji.png
ADDED
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ae961544b4a9a3f53a0c417cf2492d1e07efc5b6532c2afd424bab5208de4912
|
| 3 |
+
size 2239607176
|
sentencepiece.bpe.model
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
|
| 3 |
+
size 5069051
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": {
|
| 3 |
+
"content": "<s>",
|
| 4 |
+
"lstrip": false,
|
| 5 |
+
"normalized": false,
|
| 6 |
+
"rstrip": false,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"cls_token": {
|
| 10 |
+
"content": "<s>",
|
| 11 |
+
"lstrip": false,
|
| 12 |
+
"normalized": false,
|
| 13 |
+
"rstrip": false,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"eos_token": {
|
| 17 |
+
"content": "</s>",
|
| 18 |
+
"lstrip": false,
|
| 19 |
+
"normalized": false,
|
| 20 |
+
"rstrip": false,
|
| 21 |
+
"single_word": false
|
| 22 |
+
},
|
| 23 |
+
"mask_token": {
|
| 24 |
+
"content": "<mask>",
|
| 25 |
+
"lstrip": true,
|
| 26 |
+
"normalized": false,
|
| 27 |
+
"rstrip": false,
|
| 28 |
+
"single_word": false
|
| 29 |
+
},
|
| 30 |
+
"pad_token": {
|
| 31 |
+
"content": "<pad>",
|
| 32 |
+
"lstrip": false,
|
| 33 |
+
"normalized": false,
|
| 34 |
+
"rstrip": false,
|
| 35 |
+
"single_word": false
|
| 36 |
+
},
|
| 37 |
+
"sep_token": {
|
| 38 |
+
"content": "</s>",
|
| 39 |
+
"lstrip": false,
|
| 40 |
+
"normalized": false,
|
| 41 |
+
"rstrip": false,
|
| 42 |
+
"single_word": false
|
| 43 |
+
},
|
| 44 |
+
"unk_token": {
|
| 45 |
+
"content": "<unk>",
|
| 46 |
+
"lstrip": false,
|
| 47 |
+
"normalized": false,
|
| 48 |
+
"rstrip": false,
|
| 49 |
+
"single_word": false
|
| 50 |
+
}
|
| 51 |
+
}
|
tokenizer.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3a56def25aa40facc030ea8b0b87f3688e4b3c39eb8b45d5702b3a1300fe2a20
|
| 3 |
+
size 17082734
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "<s>",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"1": {
|
| 12 |
+
"content": "<pad>",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"2": {
|
| 20 |
+
"content": "</s>",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"3": {
|
| 28 |
+
"content": "<unk>",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"250001": {
|
| 36 |
+
"content": "<mask>",
|
| 37 |
+
"lstrip": true,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"bos_token": "<s>",
|
| 45 |
+
"clean_up_tokenization_spaces": true,
|
| 46 |
+
"cls_token": "<s>",
|
| 47 |
+
"do_lower_case": false,
|
| 48 |
+
"eos_token": "</s>",
|
| 49 |
+
"keep_accents": true,
|
| 50 |
+
"mask_token": "<mask>",
|
| 51 |
+
"model_max_len": 512,
|
| 52 |
+
"model_max_length": 512,
|
| 53 |
+
"pad_token": "<pad>",
|
| 54 |
+
"sep_token": "</s>",
|
| 55 |
+
"tokenizer_class": "XLMRobertaTokenizer",
|
| 56 |
+
"unk_token": "<unk>"
|
| 57 |
+
}
|