Update README.md
Browse files
README.md
CHANGED
|
@@ -43,7 +43,7 @@ model-index:
|
|
| 43 |
CarD-T (Carcinogen Detection via Transformers) is a novel text analytics approach that combines transformer-based machine learning with probabilistic statistical analysis to efficiently nominate carcinogens from scientific texts. This model is designed to address the challenges faced by current systems in managing the burgeoning biomedical literature related to carcinogen identification and classification.
|
| 44 |
|
| 45 |
## Model Details
|
| 46 |
-
* **Architecture**: Based on Bio-ELECTRA, a 335 million parameter language model
|
| 47 |
* **Training Data**: [CarD-T-NER dataset](https://huggingface.co/datasets/jimnoneill/CarD-T-NER) containing 19,975 annotated examples from PubMed abstracts (2000-2024)
|
| 48 |
* Training set: 11,985 examples
|
| 49 |
* Test set: 7,990 examples
|
|
@@ -255,7 +255,7 @@ training_args = TrainingArguments(
|
|
| 255 |
learning_rate=2e-5,
|
| 256 |
per_device_train_batch_size=16,
|
| 257 |
per_device_eval_batch_size=16,
|
| 258 |
-
num_train_epochs=
|
| 259 |
weight_decay=0.01,
|
| 260 |
evaluation_strategy="epoch",
|
| 261 |
save_strategy="epoch",
|
|
@@ -265,19 +265,6 @@ training_args = TrainingArguments(
|
|
| 265 |
)
|
| 266 |
```
|
| 267 |
|
| 268 |
-
## Evaluation Metrics
|
| 269 |
-
|
| 270 |
-
Detailed performance metrics on the test set (7,990 examples):
|
| 271 |
-
|
| 272 |
-
| Entity Type | Precision | Recall | F1-Score | Support |
|
| 273 |
-
|-------------|-----------|---------|----------|---------|
|
| 274 |
-
| carcinogen | 0.912 | 0.878 | 0.895 | 2,341 |
|
| 275 |
-
| negative | 0.867 | 0.823 | 0.844 | 987 |
|
| 276 |
-
| cancertype | 0.889 | 0.856 | 0.872 | 3,124 |
|
| 277 |
-
| antineoplastic | 0.908 | 0.871 | 0.889 | 1,456 |
|
| 278 |
-
| **Overall** | **0.894** | **0.857** | **0.875** | **7,908** |
|
| 279 |
-
|
| 280 |
-
## Citation
|
| 281 |
|
| 282 |
If you use this model in your research, please cite:
|
| 283 |
|
|
|
|
| 43 |
CarD-T (Carcinogen Detection via Transformers) is a novel text analytics approach that combines transformer-based machine learning with probabilistic statistical analysis to efficiently nominate carcinogens from scientific texts. This model is designed to address the challenges faced by current systems in managing the burgeoning biomedical literature related to carcinogen identification and classification.
|
| 44 |
|
| 45 |
## Model Details
|
| 46 |
+
* **Architecture**: Based on Bio-ELECTRA, a 335 million parameter language model (sultan/BioM-ELECTRA-Large-SQuAD2)
|
| 47 |
* **Training Data**: [CarD-T-NER dataset](https://huggingface.co/datasets/jimnoneill/CarD-T-NER) containing 19,975 annotated examples from PubMed abstracts (2000-2024)
|
| 48 |
* Training set: 11,985 examples
|
| 49 |
* Test set: 7,990 examples
|
|
|
|
| 255 |
learning_rate=2e-5,
|
| 256 |
per_device_train_batch_size=16,
|
| 257 |
per_device_eval_batch_size=16,
|
| 258 |
+
num_train_epochs=5,
|
| 259 |
weight_decay=0.01,
|
| 260 |
evaluation_strategy="epoch",
|
| 261 |
save_strategy="epoch",
|
|
|
|
| 265 |
)
|
| 266 |
```
|
| 267 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
|
| 269 |
If you use this model in your research, please cite:
|
| 270 |
|