jimnoneill commited on
Commit
1d44aa7
·
verified ·
1 Parent(s): 0adb5e2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -15
README.md CHANGED
@@ -43,7 +43,7 @@ model-index:
43
  CarD-T (Carcinogen Detection via Transformers) is a novel text analytics approach that combines transformer-based machine learning with probabilistic statistical analysis to efficiently nominate carcinogens from scientific texts. This model is designed to address the challenges faced by current systems in managing the burgeoning biomedical literature related to carcinogen identification and classification.
44
 
45
  ## Model Details
46
- * **Architecture**: Based on Bio-ELECTRA, a 335 million parameter language model
47
  * **Training Data**: [CarD-T-NER dataset](https://huggingface.co/datasets/jimnoneill/CarD-T-NER) containing 19,975 annotated examples from PubMed abstracts (2000-2024)
48
  * Training set: 11,985 examples
49
  * Test set: 7,990 examples
@@ -255,7 +255,7 @@ training_args = TrainingArguments(
255
  learning_rate=2e-5,
256
  per_device_train_batch_size=16,
257
  per_device_eval_batch_size=16,
258
- num_train_epochs=3,
259
  weight_decay=0.01,
260
  evaluation_strategy="epoch",
261
  save_strategy="epoch",
@@ -265,19 +265,6 @@ training_args = TrainingArguments(
265
  )
266
  ```
267
 
268
- ## Evaluation Metrics
269
-
270
- Detailed performance metrics on the test set (7,990 examples):
271
-
272
- | Entity Type | Precision | Recall | F1-Score | Support |
273
- |-------------|-----------|---------|----------|---------|
274
- | carcinogen | 0.912 | 0.878 | 0.895 | 2,341 |
275
- | negative | 0.867 | 0.823 | 0.844 | 987 |
276
- | cancertype | 0.889 | 0.856 | 0.872 | 3,124 |
277
- | antineoplastic | 0.908 | 0.871 | 0.889 | 1,456 |
278
- | **Overall** | **0.894** | **0.857** | **0.875** | **7,908** |
279
-
280
- ## Citation
281
 
282
  If you use this model in your research, please cite:
283
 
 
43
  CarD-T (Carcinogen Detection via Transformers) is a novel text analytics approach that combines transformer-based machine learning with probabilistic statistical analysis to efficiently nominate carcinogens from scientific texts. This model is designed to address the challenges faced by current systems in managing the burgeoning biomedical literature related to carcinogen identification and classification.
44
 
45
  ## Model Details
46
+ * **Architecture**: Based on Bio-ELECTRA, a 335 million parameter language model (sultan/BioM-ELECTRA-Large-SQuAD2)
47
  * **Training Data**: [CarD-T-NER dataset](https://huggingface.co/datasets/jimnoneill/CarD-T-NER) containing 19,975 annotated examples from PubMed abstracts (2000-2024)
48
  * Training set: 11,985 examples
49
  * Test set: 7,990 examples
 
255
  learning_rate=2e-5,
256
  per_device_train_batch_size=16,
257
  per_device_eval_batch_size=16,
258
+ num_train_epochs=5,
259
  weight_decay=0.01,
260
  evaluation_strategy="epoch",
261
  save_strategy="epoch",
 
265
  )
266
  ```
267
 
 
 
 
 
 
 
 
 
 
 
 
 
 
268
 
269
  If you use this model in your research, please cite:
270