|
|
--- |
|
|
language: |
|
|
- en |
|
|
- pt |
|
|
license: mit |
|
|
library_name: transformers |
|
|
tags: |
|
|
- biology |
|
|
- science |
|
|
- text-classification |
|
|
- nlp |
|
|
- biomedical |
|
|
- filter |
|
|
- roberta |
|
|
- medical |
|
|
metrics: |
|
|
- f1 |
|
|
- accuracy |
|
|
- recall |
|
|
datasets: |
|
|
- Madras1/BioClass80k |
|
|
base_model: roberta-base |
|
|
widget: |
|
|
- text: The mitochondria is the powerhouse of the cell and generates ATP. |
|
|
example_title: Biology Example 🧬 |
|
|
- text: The stock market crashed today due to high inflation rates. |
|
|
example_title: Finance Example 💰 |
|
|
- text: CRISPR-Cas9 technology allows for precise gene editing. |
|
|
example_title: Genetics Example 🔬 |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
[](https://opensource.org/licenses/MIT) |
|
|
[](https://pytorch.org/) |
|
|
[](https://huggingface.co/tasks/text-classification) |
|
|
[](https://www.python.org/) |
|
|
|
|
|
# RobertaBioClass 🧬 |
|
|
|
|
|
**RobertaBioClass** is a fine-tuned RoBERTa model designed to distinguish biological texts from other general topics. It was trained to filter large datasets, prioritizing high recall to ensure relevant biological content is captured. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Architecture:** RoBERTa Base |
|
|
- **Task:** Binary Text Classification |
|
|
- **Language:** English (and Portuguese capabilities depending on training data mix) |
|
|
- **Author:** Madras1 |
|
|
|
|
|
## Performance Metrics 📊 |
|
|
|
|
|
The model was evaluated on a held-out validation set of ~16k samples. It is optimized for **High Recall**, making it excellent for filtering pipelines where missing a biological text is worse than including a false positive. |
|
|
|
|
|
| Metric | Score | Description | |
|
|
| :--- | :--- | :--- | |
|
|
| **Accuracy** | **86.8%** | Overall correctness | |
|
|
| **F1-Score** | **78.5%** | Harmonic mean of precision and recall | |
|
|
| **Recall (Bio)** | **83.1%** | Ability to find biological texts (Sensitivity) | |
|
|
| **Precision** | **74.4%** | Correctness when predicting "Bio" | |
|
|
|
|
|
## Label Mapping |
|
|
|
|
|
The model outputs the following labels: |
|
|
* `LABEL_0`: **Non-Biology** (General text, News, Finance, Sports, etc.) |
|
|
* `LABEL_1`: **Biology** (Genetics, Medicine, Anatomy, Ecology, etc.) |
|
|
|
|
|
## Training Data & Procedure |
|
|
|
|
|
### Data Overview |
|
|
The dataset consists of approximately **80,000 text samples** aggregated from multiple sources. |
|
|
* **Total Samples:** ~79,700 |
|
|
* **Class Balance:** The dataset was imbalanced, with ~71% belonging to the "Non-Bio" class and ~29% to the "Bio" class. |
|
|
* **Preprocessing:** Scripts were used to clean delimiter issues in CSVs, remove duplicates, and perform a stratified split for validation. |
|
|
|
|
|
### Training Procedure |
|
|
To address the class imbalance without discarding valuable data (undersampling), we employed a custom **Weighted Cross-Entropy Loss**. |
|
|
* **Class Weights:** Calculated using `sklearn.utils.class_weight`. The model was penalized significantly more for missing a Biology sample than for misclassifying a general text, which directly contributed to the high Recall score. |
|
|
|
|
|
### Hyperparameters |
|
|
The model was fine-tuned using the Hugging Face `Trainer` with the following configuration: |
|
|
* **Optimizer:** AdamW |
|
|
* **Learning Rate:** 2e-5 |
|
|
* **Batch Size:** 16 |
|
|
* **Epochs:** 2 |
|
|
* **Weight Decay:** 0.01 |
|
|
* **Hardware:** Trained on a NVIDIA T4 GPU |
|
|
|
|
|
## How to Use |
|
|
|
|
|
You can use this model directly with the Hugging Face `pipeline`: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the pipeline |
|
|
classifier = pipeline("text-classification", model="Madras1/RobertaBioClass") |
|
|
|
|
|
# Test strings |
|
|
examples = [ |
|
|
"The mitochondria is the powerhouse of the cell.", |
|
|
"The stock market crashed yesterday due to inflation." |
|
|
] |
|
|
|
|
|
# Get predictions |
|
|
predictions = classifier(examples) |
|
|
print(predictions) |
|
|
# Output: |
|
|
# [{'label': 'LABEL_1', 'score': 0.99...}, <- Biology |
|
|
# {'label': 'LABEL_0', 'score': 0.98...}] <- Non-Biology |
|
|
|
|
|
``` |
|
|
|
|
|
 |
|
|
|
|
|
Intended Use |
|
|
This model is ideal for: |
|
|
|
|
|
Filtering biological data from Common Crawl or other web datasets. |
|
|
|
|
|
Categorizing academic papers. |
|
|
|
|
|
Tagging educational content. |
|
|
|
|
|
Limitations |
|
|
Since the model prioritizes Recall (83%), it may generate some False Positives (Precision ~74%). It might occasionally classify related scientific fields (like Chemistry or Physics) as Biology depending on the context. |