|
|
--- |
|
|
language: |
|
|
- en |
|
|
- pt |
|
|
license: mit |
|
|
library_name: transformers |
|
|
tags: |
|
|
- biology |
|
|
- science |
|
|
- text-classification |
|
|
- nlp |
|
|
- biomedical |
|
|
- filter |
|
|
- deberta |
|
|
metrics: |
|
|
- f1 |
|
|
- accuracy |
|
|
- recall |
|
|
datasets: |
|
|
- Madras1/BioClass80k |
|
|
base_model: microsoft/deberta-v3-base |
|
|
widget: |
|
|
- text: The mitochondria is the powerhouse of the cell and generates ATP. |
|
|
example_title: Biology Example 𧬠|
|
|
- text: The stock market crashed today due to high inflation rates. |
|
|
example_title: Finance Example π° |
|
|
- text: New studies regarding CRISPR technology show promise in gene editing. |
|
|
example_title: Genetics Example π¬ |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
[](https://opensource.org/licenses/MIT) |
|
|
[](https://pytorch.org/) |
|
|
[](https://huggingface.co/microsoft/deberta-v3-base) |
|
|
|
|
|
# DebertaBioClass 𧬠|
|
|
|
|
|
**DebertaBioClass** is a fine-tuned DeBERTa-v3 model designed for **high-recall** filtering of biological texts. It excels at identifying biological content in large, noisy datasets, prioritizing "finding everything" even if it means capturing slightly more noise than other architectures. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Architecture:** DeBERTa-v3-base |
|
|
- **Task:** Binary Text Classification |
|
|
- **Author:** Madras1 |
|
|
- **Dataset:** ~80k mixed samples (Synthetic + Real Biomedical Data) |
|
|
|
|
|
## βοΈ Model Comparison: DeBERTa vs. RoBERTa |
|
|
|
|
|
I have released two models for this task. Choose the one that fits your pipeline needs: |
|
|
|
|
|
| Feature | **DebertaBioClass** (This Model) | [RobertaBioClass](https://huggingface.co/Madras1/RobertaBioClass) | |
|
|
| :--- | :--- | :--- | |
|
|
| **Philosophy** | **"The Vacuum Cleaner"** (High Recall) | **"The Balanced Specialist"** (Precision focus) | |
|
|
| **Best Use Case** | Building raw datasets; when missing a bio-text is unacceptable. | Final classification; when you need cleaner data with less noise. | |
|
|
| **Recall (Bio)** | **86.2%** π | 83.1% | |
|
|
| **Precision (Bio)** | 72.5% | **74.4%** π | |
|
|
| **Architecture** | DeBERTa (Disentangled Attention) | RoBERTa (Optimized BERT) | |
|
|
|
|
|
## Performance Metrics π |
|
|
|
|
|
This model was trained with **Weighted Cross-Entropy Loss** to strictly penalize missing biological samples. |
|
|
|
|
|
| Metric | Score | Description | |
|
|
| :--- | :--- | :--- | |
|
|
| **Accuracy** | **86.5%** | Overall correctness | |
|
|
| **F1-Score** | **78.7%** | Harmonic mean of precision and recall | |
|
|
| **Recall (Bio)** | **86.16%** | **Highlights the model's ability to find hidden bio texts.** | |
|
|
| **Precision** | **72.51%** | Confidence when predicting "Bio" | |
|
|
|
|
|
## How to Use |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the pipeline |
|
|
classifier = pipeline("text-classification", model="Madras1/DebertaBioClass") |
|
|
|
|
|
# Test strings |
|
|
examples = [ |
|
|
"The mitochondria is the powerhouse of the cell.", |
|
|
"Manchester United won the match against Chelsea." |
|
|
] |
|
|
|
|
|
# Get predictions |
|
|
predictions = classifier(examples) |
|
|
print(predictions) |
|
|
``` |
|
|
|
|
|
Training Procedure |
|
|
Class Weights: Heavily weighted towards the minority class (Biology) to maximize Recall. |
|
|
|
|
|
Infrastructure: Trained on NVIDIA T4 GPUs (Kaggle). |
|
|
|
|
|
Hyperparameters: Learning Rate 2e-5, Batch Size 16, 2 Epochs. |
|
|
|
|
|
Loss Function: Weighted Cross-Entropy. |
|
|
|
|
|
Limitations |
|
|
False Positives: Due to the high sensitivity (86% Recall), this model may classify related scientific fields (like Chemistry or Medicine) as "Biology". This is intentional behavior to ensure no relevant data is lost during filtering. |