File size: 3,587 Bytes
b936aa4 2f0ea44 b936aa4 d5d81a7 b936aa4 3a53a69 b936aa4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 |
---
language:
- en
- pt
license: mit
library_name: transformers
tags:
- biology
- science
- text-classification
- nlp
- biomedical
- filter
- deberta
metrics:
- f1
- accuracy
- recall
datasets:
- Madras1/BioClass80k
base_model: microsoft/deberta-v3-base
widget:
- text: The mitochondria is the powerhouse of the cell and generates ATP.
example_title: Biology Example π§¬
- text: The stock market crashed today due to high inflation rates.
example_title: Finance Example π°
- text: New studies regarding CRISPR technology show promise in gene editing.
example_title: Genetics Example π¬
pipeline_tag: text-classification
---
[](https://opensource.org/licenses/MIT)
[](https://pytorch.org/)
[](https://huggingface.co/microsoft/deberta-v3-base)
# DebertaBioClass π§¬
**DebertaBioClass** is a fine-tuned DeBERTa-v3 model designed for **high-recall** filtering of biological texts. It excels at identifying biological content in large, noisy datasets, prioritizing "finding everything" even if it means capturing slightly more noise than other architectures.
## Model Details
- **Model Architecture:** DeBERTa-v3-base
- **Task:** Binary Text Classification
- **Author:** Madras1
- **Dataset:** ~80k mixed samples (Synthetic + Real Biomedical Data)
## βοΈ Model Comparison: DeBERTa vs. RoBERTa
I have released two models for this task. Choose the one that fits your pipeline needs:
| Feature | **DebertaBioClass** (This Model) | [RobertaBioClass](https://huggingface.co/Madras1/RobertaBioClass) |
| :--- | :--- | :--- |
| **Philosophy** | **"The Vacuum Cleaner"** (High Recall) | **"The Balanced Specialist"** (Precision focus) |
| **Best Use Case** | Building raw datasets; when missing a bio-text is unacceptable. | Final classification; when you need cleaner data with less noise. |
| **Recall (Bio)** | **86.2%** π | 83.1% |
| **Precision (Bio)** | 72.5% | **74.4%** π |
| **Architecture** | DeBERTa (Disentangled Attention) | RoBERTa (Optimized BERT) |
## Performance Metrics π
This model was trained with **Weighted Cross-Entropy Loss** to strictly penalize missing biological samples.
| Metric | Score | Description |
| :--- | :--- | :--- |
| **Accuracy** | **86.5%** | Overall correctness |
| **F1-Score** | **78.7%** | Harmonic mean of precision and recall |
| **Recall (Bio)** | **86.16%** | **Highlights the model's ability to find hidden bio texts.** |
| **Precision** | **72.51%** | Confidence when predicting "Bio" |
## How to Use
```python
from transformers import pipeline
# Load the pipeline
classifier = pipeline("text-classification", model="Madras1/DebertaBioClass")
# Test strings
examples = [
"The mitochondria is the powerhouse of the cell.",
"Manchester United won the match against Chelsea."
]
# Get predictions
predictions = classifier(examples)
print(predictions)
```
Training Procedure
Class Weights: Heavily weighted towards the minority class (Biology) to maximize Recall.
Infrastructure: Trained on NVIDIA T4 GPUs (Kaggle).
Hyperparameters: Learning Rate 2e-5, Batch Size 16, 2 Epochs.
Loss Function: Weighted Cross-Entropy.
Limitations
False Positives: Due to the high sensitivity (86% Recall), this model may classify related scientific fields (like Chemistry or Medicine) as "Biology". This is intentional behavior to ensure no relevant data is lost during filtering. |