File size: 3,587 Bytes
b936aa4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2f0ea44
 
b936aa4
 
d5d81a7
 
 
 
 
 
 
b936aa4
 
 
 
 
3a53a69
 
b936aa4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
language:
- en
- pt
license: mit
library_name: transformers
tags:
- biology
- science
- text-classification
- nlp
- biomedical
- filter
- deberta
metrics:
- f1
- accuracy
- recall
datasets:
- Madras1/BioClass80k
base_model: microsoft/deberta-v3-base
widget:
- text: The mitochondria is the powerhouse of the cell and generates ATP.
  example_title: Biology Example 🧬
- text: The stock market crashed today due to high inflation rates.
  example_title: Finance Example πŸ’°
- text: New studies regarding CRISPR technology show promise in gene editing.
  example_title: Genetics Example πŸ”¬
pipeline_tag: text-classification
---
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Framework: PyTorch](https://img.shields.io/badge/Framework-PyTorch-orange.svg)](https://pytorch.org/)
[![Base Model: DeBERTa-v3](https://img.shields.io/badge/Base%20Model-DeBERTa%20v3-blue.svg)](https://huggingface.co/microsoft/deberta-v3-base)

# DebertaBioClass 🧬

**DebertaBioClass** is a fine-tuned DeBERTa-v3 model designed for **high-recall** filtering of biological texts. It excels at identifying biological content in large, noisy datasets, prioritizing "finding everything" even if it means capturing slightly more noise than other architectures.

## Model Details

- **Model Architecture:** DeBERTa-v3-base
- **Task:** Binary Text Classification
- **Author:** Madras1
- **Dataset:** ~80k mixed samples (Synthetic + Real Biomedical Data)

## βš”οΈ Model Comparison: DeBERTa vs. RoBERTa

I have released two models for this task. Choose the one that fits your pipeline needs:

| Feature | **DebertaBioClass** (This Model) | [RobertaBioClass](https://huggingface.co/Madras1/RobertaBioClass) |
| :--- | :--- | :--- |
| **Philosophy** | **"The Vacuum Cleaner"** (High Recall) | **"The Balanced Specialist"** (Precision focus) |
| **Best Use Case** | Building raw datasets; when missing a bio-text is unacceptable. | Final classification; when you need cleaner data with less noise. |
| **Recall (Bio)** | **86.2%** πŸ† | 83.1% |
| **Precision (Bio)** | 72.5% | **74.4%** πŸ† |
| **Architecture** | DeBERTa (Disentangled Attention) | RoBERTa (Optimized BERT) |

## Performance Metrics πŸ“Š

This model was trained with **Weighted Cross-Entropy Loss** to strictly penalize missing biological samples.

| Metric | Score | Description |
| :--- | :--- | :--- |
| **Accuracy** | **86.5%** | Overall correctness |
| **F1-Score** | **78.7%** | Harmonic mean of precision and recall |
| **Recall (Bio)** | **86.16%** | **Highlights the model's ability to find hidden bio texts.** |
| **Precision** | **72.51%** | Confidence when predicting "Bio" |

## How to Use 

```python
from transformers import pipeline

# Load the pipeline
classifier = pipeline("text-classification", model="Madras1/DebertaBioClass")

# Test strings
examples = [
    "The mitochondria is the powerhouse of the cell.",
    "Manchester United won the match against Chelsea."
]

# Get predictions
predictions = classifier(examples)
print(predictions)
```

Training Procedure 
Class Weights: Heavily weighted towards the minority class (Biology) to maximize Recall.

Infrastructure: Trained on NVIDIA T4 GPUs (Kaggle).

Hyperparameters: Learning Rate 2e-5, Batch Size 16, 2 Epochs.

Loss Function: Weighted Cross-Entropy.

Limitations
False Positives: Due to the high sensitivity (86% Recall), this model may classify related scientific fields (like Chemistry or Medicine) as "Biology". This is intentional behavior to ensure no relevant data is lost during filtering.