Madras1
/

DebertaBioClass

+---
+language:
+- en
+- pt
+license: mit
+library_name: transformers
+tags:
+- biology
+- science
+- text-classification
+- nlp
+- biomedical
+- filter
+- deberta
+metrics:
+- f1
+- accuracy
+- recall
+base_model: microsoft/deberta-v3-base
+widget:
+- text: "The mitochondria is the powerhouse of the cell and generates ATP."
+  example_title: "Biology Example 🧬"
+- text: "The stock market crashed today due to high inflation rates."
+  example_title: "Finance Example 💰"
+- text: "New studies regarding CRISPR technology show promise in gene editing."
+  example_title: "Genetics Example 🔬"
+---
+# DebertaBioClass 🧬🔍
+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[![Framework: PyTorch](https://img.shields.io/badge/Framework-PyTorch-orange.svg)](https://pytorch.org/)
+[![Base Model: DeBERTa-v3](https://img.shields.io/badge/Base%20Model-DeBERTa%20v3-blue.svg)](https://huggingface.co/microsoft/deberta-v3-base)
+**DebertaBioClass** is a fine-tuned DeBERTa-v3 model designed for **high-recall** filtering of biological texts. It excels at identifying biological content in large, noisy datasets, prioritizing "finding everything" even if it means capturing slightly more noise than other architectures.
+## Model Details
+- **Model Architecture:** DeBERTa-v3-base
+- **Task:** Binary Text Classification
+- **Author:** Madras1
+- **Dataset:** ~80k mixed samples (Synthetic + Real Biomedical Data)
+## ⚔️ Model Comparison: DeBERTa vs. RoBERTa
+I have released two models for this task. Choose the one that fits your pipeline needs:
+| Feature | **DebertaBioClass** (This Model) | [RobertaBioClass](https://huggingface.co/Madras1/RobertaBioClass) |
+| :--- | :--- | :--- |
+| **Philosophy** | **"The Vacuum Cleaner"** (High Recall) | **"The Balanced Specialist"** (Precision focus) |
+| **Best Use Case** | Building raw datasets; when missing a bio-text is unacceptable. | Final classification; when you need cleaner data with less noise. |
+| **Recall (Bio)** | **86.2%** 🏆 | 83.1% |
+| **Precision (Bio)** | 72.5% | **74.4%** 🏆 |
+| **Architecture** | DeBERTa (Disentangled Attention) | RoBERTa (Optimized BERT) |
+## Performance Metrics 📊
+This model was trained with **Weighted Cross-Entropy Loss** to strictly penalize missing biological samples.
+| Metric | Score | Description |
+| :--- | :--- | :--- |
+| **Accuracy** | **86.5%** | Overall correctness |
+| **F1-Score** | **78.7%** | Harmonic mean of precision and recall |
+| **Recall (Bio)** | **86.16%** | **Highlights the model's ability to find hidden bio texts.** |
+| **Precision** | **72.51%** | Confidence when predicting "Bio" |
+## How to Use
+```python
+from transformers import pipeline
+# Load the pipeline
+classifier = pipeline("text-classification", model="Madras1/DebertaBioClass")
+# Test strings
+examples = [
+    "The mitochondria is the powerhouse of the cell.",
+    "Manchester United won the match against Chelsea."
+]
+# Get predictions
+predictions = classifier(examples)
+print(predictions)
+```
+Training Procedure
+Class Weights: Heavily weighted towards the minority class (Biology) to maximize Recall.
+Infrastructure: Trained on NVIDIA T4 GPUs (Kaggle).
+Hyperparameters: Learning Rate 2e-5, Batch Size 16, 2 Epochs.
+Loss Function: Weighted Cross-Entropy.
+Limitations
+False Positives: Due to the high sensitivity (86% Recall), this model may classify related scientific fields (like Chemistry or Medicine) as "Biology". This is intentional behavior to ensure no relevant data is lost during filtering.