DebertaBioClass / README.md

Update README.md

d5d81a7 verified 25 days ago

3.59 kB

	---
	language:
	- en
	- pt
	license: mit
	library_name: transformers
	tags:
	- biology
	- science
	- text-classification
	- nlp
	- biomedical
	- filter
	- deberta
	metrics:
	- f1
	- accuracy
	- recall
	datasets:
	- Madras1/BioClass80k
	base_model: microsoft/deberta-v3-base
	widget:
	- text: The mitochondria is the powerhouse of the cell and generates ATP.
	example_title: Biology Example 🧬
	- text: The stock market crashed today due to high inflation rates.
	example_title: Finance Example 💰
	- text: New studies regarding CRISPR technology show promise in gene editing.
	example_title: Genetics Example 🔬
	pipeline_tag: text-classification
	---
	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
	[![Framework: PyTorch](https://img.shields.io/badge/Framework-PyTorch-orange.svg)](https://pytorch.org/)
	[![Base Model: DeBERTa-v3](https://img.shields.io/badge/Base%20Model-DeBERTa%20v3-blue.svg)](https://huggingface.co/microsoft/deberta-v3-base)

	# DebertaBioClass 🧬

	DebertaBioClass is a fine-tuned DeBERTa-v3 model designed for high-recall filtering of biological texts. It excels at identifying biological content in large, noisy datasets, prioritizing "finding everything" even if it means capturing slightly more noise than other architectures.

	## Model Details

	- Model Architecture: DeBERTa-v3-base
	- Task: Binary Text Classification
	- Author: Madras1
	- Dataset: ~80k mixed samples (Synthetic + Real Biomedical Data)

	## ⚔️ Model Comparison: DeBERTa vs. RoBERTa

	I have released two models for this task. Choose the one that fits your pipeline needs:

	\| Feature \| DebertaBioClass (This Model) \| [RobertaBioClass](https://huggingface.co/Madras1/RobertaBioClass) \|
	\| :--- \| :--- \| :--- \|
	\| Philosophy \| "The Vacuum Cleaner" (High Recall) \| "The Balanced Specialist" (Precision focus) \|
	\| Best Use Case \| Building raw datasets; when missing a bio-text is unacceptable. \| Final classification; when you need cleaner data with less noise. \|
	\| Recall (Bio) \| 86.2% 🏆 \| 83.1% \|
	\| Precision (Bio) \| 72.5% \| 74.4% 🏆 \|
	\| Architecture \| DeBERTa (Disentangled Attention) \| RoBERTa (Optimized BERT) \|

	## Performance Metrics 📊

	This model was trained with Weighted Cross-Entropy Loss to strictly penalize missing biological samples.

	\| Metric \| Score \| Description \|
	\| :--- \| :--- \| :--- \|
	\| Accuracy \| 86.5% \| Overall correctness \|
	\| F1-Score \| 78.7% \| Harmonic mean of precision and recall \|
	\| Recall (Bio) \| 86.16% \| Highlights the model's ability to find hidden bio texts. \|
	\| Precision \| 72.51% \| Confidence when predicting "Bio" \|

	## How to Use

	```python
	from transformers import pipeline

	# Load the pipeline
	classifier = pipeline("text-classification", model="Madras1/DebertaBioClass")

	# Test strings
	examples = [
	"The mitochondria is the powerhouse of the cell.",
	"Manchester United won the match against Chelsea."
	]

	# Get predictions
	predictions = classifier(examples)
	print(predictions)
	```

	Training Procedure
	Class Weights: Heavily weighted towards the minority class (Biology) to maximize Recall.

	Infrastructure: Trained on NVIDIA T4 GPUs (Kaggle).

	Hyperparameters: Learning Rate 2e-5, Batch Size 16, 2 Epochs.

	Loss Function: Weighted Cross-Entropy.

	Limitations
	False Positives: Due to the high sensitivity (86% Recall), this model may classify related scientific fields (like Chemistry or Medicine) as "Biology". This is intentional behavior to ensure no relevant data is lost during filtering.