Madras1 commited on
Commit
b936aa4
Β·
verified Β·
1 Parent(s): ea98372

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -3
README.md CHANGED
@@ -1,3 +1,96 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - pt
5
+ license: mit
6
+ library_name: transformers
7
+ tags:
8
+ - biology
9
+ - science
10
+ - text-classification
11
+ - nlp
12
+ - biomedical
13
+ - filter
14
+ - deberta
15
+ metrics:
16
+ - f1
17
+ - accuracy
18
+ - recall
19
+ base_model: microsoft/deberta-v3-base
20
+ widget:
21
+ - text: "The mitochondria is the powerhouse of the cell and generates ATP."
22
+ example_title: "Biology Example 🧬"
23
+ - text: "The stock market crashed today due to high inflation rates."
24
+ example_title: "Finance Example πŸ’°"
25
+ - text: "New studies regarding CRISPR technology show promise in gene editing."
26
+ example_title: "Genetics Example πŸ”¬"
27
+ ---
28
+
29
+ # DebertaBioClass πŸ§¬πŸ”
30
+
31
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
32
+ [![Framework: PyTorch](https://img.shields.io/badge/Framework-PyTorch-orange.svg)](https://pytorch.org/)
33
+ [![Base Model: DeBERTa-v3](https://img.shields.io/badge/Base%20Model-DeBERTa%20v3-blue.svg)](https://huggingface.co/microsoft/deberta-v3-base)
34
+
35
+ **DebertaBioClass** is a fine-tuned DeBERTa-v3 model designed for **high-recall** filtering of biological texts. It excels at identifying biological content in large, noisy datasets, prioritizing "finding everything" even if it means capturing slightly more noise than other architectures.
36
+
37
+ ## Model Details
38
+
39
+ - **Model Architecture:** DeBERTa-v3-base
40
+ - **Task:** Binary Text Classification
41
+ - **Author:** Madras1
42
+ - **Dataset:** ~80k mixed samples (Synthetic + Real Biomedical Data)
43
+
44
+ ## βš”οΈ Model Comparison: DeBERTa vs. RoBERTa
45
+
46
+ I have released two models for this task. Choose the one that fits your pipeline needs:
47
+
48
+ | Feature | **DebertaBioClass** (This Model) | [RobertaBioClass](https://huggingface.co/Madras1/RobertaBioClass) |
49
+ | :--- | :--- | :--- |
50
+ | **Philosophy** | **"The Vacuum Cleaner"** (High Recall) | **"The Balanced Specialist"** (Precision focus) |
51
+ | **Best Use Case** | Building raw datasets; when missing a bio-text is unacceptable. | Final classification; when you need cleaner data with less noise. |
52
+ | **Recall (Bio)** | **86.2%** πŸ† | 83.1% |
53
+ | **Precision (Bio)** | 72.5% | **74.4%** πŸ† |
54
+ | **Architecture** | DeBERTa (Disentangled Attention) | RoBERTa (Optimized BERT) |
55
+
56
+ ## Performance Metrics πŸ“Š
57
+
58
+ This model was trained with **Weighted Cross-Entropy Loss** to strictly penalize missing biological samples.
59
+
60
+ | Metric | Score | Description |
61
+ | :--- | :--- | :--- |
62
+ | **Accuracy** | **86.5%** | Overall correctness |
63
+ | **F1-Score** | **78.7%** | Harmonic mean of precision and recall |
64
+ | **Recall (Bio)** | **86.16%** | **Highlights the model's ability to find hidden bio texts.** |
65
+ | **Precision** | **72.51%** | Confidence when predicting "Bio" |
66
+
67
+ ## How to Use
68
+
69
+ ```python
70
+ from transformers import pipeline
71
+
72
+ # Load the pipeline
73
+ classifier = pipeline("text-classification", model="Madras1/DebertaBioClass")
74
+
75
+ # Test strings
76
+ examples = [
77
+ "The mitochondria is the powerhouse of the cell.",
78
+ "Manchester United won the match against Chelsea."
79
+ ]
80
+
81
+ # Get predictions
82
+ predictions = classifier(examples)
83
+ print(predictions)
84
+ ```
85
+
86
+ Training Procedure
87
+ Class Weights: Heavily weighted towards the minority class (Biology) to maximize Recall.
88
+
89
+ Infrastructure: Trained on NVIDIA T4 GPUs (Kaggle).
90
+
91
+ Hyperparameters: Learning Rate 2e-5, Batch Size 16, 2 Epochs.
92
+
93
+ Loss Function: Weighted Cross-Entropy.
94
+
95
+ Limitations
96
+ False Positives: Due to the high sensitivity (86% Recall), this model may classify related scientific fields (like Chemistry or Medicine) as "Biology". This is intentional behavior to ensure no relevant data is lost during filtering.