Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,82 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
- pt
|
| 5 |
+
tags:
|
| 6 |
+
- biology
|
| 7 |
+
- classification
|
| 8 |
+
- text-classification
|
| 9 |
+
- roberta
|
| 10 |
+
metrics:
|
| 11 |
+
- f1
|
| 12 |
+
- accuracy
|
| 13 |
+
- recall
|
| 14 |
+
base_model: roberta-base
|
| 15 |
+
license: mit
|
| 16 |
+
pipeline_tag: text-classification
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# RobertaBioClass π§¬
|
| 20 |
+
|
| 21 |
+
**RobertaBioClass** is a fine-tuned RoBERTa model designed to distinguish biological texts from other general topics. It was trained to filter large datasets, prioritizing high recall to ensure relevant biological content is captured.
|
| 22 |
+
|
| 23 |
+
## Model Details
|
| 24 |
+
|
| 25 |
+
- **Model Architecture:** RoBERTa Base
|
| 26 |
+
- **Task:** Binary Text Classification
|
| 27 |
+
- **Language:** English (and Portuguese capabilities depending on training data mix)
|
| 28 |
+
- **Author:** Madras1
|
| 29 |
+
|
| 30 |
+
## Performance Metrics π
|
| 31 |
+
|
| 32 |
+
The model was evaluated on a held-out validation set of ~16k samples. It is optimized for **High Recall**, making it excellent for filtering pipelines where missing a biological text is worse than including a false positive.
|
| 33 |
+
|
| 34 |
+
| Metric | Score | Description |
|
| 35 |
+
| :--- | :--- | :--- |
|
| 36 |
+
| **Accuracy** | **86.8%** | Overall correctness |
|
| 37 |
+
| **F1-Score** | **78.5%** | Harmonic mean of precision and recall |
|
| 38 |
+
| **Recall (Bio)** | **83.1%** | Ability to find biological texts (Sensitivity) |
|
| 39 |
+
| **Precision** | **74.4%** | Correctness when predicting "Bio" |
|
| 40 |
+
|
| 41 |
+
## Label Mapping
|
| 42 |
+
|
| 43 |
+
The model outputs the following labels:
|
| 44 |
+
* `LABEL_0`: **Non-Biology** (General text, News, Finance, Sports, etc.)
|
| 45 |
+
* `LABEL_1`: **Biology** (Genetics, Medicine, Anatomy, Ecology, etc.)
|
| 46 |
+
|
| 47 |
+
## How to Use π
|
| 48 |
+
|
| 49 |
+
You can use this model directly with the Hugging Face `pipeline`:
|
| 50 |
+
|
| 51 |
+
```python
|
| 52 |
+
from transformers import pipeline
|
| 53 |
+
|
| 54 |
+
# Load the pipeline
|
| 55 |
+
classifier = pipeline("text-classification", model="Madras1/RobertaBioClass")
|
| 56 |
+
|
| 57 |
+
# Test strings
|
| 58 |
+
examples = [
|
| 59 |
+
"The mitochondria is the powerhouse of the cell.",
|
| 60 |
+
"The stock market crashed yesterday due to inflation."
|
| 61 |
+
]
|
| 62 |
+
|
| 63 |
+
# Get predictions
|
| 64 |
+
predictions = classifier(examples)
|
| 65 |
+
print(predictions)
|
| 66 |
+
# Output:
|
| 67 |
+
# [{'label': 'LABEL_1', 'score': 0.99...}, <- Biology
|
| 68 |
+
# {'label': 'LABEL_0', 'score': 0.98...}] <- Non-Biology
|
| 69 |
+
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
Intended Use
|
| 73 |
+
This model is ideal for:
|
| 74 |
+
|
| 75 |
+
Filtering biological data from Common Crawl or other web datasets.
|
| 76 |
+
|
| 77 |
+
Categorizing academic papers.
|
| 78 |
+
|
| 79 |
+
Tagging educational content.
|
| 80 |
+
|
| 81 |
+
Limitations
|
| 82 |
+
Since the model prioritizes Recall (83%), it may generate some False Positives (Precision ~74%). It might occasionally classify related scientific fields (like Chemistry or Physics) as Biology depending on the context.
|