Madras1
/

RobertaBioClass

Text Classification

text-embeddings-inference

Model card Files Files and versions

Madras1 commited on Nov 30, 2025

Commit

b1fd899

·

verified ·

1 Parent(s): 8e595d8

Update README.md

Files changed (1) hide show

README.md +82 -3

README.md CHANGED Viewed

@@ -1,3 +1,82 @@
----
-license: mit
----

+---
+language:
+- en
+- pt
+tags:
+- biology
+- classification
+- text-classification
+- roberta
+metrics:
+- f1
+- accuracy
+- recall
+base_model: roberta-base
+license: mit
+pipeline_tag: text-classification
+---
+# RobertaBioClass 🧬
+**RobertaBioClass** is a fine-tuned RoBERTa model designed to distinguish biological texts from other general topics. It was trained to filter large datasets, prioritizing high recall to ensure relevant biological content is captured.
+## Model Details
+- **Model Architecture:** RoBERTa Base
+- **Task:** Binary Text Classification
+- **Language:** English (and Portuguese capabilities depending on training data mix)
+- **Author:** Madras1
+## Performance Metrics 📊
+The model was evaluated on a held-out validation set of ~16k samples. It is optimized for **High Recall**, making it excellent for filtering pipelines where missing a biological text is worse than including a false positive.
+| Metric | Score | Description |
+| :--- | :--- | :--- |
+| **Accuracy** | **86.8%** | Overall correctness |
+| **F1-Score** | **78.5%** | Harmonic mean of precision and recall |
+| **Recall (Bio)** | **83.1%** | Ability to find biological texts (Sensitivity) |
+| **Precision** | **74.4%** | Correctness when predicting "Bio" |
+## Label Mapping
+The model outputs the following labels:
+* `LABEL_0`: **Non-Biology** (General text, News, Finance, Sports, etc.)
+* `LABEL_1`: **Biology** (Genetics, Medicine, Anatomy, Ecology, etc.)
+## How to Use 🚀
+You can use this model directly with the Hugging Face `pipeline`:
+```python
+from transformers import pipeline
+# Load the pipeline
+classifier = pipeline("text-classification", model="Madras1/RobertaBioClass")
+# Test strings
+examples = [
+    "The mitochondria is the powerhouse of the cell.",
+    "The stock market crashed yesterday due to inflation."
+]
+# Get predictions
+predictions = classifier(examples)
+print(predictions)
+# Output:
+# [{'label': 'LABEL_1', 'score': 0.99...},  <- Biology
+#  {'label': 'LABEL_0', 'score': 0.98...}]  <- Non-Biology
+```
+Intended Use
+This model is ideal for:
+Filtering biological data from Common Crawl or other web datasets.
+Categorizing academic papers.
+Tagging educational content.
+Limitations
+Since the model prioritizes Recall (83%), it may generate some False Positives (Precision ~74%). It might occasionally classify related scientific fields (like Chemistry or Physics) as Biology depending on the context.