Madras1 commited on
Commit
b1fd899
Β·
verified Β·
1 Parent(s): 8e595d8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -3
README.md CHANGED
@@ -1,3 +1,82 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - pt
5
+ tags:
6
+ - biology
7
+ - classification
8
+ - text-classification
9
+ - roberta
10
+ metrics:
11
+ - f1
12
+ - accuracy
13
+ - recall
14
+ base_model: roberta-base
15
+ license: mit
16
+ pipeline_tag: text-classification
17
+ ---
18
+
19
+ # RobertaBioClass 🧬
20
+
21
+ **RobertaBioClass** is a fine-tuned RoBERTa model designed to distinguish biological texts from other general topics. It was trained to filter large datasets, prioritizing high recall to ensure relevant biological content is captured.
22
+
23
+ ## Model Details
24
+
25
+ - **Model Architecture:** RoBERTa Base
26
+ - **Task:** Binary Text Classification
27
+ - **Language:** English (and Portuguese capabilities depending on training data mix)
28
+ - **Author:** Madras1
29
+
30
+ ## Performance Metrics πŸ“Š
31
+
32
+ The model was evaluated on a held-out validation set of ~16k samples. It is optimized for **High Recall**, making it excellent for filtering pipelines where missing a biological text is worse than including a false positive.
33
+
34
+ | Metric | Score | Description |
35
+ | :--- | :--- | :--- |
36
+ | **Accuracy** | **86.8%** | Overall correctness |
37
+ | **F1-Score** | **78.5%** | Harmonic mean of precision and recall |
38
+ | **Recall (Bio)** | **83.1%** | Ability to find biological texts (Sensitivity) |
39
+ | **Precision** | **74.4%** | Correctness when predicting "Bio" |
40
+
41
+ ## Label Mapping
42
+
43
+ The model outputs the following labels:
44
+ * `LABEL_0`: **Non-Biology** (General text, News, Finance, Sports, etc.)
45
+ * `LABEL_1`: **Biology** (Genetics, Medicine, Anatomy, Ecology, etc.)
46
+
47
+ ## How to Use πŸš€
48
+
49
+ You can use this model directly with the Hugging Face `pipeline`:
50
+
51
+ ```python
52
+ from transformers import pipeline
53
+
54
+ # Load the pipeline
55
+ classifier = pipeline("text-classification", model="Madras1/RobertaBioClass")
56
+
57
+ # Test strings
58
+ examples = [
59
+ "The mitochondria is the powerhouse of the cell.",
60
+ "The stock market crashed yesterday due to inflation."
61
+ ]
62
+
63
+ # Get predictions
64
+ predictions = classifier(examples)
65
+ print(predictions)
66
+ # Output:
67
+ # [{'label': 'LABEL_1', 'score': 0.99...}, <- Biology
68
+ # {'label': 'LABEL_0', 'score': 0.98...}] <- Non-Biology
69
+
70
+ ```
71
+
72
+ Intended Use
73
+ This model is ideal for:
74
+
75
+ Filtering biological data from Common Crawl or other web datasets.
76
+
77
+ Categorizing academic papers.
78
+
79
+ Tagging educational content.
80
+
81
+ Limitations
82
+ Since the model prioritizes Recall (83%), it may generate some False Positives (Precision ~74%). It might occasionally classify related scientific fields (like Chemistry or Physics) as Biology depending on the context.