luisMfelipe commited on
Commit
e6e9202
·
verified ·
1 Parent(s): 9bc8d3c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +140 -3
README.md CHANGED
@@ -1,3 +1,140 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - TheBlueScrubs/TheBlueScrubs-v1
5
+ language:
6
+ - en
7
+ metrics:
8
+ - accuracy
9
+ - r_squared
10
+ base_model:
11
+ - answerdotai/ModernBERT-base
12
+ pipeline_tag: text-classification
13
+ tags:
14
+ - medical
15
+ - biology
16
+ ---
17
+
18
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66eb0a4e55940cd564ad8e0a/usrRuzoZO6Rb79DYphdZG.png)
19
+
20
+ # ModernBERT Medical Precision & Factual Detail Regressor
21
+
22
+ The **ModernBERT Medical Precision & Factual Detail Regressor** is a transformer-based language model that predicts the *level of precision and factual detail* within medical or biological texts. Built upon the **ModernBERT** architecture, this model provides a continuous score (1–5) indicating how precise and factually detailed a given text is. These scores can help users filter, categorize, or prioritize documents based on their informational quality.
23
+
24
+ ## Model Details
25
+
26
+ - **Model Name**: [TheBlueScrubs/ModernBERT-base-TBS-MedicalPrecision](https://huggingface.co/TheBlueScrubs/ModernBERT-base-TBS-MedicalPrecision)
27
+ - **Developed by**: TheBlueScrubs
28
+ - **Model Type**: Transformer-based language model (regression output)
29
+ - **Language**: English
30
+ - **License**: Apache-2.0
31
+ - **Base Model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
32
+
33
+ This model leverages ModernBERT’s **Rotary Positional Embeddings**, **local–global alternating attention**, and **Flash Attention**, providing extended context windows (up to 8,192 tokens) and fast, memory-efficient inference.
34
+
35
+ ## Intended Uses & Limitations
36
+
37
+ ### Intended Uses
38
+ - **Medical/Scientific Document Filtering**: Identify texts with high precision and factual detail to focus on reliable information sources.
39
+ - **Data Curation**: Aid in building high-quality corpora by prioritizing documents that exhibit strong factual rigor.
40
+
41
+ ### Limitations
42
+ - **Domain Shift**: Primarily trained on medical and biological data. May not generalize well to non-medical text or highly specialized domains not covered in the dataset.
43
+ - **Score Interpretation**: The raw regression output (1–5) requires clear thresholds or binning strategies, depending on the downstream application.
44
+
45
+ ## How to Use
46
+
47
+ You can run inference using the Hugging Face Transformers library as follows:
48
+
49
+ ```python
50
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
51
+ import torch
52
+
53
+ # Load the tokenizer and model
54
+ tokenizer = AutoTokenizer.from_pretrained("TheBlueScrubs/ModernBERT-base-TBS-MedicalPrecision")
55
+ model = AutoModelForSequenceClassification.from_pretrained("TheBlueScrubs/ModernBERT-base-TBS-MedicalPrecision")
56
+
57
+ # Example text
58
+ text = "A recent randomized trial found that combining targeted therapy with immunotherapy improved survival rates for melanoma patients."
59
+
60
+ # Tokenize input
61
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
62
+
63
+ # Get model predictions
64
+ outputs = model(**inputs)
65
+ predictions = outputs.logits
66
+
67
+ # The model outputs a single continuous score indicating the level of precision & factual detail
68
+ precision_score = predictions.item()
69
+ print(f"Precision & Factual Detail Score: {precision_score}")
70
+ ```
71
+
72
+ ## Training Data
73
+
74
+ A **balanced subset** of The Blue Scrubs dataset was prepared, each text featuring a *“Precision and Factual Detail”* label (1–5). Steps included:
75
+
76
+ - **Data Cleaning**: Removed rows with parsing errors, NaNs, or out-of-range values.
77
+ - **Balancing**: Ensured a roughly even distribution of documents across lower and higher precision scores.
78
+
79
+ ## Training Procedure
80
+
81
+ ### Preprocessing
82
+ - **Tokenizer**: ModernBERT tokenizer with a maximum sequence length of 4,096.
83
+ - No additional filtering beyond standard data cleaning.
84
+
85
+ ### Training Hyperparameters
86
+
87
+ - **Learning Rate**: `2e-5`
88
+ - **Number of Epochs**: `5`
89
+ - **Batch Size**: `16` (per device)
90
+ - **Gradient Accumulation Steps**: `1`
91
+ - **Optimizer**: AdamW
92
+ - **Weight Decay**: `0.01`
93
+ - **FP16 Training**: Enabled
94
+ - **Total Training Steps**: ~5 epochs on the balanced set
95
+
96
+ Training used multi-GPU distributed data parallelism, with frequent evaluations (1/5 epoch) based on the `mse` metric.
97
+
98
+ ## Evaluation
99
+
100
+ ### Testing Data
101
+
102
+ The final model was evaluated on an out-of-sample test set. This dataset contained medical documents not included in the training or validation splits.
103
+
104
+ ### Metrics
105
+
106
+ - **Mean Squared Error (MSE)**: ~0.5671
107
+ - **Accuracy** (with threshold ≤ 2.0 for “low precision” vs. > 2.0): 0.9630
108
+ - **ROC Analysis**: Demonstrated robust classification capability with high True Positive Rates and low False Positive Rates.
109
+
110
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/66eb0a4e55940cd564ad8e0a/e0WXu-Oa0w78GS25NqOL1.png)
111
+
112
+ ## Bias, Risks, and Limitations
113
+
114
+ - **Data Bias**: Underrepresented subfields or rare document types may impact performance.
115
+ - **Misinterpretation**: A single numeric score is not a guarantee of clinical accuracy or evidence-based correctness.
116
+ - **Domain Evolution**: Medical knowledge evolves quickly; periodic retraining or re-validation is recommended.
117
+
118
+ ## Recommendations
119
+
120
+ - **Domain-Specific Adjustments**: Consider further fine-tuning if applying to highly specialized medical subdomains.
121
+ - **Score Thresholding**: Set context-appropriate cutoffs or categories (e.g., “low,” “moderate,” “high” precision) based on your downstream needs.
122
+ - **Continuous Monitoring**: Maintain up-to-date evaluations as new data or medical findings emerge.
123
+
124
+ ## Citation
125
+
126
+ If you utilize this model in your research or applications, please cite it as follows:
127
+
128
+ ```bibtex
129
+ @misc{thebluescrubs2025modernbert,
130
+ author = {TheBlueScrubs},
131
+ title = {ModernBERT Medical Precision & Factual Detail Regressor},
132
+ year = {2025},
133
+ publisher = {Hugging Face},
134
+ url = {https://huggingface.co/TheBlueScrubs/ModernBERT-base-TBS-MedicalPrecision}
135
+ }
136
+ ```
137
+
138
+ ## Model Card Authors
139
+
140
+ - TheBlueScrubs Team