Update README.md
Browse files
README.md
CHANGED
|
@@ -1,12 +1,78 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
-
metrics:
|
| 6 |
-
- accuracy
|
| 7 |
-
base_model:
|
| 8 |
-
- FacebookAI/roberta-base
|
| 9 |
-
pipeline_tag: text-classification
|
| 10 |
tags:
|
|
|
|
| 11 |
- physics
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
tags:
|
| 5 |
+
- text-classification
|
| 6 |
- physics
|
| 7 |
+
- science
|
| 8 |
+
- roberta
|
| 9 |
+
- data-cleaning
|
| 10 |
+
license: mit
|
| 11 |
+
metrics:
|
| 12 |
+
- accuracy
|
| 13 |
+
- f1
|
| 14 |
+
- precision
|
| 15 |
+
- recall
|
| 16 |
+
base_model: roberta-base
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# RobertaPhysics: Physics Content Classifier
|
| 20 |
+
|
| 21 |
+
This model is a fine-tuned version of [roberta-base](https://huggingface.co/roberta-base) designed to distinguish between **Physics-related content** and **General/Non-Physics text**.
|
| 22 |
+
|
| 23 |
+
It was developed specifically for **data cleaning pipelines**, aiming to filter and curate high-quality scientific datasets by removing irrelevant noise from raw text collections.
|
| 24 |
+
|
| 25 |
+
## 📊 Model Performance
|
| 26 |
+
|
| 27 |
+
The model was trained for 3 epochs and achieved the following results on the validation set (2,191 samples):
|
| 28 |
+
|
| 29 |
+
| Metric | Value | Interpretation |
|
| 30 |
+
| :--- | :--- | :--- |
|
| 31 |
+
| **Accuracy** | **94.44%** | Overall correct classification rate. |
|
| 32 |
+
| **Precision** | **70.00%** | Reliability when predicting "Physics" class. |
|
| 33 |
+
| **Recall** | **62.30%** | Ability to detect Physics content within the dataset. |
|
| 34 |
+
| **F1-Score** | **65.93%** | Harmonic mean of precision and recall. |
|
| 35 |
+
| **Validation Loss** | **0.1574** | Low validation error indicating stable convergence. |
|
| 36 |
+
|
| 37 |
+
## 🏷️ Label Mapping
|
| 38 |
+
|
| 39 |
+
The model uses the following mapping for inference:
|
| 40 |
+
|
| 41 |
+
* **LABEL_0 (0):** `General` (Non-Physics content, noise, or other topics)
|
| 42 |
+
* **LABEL_1 (1):** `Physics` (Scientific or educational content related to physics)
|
| 43 |
+
|
| 44 |
+
## ⚙️ Training Details
|
| 45 |
+
|
| 46 |
+
* **Dataset:** Approximately 11,000 processed text samples (8,762 training / 2,191 validation).
|
| 47 |
+
* **Architecture:** RoBERTa Base (Sequence Classification).
|
| 48 |
+
* **Batch Size:** 16 (Train) / 64 (Eval).
|
| 49 |
+
* **Optimizer:** AdamW (weight decay 0.01).
|
| 50 |
+
* **Loss Function:** CrossEntropyLoss.
|
| 51 |
+
|
| 52 |
+
## 🚀 Quick Start
|
| 53 |
+
|
| 54 |
+
You can use this model directly with the Hugging Face `pipeline`:
|
| 55 |
+
|
| 56 |
+
```python
|
| 57 |
+
from transformers import pipeline
|
| 58 |
+
|
| 59 |
+
# Load the classifier
|
| 60 |
+
classifier = pipeline("text-classification", model="Madras1/RobertaPhysics")
|
| 61 |
+
|
| 62 |
+
# Example 1: Physics Content
|
| 63 |
+
text_physics = "Quantum entanglement describes a phenomenon where linked particles remain connected."
|
| 64 |
+
result_physics = classifier(text_physics)
|
| 65 |
+
print(result_physics)
|
| 66 |
+
# Expected Output: [{'label': 'Physics', 'score': 0.93}]
|
| 67 |
+
|
| 68 |
+
# Example 2: General Content
|
| 69 |
+
text_general = "The quarterly earnings report will be released to investors next Tuesday."
|
| 70 |
+
result_general = classifier(text_general)
|
| 71 |
+
print(result_general)
|
| 72 |
+
# Expected Output: [{'label': 'General', 'score': 0.86}]
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
⚠️ Intended Use
|
| 76 |
+
Primary Use: Filtering datasets to retain physics-domain text.
|
| 77 |
+
|
| 78 |
+
Limitations: The model prioritizes precision over recall (Precision: 70% vs Recall: 62%). This means it is "conservative": it minimizes false positives (junk labeled as physics) but may miss some valid physics texts. This is intentional for high-quality dataset curation.
|