|
|
--- |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- text-classification |
|
|
- physics |
|
|
- science |
|
|
- roberta |
|
|
- data-cleaning |
|
|
license: mit |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
base_model: roberta-base |
|
|
--- |
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
# RobertaPhysics: Physics Content Classifier |
|
|
|
|
|
This model is a fine-tuned version of [roberta-base](https://huggingface.co/roberta-base) designed to distinguish between **Physics-related content** and **General/Non-Physics text**. |
|
|
|
|
|
It was developed specifically for **data cleaning pipelines**, aiming to filter and curate high-quality scientific datasets by removing irrelevant noise from raw text collections. |
|
|
|
|
|
## ๐ Model Performance |
|
|
|
|
|
The model was trained for 3 epochs and achieved the following results on the validation set (2,191 samples): |
|
|
|
|
|
| Metric | Value | Interpretation | |
|
|
| :--- | :--- | :--- | |
|
|
| **Accuracy** | **94.44%** | Overall correct classification rate. | |
|
|
| **Precision** | **70.00%** | Reliability when predicting "Physics" class. | |
|
|
| **Recall** | **62.30%** | Ability to detect Physics content within the dataset. | |
|
|
| **F1-Score** | **65.93%** | Harmonic mean of precision and recall. | |
|
|
| **Validation Loss** | **0.1574** | Low validation error indicating stable convergence. | |
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
## ๐ท๏ธ Label Mapping |
|
|
|
|
|
The model uses the following mapping for inference: |
|
|
|
|
|
* **LABEL_0 (0):** `General` (Non-Physics content, noise, or other topics) |
|
|
* **LABEL_1 (1):** `Physics` (Scientific or educational content related to physics) |
|
|
|
|
|
## โ๏ธ Training Details |
|
|
|
|
|
* **Dataset:** Approximately 11,000 processed text samples (8,762 training / 2,191 validation). |
|
|
* **Architecture:** RoBERTa Base (Sequence Classification). |
|
|
* **Batch Size:** 16 (Train) / 64 (Eval). |
|
|
* **Optimizer:** AdamW (weight decay 0.01). |
|
|
* **Loss Function:** CrossEntropyLoss. |
|
|
|
|
|
## ๐ Quick Start |
|
|
|
|
|
You can use this model directly with the Hugging Face `pipeline`: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Load the classifier |
|
|
classifier = pipeline("text-classification", model="Madras1/RobertaPhysics") |
|
|
|
|
|
# Example 1: Physics Content |
|
|
text_physics = "Quantum entanglement describes a phenomenon where linked particles remain connected." |
|
|
result_physics = classifier(text_physics) |
|
|
print(result_physics) |
|
|
# Expected Output: [{'label': 'Physics', 'score': 0.93}] |
|
|
|
|
|
# Example 2: General Content |
|
|
text_general = "The quarterly earnings report will be released to investors next Tuesday." |
|
|
result_general = classifier(text_general) |
|
|
print(result_general) |
|
|
# Expected Output: [{'label': 'General', 'score': 0.86}] |
|
|
``` |
|
|
|
|
|
 |
|
|
|
|
|
โ ๏ธ Intended Use |
|
|
Primary Use: Filtering datasets to retain physics-domain text. |
|
|
|
|
|
Limitations: The model prioritizes precision over recall (Precision: 70% vs Recall: 62%). This means it is "conservative": it minimizes false positives (junk labeled as physics) but may miss some valid physics texts. This is intentional for high-quality dataset curation. |