Madras1
/

RobertaPhysicsClass

Text Classification

Model card Files Files and versions

Madras1 commited on Nov 28, 2025

Commit

669c3a2

·

verified ·

1 Parent(s): 946c19a

Update README.md

Files changed (1) hide show

README.md +73 -7

README.md CHANGED Viewed

@@ -1,12 +1,78 @@
 ---
-license: mit
 language:
 - en
-metrics:
-- accuracy
-base_model:
-- FacebookAI/roberta-base
-pipeline_tag: text-classification
 tags:
 - physics
----

 ---
 language:
 - en
 tags:
+- text-classification
 - physics
+- science
+- roberta
+- data-cleaning
+license: mit
+metrics:
+- accuracy
+- f1
+- precision
+- recall
+base_model: roberta-base
+---
+# RobertaPhysics: Physics Content Classifier
+This model is a fine-tuned version of [roberta-base](https://huggingface.co/roberta-base) designed to distinguish between **Physics-related content** and **General/Non-Physics text**.
+It was developed specifically for **data cleaning pipelines**, aiming to filter and curate high-quality scientific datasets by removing irrelevant noise from raw text collections.
+## 📊 Model Performance
+The model was trained for 3 epochs and achieved the following results on the validation set (2,191 samples):
+| Metric | Value | Interpretation |
+| :--- | :--- | :--- |
+| **Accuracy** | **94.44%** | Overall correct classification rate. |
+| **Precision** | **70.00%** | Reliability when predicting "Physics" class. |
+| **Recall** | **62.30%** | Ability to detect Physics content within the dataset. |
+| **F1-Score** | **65.93%** | Harmonic mean of precision and recall. |
+| **Validation Loss** | **0.1574** | Low validation error indicating stable convergence. |
+## 🏷️ Label Mapping
+The model uses the following mapping for inference:
+* **LABEL_0 (0):** `General` (Non-Physics content, noise, or other topics)
+* **LABEL_1 (1):** `Physics` (Scientific or educational content related to physics)
+## ⚙️ Training Details
+* **Dataset:** Approximately 11,000 processed text samples (8,762 training / 2,191 validation).
+* **Architecture:** RoBERTa Base (Sequence Classification).
+* **Batch Size:** 16 (Train) / 64 (Eval).
+* **Optimizer:** AdamW (weight decay 0.01).
+* **Loss Function:** CrossEntropyLoss.
+## 🚀 Quick Start
+You can use this model directly with the Hugging Face `pipeline`:
+```python
+from transformers import pipeline
+# Load the classifier
+classifier = pipeline("text-classification", model="Madras1/RobertaPhysics")
+# Example 1: Physics Content
+text_physics = "Quantum entanglement describes a phenomenon where linked particles remain connected."
+result_physics = classifier(text_physics)
+print(result_physics)
+# Expected Output: [{'label': 'Physics', 'score': 0.93}]
+# Example 2: General Content
+text_general = "The quarterly earnings report will be released to investors next Tuesday."
+result_general = classifier(text_general)
+print(result_general)
+# Expected Output: [{'label': 'General', 'score': 0.86}]
+⚠️ Intended Use
+Primary Use: Filtering datasets to retain physics-domain text.
+Limitations: The model prioritizes precision over recall (Precision: 70% vs Recall: 62%). This means it is "conservative": it minimizes false positives (junk labeled as physics) but may miss some valid physics texts. This is intentional for high-quality dataset curation.