File size: 3,171 Bytes
946c19a
 
 
 
669c3a2
946c19a
669c3a2
 
 
 
 
 
 
 
 
 
 
 
570a228
 
 
669c3a2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
570a228
 
 
669c3a2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6caa30d
669c3a2
570a228
56e4ece
669c3a2
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
language:
- en
tags:
- text-classification
- physics
- science
- roberta
- data-cleaning
license: mit
metrics:
- accuracy
- f1
- precision
- recall
base_model: roberta-base
---


![image](https://cdn-uploads.huggingface.co/production/uploads/6691fb6571836231e29eb5fb/eeksAj0wC_vlwzCITr3Oo.png)

# RobertaPhysics: Physics Content Classifier

This model is a fine-tuned version of [roberta-base](https://huggingface.co/roberta-base) designed to distinguish between **Physics-related content** and **General/Non-Physics text**. 

It was developed specifically for **data cleaning pipelines**, aiming to filter and curate high-quality scientific datasets by removing irrelevant noise from raw text collections.

## 📊 Model Performance

The model was trained for 3 epochs and achieved the following results on the validation set (2,191 samples):

| Metric | Value | Interpretation |
| :--- | :--- | :--- |
| **Accuracy** | **94.44%** | Overall correct classification rate. |
| **Precision** | **70.00%** | Reliability when predicting "Physics" class. |
| **Recall** | **62.30%** | Ability to detect Physics content within the dataset. |
| **F1-Score** | **65.93%** | Harmonic mean of precision and recall. |
| **Validation Loss** | **0.1574** | Low validation error indicating stable convergence. |


![image](https://cdn-uploads.huggingface.co/production/uploads/6691fb6571836231e29eb5fb/THTeXtbx41e9jxs_UGeJI.png)

## 🏷️ Label Mapping

The model uses the following mapping for inference:

* **LABEL_0 (0):** `General` (Non-Physics content, noise, or other topics)
* **LABEL_1 (1):** `Physics` (Scientific or educational content related to physics)

## ⚙️ Training Details

* **Dataset:** Approximately 11,000 processed text samples (8,762 training / 2,191 validation).
* **Architecture:** RoBERTa Base (Sequence Classification).
* **Batch Size:** 16 (Train) / 64 (Eval).
* **Optimizer:** AdamW (weight decay 0.01).
* **Loss Function:** CrossEntropyLoss.

## 🚀 Quick Start

You can use this model directly with the Hugging Face `pipeline`:

```python
from transformers import pipeline

# Load the classifier
classifier = pipeline("text-classification", model="Madras1/RobertaPhysics")

# Example 1: Physics Content
text_physics = "Quantum entanglement describes a phenomenon where linked particles remain connected."
result_physics = classifier(text_physics)
print(result_physics)
# Expected Output: [{'label': 'Physics', 'score': 0.93}]

# Example 2: General Content
text_general = "The quarterly earnings report will be released to investors next Tuesday."
result_general = classifier(text_general)
print(result_general)
# Expected Output: [{'label': 'General', 'score': 0.86}]
```

![image](https://cdn-uploads.huggingface.co/production/uploads/6691fb6571836231e29eb5fb/ba2PpICAPZaZmZAAOdZKu.png)

⚠️ Intended Use
Primary Use: Filtering datasets to retain physics-domain text.

Limitations: The model prioritizes precision over recall (Precision: 70% vs Recall: 62%). This means it is "conservative": it minimizes false positives (junk labeled as physics) but may miss some valid physics texts. This is intentional for high-quality dataset curation.