Madras1 commited on
Commit
669c3a2
·
verified ·
1 Parent(s): 946c19a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -7
README.md CHANGED
@@ -1,12 +1,78 @@
1
  ---
2
- license: mit
3
  language:
4
  - en
5
- metrics:
6
- - accuracy
7
- base_model:
8
- - FacebookAI/roberta-base
9
- pipeline_tag: text-classification
10
  tags:
 
11
  - physics
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  language:
3
  - en
 
 
 
 
 
4
  tags:
5
+ - text-classification
6
  - physics
7
+ - science
8
+ - roberta
9
+ - data-cleaning
10
+ license: mit
11
+ metrics:
12
+ - accuracy
13
+ - f1
14
+ - precision
15
+ - recall
16
+ base_model: roberta-base
17
+ ---
18
+
19
+ # RobertaPhysics: Physics Content Classifier
20
+
21
+ This model is a fine-tuned version of [roberta-base](https://huggingface.co/roberta-base) designed to distinguish between **Physics-related content** and **General/Non-Physics text**.
22
+
23
+ It was developed specifically for **data cleaning pipelines**, aiming to filter and curate high-quality scientific datasets by removing irrelevant noise from raw text collections.
24
+
25
+ ## 📊 Model Performance
26
+
27
+ The model was trained for 3 epochs and achieved the following results on the validation set (2,191 samples):
28
+
29
+ | Metric | Value | Interpretation |
30
+ | :--- | :--- | :--- |
31
+ | **Accuracy** | **94.44%** | Overall correct classification rate. |
32
+ | **Precision** | **70.00%** | Reliability when predicting "Physics" class. |
33
+ | **Recall** | **62.30%** | Ability to detect Physics content within the dataset. |
34
+ | **F1-Score** | **65.93%** | Harmonic mean of precision and recall. |
35
+ | **Validation Loss** | **0.1574** | Low validation error indicating stable convergence. |
36
+
37
+ ## 🏷️ Label Mapping
38
+
39
+ The model uses the following mapping for inference:
40
+
41
+ * **LABEL_0 (0):** `General` (Non-Physics content, noise, or other topics)
42
+ * **LABEL_1 (1):** `Physics` (Scientific or educational content related to physics)
43
+
44
+ ## ⚙️ Training Details
45
+
46
+ * **Dataset:** Approximately 11,000 processed text samples (8,762 training / 2,191 validation).
47
+ * **Architecture:** RoBERTa Base (Sequence Classification).
48
+ * **Batch Size:** 16 (Train) / 64 (Eval).
49
+ * **Optimizer:** AdamW (weight decay 0.01).
50
+ * **Loss Function:** CrossEntropyLoss.
51
+
52
+ ## 🚀 Quick Start
53
+
54
+ You can use this model directly with the Hugging Face `pipeline`:
55
+
56
+ ```python
57
+ from transformers import pipeline
58
+
59
+ # Load the classifier
60
+ classifier = pipeline("text-classification", model="Madras1/RobertaPhysics")
61
+
62
+ # Example 1: Physics Content
63
+ text_physics = "Quantum entanglement describes a phenomenon where linked particles remain connected."
64
+ result_physics = classifier(text_physics)
65
+ print(result_physics)
66
+ # Expected Output: [{'label': 'Physics', 'score': 0.93}]
67
+
68
+ # Example 2: General Content
69
+ text_general = "The quarterly earnings report will be released to investors next Tuesday."
70
+ result_general = classifier(text_general)
71
+ print(result_general)
72
+ # Expected Output: [{'label': 'General', 'score': 0.86}]
73
+
74
+
75
+ ⚠️ Intended Use
76
+ Primary Use: Filtering datasets to retain physics-domain text.
77
+
78
+ Limitations: The model prioritizes precision over recall (Precision: 70% vs Recall: 62%). This means it is "conservative": it minimizes false positives (junk labeled as physics) but may miss some valid physics texts. This is intentional for high-quality dataset curation.