Madras1
/

RobertaPhysicsClass

Text Classification

Model card Files Files and versions

RobertaPhysicsClass / README.md

Madras1's picture

Update README.md

570a228 verified 2 months ago

|

history blame contribute delete

3.17 kB

	---
	language:
	- en
	tags:
	- text-classification
	- physics
	- science
	- roberta
	- data-cleaning
	license: mit
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	base_model: roberta-base
	---


	![image](https://cdn-uploads.huggingface.co/production/uploads/6691fb6571836231e29eb5fb/eeksAj0wC_vlwzCITr3Oo.png)

	# RobertaPhysics: Physics Content Classifier

	This model is a fine-tuned version of [roberta-base](https://huggingface.co/roberta-base) designed to distinguish between Physics-related content and General/Non-Physics text.

	It was developed specifically for data cleaning pipelines, aiming to filter and curate high-quality scientific datasets by removing irrelevant noise from raw text collections.

	## 📊 Model Performance

	The model was trained for 3 epochs and achieved the following results on the validation set (2,191 samples):

	\| Metric \| Value \| Interpretation \|
	\| :--- \| :--- \| :--- \|
	\| Accuracy \| 94.44% \| Overall correct classification rate. \|
	\| Precision \| 70.00% \| Reliability when predicting "Physics" class. \|
	\| Recall \| 62.30% \| Ability to detect Physics content within the dataset. \|
	\| F1-Score \| 65.93% \| Harmonic mean of precision and recall. \|
	\| Validation Loss \| 0.1574 \| Low validation error indicating stable convergence. \|


	![image](https://cdn-uploads.huggingface.co/production/uploads/6691fb6571836231e29eb5fb/THTeXtbx41e9jxs_UGeJI.png)

	## 🏷️ Label Mapping

	The model uses the following mapping for inference:

	* LABEL_0 (0): `General` (Non-Physics content, noise, or other topics)
	* LABEL_1 (1): `Physics` (Scientific or educational content related to physics)

	## ⚙️ Training Details

	* Dataset: Approximately 11,000 processed text samples (8,762 training / 2,191 validation).
	* Architecture: RoBERTa Base (Sequence Classification).
	* Batch Size: 16 (Train) / 64 (Eval).
	* Optimizer: AdamW (weight decay 0.01).
	* Loss Function: CrossEntropyLoss.

	## 🚀 Quick Start

	You can use this model directly with the Hugging Face `pipeline`:

	```python
	from transformers import pipeline

	# Load the classifier
	classifier = pipeline("text-classification", model="Madras1/RobertaPhysics")

	# Example 1: Physics Content
	text_physics = "Quantum entanglement describes a phenomenon where linked particles remain connected."
	result_physics = classifier(text_physics)
	print(result_physics)
	# Expected Output: [{'label': 'Physics', 'score': 0.93}]

	# Example 2: General Content
	text_general = "The quarterly earnings report will be released to investors next Tuesday."
	result_general = classifier(text_general)
	print(result_general)
	# Expected Output: [{'label': 'General', 'score': 0.86}]
	```

	![image](https://cdn-uploads.huggingface.co/production/uploads/6691fb6571836231e29eb5fb/ba2PpICAPZaZmZAAOdZKu.png)

	⚠️ Intended Use
	Primary Use: Filtering datasets to retain physics-domain text.

	Limitations: The model prioritizes precision over recall (Precision: 70% vs Recall: 62%). This means it is "conservative": it minimizes false positives (junk labeled as physics) but may miss some valid physics texts. This is intentional for high-quality dataset curation.