AventIQ-AI
/

distilbert-research-paper-area-classification

Safetensors

distilbert

Model card Files Files and versions

xet

Community

developerPushkal commited on Mar 18, 2025

Commit

d8c7540

verified ·

1 Parent(s): edccd37

Create README.md

Browse files

Files changed (1) hide show

README.md +106 -0

README.md ADDED Viewed

	@@ -0,0 +1,106 @@

+# DistilBERT-Base-Uncased Quantized Model for Scientific Paper Classification
+This repository hosts a quantized version of the **DistilBERT** model, fine-tuned for **scientific paper classification** into three categories: **Biology, Mathematics, and Physics**. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for real-world applications, including academic research and automated categorization of scientific literature.
+## Model Details
+- **Model Architecture:** DistilBERT Base Uncased
+- **Task:** Scientific Paper Classification
+- **Dataset:** Custom dataset labeled with three categories: Biology, Mathematics, and Physics
+- **Quantization:** Float16 (FP16)
+- **Fine-tuning Framework:** Hugging Face Transformers
+## Usage
+### Installation
+```sh
+pip install transformers torch
+```
+### Loading the Model
+```python
+from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
+import torch
+# Load quantized model
+quantized_model_path = "/kaggle/working/distilbert_finetuned_fp16"
+quantized_model = DistilBertForSequenceClassification.from_pretrained(quantized_model_path)
+quantized_model.eval()  # Set to evaluation mode
+quantized_model.half()  # Convert model to FP16
+# Load tokenizer
+tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
+# Define a test input
+test_paper = "The quantum mechanics of atomic structures are governed by Schrödinger's equation."
+# Tokenize input
+inputs = tokenizer(test_paper, return_tensors="pt", padding=True, truncation=True, max_length=512)
+# Ensure input tensors are in correct dtype
+inputs["input_ids"] = inputs["input_ids"].long()  # Convert to long type
+inputs["attention_mask"] = inputs["attention_mask"].long()  # Convert to long type
+# Make prediction
+with torch.no_grad():
+    outputs = quantized_model(**inputs)
+# Get predicted class
+predicted_class = torch.argmax(outputs.logits, dim=1).item()
+# Class labels
+label_mapping = {0: "Biology", 1: "Mathematics", 2: "Physics"}
+predicted_label = label_mapping[predicted_class]
+print(f"Predicted Label: {predicted_label}")
+```
+## Performance Metrics
+- **Accuracy:** 0.95 (after fine-tuning)
+- **F1-Score:** 0.91 (weighted)
+## Fine-Tuning Details
+### Dataset
+The dataset consists of **scientific papers** categorized into three domains:
+- **Biology**
+- **Mathematics**
+- **Physics**
+The dataset was preprocessed and tokenized using the **DistilBERT tokenizer**.
+### Training
+- Number of epochs: 3
+- Batch size: 8
+- Learning rate: 2e-5
+- Optimizer: AdamW
+- Evaluation strategy: epoch
+### Quantization
+Post-training quantization was applied using PyTorch’s built-in quantization framework to reduce the model size and improve inference efficiency.
+## Repository Structure
+```
+.
+├── model/               # Contains the quantized model files
+├── tokenizer_config/    # Tokenizer configuration and vocabulary files
+├── model.safensors/     # Fine-Tuned Model
+├── README.md            # Model documentation
+```
+## Limitations
+- The model is trained on a limited dataset and may not generalize well to niche scientific subdomains.
+- Quantization may result in slight accuracy degradation compared to full-precision models.
+## Contributing
+Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.