File size: 3,425 Bytes

d8c7540

# DistilBERT-Base-Uncased Quantized Model for Scientific Paper Classification

This repository hosts a quantized version of the **DistilBERT** model, fine-tuned for **scientific paper classification** into three categories: **Biology, Mathematics, and Physics**. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for real-world applications, including academic research and automated categorization of scientific literature.

## Model Details

- **Model Architecture:** DistilBERT Base Uncased  
- **Task:** Scientific Paper Classification  
- **Dataset:** Custom dataset labeled with three categories: Biology, Mathematics, and Physics  
- **Quantization:** Float16 (FP16)  
- **Fine-tuning Framework:** Hugging Face Transformers  

## Usage

### Installation

```sh
pip install transformers torch
```

### Loading the Model

```python
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
import torch

# Load quantized model
quantized_model_path = "/kaggle/working/distilbert_finetuned_fp16"
quantized_model = DistilBertForSequenceClassification.from_pretrained(quantized_model_path)
quantized_model.eval()  # Set to evaluation mode
quantized_model.half()  # Convert model to FP16

# Load tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

# Define a test input
test_paper = "The quantum mechanics of atomic structures are governed by Schrödinger's equation."

# Tokenize input
inputs = tokenizer(test_paper, return_tensors="pt", padding=True, truncation=True, max_length=512)

# Ensure input tensors are in correct dtype
inputs["input_ids"] = inputs["input_ids"].long()  # Convert to long type
inputs["attention_mask"] = inputs["attention_mask"].long()  # Convert to long type

# Make prediction
with torch.no_grad():
    outputs = quantized_model(**inputs)

# Get predicted class
predicted_class = torch.argmax(outputs.logits, dim=1).item()

# Class labels
label_mapping = {0: "Biology", 1: "Mathematics", 2: "Physics"}

predicted_label = label_mapping[predicted_class]
print(f"Predicted Label: {predicted_label}")
```

## Performance Metrics

- **Accuracy:** 0.95 (after fine-tuning)  
- **F1-Score:** 0.91 (weighted)  

## Fine-Tuning Details

### Dataset

The dataset consists of **scientific papers** categorized into three domains:
- **Biology**
- **Mathematics**
- **Physics**

The dataset was preprocessed and tokenized using the **DistilBERT tokenizer**.

### Training

- Number of epochs: 3  
- Batch size: 8  
- Learning rate: 2e-5  
- Optimizer: AdamW  
- Evaluation strategy: epoch  

### Quantization

Post-training quantization was applied using PyTorch’s built-in quantization framework to reduce the model size and improve inference efficiency.

## Repository Structure

```
.
├── model/               # Contains the quantized model files
├── tokenizer_config/    # Tokenizer configuration and vocabulary files
├── model.safensors/     # Fine-Tuned Model
├── README.md            # Model documentation
```

## Limitations

- The model is trained on a limited dataset and may not generalize well to niche scientific subdomains.
- Quantization may result in slight accuracy degradation compared to full-precision models.

## Contributing

Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.