# DistilBERT-Base-Uncased Quantized Model for Scientific Paper Classification This repository hosts a quantized version of the **DistilBERT** model, fine-tuned for **scientific paper classification** into three categories: **Biology, Mathematics, and Physics**. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for real-world applications, including academic research and automated categorization of scientific literature. ## Model Details - **Model Architecture:** DistilBERT Base Uncased - **Task:** Scientific Paper Classification - **Dataset:** Custom dataset labeled with three categories: Biology, Mathematics, and Physics - **Quantization:** Float16 (FP16) - **Fine-tuning Framework:** Hugging Face Transformers ## Usage ### Installation ```sh pip install transformers torch ``` ### Loading the Model ```python from transformers import DistilBertForSequenceClassification, DistilBertTokenizer import torch # Load quantized model quantized_model_path = "/kaggle/working/distilbert_finetuned_fp16" quantized_model = DistilBertForSequenceClassification.from_pretrained(quantized_model_path) quantized_model.eval() # Set to evaluation mode quantized_model.half() # Convert model to FP16 # Load tokenizer tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased") # Define a test input test_paper = "The quantum mechanics of atomic structures are governed by Schrödinger's equation." # Tokenize input inputs = tokenizer(test_paper, return_tensors="pt", padding=True, truncation=True, max_length=512) # Ensure input tensors are in correct dtype inputs["input_ids"] = inputs["input_ids"].long() # Convert to long type inputs["attention_mask"] = inputs["attention_mask"].long() # Convert to long type # Make prediction with torch.no_grad(): outputs = quantized_model(**inputs) # Get predicted class predicted_class = torch.argmax(outputs.logits, dim=1).item() # Class labels label_mapping = {0: "Biology", 1: "Mathematics", 2: "Physics"} predicted_label = label_mapping[predicted_class] print(f"Predicted Label: {predicted_label}") ``` ## Performance Metrics - **Accuracy:** 0.95 (after fine-tuning) - **F1-Score:** 0.91 (weighted) ## Fine-Tuning Details ### Dataset The dataset consists of **scientific papers** categorized into three domains: - **Biology** - **Mathematics** - **Physics** The dataset was preprocessed and tokenized using the **DistilBERT tokenizer**. ### Training - Number of epochs: 3 - Batch size: 8 - Learning rate: 2e-5 - Optimizer: AdamW - Evaluation strategy: epoch ### Quantization Post-training quantization was applied using PyTorch’s built-in quantization framework to reduce the model size and improve inference efficiency. ## Repository Structure ``` . ├── model/ # Contains the quantized model files ├── tokenizer_config/ # Tokenizer configuration and vocabulary files ├── model.safensors/ # Fine-Tuned Model ├── README.md # Model documentation ``` ## Limitations - The model is trained on a limited dataset and may not generalize well to niche scientific subdomains. - Quantization may result in slight accuracy degradation compared to full-precision models. ## Contributing Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.