| # DistilBERT-Base-Uncased Quantized Model for Scientific Paper Classification | |
| This repository hosts a quantized version of the **DistilBERT** model, fine-tuned for **scientific paper classification** into three categories: **Biology, Mathematics, and Physics**. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for real-world applications, including academic research and automated categorization of scientific literature. | |
| ## Model Details | |
| - **Model Architecture:** DistilBERT Base Uncased | |
| - **Task:** Scientific Paper Classification | |
| - **Dataset:** Custom dataset labeled with three categories: Biology, Mathematics, and Physics | |
| - **Quantization:** Float16 (FP16) | |
| - **Fine-tuning Framework:** Hugging Face Transformers | |
| ## Usage | |
| ### Installation | |
| ```sh | |
| pip install transformers torch | |
| ``` | |
| ### Loading the Model | |
| ```python | |
| from transformers import DistilBertForSequenceClassification, DistilBertTokenizer | |
| import torch | |
| # Load quantized model | |
| quantized_model_path = "/kaggle/working/distilbert_finetuned_fp16" | |
| quantized_model = DistilBertForSequenceClassification.from_pretrained(quantized_model_path) | |
| quantized_model.eval() # Set to evaluation mode | |
| quantized_model.half() # Convert model to FP16 | |
| # Load tokenizer | |
| tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased") | |
| # Define a test input | |
| test_paper = "The quantum mechanics of atomic structures are governed by Schrödinger's equation." | |
| # Tokenize input | |
| inputs = tokenizer(test_paper, return_tensors="pt", padding=True, truncation=True, max_length=512) | |
| # Ensure input tensors are in correct dtype | |
| inputs["input_ids"] = inputs["input_ids"].long() # Convert to long type | |
| inputs["attention_mask"] = inputs["attention_mask"].long() # Convert to long type | |
| # Make prediction | |
| with torch.no_grad(): | |
| outputs = quantized_model(**inputs) | |
| # Get predicted class | |
| predicted_class = torch.argmax(outputs.logits, dim=1).item() | |
| # Class labels | |
| label_mapping = {0: "Biology", 1: "Mathematics", 2: "Physics"} | |
| predicted_label = label_mapping[predicted_class] | |
| print(f"Predicted Label: {predicted_label}") | |
| ``` | |
| ## Performance Metrics | |
| - **Accuracy:** 0.95 (after fine-tuning) | |
| - **F1-Score:** 0.91 (weighted) | |
| ## Fine-Tuning Details | |
| ### Dataset | |
| The dataset consists of **scientific papers** categorized into three domains: | |
| - **Biology** | |
| - **Mathematics** | |
| - **Physics** | |
| The dataset was preprocessed and tokenized using the **DistilBERT tokenizer**. | |
| ### Training | |
| - Number of epochs: 3 | |
| - Batch size: 8 | |
| - Learning rate: 2e-5 | |
| - Optimizer: AdamW | |
| - Evaluation strategy: epoch | |
| ### Quantization | |
| Post-training quantization was applied using PyTorch’s built-in quantization framework to reduce the model size and improve inference efficiency. | |
| ## Repository Structure | |
| ``` | |
| . | |
| ├── model/ # Contains the quantized model files | |
| ├── tokenizer_config/ # Tokenizer configuration and vocabulary files | |
| ├── model.safensors/ # Fine-Tuned Model | |
| ├── README.md # Model documentation | |
| ``` | |
| ## Limitations | |
| - The model is trained on a limited dataset and may not generalize well to niche scientific subdomains. | |
| - Quantization may result in slight accuracy degradation compared to full-precision models. | |
| ## Contributing | |
| Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements. | |