developerPushkal commited on
Commit
d8c7540
·
verified ·
1 Parent(s): edccd37

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -0
README.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # DistilBERT-Base-Uncased Quantized Model for Scientific Paper Classification
2
+
3
+ This repository hosts a quantized version of the **DistilBERT** model, fine-tuned for **scientific paper classification** into three categories: **Biology, Mathematics, and Physics**. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for real-world applications, including academic research and automated categorization of scientific literature.
4
+
5
+ ## Model Details
6
+
7
+ - **Model Architecture:** DistilBERT Base Uncased
8
+ - **Task:** Scientific Paper Classification
9
+ - **Dataset:** Custom dataset labeled with three categories: Biology, Mathematics, and Physics
10
+ - **Quantization:** Float16 (FP16)
11
+ - **Fine-tuning Framework:** Hugging Face Transformers
12
+
13
+ ## Usage
14
+
15
+ ### Installation
16
+
17
+ ```sh
18
+ pip install transformers torch
19
+ ```
20
+
21
+ ### Loading the Model
22
+
23
+ ```python
24
+ from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
25
+ import torch
26
+
27
+ # Load quantized model
28
+ quantized_model_path = "/kaggle/working/distilbert_finetuned_fp16"
29
+ quantized_model = DistilBertForSequenceClassification.from_pretrained(quantized_model_path)
30
+ quantized_model.eval() # Set to evaluation mode
31
+ quantized_model.half() # Convert model to FP16
32
+
33
+ # Load tokenizer
34
+ tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
35
+
36
+ # Define a test input
37
+ test_paper = "The quantum mechanics of atomic structures are governed by Schrödinger's equation."
38
+
39
+ # Tokenize input
40
+ inputs = tokenizer(test_paper, return_tensors="pt", padding=True, truncation=True, max_length=512)
41
+
42
+ # Ensure input tensors are in correct dtype
43
+ inputs["input_ids"] = inputs["input_ids"].long() # Convert to long type
44
+ inputs["attention_mask"] = inputs["attention_mask"].long() # Convert to long type
45
+
46
+ # Make prediction
47
+ with torch.no_grad():
48
+ outputs = quantized_model(**inputs)
49
+
50
+ # Get predicted class
51
+ predicted_class = torch.argmax(outputs.logits, dim=1).item()
52
+
53
+ # Class labels
54
+ label_mapping = {0: "Biology", 1: "Mathematics", 2: "Physics"}
55
+
56
+ predicted_label = label_mapping[predicted_class]
57
+ print(f"Predicted Label: {predicted_label}")
58
+ ```
59
+
60
+ ## Performance Metrics
61
+
62
+ - **Accuracy:** 0.95 (after fine-tuning)
63
+ - **F1-Score:** 0.91 (weighted)
64
+
65
+ ## Fine-Tuning Details
66
+
67
+ ### Dataset
68
+
69
+ The dataset consists of **scientific papers** categorized into three domains:
70
+ - **Biology**
71
+ - **Mathematics**
72
+ - **Physics**
73
+
74
+ The dataset was preprocessed and tokenized using the **DistilBERT tokenizer**.
75
+
76
+ ### Training
77
+
78
+ - Number of epochs: 3
79
+ - Batch size: 8
80
+ - Learning rate: 2e-5
81
+ - Optimizer: AdamW
82
+ - Evaluation strategy: epoch
83
+
84
+ ### Quantization
85
+
86
+ Post-training quantization was applied using PyTorch’s built-in quantization framework to reduce the model size and improve inference efficiency.
87
+
88
+ ## Repository Structure
89
+
90
+ ```
91
+ .
92
+ ├── model/ # Contains the quantized model files
93
+ ├── tokenizer_config/ # Tokenizer configuration and vocabulary files
94
+ ├── model.safensors/ # Fine-Tuned Model
95
+ ├── README.md # Model documentation
96
+ ```
97
+
98
+ ## Limitations
99
+
100
+ - The model is trained on a limited dataset and may not generalize well to niche scientific subdomains.
101
+ - Quantization may result in slight accuracy degradation compared to full-precision models.
102
+
103
+ ## Contributing
104
+
105
+ Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.
106
+