AventIQ-AI
/

distilbert-research-paper-area-classification

Model card Files Files and versions

distilbert-research-paper-area-classification / README.md

developerPushkal's picture

developerPushkal

Create README.md

d8c7540 verified 10 months ago

|

history blame contribute delete

3.43 kB

	# DistilBERT-Base-Uncased Quantized Model for Scientific Paper Classification

	This repository hosts a quantized version of the DistilBERT model, fine-tuned for scientific paper classification into three categories: Biology, Mathematics, and Physics. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for real-world applications, including academic research and automated categorization of scientific literature.

	## Model Details

	- Model Architecture: DistilBERT Base Uncased
	- Task: Scientific Paper Classification
	- Dataset: Custom dataset labeled with three categories: Biology, Mathematics, and Physics
	- Quantization: Float16 (FP16)
	- Fine-tuning Framework: Hugging Face Transformers

	## Usage

	### Installation

	```sh
	pip install transformers torch
	```

	### Loading the Model

	```python
	from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
	import torch

	# Load quantized model
	quantized_model_path = "/kaggle/working/distilbert_finetuned_fp16"
	quantized_model = DistilBertForSequenceClassification.from_pretrained(quantized_model_path)
	quantized_model.eval() # Set to evaluation mode
	quantized_model.half() # Convert model to FP16

	# Load tokenizer
	tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

	# Define a test input
	test_paper = "The quantum mechanics of atomic structures are governed by Schrödinger's equation."

	# Tokenize input
	inputs = tokenizer(test_paper, return_tensors="pt", padding=True, truncation=True, max_length=512)

	# Ensure input tensors are in correct dtype
	inputs["input_ids"] = inputs["input_ids"].long() # Convert to long type
	inputs["attention_mask"] = inputs["attention_mask"].long() # Convert to long type

	# Make prediction
	with torch.no_grad():
	outputs = quantized_model(**inputs)

	# Get predicted class
	predicted_class = torch.argmax(outputs.logits, dim=1).item()

	# Class labels
	label_mapping = {0: "Biology", 1: "Mathematics", 2: "Physics"}

	predicted_label = label_mapping[predicted_class]
	print(f"Predicted Label: {predicted_label}")
	```

	## Performance Metrics

	- Accuracy: 0.95 (after fine-tuning)
	- F1-Score: 0.91 (weighted)

	## Fine-Tuning Details

	### Dataset

	The dataset consists of scientific papers categorized into three domains:
	- Biology
	- Mathematics
	- Physics

	The dataset was preprocessed and tokenized using the DistilBERT tokenizer.

	### Training

	- Number of epochs: 3
	- Batch size: 8
	- Learning rate: 2e-5
	- Optimizer: AdamW
	- Evaluation strategy: epoch

	### Quantization

	Post-training quantization was applied using PyTorch’s built-in quantization framework to reduce the model size and improve inference efficiency.

	## Repository Structure

	```
	.
	├── model/ # Contains the quantized model files
	├── tokenizer_config/ # Tokenizer configuration and vocabulary files
	├── model.safensors/ # Fine-Tuned Model
	├── README.md # Model documentation
	```

	## Limitations

	- The model is trained on a limited dataset and may not generalize well to niche scientific subdomains.
	- Quantization may result in slight accuracy degradation compared to full-precision models.

	## Contributing

	Contributions are welcome! Feel free to open an issue or submit a pull request if you have suggestions or improvements.