|
|
--- |
|
|
language: en |
|
|
license: mit |
|
|
tags: |
|
|
- research |
|
|
- classification |
|
|
- scientific-papers |
|
|
- bert |
|
|
- academic |
|
|
- nlp |
|
|
datasets: |
|
|
- mendeley-research |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
|
|
|
# BERT Research Paper Classifier |
|
|
|
|
|
## Model Description |
|
|
|
|
|
`bert_text_classifier` is a fine-tuned BERT model specifically designed for classifying research papers into scientific disciplines. The model achieves **95.39% accuracy** on a comprehensive dataset of 140,000+ research papers across 9 major scientific categories. |
|
|
|
|
|
- **Model type:** BERT for sequence classification |
|
|
- **Language(s):** English |
|
|
- **License:** MIT |
|
|
- **Finetuned from:** [bert-base-uncased](https://huggingface.co/bert-base-uncased) |
|
|
|
|
|
## Intended Uses & Limitations |
|
|
|
|
|
### Primary Use |
|
|
This model is intended for: |
|
|
- Automatic categorization of research papers and academic publications |
|
|
- Building academic recommendation systems |
|
|
- Organizing digital libraries and research databases |
|
|
- Educational applications in scientific literature analysis |
|
|
|
|
|
### Limitations |
|
|
- Trained primarily on Mendeley research catalog data |
|
|
- Performance may vary on papers outside the 9 trained categories |
|
|
- Best performance on formal academic writing style |
|
|
|
|
|
## Categories |
|
|
|
|
|
The model classifies research papers into 9 scientific disciplines: |
|
|
|
|
|
| Category | Key Subfields | |
|
|
|----------|---------------| |
|
|
| **Biology** | Genetics, Ecology, Biochemistry, Physiology | |
|
|
| **Business** | Marketing, Finance, Management, Entrepreneurship | |
|
|
| **Chemistry** | Organic Chemistry, Analytical Chemistry, Biochemistry | |
|
|
| **Computer Science** | AI, Cloud Computing, Cybersecurity, Software Engineering | |
|
|
| **Environmental Science** | Climate Change, Conservation, Sustainability | |
|
|
| **Mathematics** | Algebra, Calculus, Statistics, Optimization | |
|
|
| **Medicine** | Cardiology, Surgery, Neurology, Pediatrics | |
|
|
| **Physics** | Quantum Mechanics, Astrophysics, Particle Physics | |
|
|
| **Psychology** | Clinical, Cognitive, Social, Neuropsychology | |
|
|
|
|
|
## Training Data |
|
|
|
|
|
### Dataset Statistics |
|
|
- **Source:** Mendeley Research Catalog |
|
|
- **Total Papers:** 140,004 (after cleaning) |
|
|
- **Training Samples:** 27,953 evaluation set |
|
|
- **Cleaning Ratio:** 89.81% (from original 155,882 records) |
|
|
|
|
|
### Data Distribution |
|
|
- Psychology: 16,821 papers (12.0%) |
|
|
- Chemistry: 16,675 papers (11.9%) |
|
|
- Physics: 15,941 papers (11.4%) |
|
|
- Business: 15,929 papers (11.4%) |
|
|
- Mathematics: 15,464 papers (11.0%) |
|
|
- Medicine: 15,361 papers (11.0%) |
|
|
- Computer Science: 14,776 papers (10.6%) |
|
|
- Biology: 14,729 papers (10.5%) |
|
|
- Environmental Science: 14,308 papers (10.2%) |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Evaluation Results |
|
|
``` |
|
|
|
|
|
{ |
|
|
'eval_loss': 0.184, |
|
|
'eval_accuracy': 0.9539, |
|
|
'eval_runtime': 428.03, |
|
|
'eval_samples_per_second': 65.306 |
|
|
} |
|
|
|
|
|
``` |
|
|
|
|
|
### Detailed Metrics |
|
|
|
|
|
| Category | Precision | Recall | F1-Score | Support | |
|
|
|----------|-----------|--------|----------|---------| |
|
|
| Biology | 0.94 | 0.93 | 0.94 | 3,177 | |
|
|
| Business | 0.96 | 0.97 | 0.97 | 3,179 | |
|
|
| Chemistry | 0.94 | 0.96 | 0.95 | 3,073 | |
|
|
| Computer Science | 0.96 | 0.93 | 0.95 | 2,987 | |
|
|
| Environmental Science | 0.95 | 0.94 | 0.95 | 2,850 | |
|
|
| Mathematics | 0.93 | 0.96 | 0.95 | 3,091 | |
|
|
| Medicine | 0.97 | 0.96 | 0.96 | 3,067 | |
|
|
| Physics | 0.97 | 0.95 | 0.96 | 3,181 | |
|
|
| Psychology | 0.97 | 0.97 | 0.97 | 3,348 | |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Direct Inference |
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("Emran025/bert_text_classifier") |
|
|
model = AutoModelForSequenceClassification.from_pretrained("Emran025/bert_text_classifier") |
|
|
|
|
|
# Example research paper abstract |
|
|
text = """ |
|
|
This study explores novel deep learning architectures for protein structure |
|
|
prediction using transformer-based models and attention mechanisms. |
|
|
""" |
|
|
|
|
|
# Preprocess and predict |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512) |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) |
|
|
predicted_class = torch.argmax(predictions, dim=1).item() |
|
|
|
|
|
# Map to category |
|
|
categories = ['biology', 'business', 'chemistry', 'computerscience', |
|
|
'environmentalscience', 'mathematics', 'medicine', 'physics', 'psychology'] |
|
|
print(f"Predicted category: {categories[predicted_class]}") |
|
|
``` |
|
|
|
|
|
Using Pipeline |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
classifier = pipeline("text-classification", |
|
|
model="Emran025/bert_text_classifier", |
|
|
tokenizer="Emran025/bert_text_classifier") |
|
|
|
|
|
result = classifier("Advanced quantum computing algorithms for molecular simulation") |
|
|
print(result) |
|
|
``` |
|
|
|
|
|
Training Details |
|
|
|
|
|
Hyperparameters |
|
|
|
|
|
路 Learning Rate: 2e-5 |
|
|
路 Batch Size: 16 |
|
|
路 Epochs: 3 |
|
|
路 Max Sequence Length: 512 tokens |
|
|
路 Optimizer: AdamW |
|
|
|
|
|
Training Environment |
|
|
|
|
|
路 Framework: PyTorch with Transformers |
|
|
路 Hardware: Google Colab GPU |
|
|
路 Training Time: ~6 hours |
|
|
|
|
|
Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{bert_research_classifier_2024, |
|
|
title = {BERT Research Paper Classification Model}, |
|
|
author = {Emran Nasser and Mohammed Alyafrosy and Ryadh Alizi}, |
|
|
year = {2024}, |
|
|
publisher = {Hugging Face}, |
|
|
howpublished = {\url{https://huggingface.co/Emran025/bert_text_classifier}} |
|
|
} |
|
|
``` |
|
|
|
|
|
Contributors |
|
|
|
|
|
路 Emran Nasser (Emran025) |
|
|
路 Mohammed Alyafrosy |
|
|
路 Ryadh Alizi |
|
|
|
|
|
License |
|
|
|
|
|
MIT License - see LICENSE file for details. |
|
|
|
|
|
Repository |
|
|
|
|
|
https://github.com/Emran025/Research_Paper_Classification_model |
|
|
|