Emran025's picture
docs: add comprehensive Hugging Face model card
4b73309 verified
---
language: en
license: mit
tags:
- research
- classification
- scientific-papers
- bert
- academic
- nlp
datasets:
- mendeley-research
pipeline_tag: text-classification
---
# BERT Research Paper Classifier
## Model Description
`bert_text_classifier` is a fine-tuned BERT model specifically designed for classifying research papers into scientific disciplines. The model achieves **95.39% accuracy** on a comprehensive dataset of 140,000+ research papers across 9 major scientific categories.
- **Model type:** BERT for sequence classification
- **Language(s):** English
- **License:** MIT
- **Finetuned from:** [bert-base-uncased](https://huggingface.co/bert-base-uncased)
## Intended Uses & Limitations
### Primary Use
This model is intended for:
- Automatic categorization of research papers and academic publications
- Building academic recommendation systems
- Organizing digital libraries and research databases
- Educational applications in scientific literature analysis
### Limitations
- Trained primarily on Mendeley research catalog data
- Performance may vary on papers outside the 9 trained categories
- Best performance on formal academic writing style
## Categories
The model classifies research papers into 9 scientific disciplines:
| Category | Key Subfields |
|----------|---------------|
| **Biology** | Genetics, Ecology, Biochemistry, Physiology |
| **Business** | Marketing, Finance, Management, Entrepreneurship |
| **Chemistry** | Organic Chemistry, Analytical Chemistry, Biochemistry |
| **Computer Science** | AI, Cloud Computing, Cybersecurity, Software Engineering |
| **Environmental Science** | Climate Change, Conservation, Sustainability |
| **Mathematics** | Algebra, Calculus, Statistics, Optimization |
| **Medicine** | Cardiology, Surgery, Neurology, Pediatrics |
| **Physics** | Quantum Mechanics, Astrophysics, Particle Physics |
| **Psychology** | Clinical, Cognitive, Social, Neuropsychology |
## Training Data
### Dataset Statistics
- **Source:** Mendeley Research Catalog
- **Total Papers:** 140,004 (after cleaning)
- **Training Samples:** 27,953 evaluation set
- **Cleaning Ratio:** 89.81% (from original 155,882 records)
### Data Distribution
- Psychology: 16,821 papers (12.0%)
- Chemistry: 16,675 papers (11.9%)
- Physics: 15,941 papers (11.4%)
- Business: 15,929 papers (11.4%)
- Mathematics: 15,464 papers (11.0%)
- Medicine: 15,361 papers (11.0%)
- Computer Science: 14,776 papers (10.6%)
- Biology: 14,729 papers (10.5%)
- Environmental Science: 14,308 papers (10.2%)
## Performance
### Evaluation Results
```
{
'eval_loss': 0.184,
'eval_accuracy': 0.9539,
'eval_runtime': 428.03,
'eval_samples_per_second': 65.306
}
```
### Detailed Metrics
| Category | Precision | Recall | F1-Score | Support |
|----------|-----------|--------|----------|---------|
| Biology | 0.94 | 0.93 | 0.94 | 3,177 |
| Business | 0.96 | 0.97 | 0.97 | 3,179 |
| Chemistry | 0.94 | 0.96 | 0.95 | 3,073 |
| Computer Science | 0.96 | 0.93 | 0.95 | 2,987 |
| Environmental Science | 0.95 | 0.94 | 0.95 | 2,850 |
| Mathematics | 0.93 | 0.96 | 0.95 | 3,091 |
| Medicine | 0.97 | 0.96 | 0.96 | 3,067 |
| Physics | 0.97 | 0.95 | 0.96 | 3,181 |
| Psychology | 0.97 | 0.97 | 0.97 | 3,348 |
## Usage
### Direct Inference
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Emran025/bert_text_classifier")
model = AutoModelForSequenceClassification.from_pretrained("Emran025/bert_text_classifier")
# Example research paper abstract
text = """
This study explores novel deep learning architectures for protein structure
prediction using transformer-based models and attention mechanisms.
"""
# Preprocess and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=1).item()
# Map to category
categories = ['biology', 'business', 'chemistry', 'computerscience',
'environmentalscience', 'mathematics', 'medicine', 'physics', 'psychology']
print(f"Predicted category: {categories[predicted_class]}")
```
Using Pipeline
```python
from transformers import pipeline
classifier = pipeline("text-classification",
model="Emran025/bert_text_classifier",
tokenizer="Emran025/bert_text_classifier")
result = classifier("Advanced quantum computing algorithms for molecular simulation")
print(result)
```
Training Details
Hyperparameters
路 Learning Rate: 2e-5
路 Batch Size: 16
路 Epochs: 3
路 Max Sequence Length: 512 tokens
路 Optimizer: AdamW
Training Environment
路 Framework: PyTorch with Transformers
路 Hardware: Google Colab GPU
路 Training Time: ~6 hours
Citation
If you use this model in your research, please cite:
```bibtex
@misc{bert_research_classifier_2024,
title = {BERT Research Paper Classification Model},
author = {Emran Nasser and Mohammed Alyafrosy and Ryadh Alizi},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Emran025/bert_text_classifier}}
}
```
Contributors
路 Emran Nasser (Emran025)
路 Mohammed Alyafrosy
路 Ryadh Alizi
License
MIT License - see LICENSE file for details.
Repository
https://github.com/Emran025/Research_Paper_Classification_model