---
language: en
license: mit
tags:
- research
- classification
- scientific-papers
- bert
- academic
- nlp
datasets:
- mendeley-research
pipeline_tag: text-classification
---

# BERT Research Paper Classifier

## Model Description

`bert_text_classifier` is a fine-tuned BERT model specifically designed for classifying research papers into scientific disciplines. The model achieves **95.39% accuracy** on a comprehensive dataset of 140,000+ research papers across 9 major scientific categories.

- **Model type:** BERT for sequence classification
- **Language(s):** English
- **License:** MIT
- **Finetuned from:** [bert-base-uncased](https://huggingface.co/bert-base-uncased)

## Intended Uses & Limitations

### Primary Use
This model is intended for:
- Automatic categorization of research papers and academic publications
- Building academic recommendation systems
- Organizing digital libraries and research databases
- Educational applications in scientific literature analysis

### Limitations
- Trained primarily on Mendeley research catalog data
- Performance may vary on papers outside the 9 trained categories
- Best performance on formal academic writing style

## Categories

The model classifies research papers into 9 scientific disciplines:

| Category | Key Subfields |
|----------|---------------|
| **Biology** | Genetics, Ecology, Biochemistry, Physiology |
| **Business** | Marketing, Finance, Management, Entrepreneurship |
| **Chemistry** | Organic Chemistry, Analytical Chemistry, Biochemistry |
| **Computer Science** | AI, Cloud Computing, Cybersecurity, Software Engineering |
| **Environmental Science** | Climate Change, Conservation, Sustainability |
| **Mathematics** | Algebra, Calculus, Statistics, Optimization |
| **Medicine** | Cardiology, Surgery, Neurology, Pediatrics |
| **Physics** | Quantum Mechanics, Astrophysics, Particle Physics |
| **Psychology** | Clinical, Cognitive, Social, Neuropsychology |

## Training Data

### Dataset Statistics
- **Source:** Mendeley Research Catalog
- **Total Papers:** 140,004 (after cleaning)
- **Training Samples:** 27,953 evaluation set
- **Cleaning Ratio:** 89.81% (from original 155,882 records)

### Data Distribution
- Psychology: 16,821 papers (12.0%)
- Chemistry: 16,675 papers (11.9%)
- Physics: 15,941 papers (11.4%)
- Business: 15,929 papers (11.4%)
- Mathematics: 15,464 papers (11.0%)
- Medicine: 15,361 papers (11.0%)
- Computer Science: 14,776 papers (10.6%)
- Biology: 14,729 papers (10.5%)
- Environmental Science: 14,308 papers (10.2%)

## Performance

### Evaluation Results
```

{
'eval_loss': 0.184,
'eval_accuracy': 0.9539,
'eval_runtime': 428.03,
'eval_samples_per_second': 65.306
}

```

### Detailed Metrics

| Category | Precision | Recall | F1-Score | Support |
|----------|-----------|--------|----------|---------|
| Biology | 0.94 | 0.93 | 0.94 | 3,177 |
| Business | 0.96 | 0.97 | 0.97 | 3,179 |
| Chemistry | 0.94 | 0.96 | 0.95 | 3,073 |
| Computer Science | 0.96 | 0.93 | 0.95 | 2,987 |
| Environmental Science | 0.95 | 0.94 | 0.95 | 2,850 |
| Mathematics | 0.93 | 0.96 | 0.95 | 3,091 |
| Medicine | 0.97 | 0.96 | 0.96 | 3,067 |
| Physics | 0.97 | 0.95 | 0.96 | 3,181 |
| Psychology | 0.97 | 0.97 | 0.97 | 3,348 |

## Usage

### Direct Inference
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Emran025/bert_text_classifier")
model = AutoModelForSequenceClassification.from_pretrained("Emran025/bert_text_classifier")

# Example research paper abstract
text = """
This study explores novel deep learning architectures for protein structure 
prediction using transformer-based models and attention mechanisms.
"""

# Preprocess and predict
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=1).item()

# Map to category
categories = ['biology', 'business', 'chemistry', 'computerscience', 
              'environmentalscience', 'mathematics', 'medicine', 'physics', 'psychology']
print(f"Predicted category: {categories[predicted_class]}")
```

Using Pipeline

```python
from transformers import pipeline

classifier = pipeline("text-classification", 
                     model="Emran025/bert_text_classifier",
                     tokenizer="Emran025/bert_text_classifier")

result = classifier("Advanced quantum computing algorithms for molecular simulation")
print(result)
```

Training Details

Hyperparameters

· Learning Rate: 2e-5
· Batch Size: 16
· Epochs: 3
· Max Sequence Length: 512 tokens
· Optimizer: AdamW

Training Environment

· Framework: PyTorch with Transformers
· Hardware: Google Colab GPU
· Training Time: ~6 hours

Citation

If you use this model in your research, please cite:

```bibtex
@misc{bert_research_classifier_2024,
  title = {BERT Research Paper Classification Model},
  author = {Emran Nasser and Mohammed Alyafrosy and Ryadh Alizi},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Emran025/bert_text_classifier}}
}
```

Contributors

· Emran Nasser (Emran025)
· Mohammed Alyafrosy
· Ryadh Alizi

License

MIT License - see LICENSE file for details.

Repository

https://github.com/Emran025/Research_Paper_Classification_model