--- language: en license: mit tags: - research - classification - scientific-papers - bert - academic - nlp datasets: - mendeley-research pipeline_tag: text-classification --- # BERT Research Paper Classifier ## Model Description `bert_text_classifier` is a fine-tuned BERT model specifically designed for classifying research papers into scientific disciplines. The model achieves **95.39% accuracy** on a comprehensive dataset of 140,000+ research papers across 9 major scientific categories. - **Model type:** BERT for sequence classification - **Language(s):** English - **License:** MIT - **Finetuned from:** [bert-base-uncased](https://huggingface.co/bert-base-uncased) ## Intended Uses & Limitations ### Primary Use This model is intended for: - Automatic categorization of research papers and academic publications - Building academic recommendation systems - Organizing digital libraries and research databases - Educational applications in scientific literature analysis ### Limitations - Trained primarily on Mendeley research catalog data - Performance may vary on papers outside the 9 trained categories - Best performance on formal academic writing style ## Categories The model classifies research papers into 9 scientific disciplines: | Category | Key Subfields | |----------|---------------| | **Biology** | Genetics, Ecology, Biochemistry, Physiology | | **Business** | Marketing, Finance, Management, Entrepreneurship | | **Chemistry** | Organic Chemistry, Analytical Chemistry, Biochemistry | | **Computer Science** | AI, Cloud Computing, Cybersecurity, Software Engineering | | **Environmental Science** | Climate Change, Conservation, Sustainability | | **Mathematics** | Algebra, Calculus, Statistics, Optimization | | **Medicine** | Cardiology, Surgery, Neurology, Pediatrics | | **Physics** | Quantum Mechanics, Astrophysics, Particle Physics | | **Psychology** | Clinical, Cognitive, Social, Neuropsychology | ## Training Data ### Dataset Statistics - **Source:** Mendeley Research Catalog - **Total Papers:** 140,004 (after cleaning) - **Training Samples:** 27,953 evaluation set - **Cleaning Ratio:** 89.81% (from original 155,882 records) ### Data Distribution - Psychology: 16,821 papers (12.0%) - Chemistry: 16,675 papers (11.9%) - Physics: 15,941 papers (11.4%) - Business: 15,929 papers (11.4%) - Mathematics: 15,464 papers (11.0%) - Medicine: 15,361 papers (11.0%) - Computer Science: 14,776 papers (10.6%) - Biology: 14,729 papers (10.5%) - Environmental Science: 14,308 papers (10.2%) ## Performance ### Evaluation Results ``` { 'eval_loss': 0.184, 'eval_accuracy': 0.9539, 'eval_runtime': 428.03, 'eval_samples_per_second': 65.306 } ``` ### Detailed Metrics | Category | Precision | Recall | F1-Score | Support | |----------|-----------|--------|----------|---------| | Biology | 0.94 | 0.93 | 0.94 | 3,177 | | Business | 0.96 | 0.97 | 0.97 | 3,179 | | Chemistry | 0.94 | 0.96 | 0.95 | 3,073 | | Computer Science | 0.96 | 0.93 | 0.95 | 2,987 | | Environmental Science | 0.95 | 0.94 | 0.95 | 2,850 | | Mathematics | 0.93 | 0.96 | 0.95 | 3,091 | | Medicine | 0.97 | 0.96 | 0.96 | 3,067 | | Physics | 0.97 | 0.95 | 0.96 | 3,181 | | Psychology | 0.97 | 0.97 | 0.97 | 3,348 | ## Usage ### Direct Inference ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("Emran025/bert_text_classifier") model = AutoModelForSequenceClassification.from_pretrained("Emran025/bert_text_classifier") # Example research paper abstract text = """ This study explores novel deep learning architectures for protein structure prediction using transformer-based models and attention mechanisms. """ # Preprocess and predict inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512) with torch.no_grad(): outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_class = torch.argmax(predictions, dim=1).item() # Map to category categories = ['biology', 'business', 'chemistry', 'computerscience', 'environmentalscience', 'mathematics', 'medicine', 'physics', 'psychology'] print(f"Predicted category: {categories[predicted_class]}") ``` Using Pipeline ```python from transformers import pipeline classifier = pipeline("text-classification", model="Emran025/bert_text_classifier", tokenizer="Emran025/bert_text_classifier") result = classifier("Advanced quantum computing algorithms for molecular simulation") print(result) ``` Training Details Hyperparameters · Learning Rate: 2e-5 · Batch Size: 16 · Epochs: 3 · Max Sequence Length: 512 tokens · Optimizer: AdamW Training Environment · Framework: PyTorch with Transformers · Hardware: Google Colab GPU · Training Time: ~6 hours Citation If you use this model in your research, please cite: ```bibtex @misc{bert_research_classifier_2024, title = {BERT Research Paper Classification Model}, author = {Emran Nasser and Mohammed Alyafrosy and Ryadh Alizi}, year = {2024}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/Emran025/bert_text_classifier}} } ``` Contributors · Emran Nasser (Emran025) · Mohammed Alyafrosy · Ryadh Alizi License MIT License - see LICENSE file for details. Repository https://github.com/Emran025/Research_Paper_Classification_model