--- tags: - text-classification - roberta - scientific-abstracts - multi-class - research-field-classification datasets: - ScientificArticleAbstract_Classification license: apache-2.0 model-index: - name: ScientificTextClassification_ResearchField results: - task: name: Text Classification type: text-classification metrics: - type: accuracy value: 0.941 name: Accuracy (Top-1) - type: macro_f1 value: 0.935 name: Macro F1 Score --- # ScientificTextClassification_ResearchField ## 📚 Overview This is a **RoBERTa-base** model fine-tuned for the complex task of multi-class classification of scientific article abstracts. The model predicts the **primary research field** (e.g., Physics, Biology, Computer Science) based solely on the abstract text, serving as a powerful tool for automated journal indexing and literature review organization. ## 🧠 Model Architecture The choice of RoBERTa ensures enhanced robustness and better handling of long-range dependencies common in technical and scientific prose. * **Base Model:** `roberta-base` (an optimized BERT approach without the next-sentence prediction objective). * **Classification Head:** Outputs 8 distinct categories (`num_labels: 8`). * **Input Data:** Detailed scientific abstracts from diverse journals. * **Output:** A probability distribution over the 8 classes: Physics, Chemistry, Medicine, Computer Science, Biology, Geoscience, Materials Science, and Engineering. * **Training Dataset:** **ScientificArticleAbstract_Classification**, providing abstracts linked to their high-level research disciplines. ## 🎯 Intended Use The model offers utility in several scientific and information retrieval contexts: 1. **Automated Library and Repository Indexing:** Rapidly and accurately tagging new publications with their correct discipline. 2. **Literature Review Automation:** Filtering large databases of articles to focus on specific fields. 3. **Grant Proposal Routing:** Assisting research institutions in routing incoming proposals to the appropriate review panel or expert based on the summary. 4. **Trend Analysis:** Tracking the volume and convergence of research across different fields. ## ⚠️ Limitations 1. **Interdisciplinary Papers:** The model performs single-label classification. It may struggle with highly interdisciplinary abstracts that bridge two or more distinct fields (e.g., computational chemistry or bio-engineering). 2. **Vocabulary Drift:** Scientific terminology evolves quickly. New sub-disciplines or extremely novel concepts may not be classified correctly until the model is retrained. 3. **Class Imbalance:** If the underlying distribution of the eight fields in the real world shifts significantly from the training set, performance may vary. ### MODEL 3: **EcommerceAspectSentiment_BART** This model is a BART-large sequence-to-sequence model fine-tuned for abstractive multi-aspect sentiment summarization based on Dataset 3 (EcommerceCustomerReview\_MultiAspectRating). #### config.json ```json { "_name_or_path": "facebook/bart-large", "architectures": [ "BartForConditionalGeneration" ], "model_type": "bart", "vocab_size": 50265, "d_model": 1024, "encoder_layers": 12, "decoder_layers": 12, "encoder_attention_heads": 16, "decoder_attention_heads": 16, "encoder_ffn_dim": 4096, "decoder_ffn_dim": 4096, "dropout": 0.1, "activation_function": "gelu", "init_std": 0.02, "num_labels": 3, "max_position_embeddings": 1024, "eos_token_id": 2, "bos_token_id": 0, "pad_token_id": 1, "is_encoder_decoder": true, "scale_embedding": false, "forced_eos_token_id": 2, "transformers_version": "4.35.2" }