|
|
--- |
|
|
tags: |
|
|
- bert |
|
|
- scientific-abstract |
|
|
- multi-label-classification |
|
|
- natural-language-processing |
|
|
datasets: |
|
|
- custom-scientific-abstracts |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# SciAbstract-MultiLabel-BERT-Base |
|
|
|
|
|
## π Overview |
|
|
|
|
|
**SciAbstract-MultiLabel-BERT-Base** is a specialized multi-label text classification model fine-tuned on scientific paper abstracts. It simultaneously classifies an abstract into its **Primary Topic** and determines the underlying **Sentiment/Impact** of the research findings (e.g., highly positive breakthrough, negative result/concern). |
|
|
|
|
|
The model is based on the robust `bert-base-uncased` architecture and is ideal for automating the categorization and high-level assessment of large volumes of academic literature. |
|
|
|
|
|
## π§ Model Architecture |
|
|
|
|
|
The model uses the `BertForSequenceClassification` head, configured for a multi-label setup. |
|
|
|
|
|
* **Base Model:** `bert-base-uncased` |
|
|
* **Input:** Scientific abstract text. |
|
|
* **Output:** A 17-dimensional vector of logits, where each dimension corresponds to one of the 17 potential labels (12 Topics + 5 Sentiments). The model uses a sigmoid activation function for the final layer to handle the multi-label nature, allowing it to predict multiple positive labels (e.g., one Topic and one Sentiment) for a single input. |
|
|
* **Loss Function:** Binary Cross-Entropy with Logits (BCEWithLogitsLoss). |
|
|
* **Labels:** |
|
|
* **Topics:** Materials Science, Neuroscience, Computer Science, Ecology, Astrophysics, Medicine, etc. |
|
|
* **Sentiments:** Highly Positive, Positive, Moderately Negative, Negative, Highly Negative. |
|
|
|
|
|
## π Intended Use |
|
|
|
|
|
* **Automated Document Triage:** Rapidly categorize new research papers for subject-matter experts. |
|
|
* **Literature Review:** Filter and prioritize papers based on topic and the detected impact (sentiment) of the findings. |
|
|
* **Trend Analysis:** Track the volume of positive vs. negative research outcomes within specific scientific fields over time. |
|
|
|
|
|
## β οΈ Limitations |
|
|
|
|
|
* **Multi-Label Complexity:** The model may struggle with abstracts that span highly ambiguous or highly interdisciplinary topics not well-represented in the training data. |
|
|
* **Sentiment Scope:** The sentiment classification is specifically tailored to the tone of scientific findings (e.g., success of an experiment, critical failure, potential concern) and may not generalize well to general public sentiment. |
|
|
* **Maximum Length:** Input text is truncated to 512 tokens (the BERT standard). Extremely long abstracts may lose critical information. |
|
|
|
|
|
## π» Example Code |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_name = "Your-HF-Username/SciAbstract-MultiLabel-BERT-Base" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Sample abstract |
|
|
abstract = "Development of a quantum entanglement system achieving coherence for over 10 seconds at room temperature, a significant breakthrough for quantum computing." |
|
|
|
|
|
# Tokenize input |
|
|
inputs = tokenizer(abstract, return_tensors="pt", truncation=True, padding=True) |
|
|
|
|
|
# Make prediction |
|
|
with torch.no_grad(): |
|
|
logits = model(**inputs).logits |
|
|
|
|
|
# Apply sigmoid to get probabilities for each label |
|
|
probabilities = torch.sigmoid(logits).squeeze() |
|
|
|
|
|
# Get the label IDs and names |
|
|
id2label = model.config.id2label |
|
|
predicted_labels = [id2label[i] for i, prob in enumerate(probabilities) if prob > 0.5] # Threshold at 0.5 |
|
|
|
|
|
print(f"Abstract: {abstract}") |
|
|
print("-" * 30) |
|
|
print(f"Predicted Labels: {predicted_labels}") |
|
|
# Expected Output Example: ['Topic: Physics', 'Sentiment: Highly Positive'] |