Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- bert
|
| 4 |
+
- scientific-abstract
|
| 5 |
+
- multi-label-classification
|
| 6 |
+
- natural-language-processing
|
| 7 |
+
datasets:
|
| 8 |
+
- custom-scientific-abstracts
|
| 9 |
+
license: apache-2.0
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# SciAbstract-MultiLabel-BERT-Base
|
| 13 |
+
|
| 14 |
+
## 📝 Overview
|
| 15 |
+
|
| 16 |
+
**SciAbstract-MultiLabel-BERT-Base** is a specialized multi-label text classification model fine-tuned on scientific paper abstracts. It simultaneously classifies an abstract into its **Primary Topic** and determines the underlying **Sentiment/Impact** of the research findings (e.g., highly positive breakthrough, negative result/concern).
|
| 17 |
+
|
| 18 |
+
The model is based on the robust `bert-base-uncased` architecture and is ideal for automating the categorization and high-level assessment of large volumes of academic literature.
|
| 19 |
+
|
| 20 |
+
## 🧠 Model Architecture
|
| 21 |
+
|
| 22 |
+
The model uses the `BertForSequenceClassification` head, configured for a multi-label setup.
|
| 23 |
+
|
| 24 |
+
* **Base Model:** `bert-base-uncased`
|
| 25 |
+
* **Input:** Scientific abstract text.
|
| 26 |
+
* **Output:** A 17-dimensional vector of logits, where each dimension corresponds to one of the 17 potential labels (12 Topics + 5 Sentiments). The model uses a sigmoid activation function for the final layer to handle the multi-label nature, allowing it to predict multiple positive labels (e.g., one Topic and one Sentiment) for a single input.
|
| 27 |
+
* **Loss Function:** Binary Cross-Entropy with Logits (BCEWithLogitsLoss).
|
| 28 |
+
* **Labels:**
|
| 29 |
+
* **Topics:** Materials Science, Neuroscience, Computer Science, Ecology, Astrophysics, Medicine, etc.
|
| 30 |
+
* **Sentiments:** Highly Positive, Positive, Moderately Negative, Negative, Highly Negative.
|
| 31 |
+
|
| 32 |
+
## 🚀 Intended Use
|
| 33 |
+
|
| 34 |
+
* **Automated Document Triage:** Rapidly categorize new research papers for subject-matter experts.
|
| 35 |
+
* **Literature Review:** Filter and prioritize papers based on topic and the detected impact (sentiment) of the findings.
|
| 36 |
+
* **Trend Analysis:** Track the volume of positive vs. negative research outcomes within specific scientific fields over time.
|
| 37 |
+
|
| 38 |
+
## ⚠️ Limitations
|
| 39 |
+
|
| 40 |
+
* **Multi-Label Complexity:** The model may struggle with abstracts that span highly ambiguous or highly interdisciplinary topics not well-represented in the training data.
|
| 41 |
+
* **Sentiment Scope:** The sentiment classification is specifically tailored to the tone of scientific findings (e.g., success of an experiment, critical failure, potential concern) and may not generalize well to general public sentiment.
|
| 42 |
+
* **Maximum Length:** Input text is truncated to 512 tokens (the BERT standard). Extremely long abstracts may lose critical information.
|
| 43 |
+
|
| 44 |
+
## 💻 Example Code
|
| 45 |
+
|
| 46 |
+
```python
|
| 47 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 48 |
+
import torch
|
| 49 |
+
|
| 50 |
+
# Load model and tokenizer
|
| 51 |
+
model_name = "Your-HF-Username/SciAbstract-MultiLabel-BERT-Base"
|
| 52 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 53 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_name)
|
| 54 |
+
|
| 55 |
+
# Sample abstract
|
| 56 |
+
abstract = "Development of a quantum entanglement system achieving coherence for over 10 seconds at room temperature, a significant breakthrough for quantum computing."
|
| 57 |
+
|
| 58 |
+
# Tokenize input
|
| 59 |
+
inputs = tokenizer(abstract, return_tensors="pt", truncation=True, padding=True)
|
| 60 |
+
|
| 61 |
+
# Make prediction
|
| 62 |
+
with torch.no_grad():
|
| 63 |
+
logits = model(**inputs).logits
|
| 64 |
+
|
| 65 |
+
# Apply sigmoid to get probabilities for each label
|
| 66 |
+
probabilities = torch.sigmoid(logits).squeeze()
|
| 67 |
+
|
| 68 |
+
# Get the label IDs and names
|
| 69 |
+
id2label = model.config.id2label
|
| 70 |
+
predicted_labels = [id2label[i] for i, prob in enumerate(probabilities) if prob > 0.5] # Threshold at 0.5
|
| 71 |
+
|
| 72 |
+
print(f"Abstract: {abstract}")
|
| 73 |
+
print("-" * 30)
|
| 74 |
+
print(f"Predicted Labels: {predicted_labels}")
|
| 75 |
+
# Expected Output Example: ['Topic: Physics', 'Sentiment: Highly Positive']
|