|
|
--- |
|
|
tags: |
|
|
- text-classification |
|
|
- scientific-abstract |
|
|
- multi-label |
|
|
- sentiment-analysis |
|
|
- distilbert |
|
|
datasets: |
|
|
- SciTopicSentimentDataset |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# SciTopicSentimentClassifier |
|
|
|
|
|
## π¬ Overview |
|
|
|
|
|
SciTopicSentimentClassifier is a **multi-label classification** model fine-tuned to simultaneously predict the **primary scientific topic** and the **underlying sentiment** (high-positive or low-negative) from a research paper's abstract text. This model is ideal for automated paper categorization, literature review triage, and scientific trend analysis. |
|
|
|
|
|
The model was trained on the SciTopicSentimentDataset (a proprietary dataset similar to the generated Dataset 1), which links abstract text to predefined scientific topics and a binarized sentiment score derived from the original continuous value. |
|
|
|
|
|
## π§ Model Architecture |
|
|
|
|
|
This model is an adaptation of **DistilBERT**, a smaller, faster, and lighter version of BERT. |
|
|
|
|
|
* **Base Model:** `distilbert-base-uncased` |
|
|
* **Modification:** A custom classification head is added on top of the DistilBERT pooled output. |
|
|
* **Output Layer:** The final layer is a dense layer with **12 outputs** (10 for scientific topics + 2 for sentiment classes), followed by a Sigmoid activation function to allow for multi-label prediction (an abstract can belong to multiple topics/sentiments). |
|
|
* **Input:** Tokenized abstract text (up to 512 tokens). |
|
|
* **Task:** Multi-Label Text Classification. |
|
|
|
|
|
## π Intended Use |
|
|
|
|
|
* **Automated Labeling:** Automatically assign relevant topic tags to new scientific publication abstracts. |
|
|
* **Research Triage:** Quickly filter papers based on subject matter and the perceived 'success' or 'novelty' indicated by the abstract's sentiment. |
|
|
* **Scientific Landscape Mapping:** Analyze large corpora of papers to track emerging positive/negative trends in specific research areas. |
|
|
* **Indexing Systems:** Integration into library or repository indexing services. |
|
|
|
|
|
## β οΈ Limitations |
|
|
|
|
|
* **Topic Granularity:** The model is limited to the 10 predefined topics in its training set. It may perform poorly on highly niche or interdisciplinary topics outside this scope. |
|
|
* **Sentiment Scope:** The sentiment is coarse-grained (high vs. low) based on a metric derived from the abstract's language (e.g., using words like "novel," "significant," "limitations," "challenges"). It does not capture nuanced human-level emotional sentiment. |
|
|
* **Language:** Trained exclusively on English abstracts. |
|
|
* **Max Length:** Input texts longer than 512 tokens are truncated. |
|
|
|
|
|
## π» Example Code |
|
|
|
|
|
To use the model for prediction: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
import torch |
|
|
|
|
|
# Load the model and tokenizer |
|
|
model_name = "your-username/SciTopicSentimentClassifier" # Replace with actual HuggingFace path |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(model_name) |
|
|
|
|
|
# Sample Abstract |
|
|
abstract = "We propose a novel architecture combining convolutional and recurrent neural networks for multi-modal data fusion, demonstrating significant performance gains in complex classification tasks, overcoming prior limitations." |
|
|
|
|
|
# Preprocess the input |
|
|
inputs = tokenizer(abstract, return_tensors="pt", truncation=True, padding=True) |
|
|
|
|
|
# Run inference |
|
|
with torch.no_grad(): |
|
|
logits = model(**inputs).logits |
|
|
|
|
|
# Apply sigmoid for multi-label scores |
|
|
probs = torch.sigmoid(logits) |
|
|
|
|
|
# Get predicted labels (e.g., probability > 0.5) |
|
|
labels = model.config.id2label |
|
|
predictions = [] |
|
|
for i, prob in enumerate(probs[0]): |
|
|
if prob > 0.5: |
|
|
predictions.append(labels[i]) |
|
|
|
|
|
print(f"Abstract: {abstract[:80]}...") |
|
|
print(f"Predicted Labels: {predictions}") |
|
|
# Expected Output: ['Deep Learning/AI', 'High-Positive-Sentiment'] |