File size: 3,033 Bytes
d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 701eae6 d8a7362 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
# SDG SciBERT Classifier (`sdg-scibert-zo_up`)
This repository contains a fine-tuned version of [allenai/scibert_scivocab_cased](https://huggingface.co/allenai/scibert_scivocab_cased) for classifying scientific text into Sustainable Development Goal (SDG) categories.
- Fine-tuned using the 🤗 `transformers` Trainer API
- Uses standard `AutoModelForSequenceClassification`
- Published with full label mappings, inference scripts, and CLI tool
---
## 🧪 Quick Inference (Python)
You can use the model directly with the Hugging Face `pipeline`:
```python
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="simon-clmtd/sdg-scibert-zo_up",
tokenizer="simon-clmtd/sdg-scibert-zo_up",
truncation=True,
padding=True,
max_length=512,
return_all_scores=True,
device=0 # or -1 for CPU
)
text = "Ensure access to affordable, reliable, sustainable and modern energy for all"
print(classifier(text))
```
---
## 🖥️ CLI Tool: `sdg-predict`
### 🔧 Installation (local)
Clone the repo and install as a Python package:
```bash
git clone https://huggingface.co/simon-clmtd/sdg-scibert-zo_up
cd sdg-scibert-zo_up
pip install -e .
```
This will install a command-line tool called `sdg-predict`.
### 📥 Input format
The CLI tool accepts a `.jsonl` file (one JSON object per line). You must specify the key containing the text to classify:
Example input file (`input.jsonl`):
```json
{"id": 1, "text": "Ensure access to affordable, reliable, sustainable and modern energy for all"}
{"id": 2, "text": "Atmospheric warming is profoundly affecting high-mountain regions"}
```
### ▶️ Example usage
#### Top-1 prediction:
```bash
sdg-predict input.jsonl --key text --top1 --output preds.jsonl
```
#### Full label distribution:
```bash
sdg-predict input.jsonl --key text --output preds_all.jsonl
```
#### Custom batch size:
```bash
sdg-predict input.jsonl --key text --batch_size 16
```
### 📤 Output format
Each output line is the original input with an added `prediction` key:
**With `--top1`:**
```json
{
"id": 1,
"text": "...",
"prediction": {
"label": "7",
"score": 0.9124
}
}
```
**Without `--top1`:**
```json
{
"id": 1,
"text": "...",
"prediction": [
{"label": "1", "score": 0.0021},
{"label": "2", "score": 0.0005},
...
{"label": "7", "score": 0.9124}
]
}
```
---
## 📦 Repository Contents
- `modeling.py`: Optional class wrapper if extending the base model.
- `inference.py`: Reusable batch inference logic for Python scripts.
- `cli_predict.py`: CLI tool using the inference logic.
- `requirements.txt`: Runtime dependencies.
- `setup.py`: Installation and entry point for the CLI.
---
## 🔍 Citation
Please cite the original [SciBERT paper](https://arxiv.org/abs/1903.10676) if using this model, and attribute this fine-tuning setup if relevant.
---
## 👤 Author
Simon Clematide
Computational Linguistics, UZH
[simon-clematide.net](https://simon-clematide.net) (if applicable)
|