File size: 3,033 Bytes
d8a7362
701eae6
d8a7362
701eae6
d8a7362
 
 
701eae6
d8a7362
701eae6
d8a7362
701eae6
d8a7362
701eae6
d8a7362
 
701eae6
d8a7362
 
 
 
 
 
 
 
 
 
701eae6
d8a7362
 
 
701eae6
d8a7362
701eae6
d8a7362
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
701eae6
d8a7362
701eae6
d8a7362
701eae6
d8a7362
 
 
 
 
701eae6
d8a7362
701eae6
d8a7362
701eae6
d8a7362
701eae6
d8a7362
701eae6
d8a7362
701eae6
d8a7362
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
# SDG SciBERT Classifier (`sdg-scibert-zo_up`)

This repository contains a fine-tuned version of [allenai/scibert_scivocab_cased](https://huggingface.co/allenai/scibert_scivocab_cased) for classifying scientific text into Sustainable Development Goal (SDG) categories.

- Fine-tuned using the 🤗 `transformers` Trainer API
- Uses standard `AutoModelForSequenceClassification`
- Published with full label mappings, inference scripts, and CLI tool

---

## 🧪 Quick Inference (Python)

You can use the model directly with the Hugging Face `pipeline`:

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="simon-clmtd/sdg-scibert-zo_up",
    tokenizer="simon-clmtd/sdg-scibert-zo_up",
    truncation=True,
    padding=True,
    max_length=512,
    return_all_scores=True,
    device=0  # or -1 for CPU
)

text = "Ensure access to affordable, reliable, sustainable and modern energy for all"
print(classifier(text))
```

---

## 🖥️ CLI Tool: `sdg-predict`

### 🔧 Installation (local)

Clone the repo and install as a Python package:

```bash
git clone https://huggingface.co/simon-clmtd/sdg-scibert-zo_up
cd sdg-scibert-zo_up
pip install -e .
```

This will install a command-line tool called `sdg-predict`.

### 📥 Input format

The CLI tool accepts a `.jsonl` file (one JSON object per line). You must specify the key containing the text to classify:

Example input file (`input.jsonl`):
```json
{"id": 1, "text": "Ensure access to affordable, reliable, sustainable and modern energy for all"}
{"id": 2, "text": "Atmospheric warming is profoundly affecting high-mountain regions"}
```

### ▶️ Example usage

#### Top-1 prediction:
```bash
sdg-predict input.jsonl --key text --top1 --output preds.jsonl
```

#### Full label distribution:
```bash
sdg-predict input.jsonl --key text --output preds_all.jsonl
```

#### Custom batch size:
```bash
sdg-predict input.jsonl --key text --batch_size 16
```

### 📤 Output format

Each output line is the original input with an added `prediction` key:

**With `--top1`:**
```json
{
  "id": 1,
  "text": "...",
  "prediction": {
    "label": "7", 
    "score": 0.9124
  }
}
```

**Without `--top1`:**
```json
{
  "id": 1,
  "text": "...",
  "prediction": [
    {"label": "1", "score": 0.0021},
    {"label": "2", "score": 0.0005},
    ...
    {"label": "7", "score": 0.9124}
  ]
}
```

---

## 📦 Repository Contents

- `modeling.py`: Optional class wrapper if extending the base model.
- `inference.py`: Reusable batch inference logic for Python scripts.
- `cli_predict.py`: CLI tool using the inference logic.
- `requirements.txt`: Runtime dependencies.
- `setup.py`: Installation and entry point for the CLI.

---

## 🔍 Citation

Please cite the original [SciBERT paper](https://arxiv.org/abs/1903.10676) if using this model, and attribute this fine-tuning setup if relevant.

---

## 👤 Author

Simon Clematide  
Computational Linguistics, UZH  
[simon-clematide.net](https://simon-clematide.net) (if applicable)