# SDG SciBERT Classifier (`sdg-scibert-zo_up`)

This repository contains a fine-tuned version of [allenai/scibert_scivocab_cased](https://huggingface.co/allenai/scibert_scivocab_cased) for classifying scientific text into Sustainable Development Goal (SDG) categories.

- Fine-tuned using the 🤗 `transformers` Trainer API
- Uses standard `AutoModelForSequenceClassification`
- Published with full label mappings, inference scripts, and CLI tool

---

## 🧪 Quick Inference (Python)

You can use the model directly with the Hugging Face `pipeline`:

```python
from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="simon-clmtd/sdg-scibert-zo_up",
    tokenizer="simon-clmtd/sdg-scibert-zo_up",
    truncation=True,
    padding=True,
    max_length=512,
    return_all_scores=True,
    device=0  # or -1 for CPU
)

text = "Ensure access to affordable, reliable, sustainable and modern energy for all"
print(classifier(text))
```

---

## 🖥️ CLI Tool: `sdg-predict`

### 🔧 Installation (local)

Clone the repo and install as a Python package:

```bash
git clone https://huggingface.co/simon-clmtd/sdg-scibert-zo_up
cd sdg-scibert-zo_up
pip install -e .
```

This will install a command-line tool called `sdg-predict`.

### 📥 Input format

The CLI tool accepts a `.jsonl` file (one JSON object per line). You must specify the key containing the text to classify:

Example input file (`input.jsonl`):
```json
{"id": 1, "text": "Ensure access to affordable, reliable, sustainable and modern energy for all"}
{"id": 2, "text": "Atmospheric warming is profoundly affecting high-mountain regions"}
```

### ▶️ Example usage

#### Top-1 prediction:
```bash
sdg-predict input.jsonl --key text --top1 --output preds.jsonl
```

#### Full label distribution:
```bash
sdg-predict input.jsonl --key text --output preds_all.jsonl
```

#### Custom batch size:
```bash
sdg-predict input.jsonl --key text --batch_size 16
```

### 📤 Output format

Each output line is the original input with an added `prediction` key:

**With `--top1`:**
```json
{
  "id": 1,
  "text": "...",
  "prediction": {
    "label": "7", 
    "score": 0.9124
  }
}
```

**Without `--top1`:**
```json
{
  "id": 1,
  "text": "...",
  "prediction": [
    {"label": "1", "score": 0.0021},
    {"label": "2", "score": 0.0005},
    ...
    {"label": "7", "score": 0.9124}
  ]
}
```

---

## 📦 Repository Contents

- `modeling.py`: Optional class wrapper if extending the base model.
- `inference.py`: Reusable batch inference logic for Python scripts.
- `cli_predict.py`: CLI tool using the inference logic.
- `requirements.txt`: Runtime dependencies.
- `setup.py`: Installation and entry point for the CLI.

---

## 🔍 Citation

Please cite the original [SciBERT paper](https://arxiv.org/abs/1903.10676) if using this model, and attribute this fine-tuning setup if relevant.

---

## 👤 Author

Simon Clematide  
Computational Linguistics, UZH  
[simon-clematide.net](https://simon-clematide.net) (if applicable)