sdg-scibert-zo_up / README.md
Simon Clematide
Revise README.md to enhance model documentation and usage instructions
d8a7362
# SDG SciBERT Classifier (`sdg-scibert-zo_up`)
This repository contains a fine-tuned version of [allenai/scibert_scivocab_cased](https://huggingface.co/allenai/scibert_scivocab_cased) for classifying scientific text into Sustainable Development Goal (SDG) categories.
- Fine-tuned using the πŸ€— `transformers` Trainer API
- Uses standard `AutoModelForSequenceClassification`
- Published with full label mappings, inference scripts, and CLI tool
---
## πŸ§ͺ Quick Inference (Python)
You can use the model directly with the Hugging Face `pipeline`:
```python
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="simon-clmtd/sdg-scibert-zo_up",
tokenizer="simon-clmtd/sdg-scibert-zo_up",
truncation=True,
padding=True,
max_length=512,
return_all_scores=True,
device=0 # or -1 for CPU
)
text = "Ensure access to affordable, reliable, sustainable and modern energy for all"
print(classifier(text))
```
---
## πŸ–₯️ CLI Tool: `sdg-predict`
### πŸ”§ Installation (local)
Clone the repo and install as a Python package:
```bash
git clone https://huggingface.co/simon-clmtd/sdg-scibert-zo_up
cd sdg-scibert-zo_up
pip install -e .
```
This will install a command-line tool called `sdg-predict`.
### πŸ“₯ Input format
The CLI tool accepts a `.jsonl` file (one JSON object per line). You must specify the key containing the text to classify:
Example input file (`input.jsonl`):
```json
{"id": 1, "text": "Ensure access to affordable, reliable, sustainable and modern energy for all"}
{"id": 2, "text": "Atmospheric warming is profoundly affecting high-mountain regions"}
```
### ▢️ Example usage
#### Top-1 prediction:
```bash
sdg-predict input.jsonl --key text --top1 --output preds.jsonl
```
#### Full label distribution:
```bash
sdg-predict input.jsonl --key text --output preds_all.jsonl
```
#### Custom batch size:
```bash
sdg-predict input.jsonl --key text --batch_size 16
```
### πŸ“€ Output format
Each output line is the original input with an added `prediction` key:
**With `--top1`:**
```json
{
"id": 1,
"text": "...",
"prediction": {
"label": "7",
"score": 0.9124
}
}
```
**Without `--top1`:**
```json
{
"id": 1,
"text": "...",
"prediction": [
{"label": "1", "score": 0.0021},
{"label": "2", "score": 0.0005},
...
{"label": "7", "score": 0.9124}
]
}
```
---
## πŸ“¦ Repository Contents
- `modeling.py`: Optional class wrapper if extending the base model.
- `inference.py`: Reusable batch inference logic for Python scripts.
- `cli_predict.py`: CLI tool using the inference logic.
- `requirements.txt`: Runtime dependencies.
- `setup.py`: Installation and entry point for the CLI.
---
## πŸ” Citation
Please cite the original [SciBERT paper](https://arxiv.org/abs/1903.10676) if using this model, and attribute this fine-tuning setup if relevant.
---
## πŸ‘€ Author
Simon Clematide
Computational Linguistics, UZH
[simon-clematide.net](https://simon-clematide.net) (if applicable)