| # SDG SciBERT Classifier (`sdg-scibert-zo_up`) | |
| This repository contains a fine-tuned version of [allenai/scibert_scivocab_cased](https://huggingface.co/allenai/scibert_scivocab_cased) for classifying scientific text into Sustainable Development Goal (SDG) categories. | |
| - Fine-tuned using the π€ `transformers` Trainer API | |
| - Uses standard `AutoModelForSequenceClassification` | |
| - Published with full label mappings, inference scripts, and CLI tool | |
| --- | |
| ## π§ͺ Quick Inference (Python) | |
| You can use the model directly with the Hugging Face `pipeline`: | |
| ```python | |
| from transformers import pipeline | |
| classifier = pipeline( | |
| "text-classification", | |
| model="simon-clmtd/sdg-scibert-zo_up", | |
| tokenizer="simon-clmtd/sdg-scibert-zo_up", | |
| truncation=True, | |
| padding=True, | |
| max_length=512, | |
| return_all_scores=True, | |
| device=0 # or -1 for CPU | |
| ) | |
| text = "Ensure access to affordable, reliable, sustainable and modern energy for all" | |
| print(classifier(text)) | |
| ``` | |
| --- | |
| ## π₯οΈ CLI Tool: `sdg-predict` | |
| ### π§ Installation (local) | |
| Clone the repo and install as a Python package: | |
| ```bash | |
| git clone https://huggingface.co/simon-clmtd/sdg-scibert-zo_up | |
| cd sdg-scibert-zo_up | |
| pip install -e . | |
| ``` | |
| This will install a command-line tool called `sdg-predict`. | |
| ### π₯ Input format | |
| The CLI tool accepts a `.jsonl` file (one JSON object per line). You must specify the key containing the text to classify: | |
| Example input file (`input.jsonl`): | |
| ```json | |
| {"id": 1, "text": "Ensure access to affordable, reliable, sustainable and modern energy for all"} | |
| {"id": 2, "text": "Atmospheric warming is profoundly affecting high-mountain regions"} | |
| ``` | |
| ### βΆοΈ Example usage | |
| #### Top-1 prediction: | |
| ```bash | |
| sdg-predict input.jsonl --key text --top1 --output preds.jsonl | |
| ``` | |
| #### Full label distribution: | |
| ```bash | |
| sdg-predict input.jsonl --key text --output preds_all.jsonl | |
| ``` | |
| #### Custom batch size: | |
| ```bash | |
| sdg-predict input.jsonl --key text --batch_size 16 | |
| ``` | |
| ### π€ Output format | |
| Each output line is the original input with an added `prediction` key: | |
| **With `--top1`:** | |
| ```json | |
| { | |
| "id": 1, | |
| "text": "...", | |
| "prediction": { | |
| "label": "7", | |
| "score": 0.9124 | |
| } | |
| } | |
| ``` | |
| **Without `--top1`:** | |
| ```json | |
| { | |
| "id": 1, | |
| "text": "...", | |
| "prediction": [ | |
| {"label": "1", "score": 0.0021}, | |
| {"label": "2", "score": 0.0005}, | |
| ... | |
| {"label": "7", "score": 0.9124} | |
| ] | |
| } | |
| ``` | |
| --- | |
| ## π¦ Repository Contents | |
| - `modeling.py`: Optional class wrapper if extending the base model. | |
| - `inference.py`: Reusable batch inference logic for Python scripts. | |
| - `cli_predict.py`: CLI tool using the inference logic. | |
| - `requirements.txt`: Runtime dependencies. | |
| - `setup.py`: Installation and entry point for the CLI. | |
| --- | |
| ## π Citation | |
| Please cite the original [SciBERT paper](https://arxiv.org/abs/1903.10676) if using this model, and attribute this fine-tuning setup if relevant. | |
| --- | |
| ## π€ Author | |
| Simon Clematide | |
| Computational Linguistics, UZH | |
| [simon-clematide.net](https://simon-clematide.net) (if applicable) | |