# SDG SciBERT Classifier (`sdg-scibert-zo_up`) This repository contains a fine-tuned version of [allenai/scibert_scivocab_cased](https://huggingface.co/allenai/scibert_scivocab_cased) for classifying scientific text into Sustainable Development Goal (SDG) categories. - Fine-tuned using the ๐Ÿค— `transformers` Trainer API - Uses standard `AutoModelForSequenceClassification` - Published with full label mappings, inference scripts, and CLI tool --- ## ๐Ÿงช Quick Inference (Python) You can use the model directly with the Hugging Face `pipeline`: ```python from transformers import pipeline classifier = pipeline( "text-classification", model="simon-clmtd/sdg-scibert-zo_up", tokenizer="simon-clmtd/sdg-scibert-zo_up", truncation=True, padding=True, max_length=512, return_all_scores=True, device=0 # or -1 for CPU ) text = "Ensure access to affordable, reliable, sustainable and modern energy for all" print(classifier(text)) ``` --- ## ๐Ÿ–ฅ๏ธ CLI Tool: `sdg-predict` ### ๐Ÿ”ง Installation (local) Clone the repo and install as a Python package: ```bash git clone https://huggingface.co/simon-clmtd/sdg-scibert-zo_up cd sdg-scibert-zo_up pip install -e . ``` This will install a command-line tool called `sdg-predict`. ### ๐Ÿ“ฅ Input format The CLI tool accepts a `.jsonl` file (one JSON object per line). You must specify the key containing the text to classify: Example input file (`input.jsonl`): ```json {"id": 1, "text": "Ensure access to affordable, reliable, sustainable and modern energy for all"} {"id": 2, "text": "Atmospheric warming is profoundly affecting high-mountain regions"} ``` ### โ–ถ๏ธ Example usage #### Top-1 prediction: ```bash sdg-predict input.jsonl --key text --top1 --output preds.jsonl ``` #### Full label distribution: ```bash sdg-predict input.jsonl --key text --output preds_all.jsonl ``` #### Custom batch size: ```bash sdg-predict input.jsonl --key text --batch_size 16 ``` ### ๐Ÿ“ค Output format Each output line is the original input with an added `prediction` key: **With `--top1`:** ```json { "id": 1, "text": "...", "prediction": { "label": "7", "score": 0.9124 } } ``` **Without `--top1`:** ```json { "id": 1, "text": "...", "prediction": [ {"label": "1", "score": 0.0021}, {"label": "2", "score": 0.0005}, ... {"label": "7", "score": 0.9124} ] } ``` --- ## ๐Ÿ“ฆ Repository Contents - `modeling.py`: Optional class wrapper if extending the base model. - `inference.py`: Reusable batch inference logic for Python scripts. - `cli_predict.py`: CLI tool using the inference logic. - `requirements.txt`: Runtime dependencies. - `setup.py`: Installation and entry point for the CLI. --- ## ๐Ÿ” Citation Please cite the original [SciBERT paper](https://arxiv.org/abs/1903.10676) if using this model, and attribute this fine-tuning setup if relevant. --- ## ๐Ÿ‘ค Author Simon Clematide Computational Linguistics, UZH [simon-clematide.net](https://simon-clematide.net) (if applicable)