sdg-scibert-zo_up / README.md

Simon Clematide

Revise README.md to enhance model documentation and usage instructions

d8a7362 9 months ago

3.03 kB

	# SDG SciBERT Classifier (`sdg-scibert-zo_up`)

	This repository contains a fine-tuned version of [allenai/scibert_scivocab_cased](https://huggingface.co/allenai/scibert_scivocab_cased) for classifying scientific text into Sustainable Development Goal (SDG) categories.

	- Fine-tuned using the 🤗 `transformers` Trainer API
	- Uses standard `AutoModelForSequenceClassification`
	- Published with full label mappings, inference scripts, and CLI tool

	---

	## 🧪 Quick Inference (Python)

	You can use the model directly with the Hugging Face `pipeline`:

	```python
	from transformers import pipeline

	classifier = pipeline(
	"text-classification",
	model="simon-clmtd/sdg-scibert-zo_up",
	tokenizer="simon-clmtd/sdg-scibert-zo_up",
	truncation=True,
	padding=True,
	max_length=512,
	return_all_scores=True,
	device=0 # or -1 for CPU
	)

	text = "Ensure access to affordable, reliable, sustainable and modern energy for all"
	print(classifier(text))
	```

	---

	## 🖥️ CLI Tool: `sdg-predict`

	### 🔧 Installation (local)

	Clone the repo and install as a Python package:

	```bash
	git clone https://huggingface.co/simon-clmtd/sdg-scibert-zo_up
	cd sdg-scibert-zo_up
	pip install -e .
	```

	This will install a command-line tool called `sdg-predict`.

	### 📥 Input format

	The CLI tool accepts a `.jsonl` file (one JSON object per line). You must specify the key containing the text to classify:

	Example input file (`input.jsonl`):
	```json
	{"id": 1, "text": "Ensure access to affordable, reliable, sustainable and modern energy for all"}
	{"id": 2, "text": "Atmospheric warming is profoundly affecting high-mountain regions"}
	```

	### ▶️ Example usage

	#### Top-1 prediction:
	```bash
	sdg-predict input.jsonl --key text --top1 --output preds.jsonl
	```

	#### Full label distribution:
	```bash
	sdg-predict input.jsonl --key text --output preds_all.jsonl
	```

	#### Custom batch size:
	```bash
	sdg-predict input.jsonl --key text --batch_size 16
	```

	### 📤 Output format

	Each output line is the original input with an added `prediction` key:

	With `--top1`:
	```json
	{
	"id": 1,
	"text": "...",
	"prediction": {
	"label": "7",
	"score": 0.9124
	}
	}
	```

	Without `--top1`:
	```json
	{
	"id": 1,
	"text": "...",
	"prediction": [
	{"label": "1", "score": 0.0021},
	{"label": "2", "score": 0.0005},
	...
	{"label": "7", "score": 0.9124}
	]
	}
	```

	---

	## 📦 Repository Contents

	- `modeling.py`: Optional class wrapper if extending the base model.
	- `inference.py`: Reusable batch inference logic for Python scripts.
	- `cli_predict.py`: CLI tool using the inference logic.
	- `requirements.txt`: Runtime dependencies.
	- `setup.py`: Installation and entry point for the CLI.

	---

	## 🔍 Citation

	Please cite the original [SciBERT paper](https://arxiv.org/abs/1903.10676) if using this model, and attribute this fine-tuning setup if relevant.

	---

	## 👤 Author

	Simon Clematide
	Computational Linguistics, UZH
	[simon-clematide.net](https://simon-clematide.net) (if applicable)