gliclass-x-base / README.md

Update README.md

4599a43 verified 6 months ago

7.49 kB

	---
	license: apache-2.0
	datasets:
	- knowledgator/gliclass-v2.0
	pipeline_tag: text-classification
	---
	# ⭐ GLiClass: Generalist and Lightweight Model for Sequence Classification

	This is an efficient zero-shot classifier inspired by [GLiNER](https://github.com/urchade/GLiNER/tree/main) work. It demonstrates the same performance as a cross-encoder while being more compute-efficient because classification is done at a single forward path.

	It can be used for `topic classification`, `sentiment analysis` and as a reranker in `RAG` pipelines.

	The model was trained on synthetic and licensed data that allow commercial use and can be used in commercial applications.

	The backbone model is [mdeberta-v3-base](huggingface.co/microsoft/mdeberta-v3-base). It supports multilingual understanding, making it well-suited for tasks involving texts in different languages.

	### How to use:
	First of all, you need to install GLiClass library:
	```bash
	pip install gliclass
	pip install -U transformers>=4.48.0
	```

	Than you need to initialize a model and a pipeline:

	<details>
	<summary>English</summary>

	```python
	from gliclass import GLiClassModel, ZeroShotClassificationPipeline
	from transformers import AutoTokenizer

	model = GLiClassModel.from_pretrained("knowledgator/gliclass-x-base")
	tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-x-base", add_prefix_space=True)
	pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0')

	text = "One day I will see the world!"
	labels = ["travel", "dreams", "sport", "science", "politics"]
	results = pipeline(text, labels, threshold=0.5)[0] #because we have one text
	for result in results:
	print(result["label"], "=>", result["score"])
	```
	</details>
	<details>
	<summary>Spanish</summary>

	```python
	from gliclass import GLiClassModel, ZeroShotClassificationPipeline
	from transformers import AutoTokenizer

	model = GLiClassModel.from_pretrained("knowledgator/gliclass-x-base")
	tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-x-base", add_prefix_space=True)
	pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0')

	text = "¡Un día veré el mundo!"
	labels = ["viajes", "sueños", "deportes", "ciencia", "política"]
	results = pipeline(text, labels, threshold=0.5)[0]
	for result in results:
	print(result["label"], "=>", result["score"])
	```
	</details>
	<details>
	<summary>Italitan</summary>

	```python
	from gliclass import GLiClassModel, ZeroShotClassificationPipeline
	from transformers import AutoTokenizer

	model = GLiClassModel.from_pretrained("knowledgator/gliclass-x-base")
	tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-x-base", add_prefix_space=True)
	pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0')

	text = "Un giorno vedrò il mondo!"
	labels = ["viaggi", "sogni", "sport", "scienza", "politica"]
	results = pipeline(text, labels, threshold=0.5)[0]
	for result in results:
	print(result["label"], "=>", result["score"])
	```

	</details>
	<details>
	<summary>French</summary>

	```python
	from gliclass import GLiClassModel, ZeroShotClassificationPipeline
	from transformers import AutoTokenizer

	model = GLiClassModel.from_pretrained("knowledgator/gliclass-x-base")
	tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-x-base", add_prefix_space=True)
	pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0')

	text = "Un jour, je verrai le monde!"
	labels = ["voyage", "rêves", "sport", "science", "politique"]
	results = pipeline(text, labels, threshold=0.5)[0]
	for result in results:
	print(result["label"], "=>", result["score"])
	```

	</details>
	<details>
	<summary>German</summary>

	```python
	from gliclass import GLiClassModel, ZeroShotClassificationPipeline
	from transformers import AutoTokenizer

	model = GLiClassModel.from_pretrained("knowledgator/gliclass-x-base")
	tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-x-base", add_prefix_space=True)
	pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0')

	text = "Eines Tages werde ich die Welt sehen!"
	labels = ["Reisen", "Träume", "Sport", "Wissenschaft", "Politik"]
	results = pipeline(text, labels, threshold=0.5)[0]
	for result in results:
	print(result["label"], "=>", result["score"])
	```

	</details>

	### Benchmarks:
	Below, you can see the F1 score on several text classification datasets. All tested models were not fine-tuned on those datasets and were tested in a zero-shot setting.
	#### Multilingual benchmarks
	\| Dataset \| gliclass-x-base \| gliclass-base-v3.0 \| gliclass-large-v3.0 \|
	\| ------------------------ \| --------------- \| ------------------ \| ------------------- \|
	\| FredZhang7/toxi-text-3M \| 0.5972 \| 0.5072 \| 0.6118 \|
	\| SetFit/xglue\_nc \| 0.5014 \| 0.5348 \| 0.5378 \|
	\| Davlan/sib200\_14classes \| 0.4663 \| 0.2867 \| 0.3173 \|
	\| uhhlt/GermEval2017 \| 0.3999 \| 0.4010 \| 0.4299 \|
	\| dolfsai/toxic\_es \| 0.1250 \| 0.1399 \| 0.1412 \|
	\| Average \| 0.41796 \| 0.37392 \| 0.4076 \|
	#### General benchmarks
	\| Dataset \| gliclass-x-base \| gliclass-base-v3.0 \| gliclass-large-v3.0 \|
	\| ---------------------------- \| --------------- \| ------------------ \| ------------------- \|
	\| SetFit/CR \| 0.8630 \| 0.9127 \| 0.9398 \|
	\| SetFit/sst2 \| 0.8554 \| 0.8959 \| 0.9192 \|
	\| SetFit/sst5 \| 0.3287 \| 0.3376 \| 0.4606 \|
	\| AmazonScience/massive \| 0.2611 \| 0.5040 \| 0.5649 \|
	\| stanfordnlp/imdb \| 0.8840 \| 0.9251 \| 0.9366 \|
	\| SetFit/20\_newsgroups \| 0.4116 \| 0.4759 \| 0.5958 \|
	\| SetFit/enron\_spam \| 0.5929 \| 0.6760 \| 0.7584 \|
	\| PolyAI/banking77 \| 0.3098 \| 0.4698 \| 0.5574 \|
	\| takala/financial\_phrasebank \| 0.7851 \| 0.8971 \| 0.9000 \|
	\| ag\_news \| 0.6815 \| 0.7279 \| 0.7181 \|
	\| dair-ai/emotion \| 0.3667 \| 0.4447 \| 0.4506 \|
	\| MoritzLaurer/cap\_sotu \| 0.3935 \| 0.4614 \| 0.4589 \|
	\| cornell/rotten\_tomatoes \| 0.7252 \| 0.7943 \| 0.8411 \|
	\| snips \| 0.6307 \| 0.9474 \| 0.9692 \|
	\| Average \| 0.5778 \| 0.6764 \| 0.7193 \|

	## Citation
	```bibtex
	@misc{stepanov2025gliclassgeneralistlightweightmodel,
	title={GLiClass: Generalist Lightweight Model for Sequence Classification Tasks},
	author={Ihor Stepanov and Mykhailo Shtopko and Dmytro Vodianytskyi and Oleksandr Lukashov and Alexander Yavorskyi and Mykyta Yaroshenko},
	year={2025},
	eprint={2508.07662},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2508.07662},
	}
	```