|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- knowledgator/gliclass-v2.0 |
|
|
pipeline_tag: text-classification |
|
|
--- |
|
|
# ⭐ GLiClass: Generalist and Lightweight Model for Sequence Classification |
|
|
|
|
|
This is an efficient zero-shot classifier inspired by [GLiNER](https://github.com/urchade/GLiNER/tree/main) work. It demonstrates the same performance as a cross-encoder while being more compute-efficient because classification is done at a single forward path. |
|
|
|
|
|
It can be used for `topic classification`, `sentiment analysis` and as a reranker in `RAG` pipelines. |
|
|
|
|
|
The model was trained on synthetic and licensed data that allow commercial use and can be used in commercial applications. |
|
|
|
|
|
The backbone model is [mdeberta-v3-base](huggingface.co/microsoft/mdeberta-v3-base). It supports multilingual understanding, making it well-suited for tasks involving texts in different languages. |
|
|
|
|
|
### How to use: |
|
|
First of all, you need to install GLiClass library: |
|
|
```bash |
|
|
pip install gliclass |
|
|
pip install -U transformers>=4.48.0 |
|
|
``` |
|
|
|
|
|
Than you need to initialize a model and a pipeline: |
|
|
|
|
|
<details> |
|
|
<summary>English</summary> |
|
|
|
|
|
```python |
|
|
from gliclass import GLiClassModel, ZeroShotClassificationPipeline |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
model = GLiClassModel.from_pretrained("knowledgator/gliclass-x-base") |
|
|
tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-x-base", add_prefix_space=True) |
|
|
pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0') |
|
|
|
|
|
text = "One day I will see the world!" |
|
|
labels = ["travel", "dreams", "sport", "science", "politics"] |
|
|
results = pipeline(text, labels, threshold=0.5)[0] #because we have one text |
|
|
for result in results: |
|
|
print(result["label"], "=>", result["score"]) |
|
|
``` |
|
|
</details> |
|
|
<details> |
|
|
<summary>Spanish</summary> |
|
|
|
|
|
```python |
|
|
from gliclass import GLiClassModel, ZeroShotClassificationPipeline |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
model = GLiClassModel.from_pretrained("knowledgator/gliclass-x-base") |
|
|
tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-x-base", add_prefix_space=True) |
|
|
pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0') |
|
|
|
|
|
text = "¡Un día veré el mundo!" |
|
|
labels = ["viajes", "sueños", "deportes", "ciencia", "política"] |
|
|
results = pipeline(text, labels, threshold=0.5)[0] |
|
|
for result in results: |
|
|
print(result["label"], "=>", result["score"]) |
|
|
``` |
|
|
</details> |
|
|
<details> |
|
|
<summary>Italitan</summary> |
|
|
|
|
|
```python |
|
|
from gliclass import GLiClassModel, ZeroShotClassificationPipeline |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
model = GLiClassModel.from_pretrained("knowledgator/gliclass-x-base") |
|
|
tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-x-base", add_prefix_space=True) |
|
|
pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0') |
|
|
|
|
|
text = "Un giorno vedrò il mondo!" |
|
|
labels = ["viaggi", "sogni", "sport", "scienza", "politica"] |
|
|
results = pipeline(text, labels, threshold=0.5)[0] |
|
|
for result in results: |
|
|
print(result["label"], "=>", result["score"]) |
|
|
``` |
|
|
|
|
|
</details> |
|
|
<details> |
|
|
<summary>French</summary> |
|
|
|
|
|
```python |
|
|
from gliclass import GLiClassModel, ZeroShotClassificationPipeline |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
model = GLiClassModel.from_pretrained("knowledgator/gliclass-x-base") |
|
|
tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-x-base", add_prefix_space=True) |
|
|
pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0') |
|
|
|
|
|
text = "Un jour, je verrai le monde!" |
|
|
labels = ["voyage", "rêves", "sport", "science", "politique"] |
|
|
results = pipeline(text, labels, threshold=0.5)[0] |
|
|
for result in results: |
|
|
print(result["label"], "=>", result["score"]) |
|
|
``` |
|
|
|
|
|
</details> |
|
|
<details> |
|
|
<summary>German</summary> |
|
|
|
|
|
```python |
|
|
from gliclass import GLiClassModel, ZeroShotClassificationPipeline |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
model = GLiClassModel.from_pretrained("knowledgator/gliclass-x-base") |
|
|
tokenizer = AutoTokenizer.from_pretrained("knowledgator/gliclass-x-base", add_prefix_space=True) |
|
|
pipeline = ZeroShotClassificationPipeline(model, tokenizer, classification_type='multi-label', device='cuda:0') |
|
|
|
|
|
text = "Eines Tages werde ich die Welt sehen!" |
|
|
labels = ["Reisen", "Träume", "Sport", "Wissenschaft", "Politik"] |
|
|
results = pipeline(text, labels, threshold=0.5)[0] |
|
|
for result in results: |
|
|
print(result["label"], "=>", result["score"]) |
|
|
``` |
|
|
|
|
|
</details> |
|
|
|
|
|
### Benchmarks: |
|
|
Below, you can see the F1 score on several text classification datasets. All tested models were not fine-tuned on those datasets and were tested in a zero-shot setting. |
|
|
#### Multilingual benchmarks |
|
|
| Dataset | gliclass-x-base | gliclass-base-v3.0 | gliclass-large-v3.0 | |
|
|
| ------------------------ | --------------- | ------------------ | ------------------- | |
|
|
| FredZhang7/toxi-text-3M | 0.5972 | 0.5072 | 0.6118 | |
|
|
| SetFit/xglue\_nc | 0.5014 | 0.5348 | 0.5378 | |
|
|
| Davlan/sib200\_14classes | 0.4663 | 0.2867 | 0.3173 | |
|
|
| uhhlt/GermEval2017 | 0.3999 | 0.4010 | 0.4299 | |
|
|
| dolfsai/toxic\_es | 0.1250 | 0.1399 | 0.1412 | |
|
|
| **Average** | **0.41796** | **0.37392** | **0.4076** | |
|
|
#### General benchmarks |
|
|
| Dataset | gliclass-x-base | gliclass-base-v3.0 | gliclass-large-v3.0 | |
|
|
| ---------------------------- | --------------- | ------------------ | ------------------- | |
|
|
| SetFit/CR | 0.8630 | 0.9127 | 0.9398 | |
|
|
| SetFit/sst2 | 0.8554 | 0.8959 | 0.9192 | |
|
|
| SetFit/sst5 | 0.3287 | 0.3376 | 0.4606 | |
|
|
| AmazonScience/massive | 0.2611 | 0.5040 | 0.5649 | |
|
|
| stanfordnlp/imdb | 0.8840 | 0.9251 | 0.9366 | |
|
|
| SetFit/20\_newsgroups | 0.4116 | 0.4759 | 0.5958 | |
|
|
| SetFit/enron\_spam | 0.5929 | 0.6760 | 0.7584 | |
|
|
| PolyAI/banking77 | 0.3098 | 0.4698 | 0.5574 | |
|
|
| takala/financial\_phrasebank | 0.7851 | 0.8971 | 0.9000 | |
|
|
| ag\_news | 0.6815 | 0.7279 | 0.7181 | |
|
|
| dair-ai/emotion | 0.3667 | 0.4447 | 0.4506 | |
|
|
| MoritzLaurer/cap\_sotu | 0.3935 | 0.4614 | 0.4589 | |
|
|
| cornell/rotten\_tomatoes | 0.7252 | 0.7943 | 0.8411 | |
|
|
| snips | 0.6307 | 0.9474 | 0.9692 | |
|
|
| **Average** | **0.5778** | **0.6764** | **0.7193** | |
|
|
|
|
|
## Citation |
|
|
```bibtex |
|
|
@misc{stepanov2025gliclassgeneralistlightweightmodel, |
|
|
title={GLiClass: Generalist Lightweight Model for Sequence Classification Tasks}, |
|
|
author={Ihor Stepanov and Mykhailo Shtopko and Dmytro Vodianytskyi and Oleksandr Lukashov and Alexander Yavorskyi and Mykyta Yaroshenko}, |
|
|
year={2025}, |
|
|
eprint={2508.07662}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.LG}, |
|
|
url={https://arxiv.org/abs/2508.07662}, |
|
|
} |
|
|
``` |