| # GLiNER Small v2.1 — GPU-Optimized Inference |
|
|
| **Optimization result:** 1.71× faster inference on GPU with **zero F1-score loss** across 11 NER evaluation datasets. |
|
|
| Model: [binga/gliner_small_v2.1-optimized-gpu](https://huggingface.co/binga/gliner_small_v2.1-optimized-gpu) |
|
|
| ## What is this? |
|
|
| This is an **optimized variant** of the original [`urchade/gliner_small-v2.1`](https://huggingface.co/urchade/gliner_small-v2.1) model, specifically tuned for **maximum GPU inference speed** without sacrificing NER accuracy. |
|
|
| ## Optimizations Applied |
|
|
| | Technique | Speedup | F1 Impact | Notes | |
| |-----------|---------|-----------|-------| |
| | **FP16 (half-precision)** | ~1.28× | Zero loss | Reduces memory bandwidth, enables faster Tensor Cores | |
| | **torch.compile(mode="max-autotune")** | ~1.71× | Zero loss | Compiles transformer backbone + span/prompt layers | |
| | **Inference packing** | ~1.84× (batch throughput) | Zero loss | Packs variable-length sequences for better GPU utilization | |
|
|
| ### Recommended Usage |
|
|
| For **best latency (single text)**: |
| ```python |
| from gliner import GLiNER |
| import torch |
| |
| model = GLiNER.from_pretrained("binga/gliner_small_v2.1-optimized-gpu", map_location="cuda") |
| model.to("cuda") |
| model.half() # FP16 — critical for speed |
| |
| # Optional: compile submodules for additional speed |
| if hasattr(model, "model"): |
| inner = model.model |
| for attr in ["token_rep_layer", "span_rep_layer", "prompt_rep_layer"]: |
| layer = getattr(inner, attr, None) |
| if layer is not None: |
| setattr(inner, attr, torch.compile(layer, mode="max-autotune")) |
| |
| text = "Apple Inc. was founded by Steve Jobs in California." |
| labels = ["person", "organization", "location"] |
| entities = model.predict_entities(text, labels, threshold=0.5) |
| ``` |
|
|
| For **best throughput (batch processing)**: |
| ```python |
| from gliner import InferencePackingConfig |
| |
| # Enable inference packing (packs variable-length sequences) |
| model.configure_inference_packing( |
| InferencePackingConfig(max_length=384, streams_per_batch=8) |
| ) |
| |
| results = model.inference(texts, labels, threshold=0.5, batch_size=32) |
| ``` |
|
|
| ## Benchmark Results |
|
|
| Evaluated on **11 diverse NER datasets** (CoNLL-2003, OntoNotes 5, BC5CDR, WNUT-2017, TweetNER7, MIT Movie, MIT Restaurant (Fin), CrossNER AI/Literature/Science, WikiNeural): |
|
|
| | Dataset | Samples | Entities | Baseline F1 | Optimized F1 | Speedup | |
| |---------|---------|----------|-------------|--------------|---------| |
| | conll2003 | 3,453 | 4 | 0.5483 | 0.5481 | 1.83× | |
| | ontonotes5 | 8,262 | 18 | 0.2797 | 0.2797 | 1.75× | |
| | bc5cdr | 5,865 | 2 | 0.6592 | 0.6591 | 1.73× | |
| | wnut2017 | 1,287 | 6 | 0.4255 | 0.4252 | 1.75× | |
| | tweetner7 | 3,383 | 7 | 0.2829 | 0.2828 | 1.73× | |
| | mit_movie | 1,953 | 12 | 0.5183 | 0.5183 | 1.80× | |
| | fin | 305 | 4 | 0.2906 | 0.2906 | 1.25× | |
| | crossner_ai | 431 | 14 | 0.5000 | 0.5002 | 1.71× | |
| | crossner_literature | 416 | 12 | 0.6444 | 0.6444 | 1.73× | |
| | crossner_science | 543 | 17 | 0.6330 | 0.6332 | 1.75× | |
| | wikineural | 3,000 | 16 | 0.5465 | 0.5462 | 1.67× | |
| | **AVERAGE** | **28,358** | — | **0.4844** | **0.4844** | **1.71×** | |
|
|
| ### Performance Guarantee |
|
|
| - **F1 difference**: -0.0000 (zero loss, within measurement noise) |
| - **All 11 datasets**: No statistically significant performance degradation |
| - **Zero-shot NER**: Maintains the same generalization capability |
|
|
| ## Hardware Requirements |
|
|
| - **GPU**: NVIDIA GPU with Tensor Cores (T4, A10, A100, H100 recommended) |
| - **VRAM**: ~1.5GB for FP16 inference (vs ~3GB FP32) |
| - **CUDA**: 11.8+ or 12.x |
| - **PyTorch**: 2.0+ (for `torch.compile` support) |
|
|
| ## Model Details |
|
|
| - **Base model**: `microsoft/deberta-v3-small` (6 layers, 768 hidden) |
| - **Architecture**: Uni-encoder span-based NER |
| - **Parameters**: ~166M (same as original) |
| - **Max length**: 384 tokens |
| - **Max entity types**: 25 per inference call |
| - **Max span width**: 12 words |
| - **License**: Apache-2.0 |
|
|
| ## Citation |
|
|
| Original GLiNER paper: |
| ```bibtex |
| @inproceedings{zaratiana2024gliner, |
| title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer}, |
| author={Zaratiana, Urchade and Tomeh, Nadi and Holat, Pierre and Charnois, Thierry}, |
| booktitle={NAACL}, |
| year={2024} |
| } |
| ``` |
|
|
| ## Acknowledgments |
|
|
| This optimized model is based on the original [`urchade/gliner_small-v2.1`](https://huggingface.co/urchade/gliner_small-v2.1) by Urchade Zaratiana et al. All credit for the model architecture and training goes to the original authors. |
|
|