Add comprehensive README with benchmark results

d378de9 verified 23 days ago

4.45 kB

	# GLiNER Small v2.1 — GPU-Optimized Inference

	Optimization result: 1.71× faster inference on GPU with zero F1-score loss across 11 NER evaluation datasets.

	Model: [binga/gliner_small_v2.1-optimized-gpu](https://huggingface.co/binga/gliner_small_v2.1-optimized-gpu)

	## What is this?

	This is an optimized variant of the original [`urchade/gliner_small-v2.1`](https://huggingface.co/urchade/gliner_small-v2.1) model, specifically tuned for maximum GPU inference speed without sacrificing NER accuracy.

	## Optimizations Applied

	\| Technique \| Speedup \| F1 Impact \| Notes \|
	\|-----------\|---------\|-----------\|-------\|
	\| FP16 (half-precision) \| ~1.28× \| Zero loss \| Reduces memory bandwidth, enables faster Tensor Cores \|
	\| torch.compile(mode="max-autotune") \| ~1.71× \| Zero loss \| Compiles transformer backbone + span/prompt layers \|
	\| Inference packing \| ~1.84× (batch throughput) \| Zero loss \| Packs variable-length sequences for better GPU utilization \|

	### Recommended Usage

	For best latency (single text):
	```python
	from gliner import GLiNER
	import torch

	model = GLiNER.from_pretrained("binga/gliner_small_v2.1-optimized-gpu", map_location="cuda")
	model.to("cuda")
	model.half() # FP16 — critical for speed

	# Optional: compile submodules for additional speed
	if hasattr(model, "model"):
	inner = model.model
	for attr in ["token_rep_layer", "span_rep_layer", "prompt_rep_layer"]:
	layer = getattr(inner, attr, None)
	if layer is not None:
	setattr(inner, attr, torch.compile(layer, mode="max-autotune"))

	text = "Apple Inc. was founded by Steve Jobs in California."
	labels = ["person", "organization", "location"]
	entities = model.predict_entities(text, labels, threshold=0.5)
	```

	For best throughput (batch processing):
	```python
	from gliner import InferencePackingConfig

	# Enable inference packing (packs variable-length sequences)
	model.configure_inference_packing(
	InferencePackingConfig(max_length=384, streams_per_batch=8)
	)

	results = model.inference(texts, labels, threshold=0.5, batch_size=32)
	```

	## Benchmark Results

	Evaluated on 11 diverse NER datasets (CoNLL-2003, OntoNotes 5, BC5CDR, WNUT-2017, TweetNER7, MIT Movie, MIT Restaurant (Fin), CrossNER AI/Literature/Science, WikiNeural):

	\| Dataset \| Samples \| Entities \| Baseline F1 \| Optimized F1 \| Speedup \|
	\|---------\|---------\|----------\|-------------\|--------------\|---------\|
	\| conll2003 \| 3,453 \| 4 \| 0.5483 \| 0.5481 \| 1.83× \|
	\| ontonotes5 \| 8,262 \| 18 \| 0.2797 \| 0.2797 \| 1.75× \|
	\| bc5cdr \| 5,865 \| 2 \| 0.6592 \| 0.6591 \| 1.73× \|
	\| wnut2017 \| 1,287 \| 6 \| 0.4255 \| 0.4252 \| 1.75× \|
	\| tweetner7 \| 3,383 \| 7 \| 0.2829 \| 0.2828 \| 1.73× \|
	\| mit_movie \| 1,953 \| 12 \| 0.5183 \| 0.5183 \| 1.80× \|
	\| fin \| 305 \| 4 \| 0.2906 \| 0.2906 \| 1.25× \|
	\| crossner_ai \| 431 \| 14 \| 0.5000 \| 0.5002 \| 1.71× \|
	\| crossner_literature \| 416 \| 12 \| 0.6444 \| 0.6444 \| 1.73× \|
	\| crossner_science \| 543 \| 17 \| 0.6330 \| 0.6332 \| 1.75× \|
	\| wikineural \| 3,000 \| 16 \| 0.5465 \| 0.5462 \| 1.67× \|
	\| AVERAGE \| 28,358 \| — \| 0.4844 \| 0.4844 \| 1.71× \|

	### Performance Guarantee

	- F1 difference: -0.0000 (zero loss, within measurement noise)
	- All 11 datasets: No statistically significant performance degradation
	- Zero-shot NER: Maintains the same generalization capability

	## Hardware Requirements

	- GPU: NVIDIA GPU with Tensor Cores (T4, A10, A100, H100 recommended)
	- VRAM: ~1.5GB for FP16 inference (vs ~3GB FP32)
	- CUDA: 11.8+ or 12.x
	- PyTorch: 2.0+ (for `torch.compile` support)

	## Model Details

	- Base model: `microsoft/deberta-v3-small` (6 layers, 768 hidden)
	- Architecture: Uni-encoder span-based NER
	- Parameters: ~166M (same as original)
	- Max length: 384 tokens
	- Max entity types: 25 per inference call
	- Max span width: 12 words
	- License: Apache-2.0

	## Citation

	Original GLiNER paper:
	```bibtex
	@inproceedings{zaratiana2024gliner,
	title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer},
	author={Zaratiana, Urchade and Tomeh, Nadi and Holat, Pierre and Charnois, Thierry},
	booktitle={NAACL},
	year={2024}
	}
	```

	## Acknowledgments

	This optimized model is based on the original [`urchade/gliner_small-v2.1`](https://huggingface.co/urchade/gliner_small-v2.1) by Urchade Zaratiana et al. All credit for the model architecture and training goes to the original authors.