README.md · hivetrace/gliner-guard-biencoder at main

gliner-guard-biencoder / README.md

bogdanminko

Update README.md

a98c064 verified 5 days ago

preview code

raw

history blame contribute delete

8 kB

	---
	license: apache-2.0
	language:
	- en
	- ru
	base_model:
	- jhu-clsp/mmBERT-small
	pipeline_tag: zero-shot-classification
	tags:
	- gliner2
	- safety
	- pii
	- ai-security
	- zero-shot
	- text-classification
	- zero-shot-classification
	- span-categorization
	- token-classification
	- guardrails
	---
	# GLiNER Guard — Unified Multitask Guardrail
	One encoder model that replaces your entire guardrail stack: safety classification, PII detection, adversarial attack detection, intent and tone analysis — all in a single forward pass.
	![GLiNER Guard architecture](biencoder.png)

	145M params · GLiNER2 · biencoder · modernbert multilingual · zero-shot classification, NER and more · no LLM required

	## Installation
	Install dependencies\
	(now via our fork, wi'll update installation part after PR to GLiNER2 repo)
	```bash
	pip install "gliner2 @ git+https://github.com/bogdanminko/GLiNER2.git@feature/bi-encoder"
	```
	## Usage
	Classify Harmful messages and Detect PII via single forward pass
	```python
	from gliner2 import GLiNER2

	model = GLiNER2.from_pretrained("hivetrace/gliner-guard-biencoder")
	model.config.cache_labels = True

	PII_LABELS = ["person", "location", "email", "phone"]
	SAFETY_LABELS = ["safe", "unsafe"]
	schema = (model.create_schema()
	.entities(entity_types=PII_LABELS, threshold=0.4)
	.classification(task="safety", labels=SAFETY_LABELS)
	)

	result = model.extract(
	"Send $500 to John Smith at john.smith@gmail.com or I'll leak your photos",
	schema=schema
	)
	```
	output:
	```
	{'entities': {'person': ['John Smith'],
	'location': [],
	'email': ['john.smith@gmail.com'],
	'phone': []},
	'safety': 'unsafe'}
	```

	## Supported Tasks

	GLiNER Guard is purpose-built for 6 guardrail tasks via a shared encoder — no LLM required.\
	Thanks to zero-shot generalization, it can also handle custom labels outside the training taxonomy.

	\| Task \| Type \| Labels \| Key Labels \|
	\|------\|------\|--------\|------------\|
	\| Safety \| single-label \| 2 \| `safe` `unsafe` \|
	\| PII / NER \| span extraction \| 32 \| `person` `email` `phone` `card_number` `address` \|
	\| Adversarial Detection \| multi-label \| 15 \| `jailbreak_persona` `prompt_injection` `instruction_override` `data_exfiltration` \|
	\| Harmful Content \| multi-label \| 30 \| `hate_speech` `violence` `child_exploitation` `fraud` `pii_exposure` \|
	\| Intent \| single-label \| 13 \| `informational` `adversarial` `threatening` `solicitation` \|
	\| Tone of Voice \| single-label \| 10 \| `neutral` `aggressive` `manipulative` `deceptive` \|

	<details>
	<summary><b>Safety</b> — all 2 labels</summary>

	Classifies whether a message is safe or unsafe. Single-label.
	```python
	SAFETY_LABELS = ["safe", "unsafe"]
	```

	\| Label \| Description \|
	\|-------\|-------------\|
	\| `safe` \| Message does not contain harmful or policy-violating content \|
	\| `unsafe` \| Message contains harmful, dangerous, or policy-violating content \|

	</details>

	<details>
	<summary><b>NER / PII</b> — all 32 entity types</summary>

	Span extraction across 7 groups. Use labels from this list for best results — out-of-taxonomy labels may work via zero-shot generalization but are not benchmarked.

	\| Group \| Labels \|
	\|-------\|--------\|
	\| Person \| `person` `first_name` `last_name` `alias` `title` \|
	\| Location \| `country` `region` `city` `district` `street` `building` `unit` `postal_code` `landmark` `address` \|
	\| Organization \| `company` `government` `education` `media` `product` \|
	\| Contact \| `email` `phone` `social_account` `messenger` \|
	\| Identity \| `passport` `national_id` `document_id` \|
	\| Temporal \| `date_of_birth` `event_date` \|
	\| Financial \| `card_number` `bank_account` `crypto_wallet` \|
	```python
	PII_LABELS = [
	"person", "first_name", "last_name", "alias", "title",
	"country", "region", "city", "district", "street",
	"building", "unit", "postal_code", "landmark", "address",
	"company", "government", "education", "media", "product",
	"email", "phone", "social_account", "messenger",
	"passport", "national_id", "document_id",
	"date_of_birth", "event_date",
	"card_number", "bank_account", "crypto_wallet",
	]
	```

	</details>

	<details>
	<summary><b>Adversarial Detection</b> — all 15 labels</summary>

	Detects attacks against LLM-based systems. Multi-label: a single message can combine multiple attack vectors.

	\| Subgroup \| Labels \|
	\|----------\|--------\|
	\| Jailbreak \| `jailbreak_persona` `jailbreak_hypothetical` `jailbreak_roleplay` \|
	\| Injection \| `prompt_injection` `indirect_prompt_injection` `instruction_override` \|
	\| Extraction \| `data_exfiltration` `system_prompt_extraction` `context_manipulation` `token_manipulation` \|
	\| Advanced \| `tool_abuse` `social_engineering` `multi_turn_escalation` `schema_poisoning` \|
	\| Clean \| `none` \|
	```python
	ADVERSARIAL_LABELS = [
	"jailbreak_persona", "jailbreak_hypothetical", "jailbreak_roleplay",
	"prompt_injection", "indirect_prompt_injection", "instruction_override",
	"data_exfiltration", "system_prompt_extraction", "context_manipulation", "token_manipulation",
	"tool_abuse", "social_engineering", "multi_turn_escalation", "schema_poisoning",
	"none",
	]
	```

	</details>

	<details>
	<summary><b>Harmful Content</b> — all 30 labels</summary>

	Detects harmful content categories. Multi-label: a message can belong to multiple categories simultaneously.

	\| Subgroup \| Labels \|
	\|----------\|--------\|
	\| Interpersonal \| `harassment` `hate_speech` `discrimination` `doxxing` `bullying` \|
	\| Violence & Danger \| `violence` `dangerous_instructions` `weapons` `drugs` `self_harm` \|
	\| Sexual & Exploitation \| `sexual_content` `child_exploitation` `grooming` `sextortion` \|
	\| Deception \| `fraud` `scam` `social_engineering` `impersonation` \|
	\| Sensitive Topics \| `profanity` `extremism` `political` `war` `espionage` `cybersecurity` `religious` `lgbt` \|
	\| Information \| `misinformation` `copyright_violation` `pii_exposure` \|
	\| Clean \| `none` \|
	```python
	HARMFUL_LABELS = [
	"harassment", "hate_speech", "discrimination", "doxxing", "bullying",
	"violence", "dangerous_instructions", "weapons", "drugs", "self_harm",
	"sexual_content", "child_exploitation", "grooming", "sextortion",
	"fraud", "scam", "social_engineering", "impersonation",
	"profanity", "extremism", "political", "war", "espionage", "cybersecurity", "religious", "lgbt",
	"misinformation", "copyright_violation", "pii_exposure",
	"none",
	]
	```

	</details>

	<details>
	<summary><b>Intent</b> — all 13 labels</summary>

	Classifies the intent behind a message. Single-label.

	\| Labels \| \|
	\|--------\|--\|
	\| Benign \| `informational` `instructional` `conversational` `persuasive` `creative` `transactional` `emotional_support` `testing` \|
	\| Ambiguous \| `ambiguous` `extractive` \|
	\| Malicious \| `adversarial` `threatening` `solicitation` \|
	```python
	INTENT_LABELS = [
	"informational", "instructional", "conversational", "persuasive",
	"creative", "transactional", "emotional_support", "testing",
	"ambiguous", "extractive",
	"adversarial", "threatening", "solicitation",
	]
	```

	</details>

	<details>
	<summary><b>Tone of Voice</b> — all 10 labels</summary>

	Classifies the tone of a message. Single-label.

	\| Label \| Description \|
	\|-------\|-------------\|
	\| `neutral` \| Matter-of-fact, no strong emotional coloring \|
	\| `formal` \| Professional or official register \|
	\| `humorous` \| Playful, joking, or light-hearted \|
	\| `sarcastic` \| Ironic or mocking tone \|
	\| `distressed` \| Anxious, upset, or overwhelmed \|
	\| `confused` \| Unclear intent, disoriented phrasing \|
	\| `pleading` \| Urgent requests, begging for help or compliance \|
	\| `aggressive` \| Hostile, confrontational, or threatening \|
	\| `manipulative` \| Attempts to exploit, deceive, or coerce \|
	\| `deceptive` \| Deliberately misleading or false framing \|
	```python
	TOV_LABELS = [
	"neutral", "formal", "humorous", "sarcastic",
	"distressed", "confused", "pleading",
	"aggressive", "manipulative", "deceptive",
	]
	```

	</details>
	</details>