README.md · hivetrace/gliner-guard-biencoder at main

File size: 7,999 Bytes

---
license: apache-2.0
language:
- en
- ru
base_model:
- jhu-clsp/mmBERT-small
pipeline_tag: zero-shot-classification
tags:
- gliner2
- safety
- pii
- ai-security
- zero-shot
- text-classification
- zero-shot-classification
- span-categorization
- token-classification
- guardrails
---
# GLiNER Guard — Unified Multitask Guardrail
One encoder model that replaces your entire guardrail stack: safety classification, PII detection, adversarial attack detection, intent and tone analysis — all in a single forward pass.
![GLiNER Guard architecture](biencoder.png)

**145M params · GLiNER2 · biencoder · modernbert multilingual · zero-shot classification, NER and more · no LLM required**

## Installation
Install dependencies\
(now via our fork, wi'll update installation part after PR to GLiNER2 repo)
```bash
pip install "gliner2 @ git+https://github.com/bogdanminko/GLiNER2.git@feature/bi-encoder"
```
## Usage
Classify Harmful messages and Detect PII via single forward pass
```python
from gliner2 import GLiNER2

model = GLiNER2.from_pretrained("hivetrace/gliner-guard-biencoder")
model.config.cache_labels = True

PII_LABELS = ["person", "location", "email", "phone"]
SAFETY_LABELS = ["safe", "unsafe"]
schema = (model.create_schema()
.entities(entity_types=PII_LABELS, threshold=0.4)
.classification(task="safety", labels=SAFETY_LABELS)
)

result = model.extract(
"Send $500 to John Smith at john.smith@gmail.com or I'll leak your photos",
schema=schema
)
```
output:
```
{'entities': {'person': ['John Smith'],
  'location': [],
  'email': ['john.smith@gmail.com'],
  'phone': []},
 'safety': 'unsafe'}
```

## Supported Tasks

GLiNER Guard is purpose-built for 6 guardrail tasks via a shared encoder — no LLM required.\
Thanks to zero-shot generalization, it can also handle custom labels outside the training taxonomy.

| Task | Type | Labels | Key Labels |
|------|------|--------|------------|
| **Safety** | single-label | 2 | `safe` `unsafe` |
| **PII / NER** | span extraction | 32 | `person` `email` `phone` `card_number` `address` |
| **Adversarial Detection** | multi-label | 15 | `jailbreak_persona` `prompt_injection` `instruction_override` `data_exfiltration` |
| **Harmful Content** | multi-label | 30 | `hate_speech` `violence` `child_exploitation` `fraud` `pii_exposure` |
| **Intent** | single-label | 13 | `informational` `adversarial` `threatening` `solicitation` |
| **Tone of Voice** | single-label | 10 | `neutral` `aggressive` `manipulative` `deceptive` |

<details>
<summary><b>Safety</b> — all 2 labels</summary>

Classifies whether a message is safe or unsafe. Single-label.
```python
SAFETY_LABELS = ["safe", "unsafe"]
```

| Label | Description |
|-------|-------------|
| `safe` | Message does not contain harmful or policy-violating content |
| `unsafe` | Message contains harmful, dangerous, or policy-violating content |

</details>

<details>
<summary><b>NER / PII</b> — all 32 entity types</summary>

Span extraction across 7 groups. Use labels from this list for best results — out-of-taxonomy labels may work via zero-shot generalization but are not benchmarked.

| Group | Labels |
|-------|--------|
| **Person** | `person` `first_name` `last_name` `alias` `title` |
| **Location** | `country` `region` `city` `district` `street` `building` `unit` `postal_code` `landmark` `address` |
| **Organization** | `company` `government` `education` `media` `product` |
| **Contact** | `email` `phone` `social_account` `messenger` |
| **Identity** | `passport` `national_id` `document_id` |
| **Temporal** | `date_of_birth` `event_date` |
| **Financial** | `card_number` `bank_account` `crypto_wallet` |
```python
PII_LABELS = [
    "person", "first_name", "last_name", "alias", "title",
    "country", "region", "city", "district", "street",
    "building", "unit", "postal_code", "landmark", "address",
    "company", "government", "education", "media", "product",
    "email", "phone", "social_account", "messenger",
    "passport", "national_id", "document_id",
    "date_of_birth", "event_date",
    "card_number", "bank_account", "crypto_wallet",
]
```

</details>

<details>
<summary><b>Adversarial Detection</b> — all 15 labels</summary>

Detects attacks against LLM-based systems. Multi-label: a single message can combine multiple attack vectors.

| Subgroup | Labels |
|----------|--------|
| **Jailbreak** | `jailbreak_persona` `jailbreak_hypothetical` `jailbreak_roleplay` |
| **Injection** | `prompt_injection` `indirect_prompt_injection` `instruction_override` |
| **Extraction** | `data_exfiltration` `system_prompt_extraction` `context_manipulation` `token_manipulation` |
| **Advanced** | `tool_abuse` `social_engineering` `multi_turn_escalation` `schema_poisoning` |
| **Clean** | `none` |
```python
ADVERSARIAL_LABELS = [
    "jailbreak_persona", "jailbreak_hypothetical", "jailbreak_roleplay",
    "prompt_injection", "indirect_prompt_injection", "instruction_override",
    "data_exfiltration", "system_prompt_extraction", "context_manipulation", "token_manipulation",
    "tool_abuse", "social_engineering", "multi_turn_escalation", "schema_poisoning",
    "none",
]
```

</details>

<details>
<summary><b>Harmful Content</b> — all 30 labels</summary>

Detects harmful content categories. Multi-label: a message can belong to multiple categories simultaneously.

| Subgroup | Labels |
|----------|--------|
| **Interpersonal** | `harassment` `hate_speech` `discrimination` `doxxing` `bullying` |
| **Violence & Danger** | `violence` `dangerous_instructions` `weapons` `drugs` `self_harm` |
| **Sexual & Exploitation** | `sexual_content` `child_exploitation` `grooming` `sextortion` |
| **Deception** | `fraud` `scam` `social_engineering` `impersonation` |
| **Sensitive Topics** | `profanity` `extremism` `political` `war` `espionage` `cybersecurity` `religious` `lgbt` |
| **Information** | `misinformation` `copyright_violation` `pii_exposure` |
| **Clean** | `none` |
```python
HARMFUL_LABELS = [
    "harassment", "hate_speech", "discrimination", "doxxing", "bullying",
    "violence", "dangerous_instructions", "weapons", "drugs", "self_harm",
    "sexual_content", "child_exploitation", "grooming", "sextortion",
    "fraud", "scam", "social_engineering", "impersonation",
    "profanity", "extremism", "political", "war", "espionage", "cybersecurity", "religious", "lgbt",
    "misinformation", "copyright_violation", "pii_exposure",
    "none",
]
```

</details>

<details>
<summary><b>Intent</b> — all 13 labels</summary>

Classifies the intent behind a message. Single-label.

| Labels | |
|--------|--|
| Benign | `informational` `instructional` `conversational` `persuasive` `creative` `transactional` `emotional_support` `testing` |
| Ambiguous | `ambiguous` `extractive` |
| Malicious | `adversarial` `threatening` `solicitation` |
```python
INTENT_LABELS = [
    "informational", "instructional", "conversational", "persuasive",
    "creative", "transactional", "emotional_support", "testing",
    "ambiguous", "extractive",
    "adversarial", "threatening", "solicitation",
]
```

</details>

<details>
<summary><b>Tone of Voice</b> — all 10 labels</summary>

Classifies the tone of a message. Single-label.

| Label | Description |
|-------|-------------|
| `neutral` | Matter-of-fact, no strong emotional coloring |
| `formal` | Professional or official register |
| `humorous` | Playful, joking, or light-hearted |
| `sarcastic` | Ironic or mocking tone |
| `distressed` | Anxious, upset, or overwhelmed |
| `confused` | Unclear intent, disoriented phrasing |
| `pleading` | Urgent requests, begging for help or compliance |
| `aggressive` | Hostile, confrontational, or threatening |
| `manipulative` | Attempts to exploit, deceive, or coerce |
| `deceptive` | Deliberately misleading or false framing |
```python
TOV_LABELS = [
    "neutral", "formal", "humorous", "sarcastic",
    "distressed", "confused", "pleading",
    "aggressive", "manipulative", "deceptive",
]
```

</details>
</details>