File size: 2,996 Bytes
2129c29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
# NLProxy Firewall Module Reference

This document describes the firewall and prompt policy enforcement module in `firewall/firewall.py`.

## Purpose

The firewall module defends NLProxy against malicious prompt patterns and injection attacks. It combines deterministic regex detection with optional semantic similarity checks.

## Key Components

### `FirewallAction`

Enumerated actions:

- `ALLOW` — allow the prompt unchanged.
- `ALERT` — allow but log a security warning.
- `REWRITE` — sanitize or rewrite the prompt.
- `BLOCK` — reject the request.

### `SeverityLevel`

Severity classifications:

- `LOW`
- `MEDIUM`
- `HIGH`
- `CRITICAL`

### `FirewallRule`

Immutable dataclass with:

- `name: str`
- `pattern: str`
- `action: FirewallAction`
- `severity: SeverityLevel`
- `description: Optional[str]`

### `compile_rule_patterns(rules)`

- Compiles regex rules ahead of runtime.
- Complexity: O(|R| · m), where |R| is rule count.
- Returns structured objects for fast matching.

### `resolve_conflicting_actions(actions)`

- Resolves multiple matching rules.
- Uses priority order: `BLOCK > REWRITE > ALERT > ALLOW`.

### `PromptFirewall`

#### Responsibilities

- Validates incoming prompt text before compression.
- Applies regex matches against a curated rule set.
- Optionally uses semantic attack corpus embeddings.
- Provides `check_prompt()` and `rewrite_prompt()` to callers.

#### Initialization

```python
PromptFirewall(
    regex_rules=[...],
    semantic_config=SEMANTIC_FIREWALL_CONFIG,
    default_mode="block",
    models_dir=Path("nlproxy") / "models",
)
```

#### Features

- Defaults to blocking high-risk jailbreak patterns.
- Supports semantic corpus-based detection of prompt injection.
- Uses a shared singleton cache for corpus embeddings.
- Maintains a clear audit trail via rule names and severities.

## Default Rules

Included default rules cover:

- Classic jailbreak phrases like "ignore all previous instructions"
- System prompt exfiltration attempts
- Privilege escalation requests
- Data exfiltration requests
- SQL injection-like patterns
- Potential token leakage patterns

## Semantic Firewall Configuration

`SEMANTIC_FIREWALL_CONFIG` defines:

- `enabled`: false by default
- `model_name`: `all-MiniLM-L6-v2`
- `similarity_threshold`: `0.85`
- `attack_corpus`: curated attacker phrase samples
- `device_preference`: `cpu`

## Dependencies

- `numpy`
- Optional: `sentence_transformers`

## Performance and Scalability

- Regex-only evaluation is fast and deterministic.
- Semantic checking adds vector embed cost, so it should be reserved for high-security deployments.
- The module is safe to call per-request and is optimized for prompt lengths typical of LLM inputs.

## Edge Cases

- If semantic model loading fails, the module should not block standard regex detection.
- When multiple rules match, the most restrictive action is enforced.
- Rewrite behavior should preserve prompt structure and avoid introducing new attack surface.