File size: 5,176 Bytes
1136020
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# Threat Model β€” Explainable IDS

## 1. System Description

An ML-based Intrusion Detection System (IDS) monitors network traffic and classifies connections as Normal or one of four attack categories (DoS, Probe, R2L, U2R). The system uses post-hoc explainability methods (SHAP, LIME) to provide security analysts with interpretable justifications for each alert.

## 2. Assets Under Protection

| Asset | Value | Sensitivity |
|-------|-------|-------------|
| Network integrity | High | Disruption β†’ service outage |
| IDS model parameters | Medium | Leak β†’ evasion knowledge |
| SHAP/LIME explanations | Medium | Leak β†’ feature manipulation strategy |
| Training data statistics | Low-Medium | Leak β†’ distribution knowledge for crafting attacks |

## 3. Adversary Profiles

### Adversary A: Network Attacker (External)
- **Goal**: Bypass IDS detection β€” send malicious traffic classified as "Normal"
- **Capabilities**: Can craft and modify network packets (control over src_bytes, dst_bytes, duration, protocol, count, etc.)
- **Knowledge**: Black-box (no model access) or Grey-box (knows model type + feature set)
- **Constraints**: Cannot modify all features β€” some are protocol-determined (e.g., protocol_type, flag) or network-infrastructure-bound (e.g., dst_host_count depends on actual connections)

### Adversary B: Explanation Exploiter (Internal/External)
- **Goal**: Use SHAP/LIME output to learn which features the model relies on, then craft evasion attacks
- **Capabilities**: Can query the model and observe explanations (e.g., deployed as analyst dashboard)
- **Knowledge**: White-box on explanations, grey-box on model
- **Attack**: Query with diverse inputs β†’ aggregate SHAP values β†’ identify top features β†’ manipulate those features in attack traffic

### Adversary C: Training Data Poisoner (Supply Chain)
- **Goal**: Insert backdoor so model shows clean explanations but misclassifies triggered inputs
- **Capabilities**: Can inject samples into training set
- **Relevance**: Even explanations can be fooled if the model itself is compromised (Baniecki et al., 2022)

## 4. Feature Manipulability Analysis

Critical for realistic adversarial evaluation β€” not all 41 NSL-KDD features can be freely modified by an attacker.

| Feature Category | Manipulable? | Examples | Justification |
|-----------------|-------------|----------|---------------|
| **Packet content** | βœ… Yes | `src_bytes`, `dst_bytes`, `hot`, `num_failed_logins` | Attacker controls payload |
| **Connection behavior** | ⚠️ Partially | `duration`, `count`, `srv_count` | Attacker can slow/speed connections but within limits |
| **Protocol fields** | ⚠️ Constrained | `protocol_type`, `flag` | Must be valid TCP/UDP/ICMP; flag must match connection state |
| **Network statistics** | ❌ No | `dst_host_count`, `dst_host_srv_count` | Aggregated by IDS sensor, not attacker-controlled |
| **Error rates** | ⚠️ Partially | `serror_rate`, `rerror_rate` | Attacker can trigger errors but rates depend on overall traffic |

**Implication for SHAP/LIME**: If the model relies heavily on non-manipulable features (dst_host_count, dst_host_same_srv_rate), it is more robust against evasion. If it relies on manipulable features (src_bytes, duration), evasion is easier.

## 5. Attack Scenarios

### Scenario 1: Evasion via Explanation Leakage
1. Attacker queries IDS explanation API with known attack samples
2. SHAP reveals `serror_rate` (weight=0.45) and `count` (weight=0.32) are top features for DoS detection
3. Attacker crafts DoS traffic with low serror_rate (connection completion spoofing) and varied count
4. IDS misclassifies as Normal

### Scenario 2: LIME Instability Exploitation
1. LIME produces different top features for the same input across runs (stochastic)
2. Analyst sees Feature A as top in run 1, Feature B in run 2
3. Inconsistent investigation β†’ missed detections or wasted resources

### Scenario 3: Backdoor with Clean Explanations
1. Poisoned training data contains trigger pattern (e.g., specific src_bytes + service combination)
2. Model correctly classifies and explains normal traffic
3. On triggered inputs: misclassifies as Normal AND SHAP shows plausible benign features
4. Analyst trusts explanation β†’ attack goes undetected

## 6. Security Requirements

| Requirement | Priority | Mitigation |
|-------------|----------|------------|
| Explanation access control | High | Rate-limit explanation API, log queries |
| Explanation consistency | High | Prefer SHAP (deterministic) over LIME for critical decisions |
| Model integrity verification | Medium | Track training data provenance, validate model fingerprints |
| Robust feature reliance | Medium | Verify model doesn't over-rely on manipulable features |
| Defense-in-depth | High | Explanations supplement (don't replace) rule-based IDS |

## 7. Assumptions & Scope

- NSL-KDD is a benchmark dataset β€” real deployment would require domain-specific feature analysis
- We evaluate post-hoc explainability only (not inherently interpretable models)
- We focus on explanation reliability, not adversarial robustness of the classifier itself (that's Project 1)