Spaces:
Sleeping
Sleeping
π° Reward Design β Multi-Objective Optimization
The Adaptive AI Firewall environment uses a sophisticated, weighted reward function designed to drive agent behavior toward a balance of security efficacy, network availability, and resource efficiency.
π The Reward Equation
The total scalar reward $R$ for any action is calculated as:
Default Weights
| Component | Weight | Responsibility |
|---|---|---|
| $\alpha$ | 0.35 | Security Efficacy (Catching threats) |
| $\beta$ | 0.30 | Network Availability (Avoiding False Positives) |
| $\gamma$ | 0.20 | Resource Efficiency (Budget management) |
| $\delta$ | 0.15 | Timeliness (Stopping attacks early) |
π§© Reward Components
1. Security ($R_{\text{security}}$)
- Block Malicious: $+1.0$ (Successfully stopped a threat).
- Miss Malicious: $-2.0$ (Failed to block an attack; high penalty).
- Inspect Malicious: $+0.15$ (Correct identification, though not yet stopped).
- Inspect Benign: $-0.5$ (Unnecessary inspection).
2. Availability ($R_{\text{availability}}$)
- Allow Benign: $+0.25$ (Maintaining network flow).
- Block Benign (FP): $-1.2$ (Significant penalty for disrupting legitimate users).
- Rate Limit Benign: $-0.4$ (Milder penalty for "gray" actions).
- Inspect Benign: $-0.15$ (Unnecessary latency added).
3. Efficiency ($R_{\text{efficiency}}$)
- Cost: Calculated as $\text{latency} + \text{compute}$ for each action.
- Scaling: Penalized relative to remaining budget: $R_{\text{efficiency}} = -\frac{\text{cost}}{\max(\text{budget_remaining}, 0.1)}$.
- This creates Strategic Pressure: actions become "more expensive" as the budget depletes.
4. Timeliness ($R_{\text{timeliness}}$)
- Early Detection: $+e^{-\text{phase}}$ where
phaseis the attacker's progress in the kill chain (0 to 4). - Incentive: Stopping an attack at Phase 0 is significantly more rewarding than at Phase 3.
π Worked Examples
| Scenario | Action | Security | Availability | Efficiency | Timeliness | Total Reward |
|---|---|---|---|---|---|---|
| Legitimate User | ALLOW |
$0.0$ | $+0.25$ | $0.0$ | $0.0$ | $+0.075$ |
| Early Attack (Ph 0) | BLOCK |
$+1.0$ | $0.0$ | $-0.005$ | $+1.0$ | $+0.499$ |
| Late Attack (Ph 3) | BLOCK |
$+1.0$ | $0.0$ | $-0.005$ | $+0.05$ | $+0.357$ |
| False Positive | BLOCK |
$0.0$ | $-1.2$ | $-0.005$ | $0.0$ | $-0.361$ |
| Missed Attack | ALLOW |
$-2.0$ | $0.0$ | $0.0$ | $0.0$ | $-0.700$ |
π‘οΈ Anti-Degeneracy Controls
To prevent agents from learning "lazy" policies (like blocking everything or allowing everything), the environment implements:
- Reward Balancing: The ratio of Miss Penalty to FP Penalty is tuned (~2.3:1) so that on a typical 80/20 traffic mix, a
block_allpolicy yields a negative total reward. - Pass/Fail Constraints: Graders in graders.py require a minimum detection rate AND a minimum availability rate to pass a task, regardless of the scalar reward.