---
language: en
license: apache-2.0
tags:
  - waf
  - web-security
  - onnx
  - multi-label-classification
  - low-latency
library_name: onnxruntime
pipeline_tag: text-classification
---

# Argus Sentinel — WAF ML Classifier (V3)

Production-grade Web Application Firewall classifier. Detects 6 attack types in HTTP requests with sub-millisecond latency on CPU.

**Key metrics (test_realistic — production-like distribution, 94% clean):**
- Macro F1: **0.866** | FPR: **0.83%** | Mean attack recall: **0.889** | Latency: **0.24ms**

---

## Model Overview

| Property | Value |
|---|---|
| Architecture | CNN text encoder + numeric features fusion |
| Parameters | 1.17M |
| Vocab size | 8,192 (BPE ByteLevel) |
| Max sequence length | 128 tokens |
| ONNX model size | 4.5 MB (FP32) / 1.2 MB (INT8) |
| Inference latency | **0.24 ms** avg (CPU, single thread) |
| Training loss | Focal BCEWithLogitsLoss (gamma=2.0) |
| Best epoch | 3 / 8 (early stopping, selected on Macro F1) |

---

## Architecture

```
HTTP Request Text
     |
     v
[BPE Tokenizer (vocab=8192, max_len=128)]
     |
     +---> [Embedding (128-dim)]
     |           |
     |     [Conv1D (128 ch, k=3) + BatchNorm + ReLU] x2
     |           |
     |     [AdaptiveMaxPool1d → 128-dim]
     |
     +---> [6 Numeric Features]
               |
         [Linear 6→32 + ReLU]
               |
         [Concatenate (128 + 32 = 160)]
               |
         [Linear 160→128→64 + ReLU + Dropout(0.1)]
               |
     +---------+---------+
     |                   |
[Label Head → 7]   [Risk Head → 1]
     |                   |
[Sigmoid]            [Sigmoid]
     |                   |
label_probs [7]    risk_score [1]
```

---

## Tokenizer Specification

| Property | Value |
|---|---|
| **Type** | BPE (Byte-Pair Encoding) via HuggingFace `tokenizers` library |
| **Algorithm** | `ByteLevel` BPE — operates on UTF-8 bytes, not characters |
| **Pre-tokenizer** | `ByteLevel` (add_prefix_space=false, trim_offsets=true, use_regex=true) |
| **Normalizer** | None (raw bytes, no lowercasing or unicode normalization) |
| **Post-processor** | `TemplateProcessing` — prepends `[CLS]` token automatically |
| **Vocab size** | 8,192 tokens (7,933 merges + 3 special tokens + 256 byte tokens) |
| **Special tokens** | `[PAD]` (id=0), `[UNK]` (id=1), `[CLS]` (id=2) |
| **Max length** | 128 tokens (truncation=Right, padding=Right to fixed 128) |
| **Byte fallback** | false — unknown bytes map to `[UNK]` |
| **File** | `tokenizer.json` (HuggingFace tokenizers JSON format) |

**Input text construction**: `"{method} {path}?{query} {body[:200]}"` — capped at 500 chars before tokenization.

```python
from tokenizers import Tokenizer
tok = Tokenizer.from_file("tokenizer.json")
result = tok.encode("GET /search?q=test HTTP/1.1")
# result.ids → [2, 546, 287, ...]  (starts with [CLS]=2)
# result.attention_mask → [1, 1, 1, ...]
```

---

## Label ID Mapping (CRITICAL)

Output `label_probs` tensor shape: `[batch, 7]`. Each index maps to:

| Index | Label | Description |
|---|---|---|
| **0** | `clean` | Benign / legitimate request |
| **1** | `xss` | Cross-Site Scripting |
| **2** | `sqli` | SQL Injection |
| **3** | `path_traversal` | Directory / Path Traversal |
| **4** | `command_injection` | OS Command Injection |
| **5** | `scanner` | Vulnerability scanner / probe |
| **6** | `spam_bot` | Spam bot / automated abuse |

> **Multi-label**: Labels are NOT mutually exclusive. Multiple labels can be active simultaneously (e.g., index 2 + 5 = scanner performing SQLi). Exception: `clean` (index 0) is exclusive — if clean=1, all others must be 0.

---

## ONNX Inputs

| Name | Shape | Dtype | Description |
|---|---|---|---|
| `input_ids` | `[batch, 128]` | `int32` | BPE token IDs from `tokenizer.json` |
| `attention_mask` | `[batch, 128]` | `int32` | 1 for real tokens, 0 for `[PAD]` |
| `numeric_features` | `[batch, 6]` | `float32` | Request-level features (**RAW values**, see below) |

## Numeric Features — Normalization Parameters (CRITICAL)

**Features are passed as RAW values** — the model was trained on unnormalized features. Pass the same raw scale at inference.

| Index | Feature | Computation | Training Range | Mean | Std |
|---|---|---|---|---|---|
| **0** | `content_length` | `len(body)` if body else `0` | 0 – 662 | 35.2 | 63.5 |
| **1** | `num_headers` | `len(headers_dict)` | 3 – 13 | 7.7 | 1.3 |
| **2** | `has_body` | `1.0` if body present, else `0.0` | 0 – 1 | 0.42 | 0.49 |
| **3** | `session_request_count` | Total requests in session, or `0` | 0 – 20 | 3.0 | 6.0 |
| **4** | `session_duration` | Session time span in seconds, or `0` | 0 – 4,965,381 | 619,322 | 1,198,151 |
| **5** | `session_pattern_score` | Behavioral pattern score, or `0` | 0 – 0.5 | 0.09 | 0.18 |

**Python**:
```python
def extract_numeric_features(request: dict) -> list[float]:
    body = request.get("body") or ""
    headers = request.get("headers") or {}
    return [
        float(len(body)),                                    # content_length
        float(len(headers)),                                 # num_headers
        1.0 if body else 0.0,                                # has_body
        float(request.get("session_request_count") or 0),    # session_request_count
        float(request.get("session_duration") or 0),         # session_duration
        float(request.get("session_pattern_score") or 0),    # session_pattern_score
    ]
```

**Rust**:
```rust
fn extract_numeric_features(request: &HttpRequest) -> [f32; 6] {
    let body_len = request.body.as_ref().map_or(0, |b| b.len());
    [
        body_len as f32,
        request.headers.len() as f32,
        if body_len > 0 { 1.0 } else { 0.0 },
        request.session_request_count.unwrap_or(0) as f32,
        request.session_duration.unwrap_or(0.0),
        request.session_pattern_score.unwrap_or(0.0),
    ]
}
```

> If you don't have session data, pass `[content_length, num_headers, has_body, 0.0, 0.0, 0.0]` — ~79% of training examples had null session features.

## ONNX Outputs

| Name | Shape | Dtype | Description |
|---|---|---|---|
| `label_probs` | `[batch, 7]` | `float32` | Per-label probabilities after sigmoid |
| `risk_score` | `[batch, 1]` | `float32` | Aggregate risk score [0, 1] |

---

## Per-Label Thresholds (CRITICAL for deployment)

Do NOT use a default 0.5 threshold for all labels. Use these optimized thresholds from `thresholds.json`:

| Label | Threshold | Recall | Precision | F1 |
|-------|-----------|--------|-----------|-----|
| clean | **0.20** | 0.998 | 0.992 | 0.995 |
| xss | **0.50** | 0.951 | 0.585 | 0.724 |
| sqli | **0.74** | 0.732 | 0.940 | 0.823 |
| path_traversal | **0.68** | 0.896 | 0.794 | 0.842 |
| command_injection | **0.66** | 0.826 | 0.626 | 0.712 |
| scanner | **0.70** | 0.980 | 0.945 | 0.962 |
| spam_bot | **0.72** | 1.000 | 1.000 | 1.000 |

---

## Performance

### Production-Like (test_realistic — 25,000 examples, 94% clean)

| Metric | Value |
|---|---|
| **Macro F1** | **0.866** |
| **FPR on clean** | **0.83%** |
| **Mean attack recall** | **0.889** |

| Label | Recall | Precision | F1 |
|-------|--------|-----------|-----|
| clean | 0.998 | 0.992 | 0.995 |
| xss | 0.951 | 0.585 | 0.724 |
| sqli | 0.732 | 0.940 | 0.823 |
| path_traversal | 0.896 | 0.794 | 0.842 |
| command_injection | 0.826 | 0.626 | 0.712 |
| scanner | 0.980 | 0.945 | 0.962 |
| spam_bot | 1.000 | 1.000 | 1.000 |

### Stratified Stress Test (test — 49,830 examples)

| Metric | Value |
|---|---|
| Macro F1 | 0.787 |
| FPR on clean | 7.9% |

### Adversarial Robustness (test_mixed_adversarial — 22,250 examples)

| Metric | Value |
|---|---|
| Macro F1 | 0.499 |
| FPR on clean | 11.8% |
| XSS recall | 0.833 |

### Latency (ONNX Runtime, CPU, 1 thread, batch=1)

| Metric | FP32 | INT8 |
|--------|------|------|
| Average | **0.24 ms** | 1.20 ms |
| Throughput | ~4,100 req/s | ~830 req/s |

> On CPU without VNNI, FP32 is faster than dynamic INT8. Use `model.onnx` on standard CPUs.

---

## Training

| Hyperparameter | V1 | V3 (current) |
|---|---|---|
| Loss | BCEWithLogitsLoss | **Focal BCE (gamma=2.0)** |
| Learning rate | 1e-3 | **1e-4** |
| Batch size | 256 | **128** |
| Epochs | 5 | **8 (early stop at 6, best=3)** |
| Patience | 2 | **3** |
| Checkpoint selection | Best val_loss | **Best Macro F1** |
| Calibration | None | **Per-label threshold tuning** |
| Data augmentation | None | **+20k augmented (encoding, headers, noise, context swap)** |

### Dataset

| Property | Value |
|---|---|
| Total examples | 498,345 (+20k augmented) |
| Training split | 418,685 |
| Real traffic | 62.6% |
| Synthetic | 37.4% |
| Multi-label | 17.0% |
| Hard negatives | 15.0% |
| Unique sources | 12 |
| Sources | CIC-IDS-2017, CSE-CIC-IDS-2018, HIKARI-2021, WebAttackPayloads, PayloadsAllTheThings, + synthetic |

---

## Usage

### Python (ONNX Runtime)

```python
import onnxruntime as ort
import numpy as np
import json
from tokenizers import Tokenizer

# Load model and tokenizer
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
tokenizer = Tokenizer.from_file("tokenizer.json")
thresholds = json.load(open("thresholds.json"))["thresholds"]
label_names = ["clean", "xss", "sqli", "path_traversal",
               "command_injection", "scanner", "spam_bot"]

def classify_request(method, path, query, headers, body):
    # 1. Build text
    text = f"{method} {path}"
    if query: text += f"?{query}"
    if body: text += f" {body[:200]}"
    text = text[:500]

    # 2. Tokenize
    enc = tokenizer.encode(text)
    input_ids = np.array([enc.ids], dtype=np.int32)
    attention_mask = np.array([enc.attention_mask], dtype=np.int32)

    # 3. Numeric features (RAW values)
    numeric = np.array([[
        float(len(body or "")),
        float(len(headers)),
        1.0 if body else 0.0,
        0.0, 0.0, 0.0,  # session features (0 if unavailable)
    ]], dtype=np.float32)

    # 4. Inference
    probs, risk = session.run(None, {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "numeric_features": numeric,
    })

    # 5. Apply per-label thresholds
    detections = {
        name: float(probs[0][i])
        for i, name in enumerate(label_names)
        if name != "clean" and probs[0][i] >= thresholds[name]
    }

    return {
        "risk_score": float(risk[0][0]),
        "detections": detections,
        "is_clean": len(detections) == 0,
    }

# Example
result = classify_request("GET", "/search", "q=' OR 1=1--", {"Host": "example.com"}, None)
print(result)
# {'risk_score': 0.87, 'detections': {'sqli': 0.94}, 'is_clean': False}
```

### Rust (ort crate)

```rust
use ort::{Session, Value};
use ndarray::Array2;

fn main() -> anyhow::Result<()> {
    let session = Session::builder()?
        .with_model_from_file("model.onnx")?;

    let input_ids = Array2::<i32>::zeros((1, 128));       // from tokenizer
    let attention_mask = Array2::<i32>::zeros((1, 128));   // from tokenizer
    let numeric_features = Array2::<f32>::zeros((1, 6));   // extract_numeric_features()

    let outputs = session.run(ort::inputs![
        "input_ids" => &input_ids,
        "attention_mask" => &attention_mask,
        "numeric_features" => &numeric_features,
    ]?)?;

    let label_probs: Vec<f32> = outputs[0].extract_tensor::<f32>()?.view().iter().copied().collect();
    let risk_score: f32 = *outputs[1].extract_tensor::<f32>()?.view().first().unwrap();

    // Apply thresholds from thresholds.json
    let thresholds = [0.20, 0.50, 0.74, 0.68, 0.66, 0.70, 0.72];
    let labels = ["clean", "xss", "sqli", "path_traversal",
                  "command_injection", "scanner", "spam_bot"];

    for (i, (prob, thr)) in label_probs.iter().zip(thresholds.iter()).enumerate() {
        if i > 0 && prob >= thr {
            println!("DETECTED: {} ({:.3})", labels[i], prob);
        }
    }
    println!("Risk score: {:.4}", risk_score);

    Ok(())
}
```

### Decision Logic

```python
thresholds = json.load(open("thresholds.json"))["thresholds"]

# Per-label detection
triggered = [name for i, name in enumerate(label_names)
             if name != "clean" and probs[0][i] >= thresholds[name]]

# Risk-score action
score = float(risk[0][0])
if score >= 0.8:   action = "BLOCK"
elif score >= 0.5:  action = "CHALLENGE"
elif score >= 0.2:  action = "LOG"
else:               action = "ALLOW"
```

---

## Version History

### V3 (current) — Production-Hardened

Fixed V2 recall collapse. Multi-checkpoint selection on Macro F1. Per-label threshold optimization replaces Platt scaling.

### V2 — Focal Loss + Calibration (superseded)

Introduced Focal Loss and Platt calibration. FPR dropped to 0.18% but XSS recall collapsed to 0.016 and CMDi to 0.222 due to aggressive calibration.

### V1 — Baseline

BCE loss, fixed 0.5 thresholds. High recall (~0.98) but lower Macro F1 (0.828) and higher FPR (0.83%).

| Metric | V1 | V2 | **V3** |
|--------|-----|-----|--------|
| **Macro F1** | 0.828 | 0.669 | **0.866** |
| **FPR** | 0.83% | 0.18% | **0.83%** |
| **XSS recall** | 0.980 | 0.016 | **0.951** |
| **CMDi recall** | 0.985 | 0.222 | **0.826** |
| **Latency** | 0.77ms | 0.99ms | **0.24ms** |

---

## Deployment Strategy

**Phase 1 — Shadow Mode**: Deploy alongside existing WAF rules, log predictions, compare decisions, tune thresholds.

**Phase 2 — Safe Blocking**: Enable blocking for high-confidence classes (scanner 0.98 recall, spam_bot 1.00, xss 0.95). Monitor FPR.

**Phase 3 — Full Deployment**: Activate all labels with `thresholds.json`. Use risk-score actions (BLOCK/CHALLENGE/LOG/ALLOW).

---

## Artifacts

| File | Size | Description |
|---|---|---|
| `model.onnx` | 4.5 MB | **Production model** (FP32, fastest on CPU) |
| `model_int8.onnx` | 1.2 MB | INT8 quantized (for VNNI hardware) |
| `model_optimized.onnx` | 4.5 MB | Graph-optimized FP32 |
| `tokenizer.json` | 510 KB | BPE tokenizer |
| `config.json` | 1.5 KB | Architecture + training config |
| `thresholds.json` | 1.3 KB | **Per-label thresholds** (must use at inference) |
| `metrics.json` | 12 KB | Full 3-set evaluation results |
| `training_history.json` | 7.3 KB | Per-epoch training history |

## Known Limitations

- **SQLi recall at 0.73**: High threshold (0.74) trades recall for precision. Lower to 0.60 if SQLi detection is critical.
- **Adversarial robustness**: Fuzzed/encoded payloads have lower recall (test_adversarial macro F1 = 0.50).
- **No session-level model**: Classifies individual requests. Session features help but don't replace session analysis.
- **Sequence truncation**: Requests truncated to 128 tokens. Place attack-relevant fields early in the text.
- **FP32 > INT8 on CPU**: Without VNNI, FP32 is faster. Use `model.onnx` on standard CPUs.

## Citation

```bibtex
@misc{argus_sentinel_2026,
  title        = {Argus Sentinel: A Low-Latency CNN-Based WAF Classifier},
  author       = {Fizcko},
  year         = {2026},
  howpublished = {Hugging Face Model Hub},
  note         = {V3, 1.17M params, 0.24ms latency, Macro F1 0.866, FPR 0.83\%}
}
```