| --- |
| language: en |
| license: apache-2.0 |
| tags: |
| - waf |
| - web-security |
| - onnx |
| - multi-label-classification |
| - low-latency |
| library_name: onnxruntime |
| pipeline_tag: text-classification |
| --- |
| |
| # Argus Sentinel — WAF ML Classifier (V3) |
|
|
| Production-grade Web Application Firewall classifier. Detects 6 attack types in HTTP requests with sub-millisecond latency on CPU. |
|
|
| **Key metrics (test_realistic — production-like distribution, 94% clean):** |
| - Macro F1: **0.866** | FPR: **0.83%** | Mean attack recall: **0.889** | Latency: **0.24ms** |
| |
| --- |
| |
| ## Model Overview |
| |
| | Property | Value | |
| |---|---| |
| | Architecture | CNN text encoder + numeric features fusion | |
| | Parameters | 1.17M | |
| | Vocab size | 8,192 (BPE ByteLevel) | |
| | Max sequence length | 128 tokens | |
| | ONNX model size | 4.5 MB (FP32) / 1.2 MB (INT8) | |
| | Inference latency | **0.24 ms** avg (CPU, single thread) | |
| | Training loss | Focal BCEWithLogitsLoss (gamma=2.0) | |
| | Best epoch | 3 / 8 (early stopping, selected on Macro F1) | |
| |
| --- |
| |
| ## Architecture |
| |
| ``` |
| HTTP Request Text |
| | |
| v |
| [BPE Tokenizer (vocab=8192, max_len=128)] |
| | |
| +---> [Embedding (128-dim)] |
| | | |
| | [Conv1D (128 ch, k=3) + BatchNorm + ReLU] x2 |
| | | |
| | [AdaptiveMaxPool1d → 128-dim] |
| | |
| +---> [6 Numeric Features] |
| | |
| [Linear 6→32 + ReLU] |
| | |
| [Concatenate (128 + 32 = 160)] |
| | |
| [Linear 160→128→64 + ReLU + Dropout(0.1)] |
| | |
| +---------+---------+ |
| | | |
| [Label Head → 7] [Risk Head → 1] |
| | | |
| [Sigmoid] [Sigmoid] |
| | | |
| label_probs [7] risk_score [1] |
| ``` |
| |
| --- |
| |
| ## Tokenizer Specification |
| |
| | Property | Value | |
| |---|---| |
| | **Type** | BPE (Byte-Pair Encoding) via HuggingFace `tokenizers` library | |
| | **Algorithm** | `ByteLevel` BPE — operates on UTF-8 bytes, not characters | |
| | **Pre-tokenizer** | `ByteLevel` (add_prefix_space=false, trim_offsets=true, use_regex=true) | |
| | **Normalizer** | None (raw bytes, no lowercasing or unicode normalization) | |
| | **Post-processor** | `TemplateProcessing` — prepends `[CLS]` token automatically | |
| | **Vocab size** | 8,192 tokens (7,933 merges + 3 special tokens + 256 byte tokens) | |
| | **Special tokens** | `[PAD]` (id=0), `[UNK]` (id=1), `[CLS]` (id=2) | |
| | **Max length** | 128 tokens (truncation=Right, padding=Right to fixed 128) | |
| | **Byte fallback** | false — unknown bytes map to `[UNK]` | |
| | **File** | `tokenizer.json` (HuggingFace tokenizers JSON format) | |
|
|
| **Input text construction**: `"{method} {path}?{query} {body[:200]}"` — capped at 500 chars before tokenization. |
|
|
| ```python |
| from tokenizers import Tokenizer |
| tok = Tokenizer.from_file("tokenizer.json") |
| result = tok.encode("GET /search?q=test HTTP/1.1") |
| # result.ids → [2, 546, 287, ...] (starts with [CLS]=2) |
| # result.attention_mask → [1, 1, 1, ...] |
| ``` |
|
|
| --- |
|
|
| ## Label ID Mapping (CRITICAL) |
|
|
| Output `label_probs` tensor shape: `[batch, 7]`. Each index maps to: |
|
|
| | Index | Label | Description | |
| |---|---|---| |
| | **0** | `clean` | Benign / legitimate request | |
| | **1** | `xss` | Cross-Site Scripting | |
| | **2** | `sqli` | SQL Injection | |
| | **3** | `path_traversal` | Directory / Path Traversal | |
| | **4** | `command_injection` | OS Command Injection | |
| | **5** | `scanner` | Vulnerability scanner / probe | |
| | **6** | `spam_bot` | Spam bot / automated abuse | |
|
|
| > **Multi-label**: Labels are NOT mutually exclusive. Multiple labels can be active simultaneously (e.g., index 2 + 5 = scanner performing SQLi). Exception: `clean` (index 0) is exclusive — if clean=1, all others must be 0. |
|
|
| --- |
|
|
| ## ONNX Inputs |
|
|
| | Name | Shape | Dtype | Description | |
| |---|---|---|---| |
| | `input_ids` | `[batch, 128]` | `int32` | BPE token IDs from `tokenizer.json` | |
| | `attention_mask` | `[batch, 128]` | `int32` | 1 for real tokens, 0 for `[PAD]` | |
| | `numeric_features` | `[batch, 6]` | `float32` | Request-level features (**RAW values**, see below) | |
|
|
| ## Numeric Features — Normalization Parameters (CRITICAL) |
|
|
| **Features are passed as RAW values** — the model was trained on unnormalized features. Pass the same raw scale at inference. |
|
|
| | Index | Feature | Computation | Training Range | Mean | Std | |
| |---|---|---|---|---|---| |
| | **0** | `content_length` | `len(body)` if body else `0` | 0 – 662 | 35.2 | 63.5 | |
| | **1** | `num_headers` | `len(headers_dict)` | 3 – 13 | 7.7 | 1.3 | |
| | **2** | `has_body` | `1.0` if body present, else `0.0` | 0 – 1 | 0.42 | 0.49 | |
| | **3** | `session_request_count` | Total requests in session, or `0` | 0 – 20 | 3.0 | 6.0 | |
| | **4** | `session_duration` | Session time span in seconds, or `0` | 0 – 4,965,381 | 619,322 | 1,198,151 | |
| | **5** | `session_pattern_score` | Behavioral pattern score, or `0` | 0 – 0.5 | 0.09 | 0.18 | |
|
|
| **Python**: |
| ```python |
| def extract_numeric_features(request: dict) -> list[float]: |
| body = request.get("body") or "" |
| headers = request.get("headers") or {} |
| return [ |
| float(len(body)), # content_length |
| float(len(headers)), # num_headers |
| 1.0 if body else 0.0, # has_body |
| float(request.get("session_request_count") or 0), # session_request_count |
| float(request.get("session_duration") or 0), # session_duration |
| float(request.get("session_pattern_score") or 0), # session_pattern_score |
| ] |
| ``` |
|
|
| **Rust**: |
| ```rust |
| fn extract_numeric_features(request: &HttpRequest) -> [f32; 6] { |
| let body_len = request.body.as_ref().map_or(0, |b| b.len()); |
| [ |
| body_len as f32, |
| request.headers.len() as f32, |
| if body_len > 0 { 1.0 } else { 0.0 }, |
| request.session_request_count.unwrap_or(0) as f32, |
| request.session_duration.unwrap_or(0.0), |
| request.session_pattern_score.unwrap_or(0.0), |
| ] |
| } |
| ``` |
|
|
| > If you don't have session data, pass `[content_length, num_headers, has_body, 0.0, 0.0, 0.0]` — ~79% of training examples had null session features. |
| |
| ## ONNX Outputs |
| |
| | Name | Shape | Dtype | Description | |
| |---|---|---|---| |
| | `label_probs` | `[batch, 7]` | `float32` | Per-label probabilities after sigmoid | |
| | `risk_score` | `[batch, 1]` | `float32` | Aggregate risk score [0, 1] | |
|
|
| --- |
|
|
| ## Per-Label Thresholds (CRITICAL for deployment) |
|
|
| Do NOT use a default 0.5 threshold for all labels. Use these optimized thresholds from `thresholds.json`: |
|
|
| | Label | Threshold | Recall | Precision | F1 | |
| |-------|-----------|--------|-----------|-----| |
| | clean | **0.20** | 0.998 | 0.992 | 0.995 | |
| | xss | **0.50** | 0.951 | 0.585 | 0.724 | |
| | sqli | **0.74** | 0.732 | 0.940 | 0.823 | |
| | path_traversal | **0.68** | 0.896 | 0.794 | 0.842 | |
| | command_injection | **0.66** | 0.826 | 0.626 | 0.712 | |
| | scanner | **0.70** | 0.980 | 0.945 | 0.962 | |
| | spam_bot | **0.72** | 1.000 | 1.000 | 1.000 | |
| |
| --- |
| |
| ## Performance |
| |
| ### Production-Like (test_realistic — 25,000 examples, 94% clean) |
|
|
| | Metric | Value | |
| |---|---| |
| | **Macro F1** | **0.866** | |
| | **FPR on clean** | **0.83%** | |
| | **Mean attack recall** | **0.889** | |
|
|
| | Label | Recall | Precision | F1 | |
| |-------|--------|-----------|-----| |
| | clean | 0.998 | 0.992 | 0.995 | |
| | xss | 0.951 | 0.585 | 0.724 | |
| | sqli | 0.732 | 0.940 | 0.823 | |
| | path_traversal | 0.896 | 0.794 | 0.842 | |
| | command_injection | 0.826 | 0.626 | 0.712 | |
| | scanner | 0.980 | 0.945 | 0.962 | |
| | spam_bot | 1.000 | 1.000 | 1.000 | |
| |
| ### Stratified Stress Test (test — 49,830 examples) |
| |
| | Metric | Value | |
| |---|---| |
| | Macro F1 | 0.787 | |
| | FPR on clean | 7.9% | |
| |
| ### Adversarial Robustness (test_mixed_adversarial — 22,250 examples) |
| |
| | Metric | Value | |
| |---|---| |
| | Macro F1 | 0.499 | |
| | FPR on clean | 11.8% | |
| | XSS recall | 0.833 | |
| |
| ### Latency (ONNX Runtime, CPU, 1 thread, batch=1) |
| |
| | Metric | FP32 | INT8 | |
| |--------|------|------| |
| | Average | **0.24 ms** | 1.20 ms | |
| | Throughput | ~4,100 req/s | ~830 req/s | |
| |
| > On CPU without VNNI, FP32 is faster than dynamic INT8. Use `model.onnx` on standard CPUs. |
| |
| --- |
| |
| ## Training |
| |
| | Hyperparameter | V1 | V3 (current) | |
| |---|---|---| |
| | Loss | BCEWithLogitsLoss | **Focal BCE (gamma=2.0)** | |
| | Learning rate | 1e-3 | **1e-4** | |
| | Batch size | 256 | **128** | |
| | Epochs | 5 | **8 (early stop at 6, best=3)** | |
| | Patience | 2 | **3** | |
| | Checkpoint selection | Best val_loss | **Best Macro F1** | |
| | Calibration | None | **Per-label threshold tuning** | |
| | Data augmentation | None | **+20k augmented (encoding, headers, noise, context swap)** | |
|
|
| ### Dataset |
|
|
| | Property | Value | |
| |---|---| |
| | Total examples | 498,345 (+20k augmented) | |
| | Training split | 418,685 | |
| | Real traffic | 62.6% | |
| | Synthetic | 37.4% | |
| | Multi-label | 17.0% | |
| | Hard negatives | 15.0% | |
| | Unique sources | 12 | |
| | Sources | CIC-IDS-2017, CSE-CIC-IDS-2018, HIKARI-2021, WebAttackPayloads, PayloadsAllTheThings, + synthetic | |
|
|
| --- |
|
|
| ## Usage |
|
|
| ### Python (ONNX Runtime) |
|
|
| ```python |
| import onnxruntime as ort |
| import numpy as np |
| import json |
| from tokenizers import Tokenizer |
| |
| # Load model and tokenizer |
| session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"]) |
| tokenizer = Tokenizer.from_file("tokenizer.json") |
| thresholds = json.load(open("thresholds.json"))["thresholds"] |
| label_names = ["clean", "xss", "sqli", "path_traversal", |
| "command_injection", "scanner", "spam_bot"] |
| |
| def classify_request(method, path, query, headers, body): |
| # 1. Build text |
| text = f"{method} {path}" |
| if query: text += f"?{query}" |
| if body: text += f" {body[:200]}" |
| text = text[:500] |
| |
| # 2. Tokenize |
| enc = tokenizer.encode(text) |
| input_ids = np.array([enc.ids], dtype=np.int32) |
| attention_mask = np.array([enc.attention_mask], dtype=np.int32) |
| |
| # 3. Numeric features (RAW values) |
| numeric = np.array([[ |
| float(len(body or "")), |
| float(len(headers)), |
| 1.0 if body else 0.0, |
| 0.0, 0.0, 0.0, # session features (0 if unavailable) |
| ]], dtype=np.float32) |
| |
| # 4. Inference |
| probs, risk = session.run(None, { |
| "input_ids": input_ids, |
| "attention_mask": attention_mask, |
| "numeric_features": numeric, |
| }) |
| |
| # 5. Apply per-label thresholds |
| detections = { |
| name: float(probs[0][i]) |
| for i, name in enumerate(label_names) |
| if name != "clean" and probs[0][i] >= thresholds[name] |
| } |
| |
| return { |
| "risk_score": float(risk[0][0]), |
| "detections": detections, |
| "is_clean": len(detections) == 0, |
| } |
| |
| # Example |
| result = classify_request("GET", "/search", "q=' OR 1=1--", {"Host": "example.com"}, None) |
| print(result) |
| # {'risk_score': 0.87, 'detections': {'sqli': 0.94}, 'is_clean': False} |
| ``` |
|
|
| ### Rust (ort crate) |
|
|
| ```rust |
| use ort::{Session, Value}; |
| use ndarray::Array2; |
| |
| fn main() -> anyhow::Result<()> { |
| let session = Session::builder()? |
| .with_model_from_file("model.onnx")?; |
| |
| let input_ids = Array2::<i32>::zeros((1, 128)); // from tokenizer |
| let attention_mask = Array2::<i32>::zeros((1, 128)); // from tokenizer |
| let numeric_features = Array2::<f32>::zeros((1, 6)); // extract_numeric_features() |
| |
| let outputs = session.run(ort::inputs![ |
| "input_ids" => &input_ids, |
| "attention_mask" => &attention_mask, |
| "numeric_features" => &numeric_features, |
| ]?)?; |
| |
| let label_probs: Vec<f32> = outputs[0].extract_tensor::<f32>()?.view().iter().copied().collect(); |
| let risk_score: f32 = *outputs[1].extract_tensor::<f32>()?.view().first().unwrap(); |
| |
| // Apply thresholds from thresholds.json |
| let thresholds = [0.20, 0.50, 0.74, 0.68, 0.66, 0.70, 0.72]; |
| let labels = ["clean", "xss", "sqli", "path_traversal", |
| "command_injection", "scanner", "spam_bot"]; |
| |
| for (i, (prob, thr)) in label_probs.iter().zip(thresholds.iter()).enumerate() { |
| if i > 0 && prob >= thr { |
| println!("DETECTED: {} ({:.3})", labels[i], prob); |
| } |
| } |
| println!("Risk score: {:.4}", risk_score); |
| |
| Ok(()) |
| } |
| ``` |
|
|
| ### Decision Logic |
|
|
| ```python |
| thresholds = json.load(open("thresholds.json"))["thresholds"] |
| |
| # Per-label detection |
| triggered = [name for i, name in enumerate(label_names) |
| if name != "clean" and probs[0][i] >= thresholds[name]] |
| |
| # Risk-score action |
| score = float(risk[0][0]) |
| if score >= 0.8: action = "BLOCK" |
| elif score >= 0.5: action = "CHALLENGE" |
| elif score >= 0.2: action = "LOG" |
| else: action = "ALLOW" |
| ``` |
|
|
| --- |
|
|
| ## Version History |
|
|
| ### V3 (current) — Production-Hardened |
|
|
| Fixed V2 recall collapse. Multi-checkpoint selection on Macro F1. Per-label threshold optimization replaces Platt scaling. |
|
|
| ### V2 — Focal Loss + Calibration (superseded) |
|
|
| Introduced Focal Loss and Platt calibration. FPR dropped to 0.18% but XSS recall collapsed to 0.016 and CMDi to 0.222 due to aggressive calibration. |
|
|
| ### V1 — Baseline |
|
|
| BCE loss, fixed 0.5 thresholds. High recall (~0.98) but lower Macro F1 (0.828) and higher FPR (0.83%). |
|
|
| | Metric | V1 | V2 | **V3** | |
| |--------|-----|-----|--------| |
| | **Macro F1** | 0.828 | 0.669 | **0.866** | |
| | **FPR** | 0.83% | 0.18% | **0.83%** | |
| | **XSS recall** | 0.980 | 0.016 | **0.951** | |
| | **CMDi recall** | 0.985 | 0.222 | **0.826** | |
| | **Latency** | 0.77ms | 0.99ms | **0.24ms** | |
|
|
| --- |
|
|
| ## Deployment Strategy |
|
|
| **Phase 1 — Shadow Mode**: Deploy alongside existing WAF rules, log predictions, compare decisions, tune thresholds. |
|
|
| **Phase 2 — Safe Blocking**: Enable blocking for high-confidence classes (scanner 0.98 recall, spam_bot 1.00, xss 0.95). Monitor FPR. |
| |
| **Phase 3 — Full Deployment**: Activate all labels with `thresholds.json`. Use risk-score actions (BLOCK/CHALLENGE/LOG/ALLOW). |
| |
| --- |
| |
| ## Artifacts |
| |
| | File | Size | Description | |
| |---|---|---| |
| | `model.onnx` | 4.5 MB | **Production model** (FP32, fastest on CPU) | |
| | `model_int8.onnx` | 1.2 MB | INT8 quantized (for VNNI hardware) | |
| | `model_optimized.onnx` | 4.5 MB | Graph-optimized FP32 | |
| | `tokenizer.json` | 510 KB | BPE tokenizer | |
| | `config.json` | 1.5 KB | Architecture + training config | |
| | `thresholds.json` | 1.3 KB | **Per-label thresholds** (must use at inference) | |
| | `metrics.json` | 12 KB | Full 3-set evaluation results | |
| | `training_history.json` | 7.3 KB | Per-epoch training history | |
|
|
| ## Known Limitations |
|
|
| - **SQLi recall at 0.73**: High threshold (0.74) trades recall for precision. Lower to 0.60 if SQLi detection is critical. |
| - **Adversarial robustness**: Fuzzed/encoded payloads have lower recall (test_adversarial macro F1 = 0.50). |
| - **No session-level model**: Classifies individual requests. Session features help but don't replace session analysis. |
| - **Sequence truncation**: Requests truncated to 128 tokens. Place attack-relevant fields early in the text. |
| - **FP32 > INT8 on CPU**: Without VNNI, FP32 is faster. Use `model.onnx` on standard CPUs. |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{argus_sentinel_2026, |
| title = {Argus Sentinel: A Low-Latency CNN-Based WAF Classifier}, |
| author = {Fizcko}, |
| year = {2026}, |
| howpublished = {Hugging Face Model Hub}, |
| note = {V3, 1.17M params, 0.24ms latency, Macro F1 0.866, FPR 0.83\%} |
| } |
| ``` |
| |