Upload v3

59848dd verified about 2 months ago

15.1 kB

	---
	language: en
	license: apache-2.0
	tags:
	- waf
	- web-security
	- onnx
	- multi-label-classification
	- low-latency
	library_name: onnxruntime
	pipeline_tag: text-classification
	---

	# Argus Sentinel — WAF ML Classifier (V3)

	Production-grade Web Application Firewall classifier. Detects 6 attack types in HTTP requests with sub-millisecond latency on CPU.

	Key metrics (test_realistic — production-like distribution, 94% clean):
	- Macro F1: 0.866 \| FPR: 0.83% \| Mean attack recall: 0.889 \| Latency: 0.24ms

	---

	## Model Overview

	\| Property \| Value \|
	\|---\|---\|
	\| Architecture \| CNN text encoder + numeric features fusion \|
	\| Parameters \| 1.17M \|
	\| Vocab size \| 8,192 (BPE ByteLevel) \|
	\| Max sequence length \| 128 tokens \|
	\| ONNX model size \| 4.5 MB (FP32) / 1.2 MB (INT8) \|
	\| Inference latency \| 0.24 ms avg (CPU, single thread) \|
	\| Training loss \| Focal BCEWithLogitsLoss (gamma=2.0) \|
	\| Best epoch \| 3 / 8 (early stopping, selected on Macro F1) \|

	---

	## Architecture

	```
	HTTP Request Text
	\|
	v
	[BPE Tokenizer (vocab=8192, max_len=128)]
	\|
	+---> [Embedding (128-dim)]
	\| \|
	\| [Conv1D (128 ch, k=3) + BatchNorm + ReLU] x2
	\| \|
	\| [AdaptiveMaxPool1d → 128-dim]
	\|
	+---> [6 Numeric Features]
	\|
	[Linear 6→32 + ReLU]
	\|
	[Concatenate (128 + 32 = 160)]
	\|
	[Linear 160→128→64 + ReLU + Dropout(0.1)]
	\|
	+---------+---------+
	\| \|
	[Label Head → 7] [Risk Head → 1]
	\| \|
	[Sigmoid] [Sigmoid]
	\| \|
	label_probs [7] risk_score [1]
	```

	---

	## Tokenizer Specification

	\| Property \| Value \|
	\|---\|---\|
	\| Type \| BPE (Byte-Pair Encoding) via HuggingFace `tokenizers` library \|
	\| Algorithm \| `ByteLevel` BPE — operates on UTF-8 bytes, not characters \|
	\| Pre-tokenizer \| `ByteLevel` (add_prefix_space=false, trim_offsets=true, use_regex=true) \|
	\| Normalizer \| None (raw bytes, no lowercasing or unicode normalization) \|
	\| Post-processor \| `TemplateProcessing` — prepends `[CLS]` token automatically \|
	\| Vocab size \| 8,192 tokens (7,933 merges + 3 special tokens + 256 byte tokens) \|
	\| Special tokens \| `[PAD]` (id=0), `[UNK]` (id=1), `[CLS]` (id=2) \|
	\| Max length \| 128 tokens (truncation=Right, padding=Right to fixed 128) \|
	\| Byte fallback \| false — unknown bytes map to `[UNK]` \|
	\| File \| `tokenizer.json` (HuggingFace tokenizers JSON format) \|

	Input text construction: `"{method} {path}?{query} {body[:200]}"` — capped at 500 chars before tokenization.

	```python
	from tokenizers import Tokenizer
	tok = Tokenizer.from_file("tokenizer.json")
	result = tok.encode("GET /search?q=test HTTP/1.1")
	# result.ids → [2, 546, 287, ...] (starts with [CLS]=2)
	# result.attention_mask → [1, 1, 1, ...]
	```

	---

	## Label ID Mapping (CRITICAL)

	Output `label_probs` tensor shape: `[batch, 7]`. Each index maps to:

	\| Index \| Label \| Description \|
	\|---\|---\|---\|
	\| 0 \| `clean` \| Benign / legitimate request \|
	\| 1 \| `xss` \| Cross-Site Scripting \|
	\| 2 \| `sqli` \| SQL Injection \|
	\| 3 \| `path_traversal` \| Directory / Path Traversal \|
	\| 4 \| `command_injection` \| OS Command Injection \|
	\| 5 \| `scanner` \| Vulnerability scanner / probe \|
	\| 6 \| `spam_bot` \| Spam bot / automated abuse \|

	> Multi-label: Labels are NOT mutually exclusive. Multiple labels can be active simultaneously (e.g., index 2 + 5 = scanner performing SQLi). Exception: `clean` (index 0) is exclusive — if clean=1, all others must be 0.

	---

	## ONNX Inputs

	\| Name \| Shape \| Dtype \| Description \|
	\|---\|---\|---\|---\|
	\| `input_ids` \| `[batch, 128]` \| `int32` \| BPE token IDs from `tokenizer.json` \|
	\| `attention_mask` \| `[batch, 128]` \| `int32` \| 1 for real tokens, 0 for `[PAD]` \|
	\| `numeric_features` \| `[batch, 6]` \| `float32` \| Request-level features (RAW values, see below) \|

	## Numeric Features — Normalization Parameters (CRITICAL)

	Features are passed as RAW values — the model was trained on unnormalized features. Pass the same raw scale at inference.

	\| Index \| Feature \| Computation \| Training Range \| Mean \| Std \|
	\|---\|---\|---\|---\|---\|---\|
	\| 0 \| `content_length` \| `len(body)` if body else `0` \| 0 – 662 \| 35.2 \| 63.5 \|
	\| 1 \| `num_headers` \| `len(headers_dict)` \| 3 – 13 \| 7.7 \| 1.3 \|
	\| 2 \| `has_body` \| `1.0` if body present, else `0.0` \| 0 – 1 \| 0.42 \| 0.49 \|
	\| 3 \| `session_request_count` \| Total requests in session, or `0` \| 0 – 20 \| 3.0 \| 6.0 \|
	\| 4 \| `session_duration` \| Session time span in seconds, or `0` \| 0 – 4,965,381 \| 619,322 \| 1,198,151 \|
	\| 5 \| `session_pattern_score` \| Behavioral pattern score, or `0` \| 0 – 0.5 \| 0.09 \| 0.18 \|

	Python:
	```python
	def extract_numeric_features(request: dict) -> list[float]:
	body = request.get("body") or ""
	headers = request.get("headers") or {}
	return [
	float(len(body)), # content_length
	float(len(headers)), # num_headers
	1.0 if body else 0.0, # has_body
	float(request.get("session_request_count") or 0), # session_request_count
	float(request.get("session_duration") or 0), # session_duration
	float(request.get("session_pattern_score") or 0), # session_pattern_score
	]
	```

	Rust:
	```rust
	fn extract_numeric_features(request: &HttpRequest) -> [f32; 6] {
	let body_len = request.body.as_ref().map_or(0, \|b\| b.len());
	[
	body_len as f32,
	request.headers.len() as f32,
	if body_len > 0 { 1.0 } else { 0.0 },
	request.session_request_count.unwrap_or(0) as f32,
	request.session_duration.unwrap_or(0.0),
	request.session_pattern_score.unwrap_or(0.0),
	]
	}
	```

	> If you don't have session data, pass `[content_length, num_headers, has_body, 0.0, 0.0, 0.0]` — ~79% of training examples had null session features.

	## ONNX Outputs

	\| Name \| Shape \| Dtype \| Description \|
	\|---\|---\|---\|---\|
	\| `label_probs` \| `[batch, 7]` \| `float32` \| Per-label probabilities after sigmoid \|
	\| `risk_score` \| `[batch, 1]` \| `float32` \| Aggregate risk score [0, 1] \|

	---

	## Per-Label Thresholds (CRITICAL for deployment)

	Do NOT use a default 0.5 threshold for all labels. Use these optimized thresholds from `thresholds.json`:

	\| Label \| Threshold \| Recall \| Precision \| F1 \|
	\|-------\|-----------\|--------\|-----------\|-----\|
	\| clean \| 0.20 \| 0.998 \| 0.992 \| 0.995 \|
	\| xss \| 0.50 \| 0.951 \| 0.585 \| 0.724 \|
	\| sqli \| 0.74 \| 0.732 \| 0.940 \| 0.823 \|
	\| path_traversal \| 0.68 \| 0.896 \| 0.794 \| 0.842 \|
	\| command_injection \| 0.66 \| 0.826 \| 0.626 \| 0.712 \|
	\| scanner \| 0.70 \| 0.980 \| 0.945 \| 0.962 \|
	\| spam_bot \| 0.72 \| 1.000 \| 1.000 \| 1.000 \|

	---

	## Performance

	### Production-Like (test_realistic — 25,000 examples, 94% clean)

	\| Metric \| Value \|
	\|---\|---\|
	\| Macro F1 \| 0.866 \|
	\| FPR on clean \| 0.83% \|
	\| Mean attack recall \| 0.889 \|

	\| Label \| Recall \| Precision \| F1 \|
	\|-------\|--------\|-----------\|-----\|
	\| clean \| 0.998 \| 0.992 \| 0.995 \|
	\| xss \| 0.951 \| 0.585 \| 0.724 \|
	\| sqli \| 0.732 \| 0.940 \| 0.823 \|
	\| path_traversal \| 0.896 \| 0.794 \| 0.842 \|
	\| command_injection \| 0.826 \| 0.626 \| 0.712 \|
	\| scanner \| 0.980 \| 0.945 \| 0.962 \|
	\| spam_bot \| 1.000 \| 1.000 \| 1.000 \|

	### Stratified Stress Test (test — 49,830 examples)

	\| Metric \| Value \|
	\|---\|---\|
	\| Macro F1 \| 0.787 \|
	\| FPR on clean \| 7.9% \|

	### Adversarial Robustness (test_mixed_adversarial — 22,250 examples)

	\| Metric \| Value \|
	\|---\|---\|
	\| Macro F1 \| 0.499 \|
	\| FPR on clean \| 11.8% \|
	\| XSS recall \| 0.833 \|

	### Latency (ONNX Runtime, CPU, 1 thread, batch=1)

	\| Metric \| FP32 \| INT8 \|
	\|--------\|------\|------\|
	\| Average \| 0.24 ms \| 1.20 ms \|
	\| Throughput \| ~4,100 req/s \| ~830 req/s \|

	> On CPU without VNNI, FP32 is faster than dynamic INT8. Use `model.onnx` on standard CPUs.

	---

	## Training

	\| Hyperparameter \| V1 \| V3 (current) \|
	\|---\|---\|---\|
	\| Loss \| BCEWithLogitsLoss \| Focal BCE (gamma=2.0) \|
	\| Learning rate \| 1e-3 \| 1e-4 \|
	\| Batch size \| 256 \| 128 \|
	\| Epochs \| 5 \| 8 (early stop at 6, best=3) \|
	\| Patience \| 2 \| 3 \|
	\| Checkpoint selection \| Best val_loss \| Best Macro F1 \|
	\| Calibration \| None \| Per-label threshold tuning \|
	\| Data augmentation \| None \| +20k augmented (encoding, headers, noise, context swap) \|

	### Dataset

	\| Property \| Value \|
	\|---\|---\|
	\| Total examples \| 498,345 (+20k augmented) \|
	\| Training split \| 418,685 \|
	\| Real traffic \| 62.6% \|
	\| Synthetic \| 37.4% \|
	\| Multi-label \| 17.0% \|
	\| Hard negatives \| 15.0% \|
	\| Unique sources \| 12 \|
	\| Sources \| CIC-IDS-2017, CSE-CIC-IDS-2018, HIKARI-2021, WebAttackPayloads, PayloadsAllTheThings, + synthetic \|

	---

	## Usage

	### Python (ONNX Runtime)

	```python
	import onnxruntime as ort
	import numpy as np
	import json
	from tokenizers import Tokenizer

	# Load model and tokenizer
	session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
	tokenizer = Tokenizer.from_file("tokenizer.json")
	thresholds = json.load(open("thresholds.json"))["thresholds"]
	label_names = ["clean", "xss", "sqli", "path_traversal",
	"command_injection", "scanner", "spam_bot"]

	def classify_request(method, path, query, headers, body):
	# 1. Build text
	text = f"{method} {path}"
	if query: text += f"?{query}"
	if body: text += f" {body[:200]}"
	text = text[:500]

	# 2. Tokenize
	enc = tokenizer.encode(text)
	input_ids = np.array([enc.ids], dtype=np.int32)
	attention_mask = np.array([enc.attention_mask], dtype=np.int32)

	# 3. Numeric features (RAW values)
	numeric = np.array([[
	float(len(body or "")),
	float(len(headers)),
	1.0 if body else 0.0,
	0.0, 0.0, 0.0, # session features (0 if unavailable)
	]], dtype=np.float32)

	# 4. Inference
	probs, risk = session.run(None, {
	"input_ids": input_ids,
	"attention_mask": attention_mask,
	"numeric_features": numeric,
	})

	# 5. Apply per-label thresholds
	detections = {
	name: float(probs[0][i])
	for i, name in enumerate(label_names)
	if name != "clean" and probs[0][i] >= thresholds[name]
	}

	return {
	"risk_score": float(risk[0][0]),
	"detections": detections,
	"is_clean": len(detections) == 0,
	}

	# Example
	result = classify_request("GET", "/search", "q=' OR 1=1--", {"Host": "example.com"}, None)
	print(result)
	# {'risk_score': 0.87, 'detections': {'sqli': 0.94}, 'is_clean': False}
	```

	### Rust (ort crate)

	```rust
	use ort::{Session, Value};
	use ndarray::Array2;

	fn main() -> anyhow::Result<()> {
	let session = Session::builder()?
	.with_model_from_file("model.onnx")?;

	let input_ids = Array2::<i32>::zeros((1, 128)); // from tokenizer
	let attention_mask = Array2::<i32>::zeros((1, 128)); // from tokenizer
	let numeric_features = Array2::<f32>::zeros((1, 6)); // extract_numeric_features()

	let outputs = session.run(ort::inputs![
	"input_ids" => &input_ids,
	"attention_mask" => &attention_mask,
	"numeric_features" => &numeric_features,
	]?)?;

	let label_probs: Vec<f32> = outputs[0].extract_tensor::<f32>()?.view().iter().copied().collect();
	let risk_score: f32 = *outputs[1].extract_tensor::<f32>()?.view().first().unwrap();

	// Apply thresholds from thresholds.json
	let thresholds = [0.20, 0.50, 0.74, 0.68, 0.66, 0.70, 0.72];
	let labels = ["clean", "xss", "sqli", "path_traversal",
	"command_injection", "scanner", "spam_bot"];

	for (i, (prob, thr)) in label_probs.iter().zip(thresholds.iter()).enumerate() {
	if i > 0 && prob >= thr {
	println!("DETECTED: {} ({:.3})", labels[i], prob);
	}
	}
	println!("Risk score: {:.4}", risk_score);

	Ok(())
	}
	```

	### Decision Logic

	```python
	thresholds = json.load(open("thresholds.json"))["thresholds"]

	# Per-label detection
	triggered = [name for i, name in enumerate(label_names)
	if name != "clean" and probs[0][i] >= thresholds[name]]

	# Risk-score action
	score = float(risk[0][0])
	if score >= 0.8: action = "BLOCK"
	elif score >= 0.5: action = "CHALLENGE"
	elif score >= 0.2: action = "LOG"
	else: action = "ALLOW"
	```

	---

	## Version History

	### V3 (current) — Production-Hardened

	Fixed V2 recall collapse. Multi-checkpoint selection on Macro F1. Per-label threshold optimization replaces Platt scaling.

	### V2 — Focal Loss + Calibration (superseded)

	Introduced Focal Loss and Platt calibration. FPR dropped to 0.18% but XSS recall collapsed to 0.016 and CMDi to 0.222 due to aggressive calibration.

	### V1 — Baseline

	BCE loss, fixed 0.5 thresholds. High recall (~0.98) but lower Macro F1 (0.828) and higher FPR (0.83%).

	\| Metric \| V1 \| V2 \| V3 \|
	\|--------\|-----\|-----\|--------\|
	\| Macro F1 \| 0.828 \| 0.669 \| 0.866 \|
	\| FPR \| 0.83% \| 0.18% \| 0.83% \|
	\| XSS recall \| 0.980 \| 0.016 \| 0.951 \|
	\| CMDi recall \| 0.985 \| 0.222 \| 0.826 \|
	\| Latency \| 0.77ms \| 0.99ms \| 0.24ms \|

	---

	## Deployment Strategy

	Phase 1 — Shadow Mode: Deploy alongside existing WAF rules, log predictions, compare decisions, tune thresholds.

	Phase 2 — Safe Blocking: Enable blocking for high-confidence classes (scanner 0.98 recall, spam_bot 1.00, xss 0.95). Monitor FPR.

	Phase 3 — Full Deployment: Activate all labels with `thresholds.json`. Use risk-score actions (BLOCK/CHALLENGE/LOG/ALLOW).

	---

	## Artifacts

	\| File \| Size \| Description \|
	\|---\|---\|---\|
	\| `model.onnx` \| 4.5 MB \| Production model (FP32, fastest on CPU) \|
	\| `model_int8.onnx` \| 1.2 MB \| INT8 quantized (for VNNI hardware) \|
	\| `model_optimized.onnx` \| 4.5 MB \| Graph-optimized FP32 \|
	\| `tokenizer.json` \| 510 KB \| BPE tokenizer \|
	\| `config.json` \| 1.5 KB \| Architecture + training config \|
	\| `thresholds.json` \| 1.3 KB \| Per-label thresholds (must use at inference) \|
	\| `metrics.json` \| 12 KB \| Full 3-set evaluation results \|
	\| `training_history.json` \| 7.3 KB \| Per-epoch training history \|

	## Known Limitations

	- SQLi recall at 0.73: High threshold (0.74) trades recall for precision. Lower to 0.60 if SQLi detection is critical.
	- Adversarial robustness: Fuzzed/encoded payloads have lower recall (test_adversarial macro F1 = 0.50).
	- No session-level model: Classifies individual requests. Session features help but don't replace session analysis.
	- Sequence truncation: Requests truncated to 128 tokens. Place attack-relevant fields early in the text.
	- FP32 > INT8 on CPU: Without VNNI, FP32 is faster. Use `model.onnx` on standard CPUs.

	## Citation

	```bibtex
	@misc{argus_sentinel_2026,
	title = {Argus Sentinel: A Low-Latency CNN-Based WAF Classifier},
	author = {Fizcko},
	year = {2026},
	howpublished = {Hugging Face Model Hub},
	note = {V3, 1.17M params, 0.24ms latency, Macro F1 0.866, FPR 0.83\%}
	}
	```