marklkelly
/

bert-tiny-injection-detector

Text Classification

prompt-injection

text-embeddings-inference

Model card Files Files and versions

bert-tiny-injection-detector / deployment /fastly /README.md

marklkelly's picture

Upload folder using huggingface_hub

bd4b7b3 verified 15 days ago

|

history blame contribute delete

1.8 kB

	# Fastly Compute@Edge Deployment

	This directory contains artifacts for deploying `bert-tiny-injection-detector` on
	[Fastly Compute@Edge](https://www.fastly.com/products/edge-compute) using
	[`tract-onnx`](https://github.com/sonos/tract) in a Rust WASM service.

	## Files

	\| File \| Description \|
	\|---\|---\|
	\| `calibrated_thresholds.json` \| Calibrated block and review thresholds for the injection model \|

	## calibrated_thresholds.json

	```json
	{
	"injection": {
	"T_block_at_1pct_FPR": 0.9403,
	"T_review_lower_at_2pct_FPR": 0.8692
	}
	}
	```

	\| Threshold \| Score range \| Decision \|
	\|---\|---\|---\|
	\| Below `T_review` \| score < 0.8692 \| Allow \|
	\| Review band \| 0.8692 ≤ score < 0.9403 \| Review \|
	\| At or above `T_block` \| score ≥ 0.9403 \| Block \|

	## ONNX requirements for tract-onnx

	- Use `onnx/opset11/model.int8.onnx` (or `model.fp32.onnx` for debugging)
	- Opset 11 is required. Opset ≥ 13 uses dynamic `Unsqueeze` axes that `tract` cannot
	resolve statically. The opset-11 graph has only 2 static `Unsqueeze` nodes.
	- Input tensors must be `int64` of shape `[1, 128]`
	- Apply `head_tail` truncation before inference for inputs longer than 128 tokens

	## Memory and latency

	Measured on Fastly Compute@Edge (production, service v11: opt-level=3, Wizer pre-init, simd128):

	\| Metric \| Value \|
	\|---\|---\|
	\| Median inference \| ~69 ms \|
	\| Median total service elapsed \| ~70 ms \|
	\| p95 total service elapsed \| ~85 ms \|
	\| Memory footprint \| < 128 MB budget \|

	The inference time exceeds the nominal 50 ms Fastly CPU budget by ~1.4×. This is WASM
	overhead — INT8 SIMD paths are not accelerated in the sandbox. The service is functional
	at this latency. Wizer pre-initialization eliminates the lazy-static init cost (~163 ms
	in earlier versions); the remaining time is pure BERT inference.