marklkelly's picture
Upload folder using huggingface_hub
bd4b7b3 verified
# Fastly Compute@Edge Deployment
This directory contains artifacts for deploying `bert-tiny-injection-detector` on
[Fastly Compute@Edge](https://www.fastly.com/products/edge-compute) using
[`tract-onnx`](https://github.com/sonos/tract) in a Rust WASM service.
## Files
| File | Description |
|---|---|
| `calibrated_thresholds.json` | Calibrated block and review thresholds for the injection model |
## calibrated_thresholds.json
```json
{
"injection": {
"T_block_at_1pct_FPR": 0.9403,
"T_review_lower_at_2pct_FPR": 0.8692
}
}
```
| Threshold | Score range | Decision |
|---|---|---|
| Below `T_review` | score < 0.8692 | Allow |
| Review band | 0.8692 ≤ score < 0.9403 | Review |
| At or above `T_block` | score ≥ 0.9403 | Block |
## ONNX requirements for tract-onnx
- Use `onnx/opset11/model.int8.onnx` (or `model.fp32.onnx` for debugging)
- **Opset 11 is required.** Opset ≥ 13 uses dynamic `Unsqueeze` axes that `tract` cannot
resolve statically. The opset-11 graph has only 2 static `Unsqueeze` nodes.
- Input tensors must be `int64` of shape `[1, 128]`
- Apply `head_tail` truncation before inference for inputs longer than 128 tokens
## Memory and latency
Measured on Fastly Compute@Edge (production, service v11: opt-level=3, Wizer pre-init, simd128):
| Metric | Value |
|---|---|
| Median inference | ~69 ms |
| Median total service elapsed | ~70 ms |
| p95 total service elapsed | ~85 ms |
| Memory footprint | < 128 MB budget |
The inference time exceeds the nominal 50 ms Fastly CPU budget by ~1.4×. This is WASM
overhead — INT8 SIMD paths are not accelerated in the sandbox. The service is functional
at this latency. Wizer pre-initialization eliminates the lazy-static init cost (~163 ms
in earlier versions); the remaining time is pure BERT inference.