olafuraron
/

tracker-classifier

+---
+language: en
+license: mit
+tags:
+  - privacy
+  - web-tracking
+  - tracker-detection
+  - tabular-classification
+  - browser-fingerprinting
+  - safetensors
+  - wasm
+datasets:
+  - olafuraron/tracker-radar-ml
+metrics:
+  - f1
+  - roc_auc
+  - precision
+  - recall
+---
+# Tracker Classifier
+A lightweight feedforward neural network for classifying third-party web
+domains as tracking or non-tracking, designed for on-device inference via
+WebAssembly.
+## Model Description
+- **Architecture**: Feedforward NN (input -> 128 -> 64 -> 2) with ReLU and dropout
+- **Size**: 181 KB (safetensors)
+- **Input**: 295 behavioral and metadata features from DuckDuckGo Tracker Radar
+- **Output**: Binary classification (0 = non-tracking, 1 = tracking)
+- **Training data**: 12,932 domains (80% of labeled set)
+- **Deployment target**: Kjarni inference engine compiled to WASM with SIMD128
+## Performance (5-fold CV)
+| Model | F1 | Precision | Recall | ROC-AUC |
+|-------|-----|-----------|--------|---------|
+| **This model (Feedforward NN)** | 0.848 +/- 0.017 | 0.804 +/- 0.037 | 0.899 +/- 0.006 | 0.928 +/- 0.008 |
+| Random Forest | 0.895 +/- 0.003 | 0.895 +/- 0.006 | 0.895 +/- 0.006 | 0.958 +/- 0.002 |
+| XGBoost | 0.893 +/- 0.004 | 0.887 +/- 0.006 | 0.899 +/- 0.004 | 0.959 +/- 0.002 |
+| FP Heuristic (score >= 2)* | 0.355 | 0.579 | 0.257 | n/a |
+*The fingerprinting heuristic targets browser API fingerprinting specifically,
+not general tracking. The comparison demonstrates the gap between single-vector
+and multi-vector detection.*
+## Files
+- `tracker_classifier.safetensors`: Model weights (181 KB)
+- `config.json`: Architecture config, feature names, scaler parameters
+- `scaler.joblib`: Sklearn StandardScaler for feature normalization
+- `results.json`: Full evaluation metrics
+## Usage
+```python
+import torch
+import json
+import numpy as np
+from safetensors.torch import load_file
+weights = load_file("tracker_classifier.safetensors")
+config = json.load(open("config.json"))
+class TrackerClassifier(torch.nn.Module):
+    def __init__(self, input_dim, hidden_dim=128):
+        super().__init__()
+        self.layer1 = torch.nn.Linear(input_dim, hidden_dim)
+        self.layer2 = torch.nn.Linear(hidden_dim, hidden_dim // 2)
+        self.layer3 = torch.nn.Linear(hidden_dim // 2, 2)
+        self.relu = torch.nn.ReLU()
+    def forward(self, x):
+        x = self.relu(self.layer1(x))
+        x = self.relu(self.layer2(x))
+        return self.layer3(x)
+model = TrackerClassifier(input_dim=config["input_dim"])
+model.load_state_dict(weights)
+model.eval()
+# Classify (standardize features first)
+features = np.array([...])  # 295 features
+mean = np.array(config["scaler_mean"])
+scale = np.array(config["scaler_scale"])
+features_scaled = (features - mean) / scale
+with torch.no_grad():
+    logits = model(torch.FloatTensor(features_scaled).unsqueeze(0))
+    prediction = logits.argmax(dim=1).item()
+    # 0 = non-tracking, 1 = tracking
+```
+## On-Device Inference
+This model is designed for deployment via
+[Kjarni](https://github.com/olafurjohannsson/kjarni), compiled to
+WebAssembly with SIMD128 acceleration. The 181 KB safetensors file and
+three matrix multiplications make it suitable for real-time in-browser
+classification with no data leaving the device.
+## Limitations
+- Trained on a point-in-time snapshot of Tracker Radar (US region)
+- Metadata features (entity ownership) can cause false positives for CDN domains owned by large companies
+- Requires periodic retraining as tracking techniques evolve
+- Tree-based models (RF, XGBoost) outperform this model on accuracy, but cannot run in WASM
+## Source
+Code and methodology: [github.com/olafurjohannsson/tracker-ml](https://github.com/olafurjohannsson/tracker-ml)