olafuraron commited on
Commit
b594f1b
·
verified ·
1 Parent(s): 53a815f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +112 -3
README.md CHANGED
@@ -1,3 +1,112 @@
1
- ---
2
- license: cc-by-sa-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ tags:
5
+ - privacy
6
+ - web-tracking
7
+ - tracker-detection
8
+ - tabular-classification
9
+ - browser-fingerprinting
10
+ - safetensors
11
+ - wasm
12
+ datasets:
13
+ - olafuraron/tracker-radar-ml
14
+ metrics:
15
+ - f1
16
+ - roc_auc
17
+ - precision
18
+ - recall
19
+ ---
20
+
21
+ # Tracker Classifier
22
+
23
+ A lightweight feedforward neural network for classifying third-party web
24
+ domains as tracking or non-tracking, designed for on-device inference via
25
+ WebAssembly.
26
+
27
+ ## Model Description
28
+
29
+ - **Architecture**: Feedforward NN (input -> 128 -> 64 -> 2) with ReLU and dropout
30
+ - **Size**: 181 KB (safetensors)
31
+ - **Input**: 295 behavioral and metadata features from DuckDuckGo Tracker Radar
32
+ - **Output**: Binary classification (0 = non-tracking, 1 = tracking)
33
+ - **Training data**: 12,932 domains (80% of labeled set)
34
+ - **Deployment target**: Kjarni inference engine compiled to WASM with SIMD128
35
+
36
+ ## Performance (5-fold CV)
37
+
38
+ | Model | F1 | Precision | Recall | ROC-AUC |
39
+ |-------|-----|-----------|--------|---------|
40
+ | **This model (Feedforward NN)** | 0.848 +/- 0.017 | 0.804 +/- 0.037 | 0.899 +/- 0.006 | 0.928 +/- 0.008 |
41
+ | Random Forest | 0.895 +/- 0.003 | 0.895 +/- 0.006 | 0.895 +/- 0.006 | 0.958 +/- 0.002 |
42
+ | XGBoost | 0.893 +/- 0.004 | 0.887 +/- 0.006 | 0.899 +/- 0.004 | 0.959 +/- 0.002 |
43
+ | FP Heuristic (score >= 2)* | 0.355 | 0.579 | 0.257 | n/a |
44
+
45
+ *The fingerprinting heuristic targets browser API fingerprinting specifically,
46
+ not general tracking. The comparison demonstrates the gap between single-vector
47
+ and multi-vector detection.*
48
+
49
+ ## Files
50
+
51
+ - `tracker_classifier.safetensors`: Model weights (181 KB)
52
+ - `config.json`: Architecture config, feature names, scaler parameters
53
+ - `scaler.joblib`: Sklearn StandardScaler for feature normalization
54
+ - `results.json`: Full evaluation metrics
55
+
56
+ ## Usage
57
+ ```python
58
+ import torch
59
+ import json
60
+ import numpy as np
61
+ from safetensors.torch import load_file
62
+
63
+ weights = load_file("tracker_classifier.safetensors")
64
+ config = json.load(open("config.json"))
65
+
66
+ class TrackerClassifier(torch.nn.Module):
67
+ def __init__(self, input_dim, hidden_dim=128):
68
+ super().__init__()
69
+ self.layer1 = torch.nn.Linear(input_dim, hidden_dim)
70
+ self.layer2 = torch.nn.Linear(hidden_dim, hidden_dim // 2)
71
+ self.layer3 = torch.nn.Linear(hidden_dim // 2, 2)
72
+ self.relu = torch.nn.ReLU()
73
+
74
+ def forward(self, x):
75
+ x = self.relu(self.layer1(x))
76
+ x = self.relu(self.layer2(x))
77
+ return self.layer3(x)
78
+
79
+ model = TrackerClassifier(input_dim=config["input_dim"])
80
+ model.load_state_dict(weights)
81
+ model.eval()
82
+
83
+ # Classify (standardize features first)
84
+ features = np.array([...]) # 295 features
85
+ mean = np.array(config["scaler_mean"])
86
+ scale = np.array(config["scaler_scale"])
87
+ features_scaled = (features - mean) / scale
88
+
89
+ with torch.no_grad():
90
+ logits = model(torch.FloatTensor(features_scaled).unsqueeze(0))
91
+ prediction = logits.argmax(dim=1).item()
92
+ # 0 = non-tracking, 1 = tracking
93
+ ```
94
+
95
+ ## On-Device Inference
96
+
97
+ This model is designed for deployment via
98
+ [Kjarni](https://github.com/olafurjohannsson/kjarni), compiled to
99
+ WebAssembly with SIMD128 acceleration. The 181 KB safetensors file and
100
+ three matrix multiplications make it suitable for real-time in-browser
101
+ classification with no data leaving the device.
102
+
103
+ ## Limitations
104
+
105
+ - Trained on a point-in-time snapshot of Tracker Radar (US region)
106
+ - Metadata features (entity ownership) can cause false positives for CDN domains owned by large companies
107
+ - Requires periodic retraining as tracking techniques evolve
108
+ - Tree-based models (RF, XGBoost) outperform this model on accuracy, but cannot run in WASM
109
+
110
+ ## Source
111
+
112
+ Code and methodology: [github.com/olafurjohannsson/tracker-ml](https://github.com/olafurjohannsson/tracker-ml)