File size: 3,847 Bytes
b594f1b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a768a70
 
 
 
b594f1b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f8c98d4
 
 
 
b594f1b
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
language: en
license: mit
tags:
  - privacy
  - web-tracking
  - tracker-detection
  - tabular-classification
  - browser-fingerprinting
  - safetensors
  - wasm
datasets:
  - olafuraron/tracker-radar-ml
metrics:
  - f1
  - roc_auc
  - precision
  - recall
---

# Tracker Classifier

A lightweight feedforward neural network for classifying third-party web
domains as tracking or non-tracking, designed for on-device inference via
WebAssembly.

## Live Preview

[Live preview](https://olafurjohannsson.github.io/tracker-ml/)

## Model Description

- **Architecture**: Feedforward NN (input -> 128 -> 64 -> 2) with ReLU and dropout
- **Size**: 181 KB (safetensors)
- **Input**: 295 behavioral and metadata features from DuckDuckGo Tracker Radar
- **Output**: Binary classification (0 = non-tracking, 1 = tracking)
- **Training data**: 12,932 domains (80% of labeled set)
- **Deployment target**: Kjarni inference engine compiled to WASM with SIMD128

## Performance (5-fold CV)

| Model | F1 | Precision | Recall | ROC-AUC |
|-------|-----|-----------|--------|---------|
| **This model (Feedforward NN)** | 0.848 +/- 0.017 | 0.804 +/- 0.037 | 0.899 +/- 0.006 | 0.928 +/- 0.008 |
| Random Forest | 0.895 +/- 0.003 | 0.895 +/- 0.006 | 0.895 +/- 0.006 | 0.958 +/- 0.002 |
| XGBoost | 0.893 +/- 0.004 | 0.887 +/- 0.006 | 0.899 +/- 0.004 | 0.959 +/- 0.002 |
| FP Heuristic (score >= 2)* | 0.355 | 0.579 | 0.257 | n/a |

*The fingerprinting heuristic targets browser API fingerprinting specifically,
not general tracking. The comparison demonstrates the gap between single-vector
and multi-vector detection.*

## Files

- `tracker_classifier.safetensors`: Model weights (181 KB)
- `config.json`: Architecture config, feature names, scaler parameters
- `scaler.joblib`: Sklearn StandardScaler for feature normalization
- `results.json`: Full evaluation metrics

## Usage
```python
import torch
import json
import numpy as np
from safetensors.torch import load_file

weights = load_file("tracker_classifier.safetensors")
config = json.load(open("config.json"))

class TrackerClassifier(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim=128):
        super().__init__()
        self.layer1 = torch.nn.Linear(input_dim, hidden_dim)
        self.layer2 = torch.nn.Linear(hidden_dim, hidden_dim // 2)
        self.layer3 = torch.nn.Linear(hidden_dim // 2, 2)
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        x = self.relu(self.layer1(x))
        x = self.relu(self.layer2(x))
        return self.layer3(x)

model = TrackerClassifier(input_dim=config["input_dim"])
model.load_state_dict(weights)
model.eval()

# Classify (standardize features first)
features = np.array([...])  # 295 features
mean = np.array(config["scaler_mean"])
scale = np.array(config["scaler_scale"])
features_scaled = (features - mean) / scale

with torch.no_grad():
    logits = model(torch.FloatTensor(features_scaled).unsqueeze(0))
    prediction = logits.argmax(dim=1).item()
    # 0 = non-tracking, 1 = tracking
```

## On-Device Inference

This model is designed for deployment via
[Kjarni](https://github.com/olafurjohannsson/kjarni), compiled to
WebAssembly with SIMD128 acceleration. The 181 KB safetensors file and
three matrix multiplications make it suitable for real-time in-browser
classification with no data leaving the device.

## Limitations

- Trained on a point-in-time snapshot of Tracker Radar (US region)
- Metadata features (entity ownership) can cause false positives for CDN domains owned by large companies
- Requires periodic retraining as tracking techniques evolve
- Tree-based models (RF, XGBoost) outperform this model on accuracy, but cannot run in WASM

## Links

[Kjarni](https://kjarni.ai)

## Source

Code and methodology: [github.com/olafurjohannsson/tracker-ml](https://github.com/olafurjohannsson/tracker-ml)