Update README.md

a768a70 verified 13 days ago

3.85 kB

	---
	language: en
	license: mit
	tags:
	- privacy
	- web-tracking
	- tracker-detection
	- tabular-classification
	- browser-fingerprinting
	- safetensors
	- wasm
	datasets:
	- olafuraron/tracker-radar-ml
	metrics:
	- f1
	- roc_auc
	- precision
	- recall
	---

	# Tracker Classifier

	A lightweight feedforward neural network for classifying third-party web
	domains as tracking or non-tracking, designed for on-device inference via
	WebAssembly.

	## Live Preview

	[Live preview](https://olafurjohannsson.github.io/tracker-ml/)

	## Model Description

	- Architecture: Feedforward NN (input -> 128 -> 64 -> 2) with ReLU and dropout
	- Size: 181 KB (safetensors)
	- Input: 295 behavioral and metadata features from DuckDuckGo Tracker Radar
	- Output: Binary classification (0 = non-tracking, 1 = tracking)
	- Training data: 12,932 domains (80% of labeled set)
	- Deployment target: Kjarni inference engine compiled to WASM with SIMD128

	## Performance (5-fold CV)

	\| Model \| F1 \| Precision \| Recall \| ROC-AUC \|
	\|-------\|-----\|-----------\|--------\|---------\|
	\| This model (Feedforward NN) \| 0.848 +/- 0.017 \| 0.804 +/- 0.037 \| 0.899 +/- 0.006 \| 0.928 +/- 0.008 \|
	\| Random Forest \| 0.895 +/- 0.003 \| 0.895 +/- 0.006 \| 0.895 +/- 0.006 \| 0.958 +/- 0.002 \|
	\| XGBoost \| 0.893 +/- 0.004 \| 0.887 +/- 0.006 \| 0.899 +/- 0.004 \| 0.959 +/- 0.002 \|
	\| FP Heuristic (score >= 2)* \| 0.355 \| 0.579 \| 0.257 \| n/a \|

	*The fingerprinting heuristic targets browser API fingerprinting specifically,
	not general tracking. The comparison demonstrates the gap between single-vector
	and multi-vector detection.*

	## Files

	- `tracker_classifier.safetensors`: Model weights (181 KB)
	- `config.json`: Architecture config, feature names, scaler parameters
	- `scaler.joblib`: Sklearn StandardScaler for feature normalization
	- `results.json`: Full evaluation metrics

	## Usage
	```python
	import torch
	import json
	import numpy as np
	from safetensors.torch import load_file

	weights = load_file("tracker_classifier.safetensors")
	config = json.load(open("config.json"))

	class TrackerClassifier(torch.nn.Module):
	def __init__(self, input_dim, hidden_dim=128):
	super().__init__()
	self.layer1 = torch.nn.Linear(input_dim, hidden_dim)
	self.layer2 = torch.nn.Linear(hidden_dim, hidden_dim // 2)
	self.layer3 = torch.nn.Linear(hidden_dim // 2, 2)
	self.relu = torch.nn.ReLU()

	def forward(self, x):
	x = self.relu(self.layer1(x))
	x = self.relu(self.layer2(x))
	return self.layer3(x)

	model = TrackerClassifier(input_dim=config["input_dim"])
	model.load_state_dict(weights)
	model.eval()

	# Classify (standardize features first)
	features = np.array([...]) # 295 features
	mean = np.array(config["scaler_mean"])
	scale = np.array(config["scaler_scale"])
	features_scaled = (features - mean) / scale

	with torch.no_grad():
	logits = model(torch.FloatTensor(features_scaled).unsqueeze(0))
	prediction = logits.argmax(dim=1).item()
	# 0 = non-tracking, 1 = tracking
	```

	## On-Device Inference

	This model is designed for deployment via
	[Kjarni](https://github.com/olafurjohannsson/kjarni), compiled to
	WebAssembly with SIMD128 acceleration. The 181 KB safetensors file and
	three matrix multiplications make it suitable for real-time in-browser
	classification with no data leaving the device.

	## Limitations

	- Trained on a point-in-time snapshot of Tracker Radar (US region)
	- Metadata features (entity ownership) can cause false positives for CDN domains owned by large companies
	- Requires periodic retraining as tracking techniques evolve
	- Tree-based models (RF, XGBoost) outperform this model on accuracy, but cannot run in WASM

	## Links

	[Kjarni](https://kjarni.ai)

	## Source

	Code and methodology: [github.com/olafurjohannsson/tracker-ml](https://github.com/olafurjohannsson/tracker-ml)