Net-JEPA β Context-Aware Flow Embeddings for Encrypted Network Traffic
Net-JEPA is a Joint-Embedding Predictive Architecture (JEPA) for encrypted network flows. It classifies traffic into 8 common traffic types from the shape of the traffic β packet sizes, inter-arrival timing, and direction β without decrypting any payload. It is trained from scratch (no foundation / pre-trained / closed-weight model is used anywhere).
- Code / full docs: https://github.com/Stinson-83/Net-JEPA
- Team: FlowState (Archisman Dhar, Kritik Gupta), IIT Kanpur β Samsung EnnovateX 2026, PS #2
- License: Apache-2.0
What's in this repository
| File | What it is |
|---|---|
net_jepa_phase3.pt |
Trained Net-JEPA checkpoint (Phase 3, epoch 50) β encoder + fusion + downstream embedding head, with Ξ±-centering baked in (~1.76 M params) |
knn.joblib |
Fitted cosine k-NN (k=5) traffic-type classifier over the labelled embeddings |
labels.json |
The 8 traffic-type names + type2id mapping |
config.yaml |
The exact traffic.yaml hyper-parameters used to build the model |
umap.joblib |
(optional) Fitted UMAP reducer (128-D β 2-D) for the Signal Atlas visualization β not needed for classification |
The checkpoint produces a 128-D L2-normalised embedding per flow; the k-NN reads that embedding to predict one of 8 traffic types.
Model details
- Input: a flow's first β€64 packets as a
(64 Γ 9)tensor[size/1500, log1p(IAT), signed direction, protocol one-hotΓ4, rtt_norm, rtt_flag]+ a(15,)flow-context vector (durations, flag ratios, per-capture host stats, RTT) + a(64,)padding mask. - Encoder: 4-layer Temporal Transformer (
d=128, 4 heads) + cross-attention fusion with the flow-context vector. - Self-supervised pretraining: masked-latent prediction against an EMA target branch with a VICReg loss (no labels, no negatives).
- Supervised sharpening: traffic-type-level SupCon on the kept embedding, then Ξ±-centering (common-mode removal, Ξ±β0.65) to isotropise the space.
- Classifier: cosine k-NN (k=5).
- Size / speed:
1.76 M parameters; **3.5 ms / flow on CPU** (no GPU needed to serve).
The eight traffic types
audio_streaming Β· cloud_gaming Β· live_streaming Β· metaverse Β· online_gaming Β·
video_conferencing Β· video_on_demand Β· web_browsing.
Evaluation (held-out test split)
| Benchmark KPI | Target | Achieved |
|---|---|---|
| Intra-class cosine | > 0.7 | 0.98 |
| Inter-class cosine | < 0.3 | β0.04 |
| Classification accuracy | β₯ 90% | 99.7% |
| Generalization (few-shot, Ξ·=7) | β₯ 85% | 99.6% |
| Real-time latency / flow | < 100 ms | 3.5 ms (CPU) |
macro-F1 0.992 Β· silhouette 0.87. Per-class F1 ranges 0.971 (cloud gaming, the rarest /
hardest) to 1.000 (metaverse). Full numbers: see the repo's docs/results.md.
How to use
The checkpoint needs the netjepa code to instantiate the architecture.
git clone https://github.com/Stinson-83/Net-JEPA && cd Net-JEPA
pip install -r requirements.txt && pip install -e .
pip install huggingface_hub
import joblib, json, torch
from huggingface_hub import hf_hub_download
from netjepa.model.netjepa import NetJEPA
from netjepa.utils.io import load_checkpoint
REPO = "<your-username>/net-jepa" # this repo id
ckpt = hf_hub_download(REPO, "net_jepa_phase3.pt")
knn = joblib.load(hf_hub_download(REPO, "knn.joblib"))
labels = json.load(open(hf_hub_download(REPO, "labels.json")))["traffic_types"]
model = NetJEPA()
load_checkpoint(model, None, ckpt, torch.device("cpu"))
model.eval()
# pkt: (1,64,9) float, ctx: (1,15) float, mask: (1,64) bool
emb = model.forward_downstream(pkt, ctx, mask).detach().cpu().numpy() # (1,128)
traffic_type = labels[int(knn.predict(emb)[0])]
For end-to-end .pcap inference (parse β flow β embed β classify β 2-D projection), use the
terminal tool python -m netjepa.scripts.infer_pcap your.pcap --checkpoint <ckpt> --labels labels.json
or the FastAPI server (src/server/app.py) β see docs/usage.md. The flow-building and feature
code is identical to training (including per-capture host stats), so real captures classify
correctly (e.g. a browser YouTube capture β video_on_demand).
Training data
All datasets are public; the current model trains on 8 traffic types, all fully supervised:
- Kaggle Β· 5G Traffic Datasets β primary train/test (license "Unknown" β used under Kaggle terms, not redistributed).
- Zenodo Β· VLC / Valencia (CC-BY-4.0) β full VLC set: Spotify β audio_streaming, Web β web_browsing, Netflix/Prime/YouTube β video_on_demand, Roblox β metaverse, Teams β video_conferencing.
- Kaggle Β· Cloud Gaming Network Telemetry (BSD-3) β Xbox Cloud over 5G β cloud_gaming.
28,892 flows β 20,224 train (pretrain = downstream) / 8,668 test (30%) (leak-free stratified split, full supervision).
Intended use & limitations
- Intended: QoS / 5G-slice prioritisation, traffic analytics, and anomaly triage on encrypted flows where DPI is impossible. Operates on metadata only.
- Out of scope: identifying individual users or payload content (it cannot β only flow shape is modelled).
- Known limitation (reported honestly): classification relies on per-capture host-behaviour
features, so a pcap containing only a single flow yields degenerate host stats and may
misclassify. Real captures contain many flows and work; multi-flow captures of the 8 trained
types β including out-of-domain browser QUIC β classify correctly. See
docs/results.md.
Citation
@misc{netjepa2026,
title = {Net-JEPA: Context-Aware Flow Embeddings for Adaptive AI-based Network Traffic Classification},
author = {Dhar, Archisman and Gupta, Kritik},
year = {2026},
note = {Samsung EnnovateX 2026, Problem Statement 2},
url = {https://github.com/Stinson-83/Net-JEPA}
}
Conceptually inspired by I-JEPA (Assran et al., 2023), VICReg (Bardes et al., 2022), SupCon (Khosla et al., 2020), and DANN (Ganin & Lempitsky, 2015); adapted to encrypted network-flow classification with original contributions (packet-shape encoder + RTT/context fusion, Ξ±-centering, traffic-type SupCon, and per-capture train/inference-consistent host stats).
- Downloads last month
- 13