TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation

TrafficGPT is a deep-learning foundation model designed to tackle complex challenges in network traffic analysis and generation. By leveraging generative pre-training with a linear attention mechanism, it expands the effective token window from the traditional 512-token limit to 12,032 tokens.

Model Details

Developed by: Jian Qu, Xiaobo Ma, and Jianfeng Li (Xi'an Jiaotong University).
Model Type: Generative Pre-trained Transformer with Linear Attention.
Architecture: 24 layers, 12 attention heads, hidden dimension of 512.
Key Innovations: - Reversible Tokenization: Bijective mapping between PCAP files and token lists for direct traffic reconstruction.
- Linear Complexity: Reduces self-attention complexity from $O(N^2)$ to $O(N)$.
- Reversible Network: Optimized memory usage based on the Reformer architecture.

Intended Use

Traffic Classification: High-accuracy identification of encrypted flows, VPN traffic, and IoT device communications.
Traffic Generation: Creating realistic, protocol-compliant PCAP files for network simulation and security testing.
Protocol Reverse Engineering: Learning robust representations of unknown or complex network protocols.

Training Data

The model was pre-trained on 189 GB of raw network traffic across five major datasets:

ISCX-Tor2016: Tor network traffic characterization.
USTCTFC2016: Malware and software identification traffic.
ISCXVPN2016: Encrypted VPN vs. non-VPN flows.
DoHBrw2020: DNS-over-HTTPS tunnel detection.
CICIoT2022: Multidimensional IoT profiling data.

Training Details & Hyperparameters

While the original TrafficGPT research utilized a 99:1 train-test split (99% for pre-training, 1% for testing), this open-source version employs a standard 80:20 split.

Evaluation Results

Classification Performance (Macro F1-Score)

TrafficGPT(12k) consistently outperforms existing state-of-the-art models[cite: 16, 281].

Dataset	Metric	TrafficGPT (12k)
ISCX-VPN-App	Macro F1	1.0000
USTC-TFC	Macro F1	0.9877
Cross-Platform (iOS)	Macro F1	0.9863
Cross-Platform (Android)	Macro F1	0.9498

Generation Quality

Measured using Jensen-Shannon Divergence (JSD), where lower values indicate closer similarity to real traffic.

Packet Header JSD (Avg): 0.1605.
Flow Feature JSD (Avg): 0.2396.
Discriminator Realism: F1-Score of 0.6683.

Limitations

Networking Interpretation: Because these fields and checksums have been removed, the model does not learn IP/Port associations. While this ensures the model learns protocol features rather than metadata, it limits the model's utility in scenarios where port-protocol mapping is vital for networking interpretation
Protocol Anomalies: May occasionally generate malformed packets in complex encrypted protocols (e.g., TLS Client Hello).
Inter-flow Correlation: Currently focuses on individual TCP/UDP flows and does not yet capture complex correlations between multiple distinct flows.
Computational Cost: While linear in complexity, training on 3k tokens still requires significant memory and step-time optimization.

Citation

If you use TrafficGPT in your research, please cite:

@article{qu2024trafficgpt,
  title={TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation},
  author={Qu, Jian and Ma, Xiaobo and Li, Jianfeng},
  journal={arXiv preprint arXiv:2403.05822},
  year={2024}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for LaBackDoor/trafficgpt

TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation

Paper • 2403.05822 • Published Mar 9, 2024

Evaluation results

Macro F1 on ISCX-VPN-App
self-reported

1.000
Macro F1 on USTC-TFC
self-reported

0.988