TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation
TrafficGPT is a deep-learning foundation model designed to tackle complex challenges in network traffic analysis and generation. By leveraging generative pre-training with a linear attention mechanism, it expands the effective token window from the traditional 512-token limit to 12,032 tokens.
Model Details
- Developed by: Jian Qu, Xiaobo Ma, and Jianfeng Li (Xi'an Jiaotong University).
- Model Type: Generative Pre-trained Transformer with Linear Attention.
- Architecture: 24 layers, 12 attention heads, hidden dimension of 512.
- Key Innovations: - Reversible Tokenization: Bijective mapping between PCAP files and token lists for direct traffic reconstruction.
- Linear Complexity: Reduces self-attention complexity from $O(N^2)$ to $O(N)$.
- Reversible Network: Optimized memory usage based on the Reformer architecture.
Intended Use
- Traffic Classification: High-accuracy identification of encrypted flows, VPN traffic, and IoT device communications.
- Traffic Generation: Creating realistic, protocol-compliant PCAP files for network simulation and security testing.
- Protocol Reverse Engineering: Learning robust representations of unknown or complex network protocols.
Training Data
The model was pre-trained on 189 GB of raw network traffic across five major datasets:
- ISCX-Tor2016: Tor network traffic characterization.
- USTCTFC2016: Malware and software identification traffic.
- ISCXVPN2016: Encrypted VPN vs. non-VPN flows.
- DoHBrw2020: DNS-over-HTTPS tunnel detection.
- CICIoT2022: Multidimensional IoT profiling data.
Training Details & Hyperparameters
While the original TrafficGPT research utilized a 99:1 train-test split (99% for pre-training, 1% for testing), this open-source version employs a standard 80:20 split.
Evaluation Results
Classification Performance (Macro F1-Score)
TrafficGPT(12k) consistently outperforms existing state-of-the-art models[cite: 16, 281].
| Dataset | Metric | TrafficGPT (12k) |
|---|---|---|
| ISCX-VPN-App | Macro F1 | 1.0000 |
| USTC-TFC | Macro F1 | 0.9877 |
| Cross-Platform (iOS) | Macro F1 | 0.9863 |
| Cross-Platform (Android) | Macro F1 | 0.9498 |
Generation Quality
Measured using Jensen-Shannon Divergence (JSD), where lower values indicate closer similarity to real traffic.
- Packet Header JSD (Avg): 0.1605.
- Flow Feature JSD (Avg): 0.2396.
- Discriminator Realism: F1-Score of 0.6683.
Limitations
- Networking Interpretation: Because these fields and checksums have been removed, the model does not learn IP/Port associations. While this ensures the model learns protocol features rather than metadata, it limits the model's utility in scenarios where port-protocol mapping is vital for networking interpretation
- Protocol Anomalies: May occasionally generate malformed packets in complex encrypted protocols (e.g., TLS Client Hello).
- Inter-flow Correlation: Currently focuses on individual TCP/UDP flows and does not yet capture complex correlations between multiple distinct flows.
- Computational Cost: While linear in complexity, training on 3k tokens still requires significant memory and step-time optimization.
Citation
If you use TrafficGPT in your research, please cite:
@article{qu2024trafficgpt,
title={TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation},
author={Qu, Jian and Ma, Xiaobo and Li, Jianfeng},
journal={arXiv preprint arXiv:2403.05822},
year={2024}
}
Evaluation results
- Macro F1 on ISCX-VPN-Appself-reported1.000
- Macro F1 on USTC-TFCself-reported0.988