TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation

TrafficGPT is a deep-learning foundation model designed to tackle complex challenges in network traffic analysis and generation. By leveraging generative pre-training with a linear attention mechanism, it expands the effective token window from the traditional 512-token limit to 12,032 tokens.

Model Details

  • Developed by: Jian Qu, Xiaobo Ma, and Jianfeng Li (Xi'an Jiaotong University).
  • Model Type: Generative Pre-trained Transformer with Linear Attention.
  • Architecture: 24 layers, 12 attention heads, hidden dimension of 512.
  • Key Innovations: - Reversible Tokenization: Bijective mapping between PCAP files and token lists for direct traffic reconstruction.
    • Linear Complexity: Reduces self-attention complexity from $O(N^2)$ to $O(N)$.
    • Reversible Network: Optimized memory usage based on the Reformer architecture.

Intended Use

  • Traffic Classification: High-accuracy identification of encrypted flows, VPN traffic, and IoT device communications.
  • Traffic Generation: Creating realistic, protocol-compliant PCAP files for network simulation and security testing.
  • Protocol Reverse Engineering: Learning robust representations of unknown or complex network protocols.

Training Data

The model was pre-trained on 189 GB of raw network traffic across five major datasets:

  • ISCX-Tor2016: Tor network traffic characterization.
  • USTCTFC2016: Malware and software identification traffic.
  • ISCXVPN2016: Encrypted VPN vs. non-VPN flows.
  • DoHBrw2020: DNS-over-HTTPS tunnel detection.
  • CICIoT2022: Multidimensional IoT profiling data.

Training Details & Hyperparameters

While the original TrafficGPT research utilized a 99:1 train-test split (99% for pre-training, 1% for testing), this open-source version employs a standard 80:20 split.

Evaluation Results

Classification Performance (Macro F1-Score)

TrafficGPT(12k) consistently outperforms existing state-of-the-art models[cite: 16, 281].

Dataset Metric TrafficGPT (12k)
ISCX-VPN-App Macro F1 1.0000
USTC-TFC Macro F1 0.9877
Cross-Platform (iOS) Macro F1 0.9863
Cross-Platform (Android) Macro F1 0.9498

Generation Quality

Measured using Jensen-Shannon Divergence (JSD), where lower values indicate closer similarity to real traffic.

  • Packet Header JSD (Avg): 0.1605.
  • Flow Feature JSD (Avg): 0.2396.
  • Discriminator Realism: F1-Score of 0.6683.

Limitations

  • Networking Interpretation: Because these fields and checksums have been removed, the model does not learn IP/Port associations. While this ensures the model learns protocol features rather than metadata, it limits the model's utility in scenarios where port-protocol mapping is vital for networking interpretation
  • Protocol Anomalies: May occasionally generate malformed packets in complex encrypted protocols (e.g., TLS Client Hello).
  • Inter-flow Correlation: Currently focuses on individual TCP/UDP flows and does not yet capture complex correlations between multiple distinct flows.
  • Computational Cost: While linear in complexity, training on 3k tokens still requires significant memory and step-time optimization.

Citation

If you use TrafficGPT in your research, please cite:

@article{qu2024trafficgpt,
  title={TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation},
  author={Qu, Jian and Ma, Xiaobo and Li, Jianfeng},
  journal={arXiv preprint arXiv:2403.05822},
  year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results