|
|
--- |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- network-security |
|
|
- traffic-analysis |
|
|
- traffic-generation |
|
|
- npre |
|
|
- linear-attention |
|
|
- arxiv:2403.05822 |
|
|
datasets: |
|
|
- ISCX-Tor2016 |
|
|
- USTCTFC2016 |
|
|
- ISCXVPN2016 |
|
|
- DoHBrw2020 |
|
|
- CICIoT2022 |
|
|
metrics: |
|
|
- f1 |
|
|
- jensen-shannon divergence (jsd) |
|
|
pipeline_tag: text-generation |
|
|
extra_gated_prompt: This model is released as part of an SoK experiment. Please cite the original TrafficGPT paper and the experimenter's repository. |
|
|
model-index: |
|
|
- name: TrafficGPT(3k) |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Flow Classification |
|
|
dataset: |
|
|
name: ISCX-VPN-App |
|
|
type: ISCXVPN2016 |
|
|
metrics: |
|
|
- name: Macro F1 |
|
|
type: f1 |
|
|
value: 1 |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Flow Classification |
|
|
dataset: |
|
|
name: USTC-TFC |
|
|
type: USTCTFC2016 |
|
|
metrics: |
|
|
- name: Macro F1 |
|
|
type: f1 |
|
|
value: 0.9877 |
|
|
language: |
|
|
- hex |
|
|
base_model: |
|
|
- jianqu/TrafficGPT |
|
|
|
|
|
--- |
|
|
|
|
|
# TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation |
|
|
|
|
|
TrafficGPT is a deep-learning foundation model designed to tackle complex challenges in network traffic analysis and generation. By leveraging **generative pre-training** with a **linear attention mechanism**, it expands the effective token window from the traditional 512-token limit to **12,032 tokens**. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Developed by:** Jian Qu, Xiaobo Ma, and Jianfeng Li (Xi'an Jiaotong University). |
|
|
- **Model Type:** Generative Pre-trained Transformer with Linear Attention. |
|
|
- **Architecture:** 24 layers, 12 attention heads, hidden dimension of 512. |
|
|
- **Key Innovations:** - **Reversible Tokenization:** Bijective mapping between PCAP files and token lists for direct traffic reconstruction. |
|
|
- **Linear Complexity:** Reduces self-attention complexity from $O(N^2)$ to $O(N)$. |
|
|
- **Reversible Network:** Optimized memory usage based on the Reformer architecture. |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
- **Traffic Classification:** High-accuracy identification of encrypted flows, VPN traffic, and IoT device communications. |
|
|
- **Traffic Generation:** Creating realistic, protocol-compliant PCAP files for network simulation and security testing. |
|
|
- **Protocol Reverse Engineering:** Learning robust representations of unknown or complex network protocols. |
|
|
|
|
|
## Training Data |
|
|
The model was pre-trained on **189 GB** of raw network traffic across five major datasets: |
|
|
- **ISCX-Tor2016:** Tor network traffic characterization. |
|
|
- **USTCTFC2016:** Malware and software identification traffic. |
|
|
- **ISCXVPN2016:** Encrypted VPN vs. non-VPN flows. |
|
|
- **DoHBrw2020:** DNS-over-HTTPS tunnel detection. |
|
|
- **CICIoT2022:** Multidimensional IoT profiling data. |
|
|
|
|
|
## Training Details & Hyperparameters |
|
|
While the original TrafficGPT research utilized a 99:1 train-test split (99% for pre-training, 1% for testing), this open-source version employs a standard 80:20 split. |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
### Classification Performance (Macro F1-Score) |
|
|
TrafficGPT(12k) consistently outperforms existing state-of-the-art models[cite: 16, 281]. |
|
|
|
|
|
| Dataset | Metric | TrafficGPT (12k) | |
|
|
| :--- | :--- | :--- | |
|
|
| **ISCX-VPN-App** | Macro F1 | **1.0000** | |
|
|
| **USTC-TFC** | Macro F1 | **0.9877** | |
|
|
| **Cross-Platform (iOS)** | Macro F1 | **0.9863** | |
|
|
| **Cross-Platform (Android)** | Macro F1 | **0.9498** | |
|
|
|
|
|
### Generation Quality |
|
|
Measured using Jensen-Shannon Divergence (JSD), where lower values indicate closer similarity to real traffic. |
|
|
- **Packet Header JSD (Avg):** 0.1605. |
|
|
- **Flow Feature JSD (Avg):** 0.2396. |
|
|
- **Discriminator Realism:** F1-Score of 0.6683. |
|
|
|
|
|
## Limitations |
|
|
- **Networking Interpretation:** Because these fields and checksums have been removed, the model does not learn IP/Port associations. While this ensures the model learns protocol features rather than metadata, it limits the model's utility in scenarios where port-protocol mapping is vital for networking interpretation |
|
|
- **Protocol Anomalies:** May occasionally generate malformed packets in complex encrypted protocols (e.g., TLS Client Hello). |
|
|
- **Inter-flow Correlation:** Currently focuses on individual TCP/UDP flows and does not yet capture complex correlations between multiple distinct flows. |
|
|
- **Computational Cost:** While linear in complexity, training on 3k tokens still requires significant memory and step-time optimization. |
|
|
|
|
|
## Citation |
|
|
If you use TrafficGPT in your research, please cite: |
|
|
```bibtex |
|
|
@article{qu2024trafficgpt, |
|
|
title={TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation}, |
|
|
author={Qu, Jian and Ma, Xiaobo and Li, Jianfeng}, |
|
|
journal={arXiv preprint arXiv:2403.05822}, |
|
|
year={2024} |
|
|
} |
|
|
``` |