File size: 4,669 Bytes

---
license: apache-2.0
library_name: transformers
tags:
- network-security
- traffic-analysis
- traffic-generation
- npre
- linear-attention
- arxiv:2403.05822
datasets:
- ISCX-Tor2016
- USTCTFC2016
- ISCXVPN2016
- DoHBrw2020
- CICIoT2022
metrics:
- f1
- jensen-shannon divergence (jsd)
pipeline_tag: text-generation
extra_gated_prompt: This model is released as part of an SoK experiment. Please cite the original TrafficGPT paper and the experimenter's repository.
model-index:
- name: TrafficGPT(3k)
  results:
  - task:
      type: text-classification
      name: Flow Classification
    dataset:
      name: ISCX-VPN-App
      type: ISCXVPN2016
    metrics:
    - name: Macro F1
      type: f1
      value: 1
  - task:
      type: text-classification
      name: Flow Classification
    dataset:
      name: USTC-TFC
      type: USTCTFC2016
    metrics:
    - name: Macro F1
      type: f1
      value: 0.9877
language:
- hex
base_model:
- jianqu/TrafficGPT

---

# TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation

TrafficGPT is a deep-learning foundation model designed to tackle complex challenges in network traffic analysis and generation. By leveraging **generative pre-training** with a **linear attention mechanism**, it expands the effective token window from the traditional 512-token limit to **12,032 tokens**.

## Model Details

- **Developed by:** Jian Qu, Xiaobo Ma, and Jianfeng Li (Xi'an Jiaotong University).
- **Model Type:** Generative Pre-trained Transformer with Linear Attention.
- **Architecture:** 24 layers, 12 attention heads, hidden dimension of 512.
- **Key Innovations:** - **Reversible Tokenization:** Bijective mapping between PCAP files and token lists for direct traffic reconstruction.
  - **Linear Complexity:** Reduces self-attention complexity from $O(N^2)$ to $O(N)$.
  - **Reversible Network:** Optimized memory usage based on the Reformer architecture.

## Intended Use

- **Traffic Classification:** High-accuracy identification of encrypted flows, VPN traffic, and IoT device communications.
- **Traffic Generation:** Creating realistic, protocol-compliant PCAP files for network simulation and security testing.
- **Protocol Reverse Engineering:** Learning robust representations of unknown or complex network protocols.

## Training Data
The model was pre-trained on **189 GB** of raw network traffic across five major datasets:
- **ISCX-Tor2016:** Tor network traffic characterization.
- **USTCTFC2016:** Malware and software identification traffic.
- **ISCXVPN2016:** Encrypted VPN vs. non-VPN flows.
- **DoHBrw2020:** DNS-over-HTTPS tunnel detection.
- **CICIoT2022:** Multidimensional IoT profiling data.

## Training Details & Hyperparameters
While the original TrafficGPT research utilized a 99:1 train-test split (99% for pre-training, 1% for testing), this open-source version employs a standard 80:20 split.

## Evaluation Results

### Classification Performance (Macro F1-Score)
TrafficGPT(12k) consistently outperforms existing state-of-the-art models[cite: 16, 281].

| Dataset | Metric | TrafficGPT (12k) |
| :--- | :--- | :--- |
| **ISCX-VPN-App** | Macro F1 | **1.0000** |
| **USTC-TFC** | Macro F1 | **0.9877** |
| **Cross-Platform (iOS)** | Macro F1 | **0.9863** |
| **Cross-Platform (Android)** | Macro F1 | **0.9498** |

### Generation Quality
Measured using Jensen-Shannon Divergence (JSD), where lower values indicate closer similarity to real traffic.
- **Packet Header JSD (Avg):** 0.1605.
- **Flow Feature JSD (Avg):** 0.2396.
- **Discriminator Realism:** F1-Score of 0.6683.

## Limitations
- **Networking Interpretation:** Because these fields and checksums have been removed, the model does not learn IP/Port associations. While this ensures the model learns protocol features rather than metadata, it limits the model's utility in scenarios where port-protocol mapping is vital for networking interpretation
- **Protocol Anomalies:** May occasionally generate malformed packets in complex encrypted protocols (e.g., TLS Client Hello).
- **Inter-flow Correlation:** Currently focuses on individual TCP/UDP flows and does not yet capture complex correlations between multiple distinct flows.
- **Computational Cost:** While linear in complexity, training on 3k tokens still requires significant memory and step-time optimization.

## Citation
If you use TrafficGPT in your research, please cite:
```bibtex
@article{qu2024trafficgpt,
  title={TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation},
  author={Qu, Jian and Ma, Xiaobo and Li, Jianfeng},
  journal={arXiv preprint arXiv:2403.05822},
  year={2024}
}
```