--- license: apache-2.0 library_name: transformers tags: - network-security - traffic-analysis - traffic-generation - npre - linear-attention - arxiv:2403.05822 datasets: - ISCX-Tor2016 - USTCTFC2016 - ISCXVPN2016 - DoHBrw2020 - CICIoT2022 metrics: - f1 - jensen-shannon divergence (jsd) pipeline_tag: text-generation extra_gated_prompt: This model is released as part of an SoK experiment. Please cite the original TrafficGPT paper and the experimenter's repository. model-index: - name: TrafficGPT(3k) results: - task: type: text-classification name: Flow Classification dataset: name: ISCX-VPN-App type: ISCXVPN2016 metrics: - name: Macro F1 type: f1 value: 1 - task: type: text-classification name: Flow Classification dataset: name: USTC-TFC type: USTCTFC2016 metrics: - name: Macro F1 type: f1 value: 0.9877 language: - hex base_model: - jianqu/TrafficGPT --- # TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation TrafficGPT is a deep-learning foundation model designed to tackle complex challenges in network traffic analysis and generation. By leveraging **generative pre-training** with a **linear attention mechanism**, it expands the effective token window from the traditional 512-token limit to **12,032 tokens**. ## Model Details - **Developed by:** Jian Qu, Xiaobo Ma, and Jianfeng Li (Xi'an Jiaotong University). - **Model Type:** Generative Pre-trained Transformer with Linear Attention. - **Architecture:** 24 layers, 12 attention heads, hidden dimension of 512. - **Key Innovations:** - **Reversible Tokenization:** Bijective mapping between PCAP files and token lists for direct traffic reconstruction. - **Linear Complexity:** Reduces self-attention complexity from $O(N^2)$ to $O(N)$. - **Reversible Network:** Optimized memory usage based on the Reformer architecture. ## Intended Use - **Traffic Classification:** High-accuracy identification of encrypted flows, VPN traffic, and IoT device communications. - **Traffic Generation:** Creating realistic, protocol-compliant PCAP files for network simulation and security testing. - **Protocol Reverse Engineering:** Learning robust representations of unknown or complex network protocols. ## Training Data The model was pre-trained on **189 GB** of raw network traffic across five major datasets: - **ISCX-Tor2016:** Tor network traffic characterization. - **USTCTFC2016:** Malware and software identification traffic. - **ISCXVPN2016:** Encrypted VPN vs. non-VPN flows. - **DoHBrw2020:** DNS-over-HTTPS tunnel detection. - **CICIoT2022:** Multidimensional IoT profiling data. ## Training Details & Hyperparameters While the original TrafficGPT research utilized a 99:1 train-test split (99% for pre-training, 1% for testing), this open-source version employs a standard 80:20 split. ## Evaluation Results ### Classification Performance (Macro F1-Score) TrafficGPT(12k) consistently outperforms existing state-of-the-art models[cite: 16, 281]. | Dataset | Metric | TrafficGPT (12k) | | :--- | :--- | :--- | | **ISCX-VPN-App** | Macro F1 | **1.0000** | | **USTC-TFC** | Macro F1 | **0.9877** | | **Cross-Platform (iOS)** | Macro F1 | **0.9863** | | **Cross-Platform (Android)** | Macro F1 | **0.9498** | ### Generation Quality Measured using Jensen-Shannon Divergence (JSD), where lower values indicate closer similarity to real traffic. - **Packet Header JSD (Avg):** 0.1605. - **Flow Feature JSD (Avg):** 0.2396. - **Discriminator Realism:** F1-Score of 0.6683. ## Limitations - **Networking Interpretation:** Because these fields and checksums have been removed, the model does not learn IP/Port associations. While this ensures the model learns protocol features rather than metadata, it limits the model's utility in scenarios where port-protocol mapping is vital for networking interpretation - **Protocol Anomalies:** May occasionally generate malformed packets in complex encrypted protocols (e.g., TLS Client Hello). - **Inter-flow Correlation:** Currently focuses on individual TCP/UDP flows and does not yet capture complex correlations between multiple distinct flows. - **Computational Cost:** While linear in complexity, training on 3k tokens still requires significant memory and step-time optimization. ## Citation If you use TrafficGPT in your research, please cite: ```bibtex @article{qu2024trafficgpt, title={TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation}, author={Qu, Jian and Ma, Xiaobo and Li, Jianfeng}, journal={arXiv preprint arXiv:2403.05822}, year={2024} } ```