trafficgpt / README.md

Update README.md

bb6b4ad verified about 1 month ago

4.67 kB

	---
	license: apache-2.0
	library_name: transformers
	tags:
	- network-security
	- traffic-analysis
	- traffic-generation
	- npre
	- linear-attention
	- arxiv:2403.05822
	datasets:
	- ISCX-Tor2016
	- USTCTFC2016
	- ISCXVPN2016
	- DoHBrw2020
	- CICIoT2022
	metrics:
	- f1
	- jensen-shannon divergence (jsd)
	pipeline_tag: text-generation
	extra_gated_prompt: This model is released as part of an SoK experiment. Please cite the original TrafficGPT paper and the experimenter's repository.
	model-index:
	- name: TrafficGPT(3k)
	results:
	- task:
	type: text-classification
	name: Flow Classification
	dataset:
	name: ISCX-VPN-App
	type: ISCXVPN2016
	metrics:
	- name: Macro F1
	type: f1
	value: 1
	- task:
	type: text-classification
	name: Flow Classification
	dataset:
	name: USTC-TFC
	type: USTCTFC2016
	metrics:
	- name: Macro F1
	type: f1
	value: 0.9877
	language:
	- hex
	base_model:
	- jianqu/TrafficGPT

	---

	# TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation

	TrafficGPT is a deep-learning foundation model designed to tackle complex challenges in network traffic analysis and generation. By leveraging generative pre-training with a linear attention mechanism, it expands the effective token window from the traditional 512-token limit to 12,032 tokens.

	## Model Details

	- Developed by: Jian Qu, Xiaobo Ma, and Jianfeng Li (Xi'an Jiaotong University).
	- Model Type: Generative Pre-trained Transformer with Linear Attention.
	- Architecture: 24 layers, 12 attention heads, hidden dimension of 512.
	- Key Innovations: - Reversible Tokenization: Bijective mapping between PCAP files and token lists for direct traffic reconstruction.
	- Linear Complexity: Reduces self-attention complexity from $O(N^2)$ to $O(N)$.
	- Reversible Network: Optimized memory usage based on the Reformer architecture.

	## Intended Use

	- Traffic Classification: High-accuracy identification of encrypted flows, VPN traffic, and IoT device communications.
	- Traffic Generation: Creating realistic, protocol-compliant PCAP files for network simulation and security testing.
	- Protocol Reverse Engineering: Learning robust representations of unknown or complex network protocols.

	## Training Data
	The model was pre-trained on 189 GB of raw network traffic across five major datasets:
	- ISCX-Tor2016: Tor network traffic characterization.
	- USTCTFC2016: Malware and software identification traffic.
	- ISCXVPN2016: Encrypted VPN vs. non-VPN flows.
	- DoHBrw2020: DNS-over-HTTPS tunnel detection.
	- CICIoT2022: Multidimensional IoT profiling data.

	## Training Details & Hyperparameters
	While the original TrafficGPT research utilized a 99:1 train-test split (99% for pre-training, 1% for testing), this open-source version employs a standard 80:20 split.

	## Evaluation Results

	### Classification Performance (Macro F1-Score)
	TrafficGPT(12k) consistently outperforms existing state-of-the-art models[cite: 16, 281].

	\| Dataset \| Metric \| TrafficGPT (12k) \|
	\| :--- \| :--- \| :--- \|
	\| ISCX-VPN-App \| Macro F1 \| 1.0000 \|
	\| USTC-TFC \| Macro F1 \| 0.9877 \|
	\| Cross-Platform (iOS) \| Macro F1 \| 0.9863 \|
	\| Cross-Platform (Android) \| Macro F1 \| 0.9498 \|

	### Generation Quality
	Measured using Jensen-Shannon Divergence (JSD), where lower values indicate closer similarity to real traffic.
	- Packet Header JSD (Avg): 0.1605.
	- Flow Feature JSD (Avg): 0.2396.
	- Discriminator Realism: F1-Score of 0.6683.

	## Limitations
	- Networking Interpretation: Because these fields and checksums have been removed, the model does not learn IP/Port associations. While this ensures the model learns protocol features rather than metadata, it limits the model's utility in scenarios where port-protocol mapping is vital for networking interpretation
	- Protocol Anomalies: May occasionally generate malformed packets in complex encrypted protocols (e.g., TLS Client Hello).
	- Inter-flow Correlation: Currently focuses on individual TCP/UDP flows and does not yet capture complex correlations between multiple distinct flows.
	- Computational Cost: While linear in complexity, training on 3k tokens still requires significant memory and step-time optimization.

	## Citation
	If you use TrafficGPT in your research, please cite:
	```bibtex
	@article{qu2024trafficgpt,
	title={TrafficGPT: Breaking the Token Barrier for Efficient Long Traffic Analysis and Generation},
	author={Qu, Jian and Ma, Xiaobo and Li, Jianfeng},
	journal={arXiv preprint arXiv:2403.05822},
	year={2024}
	}
	```