LightGBM Models for QUIC Encrypted Traffic Classification
This repository contains eight trained LightGBM model variants for classifying encrypted QUIC network traffic into 17 application categories. The models are part of a research project evaluating the impact of hyperparameter tuning (Optuna) and resampling techniques (SMOTE, RUS) on multiclass traffic classification performance.
Source code: GitHub repository
Dataset
All models are trained on the CESNET-QUIC22 dataset.
- Source: Zenodo -- CESNET-QUIC22
- Parent repository: CESNET Liberouter Datasets
- Capture period: October 31 -- November 6, 2022
- Protocol: QUIC encrypted traffic flows
- Split ratio: 60% train / 20% validation / 20% test
- Preprocessing: Feature engineering on PHIST histograms, StandardScaler normalization
Model Variants
Eight model variants are provided, combining different training data configurations with default or Optuna-optimized hyperparameters.
| Model | Directory | Data | Hyperparameters |
|---|---|---|---|
| Baseline | saved_model_baseline/ |
Original | Default |
| Baseline + Optuna | saved_model_baseline_optuna/ |
Original | Optuna-optimized |
| RUS | saved_model_0.rus/ |
RUS-resampled | Default |
| RUS + Optuna | saved_model_rus_optuna/ |
RUS-resampled | Optuna-optimized |
| SMOTE | saved_model_0.smote/ |
SMOTE-resampled | Default |
| SMOTE + Optuna | saved_model_smote_optuna/ |
SMOTE-resampled | Optuna-optimized |
| SMOTE-RUS | saved_model_0.smote.rus/ |
SMOTE-RUS-resampled | Default |
| SMOTE-RUS + Optuna | saved_model_smote-rus_optuna/ |
SMOTE-RUS-resampled | Optuna-optimized |
Each model directory contains:
lightgbm_model.pkl-- serialized LightGBM Booster objectlightgbm_model_info.json-- training parameters, feature names, and evaluation metrics
Performance (Test Set)
| Model | Accuracy | Precision | Recall | F1-Score | AUROC | AUPRC |
|---|---|---|---|---|---|---|
| Baseline | 0.6852 | 0.6943 | 0.6852 | 0.6759 | 0.9300 | 0.7630 |
| Baseline + Optuna | 0.8417 | 0.8414 | 0.8417 | 0.8400 | 0.9831 | 0.9214 |
| RUS | 0.6674 | 0.7014 | 0.6674 | 0.6737 | 0.9362 | 0.7788 |
| RUS + Optuna | 0.8115 | 0.8263 | 0.8115 | 0.8144 | 0.9800 | 0.9122 |
| SMOTE | 0.7073 | 0.7146 | 0.7073 | 0.7005 | 0.9413 | 0.7914 |
| SMOTE + Optuna | 0.8382 | 0.8379 | 0.8382 | 0.8369 | 0.9825 | 0.9185 |
| SMOTE-RUS | 0.6601 | 0.7024 | 0.6601 | 0.6696 | 0.9355 | 0.7764 |
| SMOTE-RUS + Optuna | 0.8077 | 0.8250 | 0.8077 | 0.8115 | 0.9796 | 0.9105 |
All metrics are weighted averages. AUROC and AUPRC are weighted one-vs-rest.
Class Labels
The models classify traffic into 17 application categories:
| Label | Category |
|---|---|
| 0 | Other services and APIs |
| 1 | Streaming media |
| 2 | Social |
| 3 | Advertising |
| 4 | Search |
| 5 | Music |
| 6 | Authentication services |
| 7 | Instant messaging |
| 8 | Antivirus |
| 9 | File sharing |
| 10 | |
| 11 | E-commerce |
| 12 | Games |
| 13 | Analytics and Telemetry |
| 14 | Blogs and News |
| 15 | Information systems |
| 16 | Videoconferencing |
Input Features
Each model expects 24 numerical features (StandardScaler-normalized):
| Group | Features |
|---|---|
| Flow-level | duration, bytes, bytes_rev, packets, packets_rev |
| Per-Packet Information | ppi_len, ppi_duration, ppi_roundtrips |
| PHIST Source Sizes | src_sizes_mean, src_sizes_std, src_sizes_skew, src_sizes_kurt |
| PHIST Destination Sizes | dst_sizes_mean, dst_sizes_std, dst_sizes_skew, dst_sizes_kurt |
| PHIST Source IPT | src_ipt_mean, src_ipt_std, src_ipt_skew, src_ipt_kurt |
| PHIST Destination IPT | dst_ipt_mean, dst_ipt_std, dst_ipt_skew, dst_ipt_kurt |
Usage
import pickle
import numpy as np
# Load model
with open("saved_model_baseline_optuna/lightgbm_model.pkl", "rb") as f:
model = pickle.load(f)
# Prepare input (24 features, StandardScaler-normalized)
X = np.array([[...]]) # shape: (n_samples, 24)
# Predict
predictions_proba = model.predict(X, num_iteration=model.best_iteration)
predictions = np.argmax(predictions_proba, axis=1)
# Map to class names
label_mapping = {
0: "Other services and APIs",
1: "Streaming media",
2: "Social",
3: "Advertising",
4: "Search",
5: "Music",
6: "Authentication services",
7: "Instant messaging",
8: "Antivirus",
9: "File sharing",
10: "Mail",
11: "E-commerce",
12: "Games",
13: "Analytics and Telemetry",
14: "Blogs and News",
15: "Information systems",
16: "Videoconferencing",
}
predicted_labels = [label_mapping[p] for p in predictions]
Repository Structure
.
βββ README.md
βββ label_mapping.csv
β
βββ lightgbm_0_baseline_.ipynb # Training: baseline (default params)
βββ lightgbm_0_baseline_rus.ipynb # Training: baseline with RUS data
βββ lightgbm_0_baseline_smote.ipynb # Training: baseline with SMOTE data
βββ lightgbm_0_baseline_smote.rus.ipynb # Training: baseline with SMOTE-RUS data
βββ lightgbm_1_baseline_optuna.ipynb # Training: Optuna-tuned
βββ lightgbm_1_rus_optuna.ipynb # Training: Optuna-tuned with RUS
βββ lightgbm_1_smote_optuna.ipynb # Training: Optuna-tuned with SMOTE
βββ lightgbm_1_smote-rus_optuna.ipynb # Training: Optuna-tuned with SMOTE-RUS
β
βββ optuna_2_optuna_.ipynb # Optuna hyperparameter search (baseline)
βββ optuna_2_optuna_rus.ipynb # Optuna hyperparameter search (RUS)
βββ optuna_2_optuna_smote.ipynb # Optuna hyperparameter search (SMOTE)
βββ optuna_2_optuna_smote-rus.ipynb # Optuna hyperparameter search (SMOTE-RUS)
β
βββ plot_hyperparams.ipynb # Hyperparameter visualization
βββ plot_testing.ipynb # Test result visualization
β
βββ saved_hyperparams_baseline/ # Optuna trial results (baseline)
βββ saved_hyperparams_rus/ # Optuna trial results (RUS)
βββ saved_hyperparams_smote/ # Optuna trial results (SMOTE)
βββ saved_hyperparams_smote-rus/ # Optuna trial results (SMOTE-RUS)
β
βββ saved_model_baseline/ # Model: baseline
β βββ lightgbm_model.pkl
β βββ lightgbm_model_info.json
βββ saved_model_baseline_optuna/ # Model: baseline + Optuna
β βββ lightgbm_model.pkl
β βββ lightgbm_model_info.json
βββ saved_model_0.rus/ # Model: RUS
β βββ lightgbm_model.pkl
β βββ lightgbm_model_info.json
βββ saved_model_rus_optuna/ # Model: RUS + Optuna
β βββ lightgbm_model.pkl
β βββ lightgbm_model_info.json
βββ saved_model_0.smote/ # Model: SMOTE
β βββ lightgbm_model.pkl
β βββ lightgbm_model_info.json
βββ saved_model_smote_optuna/ # Model: SMOTE + Optuna
β βββ lightgbm_model.pkl
β βββ lightgbm_model_info.json
βββ saved_model_0.smote.rus/ # Model: SMOTE-RUS
β βββ lightgbm_model.pkl
β βββ lightgbm_model_info.json
βββ saved_model_smote-rus_optuna/ # Model: SMOTE-RUS + Optuna
βββ lightgbm_model.pkl
βββ lightgbm_model_info.json
Training Configuration
Best-performing model (Baseline + Optuna) hyperparameters:
| Parameter | Value |
|---|---|
| objective | multiclass |
| num_class | 17 |
| metric | multi_logloss |
| learning_rate | 0.0910 |
| num_leaves | 416 |
| max_depth | 14 |
| min_data_in_leaf | 1200 |
| lambda_l1 | 3 |
| lambda_l2 | 1 |
| feature_fraction | 0.9 |
| is_unbalance | true |
| best_iteration | 500 |
References
- Luxemburk, J., and Cejka, T. (2023). CESNET-QUIC22: A large-scale dataset for QUIC traffic classification. Zenodo. https://zenodo.org/records/10728760
- CESNET Liberouter Datasets. https://www.liberouter.org/technology-v2/tools-services/datasets/
- Ke, G., et al. (2017). LightGBM: A Highly Efficient Gradient Boosting Decision Tree. NeurIPS.
- Akiba, T., et al. (2019). Optuna: A Next-generation Hyperparameter Optimization Framework. KDD.