--- license: apache-2.0 datasets: - joyce8/EMBER2024 language: - en tags: - malware-detection - cybersecurity - onnxruntime - lightgbm - pytorch - tabnet - binary-classification pipeline_tag: text-classification library_name: onnxruntime --- # EMBER2024 Malware Detection Models A collection of four model architectures (DNN, TabNet, Hybrid GBDT2NN, LightGBM) trained and evaluated on all eight subsets of the [EMBER2024](https://huggingface.co/datasets/joyce8/EMBER2024) dataset — six file formats (Win32, Win64, .NET, APK, ELF, PDF) plus a combined `PE` group and an `all`-types set — and converted into deployment-ready formats. > **Training environment**: GPU server (CUDA 13) > **Dataset paper**: [Joyce et al., KDD 2025 (arXiv:2506.05074)](https://arxiv.org/abs/2506.05074) --- ## Models | Directory | Architecture | Deployment Format | Parameters | |-----------|--------------|-------------------|------------| | `dnn/` | Feed-Forward DNN (PReLU + Dropout) | ONNX (INT8 Static / FP32) | 13.2 M (PE) / 0.98 M (non-PE) | | `tabnet/` | TabNet ([Arik & Pfister, 2021](https://arxiv.org/abs/1908.07442)) | ONNX FP32 | ~3 M | | `hybrid/` | GBDT2NN ([DeepGBM, KDD 2019](https://www.microsoft.com/en-us/research/publication/deepgbm-a-deep-learning-framework-distilled-by-gbdt-for-online-prediction-tasks/)) | ONNX (nn_part) + LightGBM booster | ~1 M NN | | `lightgbm/` | LightGBM (pretrained, [joyce8/EMBER2024-benchmark-models](https://huggingface.co/joyce8/EMBER2024-benchmark-models)) | Treelite `.tl` | — | ### Subset List | Subset | Target File Type | Input Dim | |--------|------------------|-----------| | `PE` | All PE binaries (Win32 + Win64 + .NET) | 2,568 | | `Win32` | Windows 32-bit PE | 2,568 | | `Win64` | Windows 64-bit PE | 2,568 | | `.NET` | .NET assemblies | 2,568 | | `APK` | Android APK | 696 | | `ELF` | Linux ELF | 696 | | `PDF` | PDF documents | 696 | | `all` | All file types combined | 2,568 | --- ## Directory Structure Filename convention: `{model}_{subset}[_suffix].{ext}` The `.NET` subset is rendered as `dotnet` in filenames. ``` dnn/ ├── dnn_PE.onnx # INT8 Static (deployment; PE/Win32/Win64/dotnet/all) ├── dnn_PE_fp32.onnx # FP32 ONNX (reference; bundled only for INT8 subsets) ├── dnn_PE.pt # PyTorch checkpoint ├── dnn_PE_metrics.json # Evaluation results (AUC, TPR@1%FPR) ├── dnn_PE_benchmark.json # Size & latency ├── dnn_APK.onnx # FP32 (non-PE — INT8 AUC loss too large) ├── dnn_APK.pt └── ... tabnet/ ├── tabnet_PE.onnx # FP32 ONNX (140 MB — sparsemax unfolding) ├── tabnet_PE.zip # pytorch-tabnet native (7.4 MB, lightweight) └── ... hybrid/ ├── hybrid_PE_nnpart.onnx # GBDT2NN nn_part ONNX (5.1 MB) ├── hybrid_PE_lgbm.model # LightGBM booster (3.6 MB) ├── hybrid_PE.pt # PyTorch checkpoint └── ... lightgbm/ ├── lightgbm_PE.tl # Treelite serialization (platform-independent; recompilation required) └── ... ``` --- ## Performance Results (EMBER2024 test set) > Metrics: ROC-AUC, TPR @ 1% FPR (paper §4.1), and challenge-set detection rate at the FPR=1% threshold. > Challenge set: 6,315 evasive malware samples (positives only; Win32 3,225 / .NET 829 / Win64 814 / PDF 805 / ELF 386 / APK 256). ### DNN | Subset | ROC-AUC | TPR@1%FPR | Deployment Format | Size | |--------|---------|-----------|-------------------|------| | PE | 0.9969 | 0.9472 | INT8 Static ONNX | 13.3 MB | | Win32 | 0.9965 | 0.9479 | INT8 Static ONNX | 13.3 MB | | Win64 | 0.9969 | 0.9617 | INT8 Static ONNX | 13.3 MB | | .NET | 0.9920 | 0.8444 | INT8 Static ONNX | 13.3 MB | | all | 0.9938 | 0.8870 | INT8 Static ONNX | 13.3 MB | | APK | 0.9761 | 0.7682 | FP32 ONNX | 3.9 MB | | ELF | 0.9840 | 0.8103 | FP32 ONNX | 3.9 MB | | PDF | 0.9795 | 0.8902 | FP32 ONNX | 3.9 MB | > non-PE subsets (APK/ELF/PDF) use 696-dim inputs and have too few parameters, so INT8 quantization causes a large AUC drop — they are kept in FP32. > Figures are for the INT8 models (fixed 100K-sample set). ΔAUC vs FP32 stays within 0.19 pp. > For the .NET and all subsets, INT8 quantization causes a relatively larger drop in TPR@1%FPR (still passes the AUC gate: |ΔAUC| < 0.5 pp). ### TabNet | Subset | ROC-AUC | TPR@1%FPR | Deployment Format | Size | |--------|---------|-----------|-------------------|------| | PE | 0.9948 | 0.9195 | FP32 ONNX | 140 MB | | Win32 | 0.9949 | 0.9317 | FP32 ONNX | 140 MB | | Win64 | 0.9944 | 0.9318 | FP32 ONNX | 140 MB | | .NET | 0.9923 | 0.8700 | FP32 ONNX | 140 MB | | all | 0.9922 | 0.8912 | FP32 ONNX | 140 MB | | APK | 0.9741 | 0.7028 | FP32 ONNX | 13.5 MB | | ELF | 0.9793 | 0.5460 | FP32 ONNX | 13.5 MB | | PDF | 0.9810 | 0.8597 | FP32 ONNX | 13.5 MB | > The 140 MB ONNX size for the PE-family subsets is structural: the sparsemax attention loop is unfolded into the ONNX graph. If size matters, use `tabnet_PE.zip` (7.4 MB) directly. ### Hybrid (GBDT2NN) | Subset | ROC-AUC | TPR@1%FPR | Deployment Format | Size | |--------|---------|-----------|-------------------|------| | PE | 0.9982 | 0.9736 | nn_part ONNX + LightGBM booster | 5.3 + 3.8 MB | | Win32 | 0.9982 | 0.9734 | nn_part ONNX + LightGBM booster | 5.3 + 3.7 MB | | Win64 | 0.9982 | 0.9811 | nn_part ONNX + LightGBM booster | 5.3 + 3.7 MB | | .NET | 0.9961 | 0.9466 | nn_part ONNX + LightGBM booster | 5.3 + 3.7 MB | | all | 0.9972 | 0.9513 | nn_part ONNX + LightGBM booster | 5.3 + 3.8 MB | | APK | 0.9828 | 0.8003 | nn_part ONNX + LightGBM booster | 5.3 + 3.7 MB | | ELF | 0.9899 | 0.8827 | nn_part ONNX + LightGBM booster | 5.3 + 3.8 MB | | PDF | 0.9879 | 0.9283 | nn_part ONNX + LightGBM booster | 5.3 + 3.7 MB | ### LightGBM (Treelite-compiled) | Subset | ROC-AUC | TPR@1%FPR | Size (.tl) | Size (original .model) | |--------|---------|-----------|------------|------------------------| | PE | 0.9983 | 0.9686 | 5.3 MB | 3.8 MB | | Win32 | 0.9985 | 0.9722 | 5.3 MB | 3.7 MB | | Win64 | 0.9988 | 0.9830 | 5.3 MB | 3.7 MB | | .NET | 0.9980 | 0.9561 | 5.3 MB | 3.7 MB | | all | 0.9970 | 0.9450 | 5.3 MB | 3.8 MB | | APK | 0.9861 | 0.8157 | 5.3 MB | 3.7 MB | | ELF | 0.9929 | 0.9140 | 5.3 MB | 3.8 MB | | PDF | 0.9913 | 0.9275 | 5.3 MB | 3.7 MB | > Original LightGBM models: [joyce8/EMBER2024-benchmark-models](https://huggingface.co/joyce8/EMBER2024-benchmark-models). The `.tl` files are serialized with Treelite 3.9.1 and are platform-independent — they must be recompiled on each target platform. ### Challenge Set Detection Rate > Challenge set: 6,315 evasive malware (all positive). The FPR=1% threshold from the test set is applied. | Subset | DNN | TabNet | Hybrid | LightGBM | |--------|-----|--------|--------|----------| | `.NET` | 58.6% | 70.0% | 80.6% | 79.6% | | `APK` | 27.3% | 29.3% | 34.4% | 33.6% | | `ELF` | 11.7% | 4.4% | 23.8% | 30.3% | | `PDF` | 41.5% | 40.1% | 56.9% | 57.1% | | `PE` | 38.5% | 36.9% | 58.2% | 58.8% | | `Win32`| 36.6% | 45.3% | 58.4% | 69.9% | | `Win64`| 46.3% | 44.1% | 59.5% | 59.7% | | `all` | 35.3% | 42.3% | 54.1% | 48.4% | --- ## Inference Performance (Apple M1, darwin-arm64) > `warm_batch1` latency: batch size = 1, measured after cache warm-up. May differ from the deployment environment (x86_64 Linux). ### Latency (ms, warm batch=1) | Subset | DNN | TabNet | Hybrid | LightGBM | |--------|-----|--------|--------|----------| | `.NET` | 0.248 | 5.465 | 0.151 | 0.050 | | `APK` | 0.035 | 0.846 | 0.145 | 0.031 | | `ELF` | 0.039 | 0.505 | 0.160 | 0.036 | | `PDF` | 0.036 | 2.230 | 0.172 | 0.048 | | `PE` | 0.290 | 4.402 | 0.138 | 0.028 | | `Win32`| 0.288 | 4.693 | 0.141 | 0.044 | | `Win64`| 0.220 | 5.621 | 0.422 | 0.039 | | `all` | 0.254 | 4.788 | 0.147 | 0.068 | > TabNet latency is high because the sparsemax attention is unfolded into the ONNX graph (structural). > Hybrid = nn_part ONNX inference only (LightGBM leaf extraction excluded). > LightGBM latency is for the compiled `.dylib`; the uploaded file is `.tl` (recompilation required). ### Model File Sizes (deployment format) | Subset | DNN | TabNet `.onnx` | TabNet `.zip` | Hybrid (nn+lgbm) | LightGBM `.tl` | |--------|-----|----------------|---------------|------------------|----------------| | PE family | 13.3 MB (INT8) | 140.2 MB | 7.4 MB | 5.3 + 3.8 MB | 5.3 MB | | non-PE | 3.9 MB (FP32) | 13.5 MB | 3.2 MB | 5.3 + 3.7 MB | 5.3 MB | --- ## Usage ### Install Dependencies ```bash pip install onnxruntime>=1.20 numpy # For LightGBM / Hybrid inference pip install "treelite==3.9.1" "treelite_runtime==3.9.1" lightgbm>=4.6 # To use the TabNet checkpoint directly pip install pytorch-tabnet>=4.1 ``` ### DNN Inference (ONNX Runtime) ```python import numpy as np import onnxruntime as ort from huggingface_hub import hf_hub_download # PE subset — INT8 Static model_path = hf_hub_download( repo_id="cycloevan/ember-model", filename="dnn/dnn_PE.onnx", ) sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"]) # X: np.ndarray shape (N, 2568), dtype float32 X = np.random.randn(1, 2568).astype(np.float32) logit = sess.run(["logit"], {"features": X})[0] # shape (N, 1) prob = 1 / (1 + np.exp(-logit.ravel())) # sigmoid → [0, 1] print(f"malware probability: {prob[0]:.4f}") ``` ```python # APK subset — FP32 model_path = hf_hub_download( repo_id="cycloevan/ember-model", filename="dnn/dnn_APK.onnx", ) sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"]) X = np.random.randn(1, 696).astype(np.float32) # non-PE: dim=696 prob = 1 / (1 + np.exp(-sess.run(["logit"], {"features": X})[0].ravel())) ``` ### TabNet Inference (ONNX Runtime) ```python import numpy as np import onnxruntime as ort from huggingface_hub import hf_hub_download model_path = hf_hub_download( repo_id="cycloevan/ember-model", filename="tabnet/tabnet_PE.onnx", ) sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"]) X = np.random.randn(1, 2568).astype(np.float32) # output: logit (pre-sigmoid) logit = sess.run(["logit"], {"features": X})[0] prob = 1 / (1 + np.exp(-logit.ravel())) ``` ### Hybrid Inference (ONNX + LightGBM) ```python import numpy as np import lightgbm as lgb import onnxruntime as ort from huggingface_hub import hf_hub_download # 1. Extract leaf indices with the LightGBM booster booster = lgb.Booster(model_file=hf_hub_download( repo_id="cycloevan/ember-model", filename="hybrid/hybrid_PE_lgbm.model", )) X_raw = np.random.randn(1, 2568).astype(np.float64) leaf_indices = booster.predict(X_raw, pred_leaf=True).astype(np.int64) # (N, n_trees) # 2. Final classification with the GBDT2NN ONNX model nn_sess = ort.InferenceSession(hf_hub_download( repo_id="cycloevan/ember-model", filename="hybrid/hybrid_PE_nnpart.onnx", ), providers=["CPUExecutionProvider"]) logit = nn_sess.run(["logit"], {"leaf_indices": leaf_indices})[0] prob = 1 / (1 + np.exp(-logit.ravel())) print(f"malware probability: {prob[0]:.4f}") ``` ### LightGBM Inference (Treelite-compiled — fast inference) ```python # 1. Compile Treelite .tl → platform-specific shared library (one-time) import treelite, treelite_runtime, sys, numpy as np from huggingface_hub import hf_hub_download tl_path = hf_hub_download( repo_id="cycloevan/ember-model", filename="lightgbm/lightgbm_PE.tl", ) tl_model = treelite.Model.deserialize(tl_path) lib_ext = ".dylib" if sys.platform == "darwin" else ".so" lib_path = tl_path.replace(".tl", lib_ext) tl_model.export_lib( toolchain="clang" if sys.platform == "darwin" else "gcc", libpath=lib_path, verbose=False, ) # 2. Inference predictor = treelite_runtime.Predictor(lib_path, verbose=False) X = np.random.randn(1, 2568).astype(np.float32) prob = predictor.predict(treelite_runtime.DMatrix(X)) print(f"malware probability: {prob[0]:.4f}") ``` > **Note**: Requires `treelite==3.9.1` + `treelite_runtime==3.9.1`. Version 4.x does not support `export_lib()`. --- ## Training & Evaluation Environment | Item | Details | |------|---------| | Dataset | [EMBER2024](https://huggingface.co/datasets/joyce8/EMBER2024) — train 52 weeks (2.6 M), test 12 weeks (606 K), challenge 6,315 | | Feature dim | PE 2,568 (v3) / non-PE 696 (valid prefix) | | Split policy | Fixed temporal order (temporal split), no random shuffling | | Training environment | GPU server (CUDA 13) | | Frameworks | PyTorch 2.11.0, pytorch-tabnet 4.1, LightGBM 4.6 | | Random seed | 42 | | DNN architecture | 2 × [Linear(d→d) + BatchNorm + PReLU(α=0.25) + Dropout(0.5)] → Linear(d→1), where d = 2,568 (PE) / 696 (non-PE) | | Hybrid | LightGBM leaf extraction → shared leaf Embedding (dim 8) → concat → MLP[256, 128] (BatchNorm + PReLU) → Linear(→1) | | Evaluation metrics | ROC-AUC, PR-AUC, **TPR @ 1% FPR** (paper §4.1) | --- ## Known Limitations - **TabNet ONNX size**: unfolding the sparsemax attention loop inflates the PE-family ONNX to 140 MB. The original `tabnet_PE.zip` (7.4 MB) is lighter. - **Treelite `.tl`**: the uploaded LightGBM artifact is a platform-independent serialization. You must compile it into a shared library (`.dylib`/`.so`) on each target platform before inference — see the LightGBM usage example. (The reported LightGBM latency is for a `.dylib` compiled on Mac ARM64.) - **DNN non-PE INT8**: the 696-dim models suffer large AUC loss from quantization, so they are kept in FP32. - **Hybrid inference**: not a single ONNX file — two stages: LightGBM leaf extraction + nn_part ONNX. - **Challenge detection rate**: measured using the FPR=1% threshold from the test set. Values may vary across subsets due to distribution differences. --- ## Citation ```bibtex @inproceedings{joyce2025ember2024, title = {EMBER2024 -- A Benchmark Dataset for Holistic Evaluation of Malware Classifiers}, author = {Joyce, Robert J. and Miller, Gideon and Roth, Phil and Zak, Richard and Zaresky-Williams, Elliott and Anderson, Hyrum and Raff, Edward and Holt, James}, booktitle = {Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '25)}, year = {2025}, doi = {10.1145/3711896.3737431}, url = {https://arxiv.org/abs/2506.05074} } ``` --- ## License Code and model weights: Apache 2.0 Original LightGBM models (`hybrid/hybrid_*_lgbm.model`): subject to the [joyce8/EMBER2024-benchmark-models](https://huggingface.co/joyce8/EMBER2024-benchmark-models) license.