| --- |
| license: apache-2.0 |
| datasets: |
| - joyce8/EMBER2024 |
| language: |
| - en |
| tags: |
| - malware-detection |
| - cybersecurity |
| - onnxruntime |
| - lightgbm |
| - pytorch |
| - tabnet |
| - binary-classification |
| pipeline_tag: text-classification |
| library_name: onnxruntime |
| --- |
| |
| # EMBER2024 Malware Detection Models |
|
|
| A collection of four model architectures (DNN, TabNet, Hybrid GBDT2NN, LightGBM) trained and evaluated on all eight subsets of the [EMBER2024](https://huggingface.co/datasets/joyce8/EMBER2024) dataset β six file formats (Win32, Win64, .NET, APK, ELF, PDF) plus a combined `PE` group and an `all`-types set β and converted into deployment-ready formats. |
|
|
| > **Training environment**: GPU server (CUDA 13) |
| > **Dataset paper**: [Joyce et al., KDD 2025 (arXiv:2506.05074)](https://arxiv.org/abs/2506.05074) |
|
|
| --- |
|
|
| ## Models |
|
|
| | Directory | Architecture | Deployment Format | Parameters | |
| |-----------|--------------|-------------------|------------| |
| | `dnn/` | Feed-Forward DNN (PReLU + Dropout) | ONNX (INT8 Static / FP32) | 13.2 M (PE) / 0.98 M (non-PE) | |
| | `tabnet/` | TabNet ([Arik & Pfister, 2021](https://arxiv.org/abs/1908.07442)) | ONNX FP32 | ~3 M | |
| | `hybrid/` | GBDT2NN ([DeepGBM, KDD 2019](https://www.microsoft.com/en-us/research/publication/deepgbm-a-deep-learning-framework-distilled-by-gbdt-for-online-prediction-tasks/)) | ONNX (nn_part) + LightGBM booster | ~1 M NN | |
| | `lightgbm/` | LightGBM (pretrained, [joyce8/EMBER2024-benchmark-models](https://huggingface.co/joyce8/EMBER2024-benchmark-models)) | Treelite `.tl` | β | |
| |
| ### Subset List |
| |
| | Subset | Target File Type | Input Dim | |
| |--------|------------------|-----------| |
| | `PE` | All PE binaries (Win32 + Win64 + .NET) | 2,568 | |
| | `Win32` | Windows 32-bit PE | 2,568 | |
| | `Win64` | Windows 64-bit PE | 2,568 | |
| | `.NET` | .NET assemblies | 2,568 | |
| | `APK` | Android APK | 696 | |
| | `ELF` | Linux ELF | 696 | |
| | `PDF` | PDF documents | 696 | |
| | `all` | All file types combined | 2,568 | |
| |
| --- |
| |
| ## Directory Structure |
| |
| Filename convention: `{model}_{subset}[_suffix].{ext}` |
| The `.NET` subset is rendered as `dotnet` in filenames. |
| |
| ``` |
| dnn/ |
| βββ dnn_PE.onnx # INT8 Static (deployment; PE/Win32/Win64/dotnet/all) |
| βββ dnn_PE_fp32.onnx # FP32 ONNX (reference; bundled only for INT8 subsets) |
| βββ dnn_PE.pt # PyTorch checkpoint |
| βββ dnn_PE_metrics.json # Evaluation results (AUC, TPR@1%FPR) |
| βββ dnn_PE_benchmark.json # Size & latency |
| βββ dnn_APK.onnx # FP32 (non-PE β INT8 AUC loss too large) |
| βββ dnn_APK.pt |
| βββ ... |
| |
| tabnet/ |
| βββ tabnet_PE.onnx # FP32 ONNX (140 MB β sparsemax unfolding) |
| βββ tabnet_PE.zip # pytorch-tabnet native (7.4 MB, lightweight) |
| βββ ... |
| |
| hybrid/ |
| βββ hybrid_PE_nnpart.onnx # GBDT2NN nn_part ONNX (5.1 MB) |
| βββ hybrid_PE_lgbm.model # LightGBM booster (3.6 MB) |
| βββ hybrid_PE.pt # PyTorch checkpoint |
| βββ ... |
| |
| lightgbm/ |
| βββ lightgbm_PE.tl # Treelite serialization (platform-independent; recompilation required) |
| βββ ... |
| ``` |
| |
| --- |
| |
| ## Performance Results (EMBER2024 test set) |
| |
| > Metrics: ROC-AUC, TPR @ 1% FPR (paper Β§4.1), and challenge-set detection rate at the FPR=1% threshold. |
| > Challenge set: 6,315 evasive malware samples (positives only; Win32 3,225 / .NET 829 / Win64 814 / PDF 805 / ELF 386 / APK 256). |
| |
| ### DNN |
| |
| | Subset | ROC-AUC | TPR@1%FPR | Deployment Format | Size | |
| |--------|---------|-----------|-------------------|------| |
| | PE | 0.9969 | 0.9472 | INT8 Static ONNX | 13.3 MB | |
| | Win32 | 0.9965 | 0.9479 | INT8 Static ONNX | 13.3 MB | |
| | Win64 | 0.9969 | 0.9617 | INT8 Static ONNX | 13.3 MB | |
| | .NET | 0.9920 | 0.8444 | INT8 Static ONNX | 13.3 MB | |
| | all | 0.9938 | 0.8870 | INT8 Static ONNX | 13.3 MB | |
| | APK | 0.9761 | 0.7682 | FP32 ONNX | 3.9 MB | |
| | ELF | 0.9840 | 0.8103 | FP32 ONNX | 3.9 MB | |
| | PDF | 0.9795 | 0.8902 | FP32 ONNX | 3.9 MB | |
| |
| > non-PE subsets (APK/ELF/PDF) use 696-dim inputs and have too few parameters, so INT8 quantization causes a large AUC drop β they are kept in FP32. |
| > Figures are for the INT8 models (fixed 100K-sample set). ΞAUC vs FP32 stays within 0.19 pp. |
| > For the .NET and all subsets, INT8 quantization causes a relatively larger drop in TPR@1%FPR (still passes the AUC gate: |ΞAUC| < 0.5 pp). |
| |
| ### TabNet |
| |
| | Subset | ROC-AUC | TPR@1%FPR | Deployment Format | Size | |
| |--------|---------|-----------|-------------------|------| |
| | PE | 0.9948 | 0.9195 | FP32 ONNX | 140 MB | |
| | Win32 | 0.9949 | 0.9317 | FP32 ONNX | 140 MB | |
| | Win64 | 0.9944 | 0.9318 | FP32 ONNX | 140 MB | |
| | .NET | 0.9923 | 0.8700 | FP32 ONNX | 140 MB | |
| | all | 0.9922 | 0.8912 | FP32 ONNX | 140 MB | |
| | APK | 0.9741 | 0.7028 | FP32 ONNX | 13.5 MB | |
| | ELF | 0.9793 | 0.5460 | FP32 ONNX | 13.5 MB | |
| | PDF | 0.9810 | 0.8597 | FP32 ONNX | 13.5 MB | |
| |
| > The 140 MB ONNX size for the PE-family subsets is structural: the sparsemax attention loop is unfolded into the ONNX graph. If size matters, use `tabnet_PE.zip` (7.4 MB) directly. |
| |
| ### Hybrid (GBDT2NN) |
| |
| | Subset | ROC-AUC | TPR@1%FPR | Deployment Format | Size | |
| |--------|---------|-----------|-------------------|------| |
| | PE | 0.9982 | 0.9736 | nn_part ONNX + LightGBM booster | 5.3 + 3.8 MB | |
| | Win32 | 0.9982 | 0.9734 | nn_part ONNX + LightGBM booster | 5.3 + 3.7 MB | |
| | Win64 | 0.9982 | 0.9811 | nn_part ONNX + LightGBM booster | 5.3 + 3.7 MB | |
| | .NET | 0.9961 | 0.9466 | nn_part ONNX + LightGBM booster | 5.3 + 3.7 MB | |
| | all | 0.9972 | 0.9513 | nn_part ONNX + LightGBM booster | 5.3 + 3.8 MB | |
| | APK | 0.9828 | 0.8003 | nn_part ONNX + LightGBM booster | 5.3 + 3.7 MB | |
| | ELF | 0.9899 | 0.8827 | nn_part ONNX + LightGBM booster | 5.3 + 3.8 MB | |
| | PDF | 0.9879 | 0.9283 | nn_part ONNX + LightGBM booster | 5.3 + 3.7 MB | |
| |
| ### LightGBM (Treelite-compiled) |
| |
| | Subset | ROC-AUC | TPR@1%FPR | Size (.tl) | Size (original .model) | |
| |--------|---------|-----------|------------|------------------------| |
| | PE | 0.9983 | 0.9686 | 5.3 MB | 3.8 MB | |
| | Win32 | 0.9985 | 0.9722 | 5.3 MB | 3.7 MB | |
| | Win64 | 0.9988 | 0.9830 | 5.3 MB | 3.7 MB | |
| | .NET | 0.9980 | 0.9561 | 5.3 MB | 3.7 MB | |
| | all | 0.9970 | 0.9450 | 5.3 MB | 3.8 MB | |
| | APK | 0.9861 | 0.8157 | 5.3 MB | 3.7 MB | |
| | ELF | 0.9929 | 0.9140 | 5.3 MB | 3.8 MB | |
| | PDF | 0.9913 | 0.9275 | 5.3 MB | 3.7 MB | |
| |
| > Original LightGBM models: [joyce8/EMBER2024-benchmark-models](https://huggingface.co/joyce8/EMBER2024-benchmark-models). The `.tl` files are serialized with Treelite 3.9.1 and are platform-independent β they must be recompiled on each target platform. |
| |
| ### Challenge Set Detection Rate |
| |
| > Challenge set: 6,315 evasive malware (all positive). The FPR=1% threshold from the test set is applied. |
| |
| | Subset | DNN | TabNet | Hybrid | LightGBM | |
| |--------|-----|--------|--------|----------| |
| | `.NET` | 58.6% | 70.0% | 80.6% | 79.6% | |
| | `APK` | 27.3% | 29.3% | 34.4% | 33.6% | |
| | `ELF` | 11.7% | 4.4% | 23.8% | 30.3% | |
| | `PDF` | 41.5% | 40.1% | 56.9% | 57.1% | |
| | `PE` | 38.5% | 36.9% | 58.2% | 58.8% | |
| | `Win32`| 36.6% | 45.3% | 58.4% | 69.9% | |
| | `Win64`| 46.3% | 44.1% | 59.5% | 59.7% | |
| | `all` | 35.3% | 42.3% | 54.1% | 48.4% | |
| |
| --- |
| |
| ## Inference Performance (Apple M1, darwin-arm64) |
| |
| > `warm_batch1` latency: batch size = 1, measured after cache warm-up. May differ from the deployment environment (x86_64 Linux). |
| |
| ### Latency (ms, warm batch=1) |
| |
| | Subset | DNN | TabNet | Hybrid | LightGBM | |
| |--------|-----|--------|--------|----------| |
| | `.NET` | 0.248 | 5.465 | 0.151 | 0.050 | |
| | `APK` | 0.035 | 0.846 | 0.145 | 0.031 | |
| | `ELF` | 0.039 | 0.505 | 0.160 | 0.036 | |
| | `PDF` | 0.036 | 2.230 | 0.172 | 0.048 | |
| | `PE` | 0.290 | 4.402 | 0.138 | 0.028 | |
| | `Win32`| 0.288 | 4.693 | 0.141 | 0.044 | |
| | `Win64`| 0.220 | 5.621 | 0.422 | 0.039 | |
| | `all` | 0.254 | 4.788 | 0.147 | 0.068 | |
| |
| > TabNet latency is high because the sparsemax attention is unfolded into the ONNX graph (structural). |
| > Hybrid = nn_part ONNX inference only (LightGBM leaf extraction excluded). |
| > LightGBM latency is for the compiled `.dylib`; the uploaded file is `.tl` (recompilation required). |
| |
| ### Model File Sizes (deployment format) |
| |
| | Subset | DNN | TabNet `.onnx` | TabNet `.zip` | Hybrid (nn+lgbm) | LightGBM `.tl` | |
| |--------|-----|----------------|---------------|------------------|----------------| |
| | PE family | 13.3 MB (INT8) | 140.2 MB | 7.4 MB | 5.3 + 3.8 MB | 5.3 MB | |
| | non-PE | 3.9 MB (FP32) | 13.5 MB | 3.2 MB | 5.3 + 3.7 MB | 5.3 MB | |
| |
| --- |
| |
| ## Usage |
| |
| ### Install Dependencies |
| |
| ```bash |
| pip install onnxruntime>=1.20 numpy |
| # For LightGBM / Hybrid inference |
| pip install "treelite==3.9.1" "treelite_runtime==3.9.1" lightgbm>=4.6 |
| # To use the TabNet checkpoint directly |
| pip install pytorch-tabnet>=4.1 |
| ``` |
| |
| ### DNN Inference (ONNX Runtime) |
| |
| ```python |
| import numpy as np |
| import onnxruntime as ort |
| from huggingface_hub import hf_hub_download |
|
|
| # PE subset β INT8 Static |
| model_path = hf_hub_download( |
| repo_id="cycloevan/ember-model", |
| filename="dnn/dnn_PE.onnx", |
| ) |
| sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"]) |
| |
| # X: np.ndarray shape (N, 2568), dtype float32 |
| X = np.random.randn(1, 2568).astype(np.float32) |
| logit = sess.run(["logit"], {"features": X})[0] # shape (N, 1) |
| prob = 1 / (1 + np.exp(-logit.ravel())) # sigmoid β [0, 1] |
| print(f"malware probability: {prob[0]:.4f}") |
| ``` |
| |
| ```python |
| # APK subset β FP32 |
| model_path = hf_hub_download( |
| repo_id="cycloevan/ember-model", |
| filename="dnn/dnn_APK.onnx", |
| ) |
| sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"]) |
| X = np.random.randn(1, 696).astype(np.float32) # non-PE: dim=696 |
| prob = 1 / (1 + np.exp(-sess.run(["logit"], {"features": X})[0].ravel())) |
| ``` |
| |
| ### TabNet Inference (ONNX Runtime) |
|
|
| ```python |
| import numpy as np |
| import onnxruntime as ort |
| from huggingface_hub import hf_hub_download |
| |
| model_path = hf_hub_download( |
| repo_id="cycloevan/ember-model", |
| filename="tabnet/tabnet_PE.onnx", |
| ) |
| sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"]) |
| X = np.random.randn(1, 2568).astype(np.float32) |
| # output: logit (pre-sigmoid) |
| logit = sess.run(["logit"], {"features": X})[0] |
| prob = 1 / (1 + np.exp(-logit.ravel())) |
| ``` |
|
|
| ### Hybrid Inference (ONNX + LightGBM) |
|
|
| ```python |
| import numpy as np |
| import lightgbm as lgb |
| import onnxruntime as ort |
| from huggingface_hub import hf_hub_download |
| |
| # 1. Extract leaf indices with the LightGBM booster |
| booster = lgb.Booster(model_file=hf_hub_download( |
| repo_id="cycloevan/ember-model", |
| filename="hybrid/hybrid_PE_lgbm.model", |
| )) |
| X_raw = np.random.randn(1, 2568).astype(np.float64) |
| leaf_indices = booster.predict(X_raw, pred_leaf=True).astype(np.int64) # (N, n_trees) |
| |
| # 2. Final classification with the GBDT2NN ONNX model |
| nn_sess = ort.InferenceSession(hf_hub_download( |
| repo_id="cycloevan/ember-model", |
| filename="hybrid/hybrid_PE_nnpart.onnx", |
| ), providers=["CPUExecutionProvider"]) |
| logit = nn_sess.run(["logit"], {"leaf_indices": leaf_indices})[0] |
| prob = 1 / (1 + np.exp(-logit.ravel())) |
| print(f"malware probability: {prob[0]:.4f}") |
| ``` |
|
|
| ### LightGBM Inference (Treelite-compiled β fast inference) |
|
|
| ```python |
| # 1. Compile Treelite .tl β platform-specific shared library (one-time) |
| import treelite, treelite_runtime, sys, numpy as np |
| from huggingface_hub import hf_hub_download |
| |
| tl_path = hf_hub_download( |
| repo_id="cycloevan/ember-model", |
| filename="lightgbm/lightgbm_PE.tl", |
| ) |
| tl_model = treelite.Model.deserialize(tl_path) |
| lib_ext = ".dylib" if sys.platform == "darwin" else ".so" |
| lib_path = tl_path.replace(".tl", lib_ext) |
| tl_model.export_lib( |
| toolchain="clang" if sys.platform == "darwin" else "gcc", |
| libpath=lib_path, |
| verbose=False, |
| ) |
| |
| # 2. Inference |
| predictor = treelite_runtime.Predictor(lib_path, verbose=False) |
| X = np.random.randn(1, 2568).astype(np.float32) |
| prob = predictor.predict(treelite_runtime.DMatrix(X)) |
| print(f"malware probability: {prob[0]:.4f}") |
| ``` |
|
|
| > **Note**: Requires `treelite==3.9.1` + `treelite_runtime==3.9.1`. Version 4.x does not support `export_lib()`. |
|
|
| --- |
|
|
| ## Training & Evaluation Environment |
|
|
| | Item | Details | |
| |------|---------| |
| | Dataset | [EMBER2024](https://huggingface.co/datasets/joyce8/EMBER2024) β train 52 weeks (2.6 M), test 12 weeks (606 K), challenge 6,315 | |
| | Feature dim | PE 2,568 (v3) / non-PE 696 (valid prefix) | |
| | Split policy | Fixed temporal order (temporal split), no random shuffling | |
| | Training environment | GPU server (CUDA 13) | |
| | Frameworks | PyTorch 2.11.0, pytorch-tabnet 4.1, LightGBM 4.6 | |
| | Random seed | 42 | |
| | DNN architecture | 2 Γ [Linear(dβd) + BatchNorm + PReLU(Ξ±=0.25) + Dropout(0.5)] β Linear(dβ1), where d = 2,568 (PE) / 696 (non-PE) | |
| | Hybrid | LightGBM leaf extraction β shared leaf Embedding (dim 8) β concat β MLP[256, 128] (BatchNorm + PReLU) β Linear(β1) | |
| | Evaluation metrics | ROC-AUC, PR-AUC, **TPR @ 1% FPR** (paper Β§4.1) | |
|
|
| --- |
|
|
| ## Known Limitations |
|
|
| - **TabNet ONNX size**: unfolding the sparsemax attention loop inflates the PE-family ONNX to 140 MB. The original `tabnet_PE.zip` (7.4 MB) is lighter. |
| - **Treelite `.tl`**: the uploaded LightGBM artifact is a platform-independent serialization. You must compile it into a shared library (`.dylib`/`.so`) on each target platform before inference β see the LightGBM usage example. (The reported LightGBM latency is for a `.dylib` compiled on Mac ARM64.) |
| - **DNN non-PE INT8**: the 696-dim models suffer large AUC loss from quantization, so they are kept in FP32. |
| - **Hybrid inference**: not a single ONNX file β two stages: LightGBM leaf extraction + nn_part ONNX. |
| - **Challenge detection rate**: measured using the FPR=1% threshold from the test set. Values may vary across subsets due to distribution differences. |
| |
| --- |
| |
| ## Citation |
| |
| ```bibtex |
| @inproceedings{joyce2025ember2024, |
| title = {EMBER2024 -- A Benchmark Dataset for Holistic Evaluation of Malware Classifiers}, |
| author = {Joyce, Robert J. and Miller, Gideon and Roth, Phil and Zak, Richard and Zaresky-Williams, Elliott and Anderson, Hyrum and Raff, Edward and Holt, James}, |
| booktitle = {Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '25)}, |
| year = {2025}, |
| doi = {10.1145/3711896.3737431}, |
| url = {https://arxiv.org/abs/2506.05074} |
| } |
| ``` |
| |
| --- |
| |
| ## License |
| |
| Code and model weights: Apache 2.0 |
| Original LightGBM models (`hybrid/hybrid_*_lgbm.model`): subject to the [joyce8/EMBER2024-benchmark-models](https://huggingface.co/joyce8/EMBER2024-benchmark-models) license. |
| |