ember-model / README.md
cycloevan's picture
Remove inaccurate/private GitHub code link from model card
ff61249 verified
|
Raw
History Blame Contribute Delete
14.6 kB
---
license: apache-2.0
datasets:
- joyce8/EMBER2024
language:
- en
tags:
- malware-detection
- cybersecurity
- onnxruntime
- lightgbm
- pytorch
- tabnet
- binary-classification
pipeline_tag: text-classification
library_name: onnxruntime
---
# EMBER2024 Malware Detection Models
A collection of four model architectures (DNN, TabNet, Hybrid GBDT2NN, LightGBM) trained and evaluated on all eight subsets of the [EMBER2024](https://huggingface.co/datasets/joyce8/EMBER2024) dataset β€” six file formats (Win32, Win64, .NET, APK, ELF, PDF) plus a combined `PE` group and an `all`-types set β€” and converted into deployment-ready formats.
> **Training environment**: GPU server (CUDA 13)
> **Dataset paper**: [Joyce et al., KDD 2025 (arXiv:2506.05074)](https://arxiv.org/abs/2506.05074)
---
## Models
| Directory | Architecture | Deployment Format | Parameters |
|-----------|--------------|-------------------|------------|
| `dnn/` | Feed-Forward DNN (PReLU + Dropout) | ONNX (INT8 Static / FP32) | 13.2 M (PE) / 0.98 M (non-PE) |
| `tabnet/` | TabNet ([Arik & Pfister, 2021](https://arxiv.org/abs/1908.07442)) | ONNX FP32 | ~3 M |
| `hybrid/` | GBDT2NN ([DeepGBM, KDD 2019](https://www.microsoft.com/en-us/research/publication/deepgbm-a-deep-learning-framework-distilled-by-gbdt-for-online-prediction-tasks/)) | ONNX (nn_part) + LightGBM booster | ~1 M NN |
| `lightgbm/` | LightGBM (pretrained, [joyce8/EMBER2024-benchmark-models](https://huggingface.co/joyce8/EMBER2024-benchmark-models)) | Treelite `.tl` | β€” |
### Subset List
| Subset | Target File Type | Input Dim |
|--------|------------------|-----------|
| `PE` | All PE binaries (Win32 + Win64 + .NET) | 2,568 |
| `Win32` | Windows 32-bit PE | 2,568 |
| `Win64` | Windows 64-bit PE | 2,568 |
| `.NET` | .NET assemblies | 2,568 |
| `APK` | Android APK | 696 |
| `ELF` | Linux ELF | 696 |
| `PDF` | PDF documents | 696 |
| `all` | All file types combined | 2,568 |
---
## Directory Structure
Filename convention: `{model}_{subset}[_suffix].{ext}`
The `.NET` subset is rendered as `dotnet` in filenames.
```
dnn/
β”œβ”€β”€ dnn_PE.onnx # INT8 Static (deployment; PE/Win32/Win64/dotnet/all)
β”œβ”€β”€ dnn_PE_fp32.onnx # FP32 ONNX (reference; bundled only for INT8 subsets)
β”œβ”€β”€ dnn_PE.pt # PyTorch checkpoint
β”œβ”€β”€ dnn_PE_metrics.json # Evaluation results (AUC, TPR@1%FPR)
β”œβ”€β”€ dnn_PE_benchmark.json # Size & latency
β”œβ”€β”€ dnn_APK.onnx # FP32 (non-PE β€” INT8 AUC loss too large)
β”œβ”€β”€ dnn_APK.pt
└── ...
tabnet/
β”œβ”€β”€ tabnet_PE.onnx # FP32 ONNX (140 MB β€” sparsemax unfolding)
β”œβ”€β”€ tabnet_PE.zip # pytorch-tabnet native (7.4 MB, lightweight)
└── ...
hybrid/
β”œβ”€β”€ hybrid_PE_nnpart.onnx # GBDT2NN nn_part ONNX (5.1 MB)
β”œβ”€β”€ hybrid_PE_lgbm.model # LightGBM booster (3.6 MB)
β”œβ”€β”€ hybrid_PE.pt # PyTorch checkpoint
└── ...
lightgbm/
β”œβ”€β”€ lightgbm_PE.tl # Treelite serialization (platform-independent; recompilation required)
└── ...
```
---
## Performance Results (EMBER2024 test set)
> Metrics: ROC-AUC, TPR @ 1% FPR (paper Β§4.1), and challenge-set detection rate at the FPR=1% threshold.
> Challenge set: 6,315 evasive malware samples (positives only; Win32 3,225 / .NET 829 / Win64 814 / PDF 805 / ELF 386 / APK 256).
### DNN
| Subset | ROC-AUC | TPR@1%FPR | Deployment Format | Size |
|--------|---------|-----------|-------------------|------|
| PE | 0.9969 | 0.9472 | INT8 Static ONNX | 13.3 MB |
| Win32 | 0.9965 | 0.9479 | INT8 Static ONNX | 13.3 MB |
| Win64 | 0.9969 | 0.9617 | INT8 Static ONNX | 13.3 MB |
| .NET | 0.9920 | 0.8444 | INT8 Static ONNX | 13.3 MB |
| all | 0.9938 | 0.8870 | INT8 Static ONNX | 13.3 MB |
| APK | 0.9761 | 0.7682 | FP32 ONNX | 3.9 MB |
| ELF | 0.9840 | 0.8103 | FP32 ONNX | 3.9 MB |
| PDF | 0.9795 | 0.8902 | FP32 ONNX | 3.9 MB |
> non-PE subsets (APK/ELF/PDF) use 696-dim inputs and have too few parameters, so INT8 quantization causes a large AUC drop β€” they are kept in FP32.
> Figures are for the INT8 models (fixed 100K-sample set). Ξ”AUC vs FP32 stays within 0.19 pp.
> For the .NET and all subsets, INT8 quantization causes a relatively larger drop in TPR@1%FPR (still passes the AUC gate: |Ξ”AUC| < 0.5 pp).
### TabNet
| Subset | ROC-AUC | TPR@1%FPR | Deployment Format | Size |
|--------|---------|-----------|-------------------|------|
| PE | 0.9948 | 0.9195 | FP32 ONNX | 140 MB |
| Win32 | 0.9949 | 0.9317 | FP32 ONNX | 140 MB |
| Win64 | 0.9944 | 0.9318 | FP32 ONNX | 140 MB |
| .NET | 0.9923 | 0.8700 | FP32 ONNX | 140 MB |
| all | 0.9922 | 0.8912 | FP32 ONNX | 140 MB |
| APK | 0.9741 | 0.7028 | FP32 ONNX | 13.5 MB |
| ELF | 0.9793 | 0.5460 | FP32 ONNX | 13.5 MB |
| PDF | 0.9810 | 0.8597 | FP32 ONNX | 13.5 MB |
> The 140 MB ONNX size for the PE-family subsets is structural: the sparsemax attention loop is unfolded into the ONNX graph. If size matters, use `tabnet_PE.zip` (7.4 MB) directly.
### Hybrid (GBDT2NN)
| Subset | ROC-AUC | TPR@1%FPR | Deployment Format | Size |
|--------|---------|-----------|-------------------|------|
| PE | 0.9982 | 0.9736 | nn_part ONNX + LightGBM booster | 5.3 + 3.8 MB |
| Win32 | 0.9982 | 0.9734 | nn_part ONNX + LightGBM booster | 5.3 + 3.7 MB |
| Win64 | 0.9982 | 0.9811 | nn_part ONNX + LightGBM booster | 5.3 + 3.7 MB |
| .NET | 0.9961 | 0.9466 | nn_part ONNX + LightGBM booster | 5.3 + 3.7 MB |
| all | 0.9972 | 0.9513 | nn_part ONNX + LightGBM booster | 5.3 + 3.8 MB |
| APK | 0.9828 | 0.8003 | nn_part ONNX + LightGBM booster | 5.3 + 3.7 MB |
| ELF | 0.9899 | 0.8827 | nn_part ONNX + LightGBM booster | 5.3 + 3.8 MB |
| PDF | 0.9879 | 0.9283 | nn_part ONNX + LightGBM booster | 5.3 + 3.7 MB |
### LightGBM (Treelite-compiled)
| Subset | ROC-AUC | TPR@1%FPR | Size (.tl) | Size (original .model) |
|--------|---------|-----------|------------|------------------------|
| PE | 0.9983 | 0.9686 | 5.3 MB | 3.8 MB |
| Win32 | 0.9985 | 0.9722 | 5.3 MB | 3.7 MB |
| Win64 | 0.9988 | 0.9830 | 5.3 MB | 3.7 MB |
| .NET | 0.9980 | 0.9561 | 5.3 MB | 3.7 MB |
| all | 0.9970 | 0.9450 | 5.3 MB | 3.8 MB |
| APK | 0.9861 | 0.8157 | 5.3 MB | 3.7 MB |
| ELF | 0.9929 | 0.9140 | 5.3 MB | 3.8 MB |
| PDF | 0.9913 | 0.9275 | 5.3 MB | 3.7 MB |
> Original LightGBM models: [joyce8/EMBER2024-benchmark-models](https://huggingface.co/joyce8/EMBER2024-benchmark-models). The `.tl` files are serialized with Treelite 3.9.1 and are platform-independent β€” they must be recompiled on each target platform.
### Challenge Set Detection Rate
> Challenge set: 6,315 evasive malware (all positive). The FPR=1% threshold from the test set is applied.
| Subset | DNN | TabNet | Hybrid | LightGBM |
|--------|-----|--------|--------|----------|
| `.NET` | 58.6% | 70.0% | 80.6% | 79.6% |
| `APK` | 27.3% | 29.3% | 34.4% | 33.6% |
| `ELF` | 11.7% | 4.4% | 23.8% | 30.3% |
| `PDF` | 41.5% | 40.1% | 56.9% | 57.1% |
| `PE` | 38.5% | 36.9% | 58.2% | 58.8% |
| `Win32`| 36.6% | 45.3% | 58.4% | 69.9% |
| `Win64`| 46.3% | 44.1% | 59.5% | 59.7% |
| `all` | 35.3% | 42.3% | 54.1% | 48.4% |
---
## Inference Performance (Apple M1, darwin-arm64)
> `warm_batch1` latency: batch size = 1, measured after cache warm-up. May differ from the deployment environment (x86_64 Linux).
### Latency (ms, warm batch=1)
| Subset | DNN | TabNet | Hybrid | LightGBM |
|--------|-----|--------|--------|----------|
| `.NET` | 0.248 | 5.465 | 0.151 | 0.050 |
| `APK` | 0.035 | 0.846 | 0.145 | 0.031 |
| `ELF` | 0.039 | 0.505 | 0.160 | 0.036 |
| `PDF` | 0.036 | 2.230 | 0.172 | 0.048 |
| `PE` | 0.290 | 4.402 | 0.138 | 0.028 |
| `Win32`| 0.288 | 4.693 | 0.141 | 0.044 |
| `Win64`| 0.220 | 5.621 | 0.422 | 0.039 |
| `all` | 0.254 | 4.788 | 0.147 | 0.068 |
> TabNet latency is high because the sparsemax attention is unfolded into the ONNX graph (structural).
> Hybrid = nn_part ONNX inference only (LightGBM leaf extraction excluded).
> LightGBM latency is for the compiled `.dylib`; the uploaded file is `.tl` (recompilation required).
### Model File Sizes (deployment format)
| Subset | DNN | TabNet `.onnx` | TabNet `.zip` | Hybrid (nn+lgbm) | LightGBM `.tl` |
|--------|-----|----------------|---------------|------------------|----------------|
| PE family | 13.3 MB (INT8) | 140.2 MB | 7.4 MB | 5.3 + 3.8 MB | 5.3 MB |
| non-PE | 3.9 MB (FP32) | 13.5 MB | 3.2 MB | 5.3 + 3.7 MB | 5.3 MB |
---
## Usage
### Install Dependencies
```bash
pip install onnxruntime>=1.20 numpy
# For LightGBM / Hybrid inference
pip install "treelite==3.9.1" "treelite_runtime==3.9.1" lightgbm>=4.6
# To use the TabNet checkpoint directly
pip install pytorch-tabnet>=4.1
```
### DNN Inference (ONNX Runtime)
```python
import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
# PE subset β€” INT8 Static
model_path = hf_hub_download(
repo_id="cycloevan/ember-model",
filename="dnn/dnn_PE.onnx",
)
sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
# X: np.ndarray shape (N, 2568), dtype float32
X = np.random.randn(1, 2568).astype(np.float32)
logit = sess.run(["logit"], {"features": X})[0] # shape (N, 1)
prob = 1 / (1 + np.exp(-logit.ravel())) # sigmoid β†’ [0, 1]
print(f"malware probability: {prob[0]:.4f}")
```
```python
# APK subset β€” FP32
model_path = hf_hub_download(
repo_id="cycloevan/ember-model",
filename="dnn/dnn_APK.onnx",
)
sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
X = np.random.randn(1, 696).astype(np.float32) # non-PE: dim=696
prob = 1 / (1 + np.exp(-sess.run(["logit"], {"features": X})[0].ravel()))
```
### TabNet Inference (ONNX Runtime)
```python
import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
model_path = hf_hub_download(
repo_id="cycloevan/ember-model",
filename="tabnet/tabnet_PE.onnx",
)
sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
X = np.random.randn(1, 2568).astype(np.float32)
# output: logit (pre-sigmoid)
logit = sess.run(["logit"], {"features": X})[0]
prob = 1 / (1 + np.exp(-logit.ravel()))
```
### Hybrid Inference (ONNX + LightGBM)
```python
import numpy as np
import lightgbm as lgb
import onnxruntime as ort
from huggingface_hub import hf_hub_download
# 1. Extract leaf indices with the LightGBM booster
booster = lgb.Booster(model_file=hf_hub_download(
repo_id="cycloevan/ember-model",
filename="hybrid/hybrid_PE_lgbm.model",
))
X_raw = np.random.randn(1, 2568).astype(np.float64)
leaf_indices = booster.predict(X_raw, pred_leaf=True).astype(np.int64) # (N, n_trees)
# 2. Final classification with the GBDT2NN ONNX model
nn_sess = ort.InferenceSession(hf_hub_download(
repo_id="cycloevan/ember-model",
filename="hybrid/hybrid_PE_nnpart.onnx",
), providers=["CPUExecutionProvider"])
logit = nn_sess.run(["logit"], {"leaf_indices": leaf_indices})[0]
prob = 1 / (1 + np.exp(-logit.ravel()))
print(f"malware probability: {prob[0]:.4f}")
```
### LightGBM Inference (Treelite-compiled β€” fast inference)
```python
# 1. Compile Treelite .tl β†’ platform-specific shared library (one-time)
import treelite, treelite_runtime, sys, numpy as np
from huggingface_hub import hf_hub_download
tl_path = hf_hub_download(
repo_id="cycloevan/ember-model",
filename="lightgbm/lightgbm_PE.tl",
)
tl_model = treelite.Model.deserialize(tl_path)
lib_ext = ".dylib" if sys.platform == "darwin" else ".so"
lib_path = tl_path.replace(".tl", lib_ext)
tl_model.export_lib(
toolchain="clang" if sys.platform == "darwin" else "gcc",
libpath=lib_path,
verbose=False,
)
# 2. Inference
predictor = treelite_runtime.Predictor(lib_path, verbose=False)
X = np.random.randn(1, 2568).astype(np.float32)
prob = predictor.predict(treelite_runtime.DMatrix(X))
print(f"malware probability: {prob[0]:.4f}")
```
> **Note**: Requires `treelite==3.9.1` + `treelite_runtime==3.9.1`. Version 4.x does not support `export_lib()`.
---
## Training & Evaluation Environment
| Item | Details |
|------|---------|
| Dataset | [EMBER2024](https://huggingface.co/datasets/joyce8/EMBER2024) β€” train 52 weeks (2.6 M), test 12 weeks (606 K), challenge 6,315 |
| Feature dim | PE 2,568 (v3) / non-PE 696 (valid prefix) |
| Split policy | Fixed temporal order (temporal split), no random shuffling |
| Training environment | GPU server (CUDA 13) |
| Frameworks | PyTorch 2.11.0, pytorch-tabnet 4.1, LightGBM 4.6 |
| Random seed | 42 |
| DNN architecture | 2 × [Linear(d→d) + BatchNorm + PReLU(α=0.25) + Dropout(0.5)] → Linear(d→1), where d = 2,568 (PE) / 696 (non-PE) |
| Hybrid | LightGBM leaf extraction β†’ shared leaf Embedding (dim 8) β†’ concat β†’ MLP[256, 128] (BatchNorm + PReLU) β†’ Linear(β†’1) |
| Evaluation metrics | ROC-AUC, PR-AUC, **TPR @ 1% FPR** (paper Β§4.1) |
---
## Known Limitations
- **TabNet ONNX size**: unfolding the sparsemax attention loop inflates the PE-family ONNX to 140 MB. The original `tabnet_PE.zip` (7.4 MB) is lighter.
- **Treelite `.tl`**: the uploaded LightGBM artifact is a platform-independent serialization. You must compile it into a shared library (`.dylib`/`.so`) on each target platform before inference β€” see the LightGBM usage example. (The reported LightGBM latency is for a `.dylib` compiled on Mac ARM64.)
- **DNN non-PE INT8**: the 696-dim models suffer large AUC loss from quantization, so they are kept in FP32.
- **Hybrid inference**: not a single ONNX file β€” two stages: LightGBM leaf extraction + nn_part ONNX.
- **Challenge detection rate**: measured using the FPR=1% threshold from the test set. Values may vary across subsets due to distribution differences.
---
## Citation
```bibtex
@inproceedings{joyce2025ember2024,
title = {EMBER2024 -- A Benchmark Dataset for Holistic Evaluation of Malware Classifiers},
author = {Joyce, Robert J. and Miller, Gideon and Roth, Phil and Zak, Richard and Zaresky-Williams, Elliott and Anderson, Hyrum and Raff, Edward and Holt, James},
booktitle = {Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '25)},
year = {2025},
doi = {10.1145/3711896.3737431},
url = {https://arxiv.org/abs/2506.05074}
}
```
---
## License
Code and model weights: Apache 2.0
Original LightGBM models (`hybrid/hybrid_*_lgbm.model`): subject to the [joyce8/EMBER2024-benchmark-models](https://huggingface.co/joyce8/EMBER2024-benchmark-models) license.