Remove inaccurate/private GitHub code link from model card

ff61249 verified 2 days ago

14.6 kB

	---
	license: apache-2.0
	datasets:
	- joyce8/EMBER2024
	language:
	- en
	tags:
	- malware-detection
	- cybersecurity
	- onnxruntime
	- lightgbm
	- pytorch
	- tabnet
	- binary-classification
	pipeline_tag: text-classification
	library_name: onnxruntime
	---

	# EMBER2024 Malware Detection Models

	A collection of four model architectures (DNN, TabNet, Hybrid GBDT2NN, LightGBM) trained and evaluated on all eight subsets of the [EMBER2024](https://huggingface.co/datasets/joyce8/EMBER2024) dataset — six file formats (Win32, Win64, .NET, APK, ELF, PDF) plus a combined `PE` group and an `all`-types set — and converted into deployment-ready formats.

	> Training environment: GPU server (CUDA 13)
	> Dataset paper: [Joyce et al., KDD 2025 (arXiv:2506.05074)](https://arxiv.org/abs/2506.05074)

	---

	## Models

	\| Directory \| Architecture \| Deployment Format \| Parameters \|
	\|-----------\|--------------\|-------------------\|------------\|
	\| `dnn/` \| Feed-Forward DNN (PReLU + Dropout) \| ONNX (INT8 Static / FP32) \| 13.2 M (PE) / 0.98 M (non-PE) \|
	\| `tabnet/` \| TabNet ([Arik & Pfister, 2021](https://arxiv.org/abs/1908.07442)) \| ONNX FP32 \| ~3 M \|
	\| `hybrid/` \| GBDT2NN ([DeepGBM, KDD 2019](https://www.microsoft.com/en-us/research/publication/deepgbm-a-deep-learning-framework-distilled-by-gbdt-for-online-prediction-tasks/)) \| ONNX (nn_part) + LightGBM booster \| ~1 M NN \|
	\| `lightgbm/` \| LightGBM (pretrained, [joyce8/EMBER2024-benchmark-models](https://huggingface.co/joyce8/EMBER2024-benchmark-models)) \| Treelite `.tl` \| — \|

	### Subset List

	\| Subset \| Target File Type \| Input Dim \|
	\|--------\|------------------\|-----------\|
	\| `PE` \| All PE binaries (Win32 + Win64 + .NET) \| 2,568 \|
	\| `Win32` \| Windows 32-bit PE \| 2,568 \|
	\| `Win64` \| Windows 64-bit PE \| 2,568 \|
	\| `.NET` \| .NET assemblies \| 2,568 \|
	\| `APK` \| Android APK \| 696 \|
	\| `ELF` \| Linux ELF \| 696 \|
	\| `PDF` \| PDF documents \| 696 \|
	\| `all` \| All file types combined \| 2,568 \|

	---

	## Directory Structure

	Filename convention: `{model}_{subset}[_suffix].{ext}`
	The `.NET` subset is rendered as `dotnet` in filenames.

	```
	dnn/
	├── dnn_PE.onnx # INT8 Static (deployment; PE/Win32/Win64/dotnet/all)
	├── dnn_PE_fp32.onnx # FP32 ONNX (reference; bundled only for INT8 subsets)
	├── dnn_PE.pt # PyTorch checkpoint
	├── dnn_PE_metrics.json # Evaluation results (AUC, TPR@1%FPR)
	├── dnn_PE_benchmark.json # Size & latency
	├── dnn_APK.onnx # FP32 (non-PE — INT8 AUC loss too large)
	├── dnn_APK.pt
	└── ...

	tabnet/
	├── tabnet_PE.onnx # FP32 ONNX (140 MB — sparsemax unfolding)
	├── tabnet_PE.zip # pytorch-tabnet native (7.4 MB, lightweight)
	└── ...

	hybrid/
	├── hybrid_PE_nnpart.onnx # GBDT2NN nn_part ONNX (5.1 MB)
	├── hybrid_PE_lgbm.model # LightGBM booster (3.6 MB)
	├── hybrid_PE.pt # PyTorch checkpoint
	└── ...

	lightgbm/
	├── lightgbm_PE.tl # Treelite serialization (platform-independent; recompilation required)
	└── ...
	```

	---

	## Performance Results (EMBER2024 test set)

	> Metrics: ROC-AUC, TPR @ 1% FPR (paper §4.1), and challenge-set detection rate at the FPR=1% threshold.
	> Challenge set: 6,315 evasive malware samples (positives only; Win32 3,225 / .NET 829 / Win64 814 / PDF 805 / ELF 386 / APK 256).

	### DNN

	\| Subset \| ROC-AUC \| TPR@1%FPR \| Deployment Format \| Size \|
	\|--------\|---------\|-----------\|-------------------\|------\|
	\| PE \| 0.9969 \| 0.9472 \| INT8 Static ONNX \| 13.3 MB \|
	\| Win32 \| 0.9965 \| 0.9479 \| INT8 Static ONNX \| 13.3 MB \|
	\| Win64 \| 0.9969 \| 0.9617 \| INT8 Static ONNX \| 13.3 MB \|
	\| .NET \| 0.9920 \| 0.8444 \| INT8 Static ONNX \| 13.3 MB \|
	\| all \| 0.9938 \| 0.8870 \| INT8 Static ONNX \| 13.3 MB \|
	\| APK \| 0.9761 \| 0.7682 \| FP32 ONNX \| 3.9 MB \|
	\| ELF \| 0.9840 \| 0.8103 \| FP32 ONNX \| 3.9 MB \|
	\| PDF \| 0.9795 \| 0.8902 \| FP32 ONNX \| 3.9 MB \|

	> non-PE subsets (APK/ELF/PDF) use 696-dim inputs and have too few parameters, so INT8 quantization causes a large AUC drop — they are kept in FP32.
	> Figures are for the INT8 models (fixed 100K-sample set). ΔAUC vs FP32 stays within 0.19 pp.
	> For the .NET and all subsets, INT8 quantization causes a relatively larger drop in TPR@1%FPR (still passes the AUC gate: \|ΔAUC\| < 0.5 pp).

	### TabNet

	\| Subset \| ROC-AUC \| TPR@1%FPR \| Deployment Format \| Size \|
	\|--------\|---------\|-----------\|-------------------\|------\|
	\| PE \| 0.9948 \| 0.9195 \| FP32 ONNX \| 140 MB \|
	\| Win32 \| 0.9949 \| 0.9317 \| FP32 ONNX \| 140 MB \|
	\| Win64 \| 0.9944 \| 0.9318 \| FP32 ONNX \| 140 MB \|
	\| .NET \| 0.9923 \| 0.8700 \| FP32 ONNX \| 140 MB \|
	\| all \| 0.9922 \| 0.8912 \| FP32 ONNX \| 140 MB \|
	\| APK \| 0.9741 \| 0.7028 \| FP32 ONNX \| 13.5 MB \|
	\| ELF \| 0.9793 \| 0.5460 \| FP32 ONNX \| 13.5 MB \|
	\| PDF \| 0.9810 \| 0.8597 \| FP32 ONNX \| 13.5 MB \|

	> The 140 MB ONNX size for the PE-family subsets is structural: the sparsemax attention loop is unfolded into the ONNX graph. If size matters, use `tabnet_PE.zip` (7.4 MB) directly.

	### Hybrid (GBDT2NN)

	\| Subset \| ROC-AUC \| TPR@1%FPR \| Deployment Format \| Size \|
	\|--------\|---------\|-----------\|-------------------\|------\|
	\| PE \| 0.9982 \| 0.9736 \| nn_part ONNX + LightGBM booster \| 5.3 + 3.8 MB \|
	\| Win32 \| 0.9982 \| 0.9734 \| nn_part ONNX + LightGBM booster \| 5.3 + 3.7 MB \|
	\| Win64 \| 0.9982 \| 0.9811 \| nn_part ONNX + LightGBM booster \| 5.3 + 3.7 MB \|
	\| .NET \| 0.9961 \| 0.9466 \| nn_part ONNX + LightGBM booster \| 5.3 + 3.7 MB \|
	\| all \| 0.9972 \| 0.9513 \| nn_part ONNX + LightGBM booster \| 5.3 + 3.8 MB \|
	\| APK \| 0.9828 \| 0.8003 \| nn_part ONNX + LightGBM booster \| 5.3 + 3.7 MB \|
	\| ELF \| 0.9899 \| 0.8827 \| nn_part ONNX + LightGBM booster \| 5.3 + 3.8 MB \|
	\| PDF \| 0.9879 \| 0.9283 \| nn_part ONNX + LightGBM booster \| 5.3 + 3.7 MB \|

	### LightGBM (Treelite-compiled)

	\| Subset \| ROC-AUC \| TPR@1%FPR \| Size (.tl) \| Size (original .model) \|
	\|--------\|---------\|-----------\|------------\|------------------------\|
	\| PE \| 0.9983 \| 0.9686 \| 5.3 MB \| 3.8 MB \|
	\| Win32 \| 0.9985 \| 0.9722 \| 5.3 MB \| 3.7 MB \|
	\| Win64 \| 0.9988 \| 0.9830 \| 5.3 MB \| 3.7 MB \|
	\| .NET \| 0.9980 \| 0.9561 \| 5.3 MB \| 3.7 MB \|
	\| all \| 0.9970 \| 0.9450 \| 5.3 MB \| 3.8 MB \|
	\| APK \| 0.9861 \| 0.8157 \| 5.3 MB \| 3.7 MB \|
	\| ELF \| 0.9929 \| 0.9140 \| 5.3 MB \| 3.8 MB \|
	\| PDF \| 0.9913 \| 0.9275 \| 5.3 MB \| 3.7 MB \|

	> Original LightGBM models: [joyce8/EMBER2024-benchmark-models](https://huggingface.co/joyce8/EMBER2024-benchmark-models). The `.tl` files are serialized with Treelite 3.9.1 and are platform-independent — they must be recompiled on each target platform.

	### Challenge Set Detection Rate

	> Challenge set: 6,315 evasive malware (all positive). The FPR=1% threshold from the test set is applied.

	\| Subset \| DNN \| TabNet \| Hybrid \| LightGBM \|
	\|--------\|-----\|--------\|--------\|----------\|
	\| `.NET` \| 58.6% \| 70.0% \| 80.6% \| 79.6% \|
	\| `APK` \| 27.3% \| 29.3% \| 34.4% \| 33.6% \|
	\| `ELF` \| 11.7% \| 4.4% \| 23.8% \| 30.3% \|
	\| `PDF` \| 41.5% \| 40.1% \| 56.9% \| 57.1% \|
	\| `PE` \| 38.5% \| 36.9% \| 58.2% \| 58.8% \|
	\| `Win32`\| 36.6% \| 45.3% \| 58.4% \| 69.9% \|
	\| `Win64`\| 46.3% \| 44.1% \| 59.5% \| 59.7% \|
	\| `all` \| 35.3% \| 42.3% \| 54.1% \| 48.4% \|

	---

	## Inference Performance (Apple M1, darwin-arm64)

	> `warm_batch1` latency: batch size = 1, measured after cache warm-up. May differ from the deployment environment (x86_64 Linux).

	### Latency (ms, warm batch=1)

	\| Subset \| DNN \| TabNet \| Hybrid \| LightGBM \|
	\|--------\|-----\|--------\|--------\|----------\|
	\| `.NET` \| 0.248 \| 5.465 \| 0.151 \| 0.050 \|
	\| `APK` \| 0.035 \| 0.846 \| 0.145 \| 0.031 \|
	\| `ELF` \| 0.039 \| 0.505 \| 0.160 \| 0.036 \|
	\| `PDF` \| 0.036 \| 2.230 \| 0.172 \| 0.048 \|
	\| `PE` \| 0.290 \| 4.402 \| 0.138 \| 0.028 \|
	\| `Win32`\| 0.288 \| 4.693 \| 0.141 \| 0.044 \|
	\| `Win64`\| 0.220 \| 5.621 \| 0.422 \| 0.039 \|
	\| `all` \| 0.254 \| 4.788 \| 0.147 \| 0.068 \|

	> TabNet latency is high because the sparsemax attention is unfolded into the ONNX graph (structural).
	> Hybrid = nn_part ONNX inference only (LightGBM leaf extraction excluded).
	> LightGBM latency is for the compiled `.dylib`; the uploaded file is `.tl` (recompilation required).

	### Model File Sizes (deployment format)

	\| Subset \| DNN \| TabNet `.onnx` \| TabNet `.zip` \| Hybrid (nn+lgbm) \| LightGBM `.tl` \|
	\|--------\|-----\|----------------\|---------------\|------------------\|----------------\|
	\| PE family \| 13.3 MB (INT8) \| 140.2 MB \| 7.4 MB \| 5.3 + 3.8 MB \| 5.3 MB \|
	\| non-PE \| 3.9 MB (FP32) \| 13.5 MB \| 3.2 MB \| 5.3 + 3.7 MB \| 5.3 MB \|

	---

	## Usage

	### Install Dependencies

	```bash
	pip install onnxruntime>=1.20 numpy
	# For LightGBM / Hybrid inference
	pip install "treelite==3.9.1" "treelite_runtime==3.9.1" lightgbm>=4.6
	# To use the TabNet checkpoint directly
	pip install pytorch-tabnet>=4.1
	```

	### DNN Inference (ONNX Runtime)

	```python
	import numpy as np
	import onnxruntime as ort
	from huggingface_hub import hf_hub_download

	# PE subset — INT8 Static
	model_path = hf_hub_download(
	repo_id="cycloevan/ember-model",
	filename="dnn/dnn_PE.onnx",
	)
	sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])

	# X: np.ndarray shape (N, 2568), dtype float32
	X = np.random.randn(1, 2568).astype(np.float32)
	logit = sess.run(["logit"], {"features": X})[0] # shape (N, 1)
	prob = 1 / (1 + np.exp(-logit.ravel())) # sigmoid → [0, 1]
	print(f"malware probability: {prob[0]:.4f}")
	```

	```python
	# APK subset — FP32
	model_path = hf_hub_download(
	repo_id="cycloevan/ember-model",
	filename="dnn/dnn_APK.onnx",
	)
	sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
	X = np.random.randn(1, 696).astype(np.float32) # non-PE: dim=696
	prob = 1 / (1 + np.exp(-sess.run(["logit"], {"features": X})[0].ravel()))
	```

	### TabNet Inference (ONNX Runtime)

	```python
	import numpy as np
	import onnxruntime as ort
	from huggingface_hub import hf_hub_download

	model_path = hf_hub_download(
	repo_id="cycloevan/ember-model",
	filename="tabnet/tabnet_PE.onnx",
	)
	sess = ort.InferenceSession(model_path, providers=["CPUExecutionProvider"])
	X = np.random.randn(1, 2568).astype(np.float32)
	# output: logit (pre-sigmoid)
	logit = sess.run(["logit"], {"features": X})[0]
	prob = 1 / (1 + np.exp(-logit.ravel()))
	```

	### Hybrid Inference (ONNX + LightGBM)

	```python
	import numpy as np
	import lightgbm as lgb
	import onnxruntime as ort
	from huggingface_hub import hf_hub_download

	# 1. Extract leaf indices with the LightGBM booster
	booster = lgb.Booster(model_file=hf_hub_download(
	repo_id="cycloevan/ember-model",
	filename="hybrid/hybrid_PE_lgbm.model",
	))
	X_raw = np.random.randn(1, 2568).astype(np.float64)
	leaf_indices = booster.predict(X_raw, pred_leaf=True).astype(np.int64) # (N, n_trees)

	# 2. Final classification with the GBDT2NN ONNX model
	nn_sess = ort.InferenceSession(hf_hub_download(
	repo_id="cycloevan/ember-model",
	filename="hybrid/hybrid_PE_nnpart.onnx",
	), providers=["CPUExecutionProvider"])
	logit = nn_sess.run(["logit"], {"leaf_indices": leaf_indices})[0]
	prob = 1 / (1 + np.exp(-logit.ravel()))
	print(f"malware probability: {prob[0]:.4f}")
	```

	### LightGBM Inference (Treelite-compiled — fast inference)

	```python
	# 1. Compile Treelite .tl → platform-specific shared library (one-time)
	import treelite, treelite_runtime, sys, numpy as np
	from huggingface_hub import hf_hub_download

	tl_path = hf_hub_download(
	repo_id="cycloevan/ember-model",
	filename="lightgbm/lightgbm_PE.tl",
	)
	tl_model = treelite.Model.deserialize(tl_path)
	lib_ext = ".dylib" if sys.platform == "darwin" else ".so"
	lib_path = tl_path.replace(".tl", lib_ext)
	tl_model.export_lib(
	toolchain="clang" if sys.platform == "darwin" else "gcc",
	libpath=lib_path,
	verbose=False,
	)

	# 2. Inference
	predictor = treelite_runtime.Predictor(lib_path, verbose=False)
	X = np.random.randn(1, 2568).astype(np.float32)
	prob = predictor.predict(treelite_runtime.DMatrix(X))
	print(f"malware probability: {prob[0]:.4f}")
	```

	> Note: Requires `treelite==3.9.1` + `treelite_runtime==3.9.1`. Version 4.x does not support `export_lib()`.

	---

	## Training & Evaluation Environment

	\| Item \| Details \|
	\|------\|---------\|
	\| Dataset \| [EMBER2024](https://huggingface.co/datasets/joyce8/EMBER2024) — train 52 weeks (2.6 M), test 12 weeks (606 K), challenge 6,315 \|
	\| Feature dim \| PE 2,568 (v3) / non-PE 696 (valid prefix) \|
	\| Split policy \| Fixed temporal order (temporal split), no random shuffling \|
	\| Training environment \| GPU server (CUDA 13) \|
	\| Frameworks \| PyTorch 2.11.0, pytorch-tabnet 4.1, LightGBM 4.6 \|
	\| Random seed \| 42 \|
	\| DNN architecture \| 2 × [Linear(d→d) + BatchNorm + PReLU(α=0.25) + Dropout(0.5)] → Linear(d→1), where d = 2,568 (PE) / 696 (non-PE) \|
	\| Hybrid \| LightGBM leaf extraction → shared leaf Embedding (dim 8) → concat → MLP[256, 128] (BatchNorm + PReLU) → Linear(→1) \|
	\| Evaluation metrics \| ROC-AUC, PR-AUC, TPR @ 1% FPR (paper §4.1) \|

	---

	## Known Limitations

	- TabNet ONNX size: unfolding the sparsemax attention loop inflates the PE-family ONNX to 140 MB. The original `tabnet_PE.zip` (7.4 MB) is lighter.
	- Treelite `.tl`: the uploaded LightGBM artifact is a platform-independent serialization. You must compile it into a shared library (`.dylib`/`.so`) on each target platform before inference — see the LightGBM usage example. (The reported LightGBM latency is for a `.dylib` compiled on Mac ARM64.)
	- DNN non-PE INT8: the 696-dim models suffer large AUC loss from quantization, so they are kept in FP32.
	- Hybrid inference: not a single ONNX file — two stages: LightGBM leaf extraction + nn_part ONNX.
	- Challenge detection rate: measured using the FPR=1% threshold from the test set. Values may vary across subsets due to distribution differences.

	---

	## Citation

	```bibtex
	@inproceedings{joyce2025ember2024,
	title = {EMBER2024 -- A Benchmark Dataset for Holistic Evaluation of Malware Classifiers},
	author = {Joyce, Robert J. and Miller, Gideon and Roth, Phil and Zak, Richard and Zaresky-Williams, Elliott and Anderson, Hyrum and Raff, Edward and Holt, James},
	booktitle = {Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '25)},
	year = {2025},
	doi = {10.1145/3711896.3737431},
	url = {https://arxiv.org/abs/2506.05074}
	}
	```

	---

	## License

	Code and model weights: Apache 2.0
	Original LightGBM models (`hybrid/hybrid_*_lgbm.model`): subject to the [joyce8/EMBER2024-benchmark-models](https://huggingface.co/joyce8/EMBER2024-benchmark-models) license.