Update README.md
Browse files
README.md
CHANGED
|
@@ -25,13 +25,11 @@ malwi ./examples
|
|
| 25 |
|
| 26 |
3) **Evaluate**: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
|
| 27 |
```
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
-
- Object: runcommand
|
| 34 |
-
- Maliciousness: 👹 0.9620079398155212
|
| 35 |
```
|
| 36 |
|
| 37 |
## Why malwi?
|
|
@@ -44,16 +42,9 @@ Typical malware behaviors include:
|
|
| 44 |
- _Backdoors_: Allowing remote attackers to gain unauthorized access to your system.
|
| 45 |
- _Destructive_ actions: Deleting files, corrupting databases, or sabotaging applications.
|
| 46 |
|
| 47 |
-
> ⚠️ **Attention**: Malicious packages might execute code during installation (e.g. through `setup.py`).
|
| 48 |
-
Make sure to *NOT* download or install malicious packages from the dataset with commands like `uv add`, `pip install`, `poetry add`.
|
| 49 |
-
|
| 50 |
## How does it work?
|
| 51 |
|
| 52 |
-
malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1).
|
| 53 |
-
|
| 54 |
-
The following datasets are used as a source for malicious samples:
|
| 55 |
-
- [pypi_malregistry](https://github.com/lxyeternal/pypi_malregistry)
|
| 56 |
-
- [DataDog malicious-software-packages-dataset](https://github.com/DataDog/malicious-software-packages-dataset)
|
| 57 |
|
| 58 |
### 1. Compile Python files to bytecode
|
| 59 |
|
|
@@ -109,7 +100,7 @@ SVM => Malicious
|
|
| 109 |
|
| 110 |
## Benchmarks?
|
| 111 |
|
| 112 |
-
DistilBert
|
| 113 |
|
| 114 |
| Metric | Value |
|
| 115 |
|----------------------------|-------------------------------|
|
|
@@ -120,12 +111,19 @@ DistilBert:
|
|
| 120 |
| Hardware | NVIDIA RTX 4090 |
|
| 121 |
| Epochs | 3 |
|
| 122 |
|
| 123 |
-
SVM
|
| 124 |
|
| 125 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
|
| 127 |
## Limitations
|
| 128 |
|
|
|
|
|
|
|
|
|
|
| 129 |
The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
|
| 130 |
|
| 131 |
## What's next?
|
|
@@ -140,18 +138,27 @@ Do you have access to malicious Rust, Go, whatever packages? **Contact me.**
|
|
| 140 |
|
| 141 |
### Develop
|
| 142 |
|
| 143 |
-
Prerequisites:
|
| 144 |
-
|
|
|
|
|
|
|
|
|
|
| 145 |
# Download and process data
|
| 146 |
-
cmds/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
|
| 148 |
-
# Only
|
| 149 |
-
cmds/
|
| 150 |
|
| 151 |
-
#
|
| 152 |
-
cmds/
|
| 153 |
|
| 154 |
-
# Start
|
| 155 |
cmds/train_distilbert.sh
|
| 156 |
|
| 157 |
# Start SVM Layer training
|
|
|
|
| 25 |
|
| 26 |
3) **Evaluate**: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
|
| 27 |
```
|
| 28 |
+
- 2 files scanned
|
| 29 |
+
- 0 files skipped
|
| 30 |
+
- 3 malicious objects
|
| 31 |
|
| 32 |
+
=> 👹 malicious 1.0
|
|
|
|
|
|
|
| 33 |
```
|
| 34 |
|
| 35 |
## Why malwi?
|
|
|
|
| 42 |
- _Backdoors_: Allowing remote attackers to gain unauthorized access to your system.
|
| 43 |
- _Destructive_ actions: Deleting files, corrupting databases, or sabotaging applications.
|
| 44 |
|
|
|
|
|
|
|
|
|
|
| 45 |
## How does it work?
|
| 46 |
|
| 47 |
+
malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1). The [malwi-samples](https://github.com/schirrmacher/malwi-samples) dataset is used for training.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
### 1. Compile Python files to bytecode
|
| 50 |
|
|
|
|
| 100 |
|
| 101 |
## Benchmarks?
|
| 102 |
|
| 103 |
+
### DistilBert
|
| 104 |
|
| 105 |
| Metric | Value |
|
| 106 |
|----------------------------|-------------------------------|
|
|
|
|
| 111 |
| Hardware | NVIDIA RTX 4090 |
|
| 112 |
| Epochs | 3 |
|
| 113 |
|
| 114 |
+
### SVM Layer
|
| 115 |
|
| 116 |
+
| Metric | Value |
|
| 117 |
+
|----------------------------|-------------------------------|
|
| 118 |
+
| F1 Score | 0.96 |
|
| 119 |
+
| Recall | 0.95 |
|
| 120 |
+
| Precision | 0.95 |
|
| 121 |
|
| 122 |
## Limitations
|
| 123 |
|
| 124 |
+
malwi compiles Python to bytecode, which is highly version dependent. The AI models are trained on that bytecode.
|
| 125 |
+
This means the performance might drop if a user installed a Python version which creates different bytecode instructions. There is no data yet about this.
|
| 126 |
+
|
| 127 |
The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
|
| 128 |
|
| 129 |
## What's next?
|
|
|
|
| 138 |
|
| 139 |
### Develop
|
| 140 |
|
| 141 |
+
**Prerequisites:**
|
| 142 |
+
- [uv](https://docs.astral.sh/uv/)
|
| 143 |
+
- Download [malwi-samples](https://github.com/schirrmacher/malwi-samples) in the same parent folder
|
| 144 |
+
|
| 145 |
+
```bash
|
| 146 |
# Download and process data
|
| 147 |
+
cmds/download_and_preprocess_distilbert.sh
|
| 148 |
+
|
| 149 |
+
# Preprocess and train DistilBERT only
|
| 150 |
+
cmds/preprocess_and_train_distilbert.sh
|
| 151 |
+
|
| 152 |
+
# Preprocess and train SVM Layer only
|
| 153 |
+
cmds/preprocess_and_train_svm.sh
|
| 154 |
|
| 155 |
+
# Only preprocess data for DistilBERT
|
| 156 |
+
cmds/preprocess_distilbert.sh
|
| 157 |
|
| 158 |
+
# Only preprocess data for SVM Layer
|
| 159 |
+
cmds/preprocess_svm.sh
|
| 160 |
|
| 161 |
+
# Start DistilBERT training
|
| 162 |
cmds/train_distilbert.sh
|
| 163 |
|
| 164 |
# Start SVM Layer training
|