Update README.md
Browse files
README.md
CHANGED
|
@@ -20,16 +20,30 @@ pip install --user malwi
|
|
| 20 |
|
| 21 |
2) **Run**
|
| 22 |
```
|
| 23 |
-
malwi
|
| 24 |
```
|
| 25 |
|
| 26 |
3) **Evaluate**: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
|
| 27 |
```
|
| 28 |
-
-
|
| 29 |
-
|
| 30 |
-
|
|
|
|
| 31 |
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
```
|
| 34 |
|
| 35 |
## Why malwi?
|
|
@@ -73,29 +87,18 @@ TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_c
|
|
| 73 |
### 3. Feed tokens into pre-trained DistilBert
|
| 74 |
|
| 75 |
```
|
| 76 |
-
=> Maliciousness
|
| 77 |
```
|
| 78 |
|
| 79 |
This creates a list with malicious code objects. However malicious code might be split into chunks and spread across
|
| 80 |
a package. This is why the next layers are needed.
|
| 81 |
|
| 82 |
-
### 4.
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
| Object | DYNAMIC_CODE_EXECUTION | ENCODING_DECODING | FILESYSTEM_ACCESS | ... |
|
| 86 |
-
|----------|------------------------|-------------------|-------------------|-----|
|
| 87 |
-
| Object A | 0 | 1 | 0 | ... |
|
| 88 |
-
| Object B | 1 | 2 | 1 | ... |
|
| 89 |
-
| Object C | 0 | 0 | 2 | ... |
|
| 90 |
-
| **Package** | **1** | **3** | **3** | **...** |
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
### 5. Take final decision
|
| 94 |
|
| 95 |
-
|
| 96 |
|
| 97 |
```
|
| 98 |
-
|
| 99 |
```
|
| 100 |
|
| 101 |
## Benchmarks?
|
|
@@ -104,26 +107,16 @@ SVM => Malicious
|
|
| 104 |
|
| 105 |
| Metric | Value |
|
| 106 |
|----------------------------|-------------------------------|
|
| 107 |
-
| F1 Score | 0.
|
| 108 |
-
| Recall | 0.
|
| 109 |
-
| Precision | 0.
|
| 110 |
-
| Training time | ~
|
| 111 |
| Hardware | NVIDIA RTX 4090 |
|
| 112 |
| Epochs | 3 |
|
| 113 |
|
| 114 |
-
### SVM Layer
|
| 115 |
-
|
| 116 |
-
| Metric | Value |
|
| 117 |
-
|----------------------------|-------------------------------|
|
| 118 |
-
| F1 Score | 0.96 |
|
| 119 |
-
| Recall | 0.95 |
|
| 120 |
-
| Precision | 0.95 |
|
| 121 |
|
| 122 |
## Limitations
|
| 123 |
|
| 124 |
-
malwi compiles Python to bytecode, which is highly version dependent. The AI models are trained on that bytecode.
|
| 125 |
-
This means the performance might drop if a user installed a Python version which creates different bytecode instructions. There is no data yet about this.
|
| 126 |
-
|
| 127 |
The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
|
| 128 |
|
| 129 |
## What's next?
|
|
@@ -146,37 +139,13 @@ Do you have access to malicious Rust, Go, whatever packages? **Contact me.**
|
|
| 146 |
# Download and process data
|
| 147 |
cmds/download_and_preprocess_distilbert.sh
|
| 148 |
|
| 149 |
-
#
|
| 150 |
-
cmds/preprocess_and_train_distilbert.sh
|
| 151 |
-
|
| 152 |
-
# Preprocess and train SVM Layer only
|
| 153 |
-
cmds/preprocess_and_train_svm.sh
|
| 154 |
-
|
| 155 |
-
# Only preprocess data for DistilBERT
|
| 156 |
-
cmds/preprocess_distilbert.sh
|
| 157 |
-
|
| 158 |
-
# Only preprocess data for SVM Layer
|
| 159 |
-
cmds/preprocess_svm.sh
|
| 160 |
-
|
| 161 |
-
# Start DistilBERT training
|
| 162 |
-
cmds/train_distilbert.sh
|
| 163 |
-
|
| 164 |
-
# Start SVM Layer training
|
| 165 |
-
cmds/train_svm_layer.sh
|
| 166 |
-
```
|
| 167 |
-
|
| 168 |
-
### Triage
|
| 169 |
-
|
| 170 |
-
malwi uses a pipeline that can be enhanced by triaging its results (see `src/research/triage.py`). For automated triaging, you can leverage open-source models in combination with [Ollama](https://ollama.com/).
|
| 171 |
-
|
| 172 |
-
#### Start LLM
|
| 173 |
-
|
| 174 |
-
```
|
| 175 |
-
ollama run gemma3
|
| 176 |
-
```
|
| 177 |
|
| 178 |
-
#
|
|
|
|
| 179 |
|
|
|
|
|
|
|
|
|
|
| 180 |
```
|
| 181 |
-
uv run python -m src.research.triage --triage-ollama --path <FOLDER_WITH_MALWI_YAML_RESULTS>
|
| 182 |
-
```
|
|
|
|
| 20 |
|
| 21 |
2) **Run**
|
| 22 |
```
|
| 23 |
+
malwi examples/malicious
|
| 24 |
```
|
| 25 |
|
| 26 |
3) **Evaluate**: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
|
| 27 |
```
|
| 28 |
+
.--------.---.-| .--.--.--|__|
|
| 29 |
+
| | _ | | | | | |
|
| 30 |
+
|__|__|__|___._|__|________|__|
|
| 31 |
+
AI Python Malware Scanner
|
| 32 |
|
| 33 |
+
|
| 34 |
+
- target: examples/malicious
|
| 35 |
+
- files: 13
|
| 36 |
+
βββ scanned: 3
|
| 37 |
+
βββ skipped: 10
|
| 38 |
+
βββ suspicious:
|
| 39 |
+
βββ examples/malicious/discordpydebug-0.0.4/src/discordpydebug/__init__.py
|
| 40 |
+
βββ <module>
|
| 41 |
+
βββ deserialization
|
| 42 |
+
βββ user io
|
| 43 |
+
βββ system interaction
|
| 44 |
+
βββ process management
|
| 45 |
+
|
| 46 |
+
=> πΉ malicious 1.00
|
| 47 |
```
|
| 48 |
|
| 49 |
## Why malwi?
|
|
|
|
| 87 |
### 3. Feed tokens into pre-trained DistilBert
|
| 88 |
|
| 89 |
```
|
| 90 |
+
=> Maliciousness: 0.92
|
| 91 |
```
|
| 92 |
|
| 93 |
This creates a list with malicious code objects. However malicious code might be split into chunks and spread across
|
| 94 |
a package. This is why the next layers are needed.
|
| 95 |
|
| 96 |
+
### 4. Take final decision
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
+
The DistilBERT model makes the final maliciousness decision based on the token patterns.
|
| 99 |
|
| 100 |
```
|
| 101 |
+
=> Maliciousness: 0.92
|
| 102 |
```
|
| 103 |
|
| 104 |
## Benchmarks?
|
|
|
|
| 107 |
|
| 108 |
| Metric | Value |
|
| 109 |
|----------------------------|-------------------------------|
|
| 110 |
+
| F1 Score | 0.944 |
|
| 111 |
+
| Recall | 0.906 |
|
| 112 |
+
| Precision | 0.984 |
|
| 113 |
+
| Training time | ~5 hours |
|
| 114 |
| Hardware | NVIDIA RTX 4090 |
|
| 115 |
| Epochs | 3 |
|
| 116 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
|
| 118 |
## Limitations
|
| 119 |
|
|
|
|
|
|
|
|
|
|
| 120 |
The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
|
| 121 |
|
| 122 |
## What's next?
|
|
|
|
| 139 |
# Download and process data
|
| 140 |
cmds/download_and_preprocess_distilbert.sh
|
| 141 |
|
| 142 |
+
# Complete pipelines
|
| 143 |
+
cmds/preprocess_and_train_distilbert.sh # Data β Tokenizer β DistilBERT
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 144 |
|
| 145 |
+
# Individual data preprocessing
|
| 146 |
+
cmds/preprocess_data.sh # Process data for ML training
|
| 147 |
|
| 148 |
+
# Individual model training
|
| 149 |
+
cmds/train_tokenizer.sh # Train custom tokenizer
|
| 150 |
+
cmds/train_distilbert.sh # Train DistilBERT model
|
| 151 |
```
|
|
|
|
|
|