Update README.md
Browse files
README.md
CHANGED
|
@@ -47,15 +47,13 @@ Typical malware behaviors include:
|
|
| 47 |
> **Attention**: Malicious packages might execute code during installation (e.g. through `setup.py`).
|
| 48 |
Make sure to *NOT* download or install malicious packages from the dataset with commands like `uv add`, `pip install`, `poetry add`.
|
| 49 |
|
| 50 |
-
## What's next?
|
| 51 |
-
|
| 52 |
-
The first iteration focuses on **maliciousness of Python source code**.
|
| 53 |
-
|
| 54 |
-
Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).
|
| 55 |
-
|
| 56 |
## How does it work?
|
| 57 |
|
| 58 |
-
malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) and Support Vector Machines (SVM) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1).
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
1. malwi compiles Python files to bytecode:
|
| 61 |
|
|
@@ -87,6 +85,25 @@ TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_c
|
|
| 87 |
Maliciousness: 0.9620079398155212
|
| 88 |
```
|
| 89 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
## Support
|
| 91 |
|
| 92 |
Do you have access to malicious Rust, Go, whatever packages? **Contact me.**
|
|
@@ -106,4 +123,20 @@ cmds/preprocess_and_train.sh
|
|
| 106 |
|
| 107 |
# Only start training
|
| 108 |
cmds/train.sh
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
```
|
|
|
|
| 47 |
> **Attention**: Malicious packages might execute code during installation (e.g. through `setup.py`).
|
| 48 |
Make sure to *NOT* download or install malicious packages from the dataset with commands like `uv add`, `pip install`, `poetry add`.
|
| 49 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
## How does it work?
|
| 51 |
|
| 52 |
+
malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) and Support Vector Machines (SVM) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1).
|
| 53 |
+
|
| 54 |
+
The following datasets are used as a source for malicious samples:
|
| 55 |
+
- [pypi_malregistry](https://github.com/lxyeternal/pypi_malregistry) is used as a source for malicious samples
|
| 56 |
+
- [DataDog malicious-software-packages-dataset](https://github.com/DataDog/malicious-software-packages-dataset)
|
| 57 |
|
| 58 |
1. malwi compiles Python files to bytecode:
|
| 59 |
|
|
|
|
| 85 |
Maliciousness: 0.9620079398155212
|
| 86 |
```
|
| 87 |
|
| 88 |
+
## Benchmarks?
|
| 89 |
+
|
| 90 |
+
The current best model differentiates benign from malicious code with the following metrics:
|
| 91 |
+
|
| 92 |
+
- F1 Score: `0.91`
|
| 93 |
+
- Recall: `0.87`
|
| 94 |
+
- Precision: `0.94`
|
| 95 |
+
- Benign unique samples: 1070888
|
| 96 |
+
- Malicious unique samples: 152984
|
| 97 |
+
- Trained for ~4 hours on a NVIDIA RTX 4090 for 3 epochs
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
## What's next?
|
| 101 |
+
|
| 102 |
+
The first iteration focuses on **maliciousness of Python source code**.
|
| 103 |
+
|
| 104 |
+
Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).
|
| 105 |
+
|
| 106 |
+
|
| 107 |
## Support
|
| 108 |
|
| 109 |
Do you have access to malicious Rust, Go, whatever packages? **Contact me.**
|
|
|
|
| 123 |
|
| 124 |
# Only start training
|
| 125 |
cmds/train.sh
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
### Triage
|
| 129 |
+
|
| 130 |
+
malwi utilized a pipeline which can be improved by triaging results (see `src/research/triage.py`).
|
| 131 |
+
For automated triaging you can utilize open-source models and Ollama (default Gemma 3).
|
| 132 |
+
|
| 133 |
+
Install Gemma 3:
|
| 134 |
+
|
| 135 |
+
```
|
| 136 |
+
ollama run gemma3
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
Start auto-triaging:
|
| 140 |
+
```
|
| 141 |
+
uv run python -m src.research.triage --triage-ollama --path <FOLDER_WITH_MALWI_YAML_RESULTS>
|
| 142 |
```
|