schirrmacher commited on
Commit
dec2025
·
verified ·
1 Parent(s): 54abaae

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -7
README.md CHANGED
@@ -47,15 +47,13 @@ Typical malware behaviors include:
47
  > **Attention**: Malicious packages might execute code during installation (e.g. through `setup.py`).
48
  Make sure to *NOT* download or install malicious packages from the dataset with commands like `uv add`, `pip install`, `poetry add`.
49
 
50
- ## What's next?
51
-
52
- The first iteration focuses on **maliciousness of Python source code**.
53
-
54
- Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).
55
-
56
  ## How does it work?
57
 
58
- malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) and Support Vector Machines (SVM) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1). [pypi_malregistry](https://github.com/lxyeternal/pypi_malregistry) is used as a source for malicious samples.
 
 
 
 
59
 
60
  1. malwi compiles Python files to bytecode:
61
 
@@ -87,6 +85,25 @@ TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_c
87
  Maliciousness: 0.9620079398155212
88
  ```
89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  ## Support
91
 
92
  Do you have access to malicious Rust, Go, whatever packages? **Contact me.**
@@ -106,4 +123,20 @@ cmds/preprocess_and_train.sh
106
 
107
  # Only start training
108
  cmds/train.sh
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
  ```
 
47
  > **Attention**: Malicious packages might execute code during installation (e.g. through `setup.py`).
48
  Make sure to *NOT* download or install malicious packages from the dataset with commands like `uv add`, `pip install`, `poetry add`.
49
 
 
 
 
 
 
 
50
  ## How does it work?
51
 
52
+ malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) and Support Vector Machines (SVM) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1).
53
+
54
+ The following datasets are used as a source for malicious samples:
55
+ - [pypi_malregistry](https://github.com/lxyeternal/pypi_malregistry) is used as a source for malicious samples
56
+ - [DataDog malicious-software-packages-dataset](https://github.com/DataDog/malicious-software-packages-dataset)
57
 
58
  1. malwi compiles Python files to bytecode:
59
 
 
85
  Maliciousness: 0.9620079398155212
86
  ```
87
 
88
+ ## Benchmarks?
89
+
90
+ The current best model differentiates benign from malicious code with the following metrics:
91
+
92
+ - F1 Score: `0.91`
93
+ - Recall: `0.87`
94
+ - Precision: `0.94`
95
+ - Benign unique samples: 1070888
96
+ - Malicious unique samples: 152984
97
+ - Trained for ~4 hours on a NVIDIA RTX 4090 for 3 epochs
98
+
99
+
100
+ ## What's next?
101
+
102
+ The first iteration focuses on **maliciousness of Python source code**.
103
+
104
+ Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).
105
+
106
+
107
  ## Support
108
 
109
  Do you have access to malicious Rust, Go, whatever packages? **Contact me.**
 
123
 
124
  # Only start training
125
  cmds/train.sh
126
+ ```
127
+
128
+ ### Triage
129
+
130
+ malwi utilized a pipeline which can be improved by triaging results (see `src/research/triage.py`).
131
+ For automated triaging you can utilize open-source models and Ollama (default Gemma 3).
132
+
133
+ Install Gemma 3:
134
+
135
+ ```
136
+ ollama run gemma3
137
+ ```
138
+
139
+ Start auto-triaging:
140
+ ```
141
+ uv run python -m src.research.triage --triage-ollama --path <FOLDER_WITH_MALWI_YAML_RESULTS>
142
  ```