schirrmacher commited on
Commit
a05ca61
Β·
verified Β·
1 Parent(s): 2c1e82a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -65
README.md CHANGED
@@ -20,16 +20,30 @@ pip install --user malwi
20
 
21
  2) **Run**
22
  ```
23
- malwi ./examples
24
  ```
25
 
26
  3) **Evaluate**: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
27
  ```
28
- - 2 files scanned
29
- - 0 files skipped
30
- - 3 malicious objects
 
31
 
32
- => πŸ‘Ή malicious 1.0
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ```
34
 
35
  ## Why malwi?
@@ -73,29 +87,18 @@ TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_c
73
  ### 3. Feed tokens into pre-trained DistilBert
74
 
75
  ```
76
- => Maliciousness Score: 0.92
77
  ```
78
 
79
  This creates a list with malicious code objects. However malicious code might be split into chunks and spread across
80
  a package. This is why the next layers are needed.
81
 
82
- ### 4. Create statistics about malicious activities
83
-
84
-
85
- | Object | DYNAMIC_CODE_EXECUTION | ENCODING_DECODING | FILESYSTEM_ACCESS | ... |
86
- |----------|------------------------|-------------------|-------------------|-----|
87
- | Object A | 0 | 1 | 0 | ... |
88
- | Object B | 1 | 2 | 1 | ... |
89
- | Object C | 0 | 0 | 2 | ... |
90
- | **Package** | **1** | **3** | **3** | **...** |
91
-
92
-
93
- ### 5. Take final decision
94
 
95
- An SVM layer takes statistics as input and decides if all findings combined are malicious.
96
 
97
  ```
98
- SVM => Malicious
99
  ```
100
 
101
  ## Benchmarks?
@@ -104,26 +107,16 @@ SVM => Malicious
104
 
105
  | Metric | Value |
106
  |----------------------------|-------------------------------|
107
- | F1 Score | 0.96 |
108
- | Recall | 0.95 |
109
- | Precision | 0.98 |
110
- | Training time | ~4 hours |
111
  | Hardware | NVIDIA RTX 4090 |
112
  | Epochs | 3 |
113
 
114
- ### SVM Layer
115
-
116
- | Metric | Value |
117
- |----------------------------|-------------------------------|
118
- | F1 Score | 0.96 |
119
- | Recall | 0.95 |
120
- | Precision | 0.95 |
121
 
122
  ## Limitations
123
 
124
- malwi compiles Python to bytecode, which is highly version dependent. The AI models are trained on that bytecode.
125
- This means the performance might drop if a user installed a Python version which creates different bytecode instructions. There is no data yet about this.
126
-
127
  The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
128
 
129
  ## What's next?
@@ -146,37 +139,13 @@ Do you have access to malicious Rust, Go, whatever packages? **Contact me.**
146
  # Download and process data
147
  cmds/download_and_preprocess_distilbert.sh
148
 
149
- # Preprocess and train DistilBERT only
150
- cmds/preprocess_and_train_distilbert.sh
151
-
152
- # Preprocess and train SVM Layer only
153
- cmds/preprocess_and_train_svm.sh
154
-
155
- # Only preprocess data for DistilBERT
156
- cmds/preprocess_distilbert.sh
157
-
158
- # Only preprocess data for SVM Layer
159
- cmds/preprocess_svm.sh
160
-
161
- # Start DistilBERT training
162
- cmds/train_distilbert.sh
163
-
164
- # Start SVM Layer training
165
- cmds/train_svm_layer.sh
166
- ```
167
-
168
- ### Triage
169
-
170
- malwi uses a pipeline that can be enhanced by triaging its results (see `src/research/triage.py`). For automated triaging, you can leverage open-source models in combination with [Ollama](https://ollama.com/).
171
-
172
- #### Start LLM
173
-
174
- ```
175
- ollama run gemma3
176
- ```
177
 
178
- #### Start Triaging
 
179
 
 
 
 
180
  ```
181
- uv run python -m src.research.triage --triage-ollama --path <FOLDER_WITH_MALWI_YAML_RESULTS>
182
- ```
 
20
 
21
  2) **Run**
22
  ```
23
+ malwi examples/malicious
24
  ```
25
 
26
  3) **Evaluate**: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
27
  ```
28
+ .--------.---.-| .--.--.--|__|
29
+ | | _ | | | | | |
30
+ |__|__|__|___._|__|________|__|
31
+ AI Python Malware Scanner
32
 
33
+
34
+ - target: examples/malicious
35
+ - files: 13
36
+ β”œβ”€β”€ scanned: 3
37
+ β”œβ”€β”€ skipped: 10
38
+ └── suspicious:
39
+ └── examples/malicious/discordpydebug-0.0.4/src/discordpydebug/__init__.py
40
+ └── <module>
41
+ β”œβ”€β”€ deserialization
42
+ β”œβ”€β”€ user io
43
+ β”œβ”€β”€ system interaction
44
+ └── process management
45
+
46
+ => πŸ‘Ή malicious 1.00
47
  ```
48
 
49
  ## Why malwi?
 
87
  ### 3. Feed tokens into pre-trained DistilBert
88
 
89
  ```
90
+ => Maliciousness: 0.92
91
  ```
92
 
93
  This creates a list with malicious code objects. However malicious code might be split into chunks and spread across
94
  a package. This is why the next layers are needed.
95
 
96
+ ### 4. Take final decision
 
 
 
 
 
 
 
 
 
 
 
97
 
98
+ The DistilBERT model makes the final maliciousness decision based on the token patterns.
99
 
100
  ```
101
+ => Maliciousness: 0.92
102
  ```
103
 
104
  ## Benchmarks?
 
107
 
108
  | Metric | Value |
109
  |----------------------------|-------------------------------|
110
+ | F1 Score | 0.944 |
111
+ | Recall | 0.906 |
112
+ | Precision | 0.984 |
113
+ | Training time | ~5 hours |
114
  | Hardware | NVIDIA RTX 4090 |
115
  | Epochs | 3 |
116
 
 
 
 
 
 
 
 
117
 
118
  ## Limitations
119
 
 
 
 
120
  The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
121
 
122
  ## What's next?
 
139
  # Download and process data
140
  cmds/download_and_preprocess_distilbert.sh
141
 
142
+ # Complete pipelines
143
+ cmds/preprocess_and_train_distilbert.sh # Data β†’ Tokenizer β†’ DistilBERT
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
144
 
145
+ # Individual data preprocessing
146
+ cmds/preprocess_data.sh # Process data for ML training
147
 
148
+ # Individual model training
149
+ cmds/train_tokenizer.sh # Train custom tokenizer
150
+ cmds/train_distilbert.sh # Train DistilBERT model
151
  ```