schirrmacher commited on
Commit
7241c28
·
verified ·
1 Parent(s): a97893f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -12
README.md CHANGED
@@ -55,7 +55,7 @@ The following datasets are used as a source for malicious samples:
55
  - [pypi_malregistry](https://github.com/lxyeternal/pypi_malregistry)
56
  - [DataDog malicious-software-packages-dataset](https://github.com/DataDog/malicious-software-packages-dataset)
57
 
58
- ### 1. malwi compiles Python files to bytecode
59
 
60
  ```
61
  def runcommand(value):
@@ -73,33 +73,57 @@ def runcommand(value):
73
  ...
74
  ```
75
 
76
- ### 2. Bytecode operators are mapped to tokens
77
 
78
  ```
79
  TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_const INTEGER load_const INTEGER kw_names capture_output shell call store_fast output load_fast output load_attr stdout load_fast output load_attr stderr build_list return_value
80
  ```
81
 
82
- ### 3. Tokens are used as input for a pre-trained DistilBert
83
 
84
  ```
85
- Maliciousness: 0.9620079398155212
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  ```
87
 
88
  ## Benchmarks?
89
 
90
- The current best model differentiates benign from malicious code with the following metrics:
91
 
92
  | Metric | Value |
93
  |----------------------------|-------------------------------|
94
- | F1 Score | 0.91 |
95
- | Recall | 0.87 |
96
- | Precision | 0.94 |
97
- | Unique benign samples | 1,070,888 |
98
- | Unique malicious samples | 152,984 |
99
  | Training time | ~4 hours |
100
  | Hardware | NVIDIA RTX 4090 |
101
  | Epochs | 3 |
102
 
 
 
 
 
103
  ## Limitations
104
 
105
  The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
@@ -127,8 +151,11 @@ cmds/preprocess.sh
127
  # Preprocess then start training
128
  cmds/preprocess_and_train.sh
129
 
130
- # Only start training
131
- cmds/train.sh
 
 
 
132
  ```
133
 
134
  ### Triage
 
55
  - [pypi_malregistry](https://github.com/lxyeternal/pypi_malregistry)
56
  - [DataDog malicious-software-packages-dataset](https://github.com/DataDog/malicious-software-packages-dataset)
57
 
58
+ ### 1. Compile Python files to bytecode
59
 
60
  ```
61
  def runcommand(value):
 
73
  ...
74
  ```
75
 
76
+ ### 2. Map bytecode to tokens
77
 
78
  ```
79
  TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_const INTEGER load_const INTEGER kw_names capture_output shell call store_fast output load_fast output load_attr stdout load_fast output load_attr stderr build_list return_value
80
  ```
81
 
82
+ ### 3. Feed tokens into pre-trained DistilBert
83
 
84
  ```
85
+ => Maliciousness Score: 0.92
86
+ ```
87
+
88
+ This creates a list with malicious code objects. However malicious code might be split into chunks and spread across
89
+ a package. This is why the next layers are needed.
90
+
91
+ ### 4. Create statistics about malicious activities
92
+
93
+
94
+ | Object | DYNAMIC_CODE_EXECUTION | ENCODING_DECODING | FILESYSTEM_ACCESS | ... |
95
+ |----------|------------------------|-------------------|-------------------|-----|
96
+ | Object A | 0 | 1 | 0 | ... |
97
+ | Object B | 1 | 2 | 1 | ... |
98
+ | Object C | 0 | 0 | 2 | ... |
99
+ | **Package** | **1** | **3** | **3** | **...** |
100
+
101
+
102
+ ### 5. Take final decision
103
+
104
+ An SVM layer takes statistics as input and decides if all findings combined are malicious.
105
+
106
+ ```
107
+ SVM => Malicious
108
  ```
109
 
110
  ## Benchmarks?
111
 
112
+ DistilBert:
113
 
114
  | Metric | Value |
115
  |----------------------------|-------------------------------|
116
+ | F1 Score | 0.96 |
117
+ | Recall | 0.95 |
118
+ | Precision | 0.98 |
 
 
119
  | Training time | ~4 hours |
120
  | Hardware | NVIDIA RTX 4090 |
121
  | Epochs | 3 |
122
 
123
+ SVM:
124
+
125
+ `Coming soon`
126
+
127
  ## Limitations
128
 
129
  The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
 
151
  # Preprocess then start training
152
  cmds/preprocess_and_train.sh
153
 
154
+ # Start DistilBert training
155
+ cmds/train_distilbert.sh
156
+
157
+ # Start SVM Layer training
158
+ cmds/train_svm_layer.sh
159
  ```
160
 
161
  ### Triage