Update README.md
Browse files
README.md
CHANGED
|
@@ -55,7 +55,7 @@ The following datasets are used as a source for malicious samples:
|
|
| 55 |
- [pypi_malregistry](https://github.com/lxyeternal/pypi_malregistry)
|
| 56 |
- [DataDog malicious-software-packages-dataset](https://github.com/DataDog/malicious-software-packages-dataset)
|
| 57 |
|
| 58 |
-
### 1.
|
| 59 |
|
| 60 |
```
|
| 61 |
def runcommand(value):
|
|
@@ -73,33 +73,57 @@ def runcommand(value):
|
|
| 73 |
...
|
| 74 |
```
|
| 75 |
|
| 76 |
-
### 2.
|
| 77 |
|
| 78 |
```
|
| 79 |
TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_const INTEGER load_const INTEGER kw_names capture_output shell call store_fast output load_fast output load_attr stdout load_fast output load_attr stderr build_list return_value
|
| 80 |
```
|
| 81 |
|
| 82 |
-
### 3.
|
| 83 |
|
| 84 |
```
|
| 85 |
-
Maliciousness: 0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
```
|
| 87 |
|
| 88 |
## Benchmarks?
|
| 89 |
|
| 90 |
-
|
| 91 |
|
| 92 |
| Metric | Value |
|
| 93 |
|----------------------------|-------------------------------|
|
| 94 |
-
| F1 Score | 0.
|
| 95 |
-
| Recall | 0.
|
| 96 |
-
| Precision | 0.
|
| 97 |
-
| Unique benign samples | 1,070,888 |
|
| 98 |
-
| Unique malicious samples | 152,984 |
|
| 99 |
| Training time | ~4 hours |
|
| 100 |
| Hardware | NVIDIA RTX 4090 |
|
| 101 |
| Epochs | 3 |
|
| 102 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
## Limitations
|
| 104 |
|
| 105 |
The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
|
|
@@ -127,8 +151,11 @@ cmds/preprocess.sh
|
|
| 127 |
# Preprocess then start training
|
| 128 |
cmds/preprocess_and_train.sh
|
| 129 |
|
| 130 |
-
#
|
| 131 |
-
cmds/
|
|
|
|
|
|
|
|
|
|
| 132 |
```
|
| 133 |
|
| 134 |
### Triage
|
|
|
|
| 55 |
- [pypi_malregistry](https://github.com/lxyeternal/pypi_malregistry)
|
| 56 |
- [DataDog malicious-software-packages-dataset](https://github.com/DataDog/malicious-software-packages-dataset)
|
| 57 |
|
| 58 |
+
### 1. Compile Python files to bytecode
|
| 59 |
|
| 60 |
```
|
| 61 |
def runcommand(value):
|
|
|
|
| 73 |
...
|
| 74 |
```
|
| 75 |
|
| 76 |
+
### 2. Map bytecode to tokens
|
| 77 |
|
| 78 |
```
|
| 79 |
TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_const INTEGER load_const INTEGER kw_names capture_output shell call store_fast output load_fast output load_attr stdout load_fast output load_attr stderr build_list return_value
|
| 80 |
```
|
| 81 |
|
| 82 |
+
### 3. Feed tokens into pre-trained DistilBert
|
| 83 |
|
| 84 |
```
|
| 85 |
+
=> Maliciousness Score: 0.92
|
| 86 |
+
```
|
| 87 |
+
|
| 88 |
+
This creates a list with malicious code objects. However malicious code might be split into chunks and spread across
|
| 89 |
+
a package. This is why the next layers are needed.
|
| 90 |
+
|
| 91 |
+
### 4. Create statistics about malicious activities
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
| Object | DYNAMIC_CODE_EXECUTION | ENCODING_DECODING | FILESYSTEM_ACCESS | ... |
|
| 95 |
+
|----------|------------------------|-------------------|-------------------|-----|
|
| 96 |
+
| Object A | 0 | 1 | 0 | ... |
|
| 97 |
+
| Object B | 1 | 2 | 1 | ... |
|
| 98 |
+
| Object C | 0 | 0 | 2 | ... |
|
| 99 |
+
| **Package** | **1** | **3** | **3** | **...** |
|
| 100 |
+
|
| 101 |
+
|
| 102 |
+
### 5. Take final decision
|
| 103 |
+
|
| 104 |
+
An SVM layer takes statistics as input and decides if all findings combined are malicious.
|
| 105 |
+
|
| 106 |
+
```
|
| 107 |
+
SVM => Malicious
|
| 108 |
```
|
| 109 |
|
| 110 |
## Benchmarks?
|
| 111 |
|
| 112 |
+
DistilBert:
|
| 113 |
|
| 114 |
| Metric | Value |
|
| 115 |
|----------------------------|-------------------------------|
|
| 116 |
+
| F1 Score | 0.96 |
|
| 117 |
+
| Recall | 0.95 |
|
| 118 |
+
| Precision | 0.98 |
|
|
|
|
|
|
|
| 119 |
| Training time | ~4 hours |
|
| 120 |
| Hardware | NVIDIA RTX 4090 |
|
| 121 |
| Epochs | 3 |
|
| 122 |
|
| 123 |
+
SVM:
|
| 124 |
+
|
| 125 |
+
`Coming soon`
|
| 126 |
+
|
| 127 |
## Limitations
|
| 128 |
|
| 129 |
The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
|
|
|
|
| 151 |
# Preprocess then start training
|
| 152 |
cmds/preprocess_and_train.sh
|
| 153 |
|
| 154 |
+
# Start DistilBert training
|
| 155 |
+
cmds/train_distilbert.sh
|
| 156 |
+
|
| 157 |
+
# Start SVM Layer training
|
| 158 |
+
cmds/train_svm_layer.sh
|
| 159 |
```
|
| 160 |
|
| 161 |
### Triage
|