schirrmacher
/

malwi

Safetensors

distilbert

Model card Files Files and versions

xet

Community

schirrmacher commited on Jun 12, 2025

Commit

7241c28

verified ·

1 Parent(s): a97893f

Update README.md

Browse files

Files changed (1) hide show

README.md +39 -12

README.md CHANGED Viewed

@@ -55,7 +55,7 @@ The following datasets are used as a source for malicious samples:
 - [pypi_malregistry](https://github.com/lxyeternal/pypi_malregistry)
 - [DataDog malicious-software-packages-dataset](https://github.com/DataDog/malicious-software-packages-dataset)
-### 1. malwi compiles Python files to bytecode
 ```
 def runcommand(value):
@@ -73,33 +73,57 @@ def runcommand(value):
   ...
 ```
-### 2. Bytecode operators are mapped to tokens
 ```
 TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_const INTEGER load_const INTEGER kw_names capture_output shell call store_fast output load_fast output load_attr stdout load_fast output load_attr stderr build_list return_value
 ```
-### 3. Tokens are used as input for a pre-trained DistilBert
 ```
-Maliciousness: 0.9620079398155212
 ```
 ## Benchmarks?
-The current best model differentiates benign from malicious code with the following metrics:
 | Metric                     | Value                         |
 |----------------------------|-------------------------------|
-| F1 Score                   | 0.91                          |
-| Recall                     | 0.87                          |
-| Precision                  | 0.94                          |
-| Unique benign samples      | 1,070,888                     |
-| Unique malicious samples   | 152,984                       |
 | Training time              | ~4 hours                      |
 | Hardware                   | NVIDIA RTX 4090               |
 | Epochs                     | 3                             |
 ## Limitations
 The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
@@ -127,8 +151,11 @@ cmds/preprocess.sh
 # Preprocess then start training
 cmds/preprocess_and_train.sh
-# Only start training
-cmds/train.sh
 ```
 ### Triage

 - [pypi_malregistry](https://github.com/lxyeternal/pypi_malregistry)
 - [DataDog malicious-software-packages-dataset](https://github.com/DataDog/malicious-software-packages-dataset)
+### 1. Compile Python files to bytecode
 ```
 def runcommand(value):
   ...
 ```
+### 2. Map bytecode to tokens
 ```
 TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_const INTEGER load_const INTEGER kw_names capture_output shell call store_fast output load_fast output load_attr stdout load_fast output load_attr stderr build_list return_value
 ```
+### 3. Feed tokens into pre-trained DistilBert
 ```
+=> Maliciousness Score: 0.92
+```
+This creates a list with malicious code objects. However malicious code might be split into chunks and spread across
+a package. This is why the next layers are needed.
+### 4. Create statistics about malicious activities
+| Object   | DYNAMIC_CODE_EXECUTION | ENCODING_DECODING | FILESYSTEM_ACCESS | ... |
+|----------|------------------------|-------------------|-------------------|-----|
+| Object A | 0                      | 1                 | 0                 | ... |
+| Object B | 1                      | 2                 | 1                 | ... |
+| Object C | 0                      | 0                 | 2                 | ... |
+| **Package**  | **1**                      | **3**                 | **3**                 | **...** |
+### 5. Take final decision
+An SVM layer takes statistics as input and decides if all findings combined are malicious.
+```
+SVM => Malicious
 ```
 ## Benchmarks?
+DistilBert:
 | Metric                     | Value                         |
 |----------------------------|-------------------------------|
+| F1 Score                   | 0.96                          |
+| Recall                     | 0.95                          |
+| Precision                  | 0.98                          |
 | Training time              | ~4 hours                      |
 | Hardware                   | NVIDIA RTX 4090               |
 | Epochs                     | 3                             |
+SVM:
+`Coming soon`
 ## Limitations
 The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
 # Preprocess then start training
 cmds/preprocess_and_train.sh
+# Start DistilBert training
+cmds/train_distilbert.sh
+# Start SVM Layer training
+cmds/train_svm_layer.sh
 ```
 ### Triage