schirrmacher commited on
Commit
e29f6b3
·
verified ·
1 Parent(s): 9c0ec86

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -25
README.md CHANGED
@@ -25,13 +25,11 @@ malwi ./examples
25
 
26
  3) **Evaluate**: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
27
  ```
28
- def runcommand(value):
29
- output = subprocess.run(value, shell=True, capture_output=True)
30
- return [output.stdout, output.stderr]
31
 
32
- ## examples/__init__.py
33
- - Object: runcommand
34
- - Maliciousness: 👹 0.9620079398155212
35
  ```
36
 
37
  ## Why malwi?
@@ -44,16 +42,9 @@ Typical malware behaviors include:
44
  - _Backdoors_: Allowing remote attackers to gain unauthorized access to your system.
45
  - _Destructive_ actions: Deleting files, corrupting databases, or sabotaging applications.
46
 
47
- > ⚠️ **Attention**: Malicious packages might execute code during installation (e.g. through `setup.py`).
48
- Make sure to *NOT* download or install malicious packages from the dataset with commands like `uv add`, `pip install`, `poetry add`.
49
-
50
  ## How does it work?
51
 
52
- malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1).
53
-
54
- The following datasets are used as a source for malicious samples:
55
- - [pypi_malregistry](https://github.com/lxyeternal/pypi_malregistry)
56
- - [DataDog malicious-software-packages-dataset](https://github.com/DataDog/malicious-software-packages-dataset)
57
 
58
  ### 1. Compile Python files to bytecode
59
 
@@ -109,7 +100,7 @@ SVM => Malicious
109
 
110
  ## Benchmarks?
111
 
112
- DistilBert:
113
 
114
  | Metric | Value |
115
  |----------------------------|-------------------------------|
@@ -120,12 +111,19 @@ DistilBert:
120
  | Hardware | NVIDIA RTX 4090 |
121
  | Epochs | 3 |
122
 
123
- SVM:
124
 
125
- `Coming soon`
 
 
 
 
126
 
127
  ## Limitations
128
 
 
 
 
129
  The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
130
 
131
  ## What's next?
@@ -140,18 +138,27 @@ Do you have access to malicious Rust, Go, whatever packages? **Contact me.**
140
 
141
  ### Develop
142
 
143
- Prerequisites: [uv](https://docs.astral.sh/uv/)
144
- ```
 
 
 
145
  # Download and process data
146
- cmds/download_and_preprocess.sh
 
 
 
 
 
 
147
 
148
- # Only process data
149
- cmds/preprocess.sh
150
 
151
- # Preprocess then start training
152
- cmds/preprocess_and_train.sh
153
 
154
- # Start DistilBert training
155
  cmds/train_distilbert.sh
156
 
157
  # Start SVM Layer training
 
25
 
26
  3) **Evaluate**: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
27
  ```
28
+ - 2 files scanned
29
+ - 0 files skipped
30
+ - 3 malicious objects
31
 
32
+ => 👹 malicious 1.0
 
 
33
  ```
34
 
35
  ## Why malwi?
 
42
  - _Backdoors_: Allowing remote attackers to gain unauthorized access to your system.
43
  - _Destructive_ actions: Deleting files, corrupting databases, or sabotaging applications.
44
 
 
 
 
45
  ## How does it work?
46
 
47
+ malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1). The [malwi-samples](https://github.com/schirrmacher/malwi-samples) dataset is used for training.
 
 
 
 
48
 
49
  ### 1. Compile Python files to bytecode
50
 
 
100
 
101
  ## Benchmarks?
102
 
103
+ ### DistilBert
104
 
105
  | Metric | Value |
106
  |----------------------------|-------------------------------|
 
111
  | Hardware | NVIDIA RTX 4090 |
112
  | Epochs | 3 |
113
 
114
+ ### SVM Layer
115
 
116
+ | Metric | Value |
117
+ |----------------------------|-------------------------------|
118
+ | F1 Score | 0.96 |
119
+ | Recall | 0.95 |
120
+ | Precision | 0.95 |
121
 
122
  ## Limitations
123
 
124
+ malwi compiles Python to bytecode, which is highly version dependent. The AI models are trained on that bytecode.
125
+ This means the performance might drop if a user installed a Python version which creates different bytecode instructions. There is no data yet about this.
126
+
127
  The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
128
 
129
  ## What's next?
 
138
 
139
  ### Develop
140
 
141
+ **Prerequisites:**
142
+ - [uv](https://docs.astral.sh/uv/)
143
+ - Download [malwi-samples](https://github.com/schirrmacher/malwi-samples) in the same parent folder
144
+
145
+ ```bash
146
  # Download and process data
147
+ cmds/download_and_preprocess_distilbert.sh
148
+
149
+ # Preprocess and train DistilBERT only
150
+ cmds/preprocess_and_train_distilbert.sh
151
+
152
+ # Preprocess and train SVM Layer only
153
+ cmds/preprocess_and_train_svm.sh
154
 
155
+ # Only preprocess data for DistilBERT
156
+ cmds/preprocess_distilbert.sh
157
 
158
+ # Only preprocess data for SVM Layer
159
+ cmds/preprocess_svm.sh
160
 
161
+ # Start DistilBERT training
162
  cmds/train_distilbert.sh
163
 
164
  # Start SVM Layer training