schirrmacher commited on
Commit
ff0f686
·
verified ·
1 Parent(s): 59650b6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -18
README.md CHANGED
@@ -3,15 +3,38 @@ license: mit
3
  ---
4
  # malwi - AI Python Malware Scanner
5
 
 
 
6
  Detect Python malware _fast_ - no internet, no expensive hardware, no fees.
7
 
8
- malwi is specialized in detecting **zero-day vulnerabilities**, achieving an impressive **97% accuracy** in classifying code as safe or harmful.
9
 
10
  Open-source software made in Europe.
11
  Based on open research, open code, open data.
12
  🇪🇺🤘🕊️
13
 
14
- ## Why malwi?
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  [The number of _malicious open-source packages_ is growing](https://arxiv.org/pdf/2404.04991). This is not just a threat to your business but also to the open-source community.
17
 
@@ -22,16 +45,7 @@ Typical malware behaviors include:
22
  - _Destructive_ actions: Deleting files, corrupting databases, or sabotaging applications.
23
 
24
  > **Attention**: Malicious packages might execute code during installation (e.g. through `setup.py`).
25
- Make sure to *NOT* download package artifacts with commands like:
26
- >```
27
- ># Do NOT RUN THE FOLLOWING COMMANDS on malicious packages!!!
28
- >
29
- >uv add <MALICIOUS_PACKAGE>
30
- >pip install <MALICIOUS_PACKAGE>
31
- >pipenv install <MALICIOUS_PACKAGE>
32
- >poetry add <MALICIOUS_PACKAGE>
33
- >conda install <MALICIOUS_PACKAGE>
34
- >```
35
 
36
  ## What's next?
37
 
@@ -41,15 +55,55 @@ Future iterations will cover malware scanning for more languages (JavaScript, Ru
41
 
42
  ## How does it work?
43
 
44
- malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) and Support Vector Machines (SVM) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1).
45
- Additionally, malwi applies [Tree-sitter](https://tree-sitter.github.io/tree-sitter/) for creating Abstract Syntax Tree (ASTs) which are mapped to a unified and security sensitive syntax used as training input. The Python malware dataset can be found [here](https://github.com/lxyeternal/pypi_malregistry). After 3 epochs of training you will get: Loss: `0.0986`, Accuracy: `0.9669`, F1: `0.9666`.
 
 
 
 
 
 
 
 
 
 
46
 
47
- High-level training pipeline:
 
 
 
 
 
48
 
49
- - Create dataset from malicious/benign repositories and map code to malwi syntax
50
- - Remove code duplications based on hashes
51
- - Train DistilBert based on the malwi samples for categorizing malicious/benign
 
 
 
 
 
 
 
 
52
 
53
  ## Support
54
 
55
  Do you have access to malicious Rust, Go, whatever packages? **Contact me.**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
  # malwi - AI Python Malware Scanner
5
 
6
+ <img src="malwi-logo.png" alt="Logo">
7
+
8
  Detect Python malware _fast_ - no internet, no expensive hardware, no fees.
9
 
10
+ malwi is specialized in detecting **zero-day vulnerabilities**, for classifying code as safe or harmful.
11
 
12
  Open-source software made in Europe.
13
  Based on open research, open code, open data.
14
  🇪🇺🤘🕊️
15
 
16
+ 1) **Install**
17
+ ```
18
+ pip install --user malwi
19
+ ```
20
+
21
+ 2) **Run**
22
+ ```
23
+ malwi ./examples
24
+ ```
25
+
26
+ 3) **Evaluate**: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
27
+ ```
28
+ def runcommand(value):
29
+ output = subprocess.run(value, shell=True, capture_output=True)
30
+ return [output.stdout, output.stderr]
31
+
32
+ ## examples/__init__.py
33
+ - Object: runcommand
34
+ - Maliciousness: 👹 0.9620079398155212
35
+ ```
36
+
37
+ ## Why malwi?
38
 
39
  [The number of _malicious open-source packages_ is growing](https://arxiv.org/pdf/2404.04991). This is not just a threat to your business but also to the open-source community.
40
 
 
45
  - _Destructive_ actions: Deleting files, corrupting databases, or sabotaging applications.
46
 
47
  > **Attention**: Malicious packages might execute code during installation (e.g. through `setup.py`).
48
+ Make sure to *NOT* download or install malicious packages from the dataset with commands like `uv add`, `pip install`, `poetry add`.
 
 
 
 
 
 
 
 
 
49
 
50
  ## What's next?
51
 
 
55
 
56
  ## How does it work?
57
 
58
+ malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) and Support Vector Machines (SVM) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1). [pypi_malregistry](https://github.com/lxyeternal/pypi_malregistry) is used as a source for malicious samples.
59
+
60
+ 1. malwi compiles Python files to bytecode:
61
+
62
+ ```
63
+ def runcommand(value):
64
+ output = subprocess.run(value, shell=True, capture_output=True)
65
+ return [output.stdout, output.stderr]
66
+ ```
67
+
68
+ ```
69
+ 0 RESUME 0
70
 
71
+ 1 LOAD_CONST 0 (<code object runcommand at 0x5b4f60ae7540, file "example.py", line 1>)
72
+ MAKE_FUNCTION
73
+ STORE_NAME 0 (runcommand)
74
+ RETURN_CONST 1 (None)
75
+ ...
76
+ ```
77
 
78
+ 2. Bytecode operators are mapped to tokens:
79
+
80
+ ```
81
+ TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_const INTEGER load_const INTEGER kw_names capture_output shell call store_fast output load_fast output load_attr stdout load_fast output load_attr stderr build_list return_value
82
+ ```
83
+
84
+ 3. Tokens are used as input for a pre-trained DistilBert:
85
+
86
+ ```
87
+ Maliciousness: 0.9620079398155212
88
+ ```
89
 
90
  ## Support
91
 
92
  Do you have access to malicious Rust, Go, whatever packages? **Contact me.**
93
+
94
+ ### Develop
95
+
96
+ Prerequisites: [uv](https://docs.astral.sh/uv/)
97
+ ```
98
+ # Download and process data
99
+ cmds/download_and_preprocess.sh
100
+
101
+ # Only process data
102
+ cmds/preprocess.sh
103
+
104
+ # Preprocess then start training
105
+ cmds/preprocess_and_train.sh
106
+
107
+ # Only start training
108
+ cmds/train.sh
109
+ ```