schirrmacher commited on
Commit
b42a5f8
·
verified ·
1 Parent(s): 6e5a7d9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -3
README.md CHANGED
@@ -1,3 +1,54 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ Detect Python malware _fast_ - no internet, no expensive hardware, no fees.
6
+
7
+ malwi is specialized in detecting **zero-day vulnerabilities**, achieving an impressive **97% accuracy** in classifying code as safe or harmful.
8
+
9
+ Open-source software made in Europe.
10
+ Based on open research, open code, open data.
11
+ 🇪🇺🤘🕊️
12
+
13
+ ## Why malwi?
14
+
15
+ [The number of _malicious open-source packages_ is growing](https://arxiv.org/pdf/2404.04991). This is not just a threat to your business but also to the open-source community.
16
+
17
+ Typical malware behaviors include:
18
+
19
+ - _Exfiltration_ of data: Stealing credentials, API keys, or sensitive user data.
20
+ - _Backdoors_: Allowing remote attackers to gain unauthorized access to your system.
21
+ - _Destructive_ actions: Deleting files, corrupting databases, or sabotaging applications.
22
+
23
+ > **Attention**: Malicious packages might execute code during installation (e.g. through `setup.py`).
24
+ Make sure to *NOT* download package artifacts with commands like:
25
+ >```
26
+ ># Do NOT RUN THE FOLLOWING COMMANDS on malicious packages!!!
27
+ >
28
+ >uv add <MALICIOUS_PACKAGE>
29
+ >pip install <MALICIOUS_PACKAGE>
30
+ >pipenv install <MALICIOUS_PACKAGE>
31
+ >poetry add <MALICIOUS_PACKAGE>
32
+ >conda install <MALICIOUS_PACKAGE>
33
+ >```
34
+
35
+ ## What's next?
36
+
37
+ The first iteration focuses on **maliciousness of Python source code**.
38
+
39
+ Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).
40
+
41
+ ## How does it work?
42
+
43
+ malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) and Support Vector Machines (SVM) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1).
44
+ Additionally, malwi applies [Tree-sitter](https://tree-sitter.github.io/tree-sitter/) for creating Abstract Syntax Tree (ASTs) which are mapped to a unified and security sensitive syntax used as training input. The Python malware dataset can be found [here](https://github.com/lxyeternal/pypi_malregistry). After 3 epochs of training you will get: Loss: `0.0986`, Accuracy: `0.9669`, F1: `0.9666`.
45
+
46
+ High-level training pipeline:
47
+
48
+ - Create dataset from malicious/benign repositories and map code to malwi syntax
49
+ - Remove code duplications based on hashes
50
+ - Train DistilBert based on the malwi samples for categorizing malicious/benign
51
+
52
+ ## Support
53
+
54
+ Do you have access to malicious Rust, Go, whatever packages? **Contact me.**