Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,54 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
Detect Python malware _fast_ - no internet, no expensive hardware, no fees.
|
| 6 |
+
|
| 7 |
+
malwi is specialized in detecting **zero-day vulnerabilities**, achieving an impressive **97% accuracy** in classifying code as safe or harmful.
|
| 8 |
+
|
| 9 |
+
Open-source software made in Europe.
|
| 10 |
+
Based on open research, open code, open data.
|
| 11 |
+
🇪🇺🤘🕊️
|
| 12 |
+
|
| 13 |
+
## Why malwi?
|
| 14 |
+
|
| 15 |
+
[The number of _malicious open-source packages_ is growing](https://arxiv.org/pdf/2404.04991). This is not just a threat to your business but also to the open-source community.
|
| 16 |
+
|
| 17 |
+
Typical malware behaviors include:
|
| 18 |
+
|
| 19 |
+
- _Exfiltration_ of data: Stealing credentials, API keys, or sensitive user data.
|
| 20 |
+
- _Backdoors_: Allowing remote attackers to gain unauthorized access to your system.
|
| 21 |
+
- _Destructive_ actions: Deleting files, corrupting databases, or sabotaging applications.
|
| 22 |
+
|
| 23 |
+
> **Attention**: Malicious packages might execute code during installation (e.g. through `setup.py`).
|
| 24 |
+
Make sure to *NOT* download package artifacts with commands like:
|
| 25 |
+
>```
|
| 26 |
+
># Do NOT RUN THE FOLLOWING COMMANDS on malicious packages!!!
|
| 27 |
+
>
|
| 28 |
+
>uv add <MALICIOUS_PACKAGE>
|
| 29 |
+
>pip install <MALICIOUS_PACKAGE>
|
| 30 |
+
>pipenv install <MALICIOUS_PACKAGE>
|
| 31 |
+
>poetry add <MALICIOUS_PACKAGE>
|
| 32 |
+
>conda install <MALICIOUS_PACKAGE>
|
| 33 |
+
>```
|
| 34 |
+
|
| 35 |
+
## What's next?
|
| 36 |
+
|
| 37 |
+
The first iteration focuses on **maliciousness of Python source code**.
|
| 38 |
+
|
| 39 |
+
Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).
|
| 40 |
+
|
| 41 |
+
## How does it work?
|
| 42 |
+
|
| 43 |
+
malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) and Support Vector Machines (SVM) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1).
|
| 44 |
+
Additionally, malwi applies [Tree-sitter](https://tree-sitter.github.io/tree-sitter/) for creating Abstract Syntax Tree (ASTs) which are mapped to a unified and security sensitive syntax used as training input. The Python malware dataset can be found [here](https://github.com/lxyeternal/pypi_malregistry). After 3 epochs of training you will get: Loss: `0.0986`, Accuracy: `0.9669`, F1: `0.9666`.
|
| 45 |
+
|
| 46 |
+
High-level training pipeline:
|
| 47 |
+
|
| 48 |
+
- Create dataset from malicious/benign repositories and map code to malwi syntax
|
| 49 |
+
- Remove code duplications based on hashes
|
| 50 |
+
- Train DistilBert based on the malwi samples for categorizing malicious/benign
|
| 51 |
+
|
| 52 |
+
## Support
|
| 53 |
+
|
| 54 |
+
Do you have access to malicious Rust, Go, whatever packages? **Contact me.**
|