schirrmacher
/

malwi

Safetensors

distilbert

Model card Files Files and versions

xet

Community

schirrmacher commited on Jun 15, 2025

Commit

e29f6b3

verified ·

1 Parent(s): 9c0ec86

Update README.md

Browse files

Files changed (1) hide show

README.md +32 -25

README.md CHANGED Viewed

@@ -25,13 +25,11 @@ malwi ./examples
 3) **Evaluate**: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
 ```
-def runcommand(value):
-    output = subprocess.run(value, shell=True, capture_output=True)
-    return [output.stdout, output.stderr]
-## examples/__init__.py
-- Object: runcommand
-- Maliciousness: 👹 0.9620079398155212
 ```
 ## Why malwi?
@@ -44,16 +42,9 @@ Typical malware behaviors include:
 - _Backdoors_: Allowing remote attackers to gain unauthorized access to your system.
 - _Destructive_ actions: Deleting files, corrupting databases, or sabotaging applications.
-> ⚠️ **Attention**: Malicious packages might execute code during installation (e.g. through `setup.py`).
-Make sure to *NOT* download or install malicious packages from the dataset with commands like `uv add`, `pip install`, `poetry add`.
 ## How does it work?
-malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1).
-The following datasets are used as a source for malicious samples:
-- [pypi_malregistry](https://github.com/lxyeternal/pypi_malregistry)
-- [DataDog malicious-software-packages-dataset](https://github.com/DataDog/malicious-software-packages-dataset)
 ### 1. Compile Python files to bytecode
@@ -109,7 +100,7 @@ SVM => Malicious
 ## Benchmarks?
-DistilBert:
 | Metric                     | Value                         |
 |----------------------------|-------------------------------|
@@ -120,12 +111,19 @@ DistilBert:
 | Hardware                   | NVIDIA RTX 4090               |
 | Epochs                     | 3                             |
-SVM:
-`Coming soon`
 ## Limitations
 The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
 ## What's next?
@@ -140,18 +138,27 @@ Do you have access to malicious Rust, Go, whatever packages? **Contact me.**
 ### Develop
-Prerequisites: [uv](https://docs.astral.sh/uv/)
-```
 # Download and process data
-cmds/download_and_preprocess.sh
-# Only process data
-cmds/preprocess.sh
-# Preprocess then start training
-cmds/preprocess_and_train.sh
-# Start DistilBert training
 cmds/train_distilbert.sh
 # Start SVM Layer training

 3) **Evaluate**: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
 ```
+- 2 files scanned
+- 0 files skipped
+- 3 malicious objects
+=> 👹 malicious 1.0
 ```
 ## Why malwi?
 - _Backdoors_: Allowing remote attackers to gain unauthorized access to your system.
 - _Destructive_ actions: Deleting files, corrupting databases, or sabotaging applications.
 ## How does it work?
+malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1). The [malwi-samples](https://github.com/schirrmacher/malwi-samples) dataset is used for training.
 ### 1. Compile Python files to bytecode
 ## Benchmarks?
+### DistilBert
 | Metric                     | Value                         |
 |----------------------------|-------------------------------|
 | Hardware                   | NVIDIA RTX 4090               |
 | Epochs                     | 3                             |
+### SVM Layer
+| Metric                     | Value                         |
+|----------------------------|-------------------------------|
+| F1 Score                   | 0.96                          |
+| Recall                     | 0.95                          |
+| Precision                  | 0.95                          |
 ## Limitations
+malwi compiles Python to bytecode, which is highly version dependent. The AI models are trained on that bytecode.
+This means the performance might drop if a user installed a Python version which creates different bytecode instructions. There is no data yet about this.
 The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
 ## What's next?
 ### Develop
+**Prerequisites:**
+- [uv](https://docs.astral.sh/uv/)
+- Download [malwi-samples](https://github.com/schirrmacher/malwi-samples) in the same parent folder
+```bash
 # Download and process data
+cmds/download_and_preprocess_distilbert.sh
+# Preprocess and train DistilBERT only
+cmds/preprocess_and_train_distilbert.sh
+# Preprocess and train SVM Layer only
+cmds/preprocess_and_train_svm.sh
+# Only preprocess data for DistilBERT
+cmds/preprocess_distilbert.sh
+# Only preprocess data for SVM Layer
+cmds/preprocess_svm.sh
+# Start DistilBERT training
 cmds/train_distilbert.sh
 # Start SVM Layer training