schirrmacher
/

malwi

Safetensors

distilbert

Model card Files Files and versions

xet

Community

schirrmacher commited on Jun 2, 2025

Commit

dec2025

verified ·

1 Parent(s): 54abaae

Update README.md

Browse files

Files changed (1) hide show

README.md +40 -7

README.md CHANGED Viewed

@@ -47,15 +47,13 @@ Typical malware behaviors include:
 > **Attention**: Malicious packages might execute code during installation (e.g. through `setup.py`).
 Make sure to *NOT* download or install malicious packages from the dataset with commands like `uv add`, `pip install`, `poetry add`.
-## What's next?
-The first iteration focuses on **maliciousness of Python source code**.
-Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).
 ## How does it work?
-malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) and Support Vector Machines (SVM) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1). [pypi_malregistry](https://github.com/lxyeternal/pypi_malregistry) is used as a source for malicious samples.
 1. malwi compiles Python files to bytecode:
@@ -87,6 +85,25 @@ TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_c
 Maliciousness: 0.9620079398155212
 ```
 ## Support
 Do you have access to malicious Rust, Go, whatever packages? **Contact me.**
@@ -106,4 +123,20 @@ cmds/preprocess_and_train.sh
 # Only start training
 cmds/train.sh
 ```

 > **Attention**: Malicious packages might execute code during installation (e.g. through `setup.py`).
 Make sure to *NOT* download or install malicious packages from the dataset with commands like `uv add`, `pip install`, `poetry add`.
 ## How does it work?
+malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) and Support Vector Machines (SVM) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1).
+The following datasets are used as a source for malicious samples:
+- [pypi_malregistry](https://github.com/lxyeternal/pypi_malregistry) is used as a source for malicious samples
+- [DataDog malicious-software-packages-dataset](https://github.com/DataDog/malicious-software-packages-dataset)
 1. malwi compiles Python files to bytecode:
 Maliciousness: 0.9620079398155212
 ```
+## Benchmarks?
+The current best model differentiates benign from malicious code with the following metrics:
+- F1 Score: `0.91`
+- Recall: `0.87`
+- Precision: `0.94`
+- Benign unique samples: 1070888
+- Malicious unique samples: 152984
+- Trained for ~4 hours on a NVIDIA RTX 4090 for 3 epochs
+## What's next?
+The first iteration focuses on **maliciousness of Python source code**.
+Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).
 ## Support
 Do you have access to malicious Rust, Go, whatever packages? **Contact me.**
 # Only start training
 cmds/train.sh
+```
+### Triage
+malwi utilized a pipeline which can be improved by triaging results (see `src/research/triage.py`).
+For automated triaging you can utilize open-source models and Ollama (default Gemma 3).
+Install Gemma 3:
+```
+ollama run gemma3
+```
+Start auto-triaging:
+```
+uv run python -m src.research.triage --triage-ollama --path <FOLDER_WITH_MALWI_YAML_RESULTS>
 ```