schirrmacher
/

malwi

Safetensors

distilbert

Model card Files Files and versions

xet

Community

schirrmacher commited on Aug 14, 2025

Commit

a05ca61

verified ·

1 Parent(s): 2c1e82a

Update README.md

Browse files

Files changed (1) hide show

README.md +34 -65

README.md CHANGED Viewed

@@ -20,16 +20,30 @@ pip install --user malwi
 2) **Run**
 ```
-malwi ./examples
 ```
 3) **Evaluate**: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
 ```
-- 2 files scanned
-- 0 files skipped
-- 3 malicious objects
-=> 👹 malicious 1.0
 ```
 ## Why malwi?
@@ -73,29 +87,18 @@ TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_c
 ### 3. Feed tokens into pre-trained DistilBert
 ```
-=> Maliciousness Score: 0.92
 ```
 This creates a list with malicious code objects. However malicious code might be split into chunks and spread across
 a package. This is why the next layers are needed.
-### 4. Create statistics about malicious activities
-| Object   | DYNAMIC_CODE_EXECUTION | ENCODING_DECODING | FILESYSTEM_ACCESS | ... |
-|----------|------------------------|-------------------|-------------------|-----|
-| Object A | 0                      | 1                 | 0                 | ... |
-| Object B | 1                      | 2                 | 1                 | ... |
-| Object C | 0                      | 0                 | 2                 | ... |
-| **Package**  | **1**                      | **3**                 | **3**                 | **...** |
-### 5. Take final decision
-An SVM layer takes statistics as input and decides if all findings combined are malicious.
 ```
-SVM => Malicious
 ```
 ## Benchmarks?
@@ -104,26 +107,16 @@ SVM => Malicious
 | Metric                     | Value                         |
 |----------------------------|-------------------------------|
-| F1 Score                   | 0.96                          |
-| Recall                     | 0.95                          |
-| Precision                  | 0.98                          |
-| Training time              | ~4 hours                      |
 | Hardware                   | NVIDIA RTX 4090               |
 | Epochs                     | 3                             |
-### SVM Layer
-| Metric                     | Value                         |
-|----------------------------|-------------------------------|
-| F1 Score                   | 0.96                          |
-| Recall                     | 0.95                          |
-| Precision                  | 0.95                          |
 ## Limitations
-malwi compiles Python to bytecode, which is highly version dependent. The AI models are trained on that bytecode.
-This means the performance might drop if a user installed a Python version which creates different bytecode instructions. There is no data yet about this.
 The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
 ## What's next?
@@ -146,37 +139,13 @@ Do you have access to malicious Rust, Go, whatever packages? **Contact me.**
 # Download and process data
 cmds/download_and_preprocess_distilbert.sh
-# Preprocess and train DistilBERT only
-cmds/preprocess_and_train_distilbert.sh
-# Preprocess and train SVM Layer only
-cmds/preprocess_and_train_svm.sh
-# Only preprocess data for DistilBERT
-cmds/preprocess_distilbert.sh
-# Only preprocess data for SVM Layer
-cmds/preprocess_svm.sh
-# Start DistilBERT training
-cmds/train_distilbert.sh
-# Start SVM Layer training
-cmds/train_svm_layer.sh
-```
-### Triage
-malwi uses a pipeline that can be enhanced by triaging its results (see `src/research/triage.py`). For automated triaging, you can leverage open-source models in combination with [Ollama](https://ollama.com/).
-#### Start LLM
-```
-ollama run gemma3
-```
-#### Start Triaging
 ```
-uv run python -m src.research.triage --triage-ollama --path <FOLDER_WITH_MALWI_YAML_RESULTS>
-```

 2) **Run**
 ```
+malwi examples/malicious
 ```
 3) **Evaluate**: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
 ```
+  .--------.---.-|  .--.--.--|__|
+  |        |  _  |  |  |  |  |  |
+  |__|__|__|___._|__|________|__|
+     AI Python Malware Scanner
+- target: examples/malicious
+- files: 13
+  ├── scanned: 3
+  ├── skipped: 10
+  └── suspicious:
+      └── examples/malicious/discordpydebug-0.0.4/src/discordpydebug/__init__.py
+          └── <module>
+              ├── deserialization
+              ├── user io
+              ├── system interaction
+              └── process management
+=> 👹 malicious 1.00
 ```
 ## Why malwi?
 ### 3. Feed tokens into pre-trained DistilBert
 ```
+=> Maliciousness: 0.92
 ```
 This creates a list with malicious code objects. However malicious code might be split into chunks and spread across
 a package. This is why the next layers are needed.
+### 4. Take final decision
+The DistilBERT model makes the final maliciousness decision based on the token patterns.
 ```
+=> Maliciousness: 0.92
 ```
 ## Benchmarks?
 | Metric                     | Value                         |
 |----------------------------|-------------------------------|
+| F1 Score                   | 0.944                         |
+| Recall                     | 0.906                         |
+| Precision                  | 0.984                         |
+| Training time              | ~5 hours                      |
 | Hardware                   | NVIDIA RTX 4090               |
 | Epochs                     | 3                             |
 ## Limitations
 The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
 ## What's next?
 # Download and process data
 cmds/download_and_preprocess_distilbert.sh
+# Complete pipelines
+cmds/preprocess_and_train_distilbert.sh  # Data → Tokenizer → DistilBERT
+# Individual data preprocessing
+cmds/preprocess_data.sh                  # Process data for ML training
+# Individual model training
+cmds/train_tokenizer.sh                  # Train custom tokenizer
+cmds/train_distilbert.sh                 # Train DistilBERT model
 ```