schirrmacher commited on
Commit
2bd8857
Β·
verified Β·
1 Parent(s): 09e6162

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +199 -125
README.md CHANGED
@@ -5,16 +5,19 @@ license: mit
5
 
6
  <img src="malwi-logo.png" alt="Logo">
7
 
8
- ## **malwi** detects Python malware using AI.
9
-
10
- It specializes in finding **zero-day vulnerabilities** and can classify code as malicious or benign without requiring internet access.
11
 
12
  ### Key Features
13
- - πŸ” Detects unknown malware patterns through AI analysis
14
- - πŸ”’ Runs completely offline - no data leaves your machine
15
- - ⚑ Fast scanning of entire codebases
16
- - 🚫 No external dependencies or cloud services required
17
- - πŸ“– Open-source project built on research and open data πŸ‡ͺπŸ‡Ί
 
 
 
 
 
18
 
19
  ### 1) Install
20
  ```
@@ -22,8 +25,8 @@ pip install --user malwi
22
  ```
23
 
24
  ### 2) Run
25
- ```
26
- malwi examples/malicious
27
  ```
28
 
29
  ### 3) Evaluate: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
@@ -35,11 +38,11 @@ malwi examples/malicious
35
  AI Python Malware Scanner
36
 
37
 
38
- - target: examples/malicious
39
- - seconds: 0.42
40
- - files: 13
41
- β”œβ”€β”€ scanned: 3
42
- β”œβ”€β”€ skipped: 10
43
  └── suspicious:
44
  β”œβ”€β”€ examples/malicious/discordpydebug-0.0.4/setup.py
45
  β”‚ └── <module>
@@ -48,8 +51,8 @@ malwi examples/malicious
48
  └── examples/malicious/discordpydebug-0.0.4/src/discordpydebug/__init__.py
49
  β”œβ”€β”€ <module>
50
  β”‚ β”œβ”€β”€ process management
51
- β”‚ β”œβ”€β”€ system interaction
52
  β”‚ β”œβ”€β”€ deserialization
 
53
  β”‚ └── user io
54
  β”œβ”€β”€ run
55
  β”‚ └── fs linking
@@ -62,112 +65,211 @@ malwi examples/malicious
62
  => πŸ‘Ή malicious 0.98
63
  ```
64
 
65
- ## Why malwi?
66
 
67
- [The number of _malicious open-source packages_ is growing](https://arxiv.org/pdf/2404.04991). This is not just a threat to your business but also to the open-source community.
68
 
69
- Typical malware behaviors include:
 
 
70
 
71
- - _Exfiltration_ of data: Stealing credentials, API keys, or sensitive user data.
72
- - _Backdoors_: Allowing remote attackers to gain unauthorized access to your system.
73
- - _Destructive_ actions: Deleting files, corrupting databases, or sabotaging applications.
 
 
 
74
 
75
- ## How does it work?
76
 
77
- malwi applies [DistilBert](https://huggingface.co/docs/transformers/model_doc/distilbert) based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1). The [malwi-samples](https://github.com/schirrmacher/malwi-samples) dataset is used for training.
 
 
 
 
78
 
79
- ### 1. Compile Python files to bytecode
 
 
 
 
 
 
 
 
 
 
80
 
 
 
 
 
81
  ```
82
- def runcommand(value):
83
- output = subprocess.run(value, shell=True, capture_output=True)
84
- return [output.stdout, output.stderr]
 
 
 
 
 
 
 
 
85
  ```
86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
  ```
88
- 0 RESUME 0
89
 
90
- 1 LOAD_CONST 0 (<code object runcommand at 0x5b4f60ae7540, file "example.py", line 1>)
91
- MAKE_FUNCTION
92
- STORE_NAME 0 (runcommand)
93
- RETURN_CONST 1 (None)
94
- ...
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
95
  ```
96
 
97
- ### 2. Map bytecode to tokens
98
 
99
- ```
100
- TARGETED_FILE resume load_global subprocess load_attr run load_fast value load_const INTEGER load_const INTEGER kw_names capture_output shell call store_fast output load_fast output load_attr stdout load_fast output load_attr stderr build_list return_value
101
- ```
102
 
103
- ### 3. Feed tokens into pre-trained DistilBert
104
 
105
- ```
106
- => Maliciousness: 0.92
107
- ```
108
 
109
- This creates a list with malicious code objects. However malicious code might be split into chunks and spread across
110
- a package. This is why the next layers are needed.
111
 
112
- ### 4. Take final decision
113
 
114
- The DistilBERT model makes the final maliciousness decision based on the token patterns.
115
 
116
- ```
117
- => Maliciousness: 0.92
 
 
118
  ```
119
 
120
- ## Benchmarks?
121
 
122
- ### DistilBert
 
 
 
 
 
 
 
123
 
124
- | Metric | Value |
125
- |----------------------------|-------------------------------|
126
- | F1 Score | 0.944 |
127
- | Recall | 0.906 |
128
- | Precision | 0.984 |
129
- | Training time | ~1 hour |
130
- | Hardware | NVIDIA RTX 4090 |
131
- | Epochs | 3 |
132
 
 
133
 
134
- ## Limitations
 
 
135
 
136
- The malicious dataset includes some boilerplate functions, such as init functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
137
 
138
- ## What's next?
139
 
140
- The first iteration focuses on **maliciousness of Python source code**.
141
-
142
- Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).
143
 
144
- ## Contributing & Support
145
 
146
- ### πŸ› Report Issues
147
- Found a bug or have a feature request? [Open an issue](https://github.com/schirrmacher/malwi/issues)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
 
149
- ### πŸ“Š Share Malware Samples
150
- Have access to malicious packages in Rust, Go, or other languages? Your contributions can help expand malwi's detection capabilities:
151
- - **Email**: [Contact via GitHub profile](https://github.com/schirrmacher)
152
- - **Submit samples**: Follow responsible disclosure practices
153
 
154
- ### πŸ’¬ Community
155
- - **Discussions**: Share ideas and ask questions in [GitHub Discussions](https://github.com/schirrmacher/malwi/discussions)
156
- - **Security**: Report security vulnerabilities privately via GitHub Security tab
157
 
158
- ## Development
159
 
160
- ### πŸ› οΈ Prerequisites
161
 
162
  1. **Package Manager**: Install [uv](https://docs.astral.sh/uv/) for fast Python dependency management
163
- 2. **Training Data**: Clone [malwi-samples](https://github.com/schirrmacher/malwi-samples) in the parent directory:
164
- ```bash
165
- cd ..
166
- git clone https://github.com/schirrmacher/malwi-samples.git
167
- cd malwi
168
- ```
169
 
170
- ### πŸš€ Quick Start
171
 
172
  ```bash
173
  # Install dependencies
@@ -176,57 +278,29 @@ uv sync
176
  # Run tests
177
  uv run pytest tests
178
 
179
- # Train a model from scratch (full pipeline)
180
- ./cmds/preprocess_and_train_distilbert.sh
181
  ```
182
 
183
- ### πŸ“š Training Pipeline
184
-
185
- The training pipeline consists of three stages that can be run together or independently:
186
-
187
- #### **Complete Pipeline** (Recommended)
188
- ```bash
189
- # Data preprocessing β†’ Tokenizer training β†’ Model training
190
- ./cmds/preprocess_and_train_distilbert.sh
191
- ```
192
-
193
- #### **Individual Stages**
194
  ```bash
195
- # 1. Data Preprocessing (parallel by default, ~5-7 min on 8 cores)
196
- ./cmds/preprocess_data.sh
197
 
198
- # 2. Tokenizer Training (~2 min)
199
- ./cmds/train_tokenizer.sh
200
 
201
- # 3. Model Training (~5 hours on NVIDIA RTX 4090)
202
- ./cmds/train_distilbert.sh
203
  ```
204
 
205
- ### βš™οΈ Configuration
206
-
207
- ```bash
208
- # Customize parallel processing (preprocessing)
209
- NUM_PROCESSES=16 ./cmds/preprocess_data.sh
210
-
211
- # Train smaller/faster model
212
- HIDDEN_SIZE=256 ./cmds/train_distilbert.sh
213
-
214
- # Train larger/more accurate model
215
- HIDDEN_SIZE=512 EPOCHS=5 ./cmds/train_distilbert.sh
216
- ```
217
 
218
- ### πŸ§ͺ Testing & Quality
219
 
220
- ```bash
221
- # Run tests
222
- uv run pytest tests
223
 
224
- # Code formatting
225
- uv run ruff format .
226
 
227
- # Linting
228
- uv run ruff check .
229
 
230
- # Regenerate test data (after compiler changes)
231
- uv run python util/regenerate_test_data.py
232
- ```
 
5
 
6
  <img src="malwi-logo.png" alt="Logo">
7
 
8
+ ## malwi specializes in finding malware
 
 
9
 
10
  ### Key Features
11
+
12
+ - πŸ›‘οΈ **AI-Powered Python Malware Detection**: Leverages advanced AI to identify malicious code in Python projects with high accuracy.
13
+
14
+ - ⚑ **Lightning-Fast Codebase Scanning**: Scans entire repositories in seconds, so you can focus on developmentβ€”not security worries.
15
+
16
+ - πŸ”’ **100% Offline & Private**: Your code never leaves your machine. Full control, zero data exposure.
17
+
18
+ - πŸ’° **Free & Open-Source**: No hidden costs. Built on transparent research and openly available data.
19
+
20
+ - πŸ‡ͺπŸ‡Ί **Developed in the EU**: Committed to open-source principles and European data standards.
21
 
22
  ### 1) Install
23
  ```
 
25
  ```
26
 
27
  ### 2) Run
28
+ ```bash
29
+ malwi scan examples/malicious
30
  ```
31
 
32
  ### 3) Evaluate: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
 
38
  AI Python Malware Scanner
39
 
40
 
41
+ - target: examples
42
+ - seconds: 1.87
43
+ - files: 14
44
+ β”œβ”€β”€ scanned: 4 (.py)
45
+ β”œβ”€β”€ skipped: 10 (.cfg, .md, .toml, .txt)
46
  └── suspicious:
47
  β”œβ”€β”€ examples/malicious/discordpydebug-0.0.4/setup.py
48
  β”‚ └── <module>
 
51
  └── examples/malicious/discordpydebug-0.0.4/src/discordpydebug/__init__.py
52
  β”œβ”€β”€ <module>
53
  β”‚ β”œβ”€β”€ process management
 
54
  β”‚ β”œβ”€β”€ deserialization
55
+ β”‚ β”œβ”€β”€ system interaction
56
  β”‚ └── user io
57
  β”œβ”€β”€ run
58
  β”‚ └── fs linking
 
65
  => πŸ‘Ή malicious 0.98
66
  ```
67
 
68
+ ## PyPI Package Scanning
69
 
70
+ malwi can directly scan PyPI packages without executing malicious logic, typically placed in `setup.py` or `__init__.py` files:
71
 
72
+ ```bash
73
+ malwi pypi requests
74
+ ````
75
 
76
+ ```
77
+ __ __
78
+ .--------.---.-| .--.--.--|__|
79
+ | | _ | | | | | |
80
+ |__|__|__|___._|__|________|__|
81
+ AI Python Malware Scanner
82
 
 
83
 
84
+ - target: downloads/requests-2.32.4.tar
85
+ - seconds: 3.10
86
+ - files: 84
87
+ β”œβ”€β”€ scanned: 34
88
+ └── skipped: 50
89
 
90
+ => 🟒 good
91
+ ```
92
+
93
+ ## Python API
94
+
95
+ malwi provides a comprehensive Python API for integrating malware detection into your applications.
96
+
97
+ ### Quick Start
98
+
99
+ ```python
100
+ import malwi
101
 
102
+ report = malwi.MalwiReport.create(input_path="suspicious_file.py")
103
+
104
+ for obj in report.malicious_objects:
105
+ print(f"File: {obj.file_path}")
106
  ```
107
+
108
+ ### `MalwiReport`
109
+
110
+ ```python
111
+ MalwiReport.create(
112
+ input_path, # str or Path - file/directory to scan
113
+ accepted_extensions=None, # List[str] - file extensions to scan (e.g., ['py', 'js'])
114
+ silent=False, # bool - suppress progress messages
115
+ malicious_threshold=0.7, # float - threshold for malicious classification (0.0-1.0)
116
+ on_finding=None # callable - callback when malicious objects found
117
+ ) -> MalwiReport # Returns: MalwiReport instance with scan results
118
  ```
119
 
120
+ ```python
121
+ import malwi
122
+
123
+ report = malwi.MalwiReport.create("suspicious_directory/")
124
+
125
+ # Properties
126
+ report.malicious # bool: True if malicious objects detected
127
+ report.confidence # float: Overall confidence score (0.0-1.0)
128
+ report.duration # float: Scan duration in seconds
129
+ report.all_objects # List[MalwiObject]: All analyzed code objects
130
+ report.malicious_objects # List[MalwiObject]: Objects exceeding threshold
131
+ report.threshold # float: Maliciousness threshold used (0.0-1.0)
132
+ report.all_files # List[Path]: All files found in input path
133
+ report.skipped_files # List[Path]: Files skipped (wrong extension)
134
+ report.processed_files # int: Number of files successfully processed
135
+ report.activities # List[str]: Suspicious activities detected
136
+ report.input_path # str: Original input path scanned
137
+ report.start_time # str: ISO 8601 timestamp when scan started
138
+ report.all_file_types # List[str]: All file extensions found
139
+ report.version # str: Malwi version with model hash
140
+
141
+ # Methods
142
+ report.to_demo_text() # str: Human-readable tree summary
143
+ report.to_json() # str: JSON formatted report
144
+ report.to_yaml() # str: YAML formatted report
145
+ report.to_markdown() # str: Markdown formatted report
146
+
147
+ # Pre-load models to avoid delay on first prediction
148
+ malwi.MalwiReport.load_models_into_memory()
149
  ```
 
150
 
151
+ ### `MalwiObject`
152
+ ```python
153
+ obj = report.all_objects[0]
154
+
155
+ # Core properties
156
+ obj.name # str: Function/class/module name
157
+ obj.file_path # str: Path to source file
158
+ obj.language # str: Programming language ('python'/'javascript')
159
+ obj.maliciousness # float|None: ML confidence score (0.0-1.0)
160
+ obj.warnings # List[str]: Compilation warnings/errors
161
+
162
+ # Source code and AST compilation
163
+ obj.file_source_code # str: Complete content of source file
164
+ obj.source_code # str|None: Extracted source for this specific object
165
+ obj.byte_code # List[Instruction]|None: Compiled AST bytecode
166
+ obj.location # Tuple[int,int]|None: Start and end line numbers
167
+ obj.embedding_count # int: Number of DistilBERT tokens (cached)
168
+
169
+ # Analysis methods
170
+ obj.predict() # dict: Run ML prediction and update maliciousness
171
+ obj.to_tokens() # List[str]: Extract tokens for analysis
172
+ obj.to_token_string() # str: Space-separated token string
173
+ obj.to_string() # str: Bytecode as readable string
174
+ obj.to_hash() # str: SHA256 hash of bytecode
175
+ obj.to_dict() # dict: Serializable representation
176
+ obj.to_yaml() # str: YAML formatted output
177
+ obj.to_json() # str: JSON formatted output
178
+
179
+ # Class methods
180
+ MalwiObject.all_tokens(language="python") # List[str]: All possible tokens
181
  ```
182
 
183
+ ## Why malwi?
184
 
185
+ Malicious actors are increasingly [targeting open-source projects](https://arxiv.org/pdf/2404.04991), introducing packages designed to compromise security.
 
 
186
 
187
+ Common malicious behaviors include:
188
 
189
+ - **Data exfiltration**: Theft of sensitive information such as credentials, API keys, or user data.
190
+ - **Backdoors**: Unauthorized remote access to systems, enabling attackers to exploit vulnerabilities.
191
+ - **Destructive actions**: Deliberate sabotage, including file deletion, database corruption, or application disruption.
192
 
193
+ ## How does it work?
 
194
 
195
+ malwi is based on the design of [_Zero Day Malware Detection with Alpha: Fast DBI with Transformer Models for Real World Application_ (2025)](https://arxiv.org/pdf/2504.14886v1).
196
 
197
+ Imagine there is a function like:
198
 
199
+ ```python
200
+ def runcommand(value):
201
+ output = subprocess.run(value, shell=True, capture_output=True)
202
+ return [output.stdout, output.stderr]
203
  ```
204
 
205
+ ### 1. Files are compiled to create an Abstract Syntax Tree with [Tree-sitter](https://tree-sitter.github.io/tree-sitter/index.html)
206
 
207
+ ```
208
+ module [0, 0] - [3, 0]
209
+ function_definition [0, 0] - [2, 41]
210
+ name: identifier [0, 4] - [0, 14]
211
+ parameters: parameters [0, 14] - [0, 21]
212
+ identifier [0, 15] - [0, 20]
213
+ ...
214
+ ```
215
 
216
+ ### 2. The AST is transpiled to dummy bytecode
 
 
 
 
 
 
 
217
 
218
+ The bytecode is enhanced with security related instructions.
219
 
220
+ ```
221
+ TARGETED_FILE PUSH_NULL LOAD_GLOBAL PROCESS_MANAGEMENT LOAD_ATTR run LOAD_PARAM value LOAD_CONST BOOLEAN LOAD_CONST BOOLEAN KW_NAMES shell capture_output CALL STRING_VERSION STORE_GLOBAL output LOAD_GLOBAL output LOAD_ATTR stdout LOAD_GLOBAL output LOAD_ATTR stderr BUILD_LIST STRING_VERSION RETURN_VALUE
222
+ ```
223
 
224
+ ### 3. The bytecode is fed into a pre-trained [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert)
225
 
226
+ A DistilBERT model trained on [malware-samples](https://github.com/schirrmacher/malwi-samples) is used to identify suspicious code patterns.
227
 
228
+ ```
229
+ => Maliciousness: 0.98
230
+ ```
231
 
232
+ ## Benchmarks?
233
 
234
+ ```
235
+ training_loss: 0.0110
236
+ epochs_completed: 3.0000
237
+ original_train_samples: 598540.0000
238
+ windowed_train_features: 831865.0000
239
+ original_validation_samples: 149636.0000
240
+ windowed_validation_features: 204781.0000
241
+ benign_samples_used: 734930.0000
242
+ malicious_samples_used: 13246.0000
243
+ benign_to_malicious_ratio: 60.0000
244
+ vocab_size: 30522.0000
245
+ max_length: 512.0000
246
+ window_stride: 128.0000
247
+ batch_size: 16.0000
248
+ eval_loss: 0.0107
249
+ eval_accuracy: 0.9980
250
+ eval_f1: 0.9521
251
+ eval_precision: 0.9832
252
+ eval_recall: 0.9229
253
+ eval_runtime: 115.5982
254
+ eval_samples_per_second: 1771.4900
255
+ eval_steps_per_second: 110.7200
256
+ epoch: 3.0000
257
+ ```
258
 
259
+ ## Contributing & Support
 
 
 
260
 
261
+ - Found a bug or have a feature request? [Open an issue](https://github.com/schirrmacher/malwi/issues).
262
+ - Do you have access to malicious packages in Rust, Go, or other languages? [Contact via GitHub profile](https://github.com/schirrmacher).
263
+ - Struggling with false-positive findings? [Create a Pull-Request](https://github.com/schirrmacher/malwi-samples/pulls).
264
 
265
+ ## Research
266
 
267
+ ### Prerequisites
268
 
269
  1. **Package Manager**: Install [uv](https://docs.astral.sh/uv/) for fast Python dependency management
270
+ 2. **Training Data**: The research CLI will automatically clone [malwi-samples](https://github.com/schirrmacher/malwi-samples) when needed
 
 
 
 
 
271
 
272
+ ### Quick Start
273
 
274
  ```bash
275
  # Install dependencies
 
278
  # Run tests
279
  uv run pytest tests
280
 
281
+ # Train a model from scratch (full pipeline with automatic data download)
282
+ ./research download preprocess train
283
  ```
284
 
285
+ #### Individual Pipeline Steps
 
 
 
 
 
 
 
 
 
 
286
  ```bash
287
+ # 1. Download training data (clones malwi-samples + downloads repositories)
288
+ ./research download
289
 
290
+ # 2. Data preprocessing only (parallel processing, ~4 min on 32 cores)
291
+ ./research preprocess --language python
292
 
293
+ # 3. Model training only (tokenizer + DistilBERT, ~40 minutes on NVIDIA RTX 4090)
294
+ ./research train
295
  ```
296
 
297
+ ## Limitations
 
 
 
 
 
 
 
 
 
 
 
298
 
299
+ The malicious dataset includes some boilerplate functions, such as setup functions, which can also appear in benign code. These cause false positives during scans. The goal is to triage and reduce such false positives to improve malwi's accuracy.
300
 
301
+ ## What's next?
 
 
302
 
303
+ The first iteration focuses on **maliciousness of Python source code**.
 
304
 
305
+ Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).
 
306