schirrmacher commited on
Commit
bd5323e
Β·
verified Β·
1 Parent(s): 7e32428

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +92 -24
README.md CHANGED
@@ -5,25 +5,28 @@ license: mit
5
 
6
  <img src="malwi-logo.png" alt="Logo">
7
 
8
- Detect Python malware _fast_ - no internet, no expensive hardware, no fees.
9
 
10
- malwi is specialized in detecting **zero-day vulnerabilities**, for classifying code as safe or harmful.
11
 
12
- Open-source software made in Europe.
13
- Based on open research, open code, open data.
14
- πŸ‡ͺπŸ‡ΊπŸ€˜πŸ•ŠοΈ
 
 
 
15
 
16
- 1) **Install**
17
  ```
18
  pip install --user malwi
19
  ```
20
 
21
- 2) **Run**
22
  ```
23
  malwi examples/malicious
24
  ```
25
 
26
- 3) **Evaluate**: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
27
  ```
28
  __ __
29
  .--------.---.-| .--.--.--|__|
@@ -123,7 +126,7 @@ The DistilBERT model makes the final maliciousness decision based on the token p
123
  | F1 Score | 0.944 |
124
  | Recall | 0.906 |
125
  | Precision | 0.984 |
126
- | Training time | ~5 hours |
127
  | Hardware | NVIDIA RTX 4090 |
128
  | Epochs | 3 |
129
 
@@ -138,27 +141,92 @@ The first iteration focuses on **maliciousness of Python source code**.
138
 
139
  Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).
140
 
141
- ## Support
142
 
143
- Do you have access to malicious Rust, Go, whatever packages? **Contact me.**
 
144
 
145
- ### Develop
 
 
 
146
 
147
- **Prerequisites:**
148
- - [uv](https://docs.astral.sh/uv/)
149
- - Download [malwi-samples](https://github.com/schirrmacher/malwi-samples) in the same parent folder
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
 
151
  ```bash
152
- # Download and process data
153
- cmds/download_and_preprocess_distilbert.sh
154
 
155
- # Complete pipelines
156
- cmds/preprocess_and_train_distilbert.sh # Data β†’ Tokenizer β†’ DistilBERT
157
 
158
- # Individual data preprocessing
159
- cmds/preprocess_data.sh # Process data for ML training
160
 
161
- # Individual model training
162
- cmds/train_tokenizer.sh # Train custom tokenizer
163
- cmds/train_distilbert.sh # Train DistilBERT model
164
  ```
 
5
 
6
  <img src="malwi-logo.png" alt="Logo">
7
 
8
+ ## **malwi** detects Python malware using AI.
9
 
10
+ It specializes in finding **zero-day vulnerabilities** and can classify code as malicious or benign without requiring internet access.
11
 
12
+ ### Key Features
13
+ - πŸ” Detects unknown malware patterns through AI analysis
14
+ - πŸ”’ Runs completely offline - no data leaves your machine
15
+ - ⚑ Fast scanning of entire codebases
16
+ - 🚫 No external dependencies or cloud services required
17
+ - πŸ“– Open-source project built on research and open data πŸ‡ͺπŸ‡Ί
18
 
19
+ ### 1) Install
20
  ```
21
  pip install --user malwi
22
  ```
23
 
24
+ ### 2) Run
25
  ```
26
  malwi examples/malicious
27
  ```
28
 
29
+ ### 3) Evaluate: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
30
  ```
31
  __ __
32
  .--------.---.-| .--.--.--|__|
 
126
  | F1 Score | 0.944 |
127
  | Recall | 0.906 |
128
  | Precision | 0.984 |
129
+ | Training time | ~1 hour |
130
  | Hardware | NVIDIA RTX 4090 |
131
  | Epochs | 3 |
132
 
 
141
 
142
  Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).
143
 
144
+ ## Contributing & Support
145
 
146
+ ### πŸ› Report Issues
147
+ Found a bug or have a feature request? [Open an issue](https://github.com/schirrmacher/malwi/issues)
148
 
149
+ ### πŸ“Š Share Malware Samples
150
+ Have access to malicious packages in Rust, Go, or other languages? Your contributions can help expand malwi's detection capabilities:
151
+ - **Email**: [Contact via GitHub profile](https://github.com/schirrmacher)
152
+ - **Submit samples**: Follow responsible disclosure practices
153
 
154
+ ### πŸ’¬ Community
155
+ - **Discussions**: Share ideas and ask questions in [GitHub Discussions](https://github.com/schirrmacher/malwi/discussions)
156
+ - **Security**: Report security vulnerabilities privately via GitHub Security tab
157
+
158
+ ## Development
159
+
160
+ ### πŸ› οΈ Prerequisites
161
+
162
+ 1. **Package Manager**: Install [uv](https://docs.astral.sh/uv/) for fast Python dependency management
163
+ 2. **Training Data**: Clone [malwi-samples](https://github.com/schirrmacher/malwi-samples) in the parent directory:
164
+ ```bash
165
+ cd ..
166
+ git clone https://github.com/schirrmacher/malwi-samples.git
167
+ cd malwi
168
+ ```
169
+
170
+ ### πŸš€ Quick Start
171
+
172
+ ```bash
173
+ # Install dependencies
174
+ uv sync
175
+
176
+ # Run tests
177
+ uv run pytest tests
178
+
179
+ # Train a model from scratch (full pipeline)
180
+ ./cmds/preprocess_and_train_distilbert.sh
181
+ ```
182
+
183
+ ### πŸ“š Training Pipeline
184
+
185
+ The training pipeline consists of three stages that can be run together or independently:
186
+
187
+ #### **Complete Pipeline** (Recommended)
188
+ ```bash
189
+ # Data preprocessing β†’ Tokenizer training β†’ Model training
190
+ ./cmds/preprocess_and_train_distilbert.sh
191
+ ```
192
+
193
+ #### **Individual Stages**
194
+ ```bash
195
+ # 1. Data Preprocessing (parallel by default, ~5-7 min on 8 cores)
196
+ ./cmds/preprocess_data.sh
197
+
198
+ # 2. Tokenizer Training (~2 min)
199
+ ./cmds/train_tokenizer.sh
200
+
201
+ # 3. Model Training (~5 hours on NVIDIA RTX 4090)
202
+ ./cmds/train_distilbert.sh
203
+ ```
204
+
205
+ ### βš™οΈ Configuration
206
+
207
+ ```bash
208
+ # Customize parallel processing (preprocessing)
209
+ NUM_PROCESSES=16 ./cmds/preprocess_data.sh
210
+
211
+ # Train smaller/faster model
212
+ HIDDEN_SIZE=256 ./cmds/train_distilbert.sh
213
+
214
+ # Train larger/more accurate model
215
+ HIDDEN_SIZE=512 EPOCHS=5 ./cmds/train_distilbert.sh
216
+ ```
217
+
218
+ ### πŸ§ͺ Testing & Quality
219
 
220
  ```bash
221
+ # Run tests
222
+ uv run pytest tests
223
 
224
+ # Code formatting
225
+ uv run ruff format .
226
 
227
+ # Linting
228
+ uv run ruff check .
229
 
230
+ # Regenerate test data (after compiler changes)
231
+ uv run python util/regenerate_test_data.py
 
232
  ```