Update README.md
Browse files
README.md
CHANGED
|
@@ -5,25 +5,28 @@ license: mit
|
|
| 5 |
|
| 6 |
<img src="malwi-logo.png" alt="Logo">
|
| 7 |
|
| 8 |
-
|
| 9 |
|
| 10 |
-
|
| 11 |
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
-
1)
|
| 17 |
```
|
| 18 |
pip install --user malwi
|
| 19 |
```
|
| 20 |
|
| 21 |
-
2)
|
| 22 |
```
|
| 23 |
malwi examples/malicious
|
| 24 |
```
|
| 25 |
|
| 26 |
-
3)
|
| 27 |
```
|
| 28 |
__ __
|
| 29 |
.--------.---.-| .--.--.--|__|
|
|
@@ -123,7 +126,7 @@ The DistilBERT model makes the final maliciousness decision based on the token p
|
|
| 123 |
| F1 Score | 0.944 |
|
| 124 |
| Recall | 0.906 |
|
| 125 |
| Precision | 0.984 |
|
| 126 |
-
| Training time | ~
|
| 127 |
| Hardware | NVIDIA RTX 4090 |
|
| 128 |
| Epochs | 3 |
|
| 129 |
|
|
@@ -138,27 +141,92 @@ The first iteration focuses on **maliciousness of Python source code**.
|
|
| 138 |
|
| 139 |
Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).
|
| 140 |
|
| 141 |
-
## Support
|
| 142 |
|
| 143 |
-
|
|
|
|
| 144 |
|
| 145 |
-
###
|
|
|
|
|
|
|
|
|
|
| 146 |
|
| 147 |
-
|
| 148 |
-
- [
|
| 149 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
|
| 151 |
```bash
|
| 152 |
-
#
|
| 153 |
-
|
| 154 |
|
| 155 |
-
#
|
| 156 |
-
|
| 157 |
|
| 158 |
-
#
|
| 159 |
-
|
| 160 |
|
| 161 |
-
#
|
| 162 |
-
|
| 163 |
-
cmds/train_distilbert.sh # Train DistilBERT model
|
| 164 |
```
|
|
|
|
| 5 |
|
| 6 |
<img src="malwi-logo.png" alt="Logo">
|
| 7 |
|
| 8 |
+
## **malwi** detects Python malware using AI.
|
| 9 |
|
| 10 |
+
It specializes in finding **zero-day vulnerabilities** and can classify code as malicious or benign without requiring internet access.
|
| 11 |
|
| 12 |
+
### Key Features
|
| 13 |
+
- π Detects unknown malware patterns through AI analysis
|
| 14 |
+
- π Runs completely offline - no data leaves your machine
|
| 15 |
+
- β‘ Fast scanning of entire codebases
|
| 16 |
+
- π« No external dependencies or cloud services required
|
| 17 |
+
- π Open-source project built on research and open data πͺπΊ
|
| 18 |
|
| 19 |
+
### 1) Install
|
| 20 |
```
|
| 21 |
pip install --user malwi
|
| 22 |
```
|
| 23 |
|
| 24 |
+
### 2) Run
|
| 25 |
```
|
| 26 |
malwi examples/malicious
|
| 27 |
```
|
| 28 |
|
| 29 |
+
### 3) Evaluate: a [recent zero-day](https://socket.dev/blog/malicious-pypi-package-targets-discord-developers-with-RAT) detected with high confidence
|
| 30 |
```
|
| 31 |
__ __
|
| 32 |
.--------.---.-| .--.--.--|__|
|
|
|
|
| 126 |
| F1 Score | 0.944 |
|
| 127 |
| Recall | 0.906 |
|
| 128 |
| Precision | 0.984 |
|
| 129 |
+
| Training time | ~1 hour |
|
| 130 |
| Hardware | NVIDIA RTX 4090 |
|
| 131 |
| Epochs | 3 |
|
| 132 |
|
|
|
|
| 141 |
|
| 142 |
Future iterations will cover malware scanning for more languages (JavaScript, Rust, Go) and more formats (binaries, logs).
|
| 143 |
|
| 144 |
+
## Contributing & Support
|
| 145 |
|
| 146 |
+
### π Report Issues
|
| 147 |
+
Found a bug or have a feature request? [Open an issue](https://github.com/schirrmacher/malwi/issues)
|
| 148 |
|
| 149 |
+
### π Share Malware Samples
|
| 150 |
+
Have access to malicious packages in Rust, Go, or other languages? Your contributions can help expand malwi's detection capabilities:
|
| 151 |
+
- **Email**: [Contact via GitHub profile](https://github.com/schirrmacher)
|
| 152 |
+
- **Submit samples**: Follow responsible disclosure practices
|
| 153 |
|
| 154 |
+
### π¬ Community
|
| 155 |
+
- **Discussions**: Share ideas and ask questions in [GitHub Discussions](https://github.com/schirrmacher/malwi/discussions)
|
| 156 |
+
- **Security**: Report security vulnerabilities privately via GitHub Security tab
|
| 157 |
+
|
| 158 |
+
## Development
|
| 159 |
+
|
| 160 |
+
### π οΈ Prerequisites
|
| 161 |
+
|
| 162 |
+
1. **Package Manager**: Install [uv](https://docs.astral.sh/uv/) for fast Python dependency management
|
| 163 |
+
2. **Training Data**: Clone [malwi-samples](https://github.com/schirrmacher/malwi-samples) in the parent directory:
|
| 164 |
+
```bash
|
| 165 |
+
cd ..
|
| 166 |
+
git clone https://github.com/schirrmacher/malwi-samples.git
|
| 167 |
+
cd malwi
|
| 168 |
+
```
|
| 169 |
+
|
| 170 |
+
### π Quick Start
|
| 171 |
+
|
| 172 |
+
```bash
|
| 173 |
+
# Install dependencies
|
| 174 |
+
uv sync
|
| 175 |
+
|
| 176 |
+
# Run tests
|
| 177 |
+
uv run pytest tests
|
| 178 |
+
|
| 179 |
+
# Train a model from scratch (full pipeline)
|
| 180 |
+
./cmds/preprocess_and_train_distilbert.sh
|
| 181 |
+
```
|
| 182 |
+
|
| 183 |
+
### π Training Pipeline
|
| 184 |
+
|
| 185 |
+
The training pipeline consists of three stages that can be run together or independently:
|
| 186 |
+
|
| 187 |
+
#### **Complete Pipeline** (Recommended)
|
| 188 |
+
```bash
|
| 189 |
+
# Data preprocessing β Tokenizer training β Model training
|
| 190 |
+
./cmds/preprocess_and_train_distilbert.sh
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
#### **Individual Stages**
|
| 194 |
+
```bash
|
| 195 |
+
# 1. Data Preprocessing (parallel by default, ~5-7 min on 8 cores)
|
| 196 |
+
./cmds/preprocess_data.sh
|
| 197 |
+
|
| 198 |
+
# 2. Tokenizer Training (~2 min)
|
| 199 |
+
./cmds/train_tokenizer.sh
|
| 200 |
+
|
| 201 |
+
# 3. Model Training (~5 hours on NVIDIA RTX 4090)
|
| 202 |
+
./cmds/train_distilbert.sh
|
| 203 |
+
```
|
| 204 |
+
|
| 205 |
+
### βοΈ Configuration
|
| 206 |
+
|
| 207 |
+
```bash
|
| 208 |
+
# Customize parallel processing (preprocessing)
|
| 209 |
+
NUM_PROCESSES=16 ./cmds/preprocess_data.sh
|
| 210 |
+
|
| 211 |
+
# Train smaller/faster model
|
| 212 |
+
HIDDEN_SIZE=256 ./cmds/train_distilbert.sh
|
| 213 |
+
|
| 214 |
+
# Train larger/more accurate model
|
| 215 |
+
HIDDEN_SIZE=512 EPOCHS=5 ./cmds/train_distilbert.sh
|
| 216 |
+
```
|
| 217 |
+
|
| 218 |
+
### π§ͺ Testing & Quality
|
| 219 |
|
| 220 |
```bash
|
| 221 |
+
# Run tests
|
| 222 |
+
uv run pytest tests
|
| 223 |
|
| 224 |
+
# Code formatting
|
| 225 |
+
uv run ruff format .
|
| 226 |
|
| 227 |
+
# Linting
|
| 228 |
+
uv run ruff check .
|
| 229 |
|
| 230 |
+
# Regenerate test data (after compiler changes)
|
| 231 |
+
uv run python util/regenerate_test_data.py
|
|
|
|
| 232 |
```
|