YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Malware Detection with PyTorch
A PyTorch implementation of a feed-forward neural network for binary malware classification, supporting CPU, CUDA, and Apple MPS acceleration.
Features
- Multi-device support: Automatically detects and uses the best available device (CUDA GPU > Apple MPS > CPU)
- Flexible data loading: Supports subsampling of large datasets
- Robust training: Includes validation, early stopping, and learning rate scheduling
- Comprehensive evaluation: Multiple metrics including accuracy, precision, recall, F1-score, and AUC
- Easy inference: Standalone script for making predictions on new data
- Data utilities: Tools for dataset validation, analysis, and sample data generation
Requirements
Python Version
- Python 3.12
Dependencies
- PyTorch ≥ 2.4.0 (with Python 3.12 support)
- NumPy ≥ 1.24.0
- Pandas ≥ 2.0.0
- Scikit-learn ≥ 1.3.0
- Matplotlib ≥ 3.7.0
- Seaborn ≥ 0.12.0
Installation
pip install -r requirements.txt
Dataset Format
Your CSV file should have the following structure:
sha256: File hash identifierlabel: Binary label (0 = benign, 1 = malicious)feature_0,feature_1, ...,feature_N: Numerical features for machine learning- Optional:
rl_fs_t,rl_ls_const_positives: Additional metadata features
Example:
sha256,label,rl_fs_t,rl_ls_const_positives,feature_0,feature_1,feature_2,...
abc123...,0,0.5,0.2,1.23,4.56,7.89,...
def456...,1,0.8,0.9,2.34,5.67,8.90,...
Usage
1. Data Validation and Analysis
Validate your dataset:
python data_utils.py --action validate --input_path train.csv
Analyse your dataset:
python data_utils.py --action analyse --input_path train.csv
2. Training
Basic training:
python malware_classifier.py --data_path train.csv
Training with options:
python malware_classifier.py \
--data_path train.csv \
--subsample_ratio 0.1 \
--batch_size 512 \
--epochs 100 \
--learning_rate 0.001 \
--output_dir ./outputs
Command Line Arguments
malware_classifier.py
--data_path: Path to CSV training data (default: train.csv)--subsample_ratio: Ratio to subsample dataset (0.0-1.0, optional)--batch_size: Training batch size (default: 512)--epochs: Number of training epochs (default: 100)--learning_rate: Learning rate (default: 0.001)--output_dir: Directory to save outputs (default: ./outputs)
Model Architecture
The neural network uses a feed-forward architecture with:
- Input layer: Size matches number of features (typically ~2381)
- Hidden layers: [512, 256, 128] neurons with BatchNorm, ReLU, and Dropout
- Output layer: Single neuron with Sigmoid activation for binary classification
- Regularization: Dropout (0.3) and weight decay (1e-5)
Device Selection
The implementation automatically selects the best available device:
- CUDA GPU: If NVIDIA GPU with CUDA is available
- Apple MPS: If running on Apple Silicon Mac
- CPU: Fallback option
Device selection is logged at runtime:
INFO - Using CUDA GPU: NVIDIA GeForce RTX 4090
INFO - Using Apple MPS
INFO - Using CPU
Output Files
Training Outputs
malware_classifier.pth: Trained model checkpointtraining_history.json: Training and validation metrics per epoch
Analysis Outputs
label_distribution.png: Visualization of class distributionfeature_correlations.png: Top features correlated with labelsfeature_distributions.png: Distribution comparison by classdataset_statistics.csv: Summary statistics
Performance Monitoring
The training script provides detailed logging:
- Training and validation loss per epoch
- Accuracy, precision, recall, F1-score, and AUC
- Automatic model checkpointing based on best validation F1-score
- Learning rate scheduling with plateau detection
Example Workflow
# 1. Validate the dataset
python data_utils.py --action validate --input_path sample_train.csv
# 2. Analyse the dataset
python data_utils.py --action analyse --input_path sample_train.csv
# 3. Train the model
python malware_classifier.py --data_path sample_train.csv --epochs 50
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support