YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Malware Detection with PyTorch

A PyTorch implementation of a feed-forward neural network for binary malware classification, supporting CPU, CUDA, and Apple MPS acceleration.

Features

  • Multi-device support: Automatically detects and uses the best available device (CUDA GPU > Apple MPS > CPU)
  • Flexible data loading: Supports subsampling of large datasets
  • Robust training: Includes validation, early stopping, and learning rate scheduling
  • Comprehensive evaluation: Multiple metrics including accuracy, precision, recall, F1-score, and AUC
  • Easy inference: Standalone script for making predictions on new data
  • Data utilities: Tools for dataset validation, analysis, and sample data generation

Requirements

Python Version

  • Python 3.12

Dependencies

  • PyTorch ≥ 2.4.0 (with Python 3.12 support)
  • NumPy ≥ 1.24.0
  • Pandas ≥ 2.0.0
  • Scikit-learn ≥ 1.3.0
  • Matplotlib ≥ 3.7.0
  • Seaborn ≥ 0.12.0

Installation

pip install -r requirements.txt

Dataset Format

Your CSV file should have the following structure:

  • sha256: File hash identifier
  • label: Binary label (0 = benign, 1 = malicious)
  • feature_0, feature_1, ..., feature_N: Numerical features for machine learning
  • Optional: rl_fs_t, rl_ls_const_positives: Additional metadata features

Example:

sha256,label,rl_fs_t,rl_ls_const_positives,feature_0,feature_1,feature_2,...
abc123...,0,0.5,0.2,1.23,4.56,7.89,...
def456...,1,0.8,0.9,2.34,5.67,8.90,...

Usage

1. Data Validation and Analysis

Validate your dataset:

python data_utils.py --action validate --input_path train.csv

Analyse your dataset:

python data_utils.py --action analyse --input_path train.csv

2. Training

Basic training:

python malware_classifier.py --data_path train.csv

Training with options:

python malware_classifier.py \
    --data_path train.csv \
    --subsample_ratio 0.1 \
    --batch_size 512 \
    --epochs 100 \
    --learning_rate 0.001 \
    --output_dir ./outputs

Command Line Arguments

malware_classifier.py

  • --data_path: Path to CSV training data (default: train.csv)
  • --subsample_ratio: Ratio to subsample dataset (0.0-1.0, optional)
  • --batch_size: Training batch size (default: 512)
  • --epochs: Number of training epochs (default: 100)
  • --learning_rate: Learning rate (default: 0.001)
  • --output_dir: Directory to save outputs (default: ./outputs)

Model Architecture

The neural network uses a feed-forward architecture with:

  • Input layer: Size matches number of features (typically ~2381)
  • Hidden layers: [512, 256, 128] neurons with BatchNorm, ReLU, and Dropout
  • Output layer: Single neuron with Sigmoid activation for binary classification
  • Regularization: Dropout (0.3) and weight decay (1e-5)

Device Selection

The implementation automatically selects the best available device:

  1. CUDA GPU: If NVIDIA GPU with CUDA is available
  2. Apple MPS: If running on Apple Silicon Mac
  3. CPU: Fallback option

Device selection is logged at runtime:

INFO - Using CUDA GPU: NVIDIA GeForce RTX 4090
INFO - Using Apple MPS
INFO - Using CPU

Output Files

Training Outputs

  • malware_classifier.pth: Trained model checkpoint
  • training_history.json: Training and validation metrics per epoch

Analysis Outputs

  • label_distribution.png: Visualization of class distribution
  • feature_correlations.png: Top features correlated with labels
  • feature_distributions.png: Distribution comparison by class
  • dataset_statistics.csv: Summary statistics

Performance Monitoring

The training script provides detailed logging:

  • Training and validation loss per epoch
  • Accuracy, precision, recall, F1-score, and AUC
  • Automatic model checkpointing based on best validation F1-score
  • Learning rate scheduling with plateau detection

Example Workflow

# 1. Validate the dataset
python data_utils.py --action validate --input_path sample_train.csv

# 2. Analyse the dataset
python data_utils.py --action analyse --input_path sample_train.csv

# 3. Train the model
python malware_classifier.py --data_path sample_train.csv --epochs 50
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support