Malware Detection Model (Enhanced Baseline)
π§ Overview
This repository contains a machine learning-based malware detection model designed to classify files as benign or malicious based on extracted numerical features.
The current version focuses on validating the end-to-end pipeline, including:
- Feature preprocessing
- Model training
- Serialization
- Deployment via Hugging Face Hub
βοΈ Model Architecture
- Algorithm: Random Forest Classifier
- Number of trees: 300
- Max depth: 15
- Input features: 50 engineered features
The model is optimized for:
- Fast inference
- Robustness to noisy inputs
- Scalability for larger datasets
π Features Used
The model operates on numerical features derived from file characteristics, such as:
- File entropy (randomness of bytes)
- File size
- Section count
- Import/export table size
- Byte distribution statistics
- Header metadata patterns
Note: Current version uses simulated data to validate architecture. Integration with real-world datasets (e.g., EMBER) is planned.
π§ͺ Usage
import joblib
import numpy as np
bundle = joblib.load("model.pkl")
model = bundle["model"]
scaler = bundle["scaler"]
sample = np.random.rand(1, 50)
sample_scaled = scaler.transform(sample)
prediction = model.predict(sample_scaled)
print("Malicious" if prediction[0] == 1 else "Benign")