| ---
|
| language: en
|
| license: mit
|
| library_name: scikit-learn
|
| tags:
|
| - malware-detection
|
| - cybersecurity
|
| - machine-learning
|
| - random-forest
|
| - classification
|
| pipeline_tag: tabular-classification
|
| ---
|
|
|
| # Malware Detection Model (Enhanced Baseline)
|
|
|
| ## 🧠 Overview
|
| This repository contains a machine learning-based malware detection model designed to classify files as **benign or malicious** based on extracted numerical features.
|
|
|
| The current version focuses on validating the **end-to-end pipeline**, including:
|
| - Feature preprocessing
|
| - Model training
|
| - Serialization
|
| - Deployment via Hugging Face Hub
|
|
|
| ---
|
|
|
| ## ⚙️ Model Architecture
|
| - Algorithm: **Random Forest Classifier**
|
| - Number of trees: **300**
|
| - Max depth: **15**
|
| - Input features: **50 engineered features**
|
|
|
| The model is optimized for:
|
| - Fast inference
|
| - Robustness to noisy inputs
|
| - Scalability for larger datasets
|
|
|
| ---
|
|
|
| ## 📊 Features Used
|
| The model operates on numerical features derived from file characteristics, such as:
|
|
|
| - File entropy (randomness of bytes)
|
| - File size
|
| - Section count
|
| - Import/export table size
|
| - Byte distribution statistics
|
| - Header metadata patterns
|
|
|
| > Note: Current version uses simulated data to validate architecture. Integration with real-world datasets (e.g., EMBER) is planned.
|
|
|
| ---
|
|
|
| ## 🧪 Usage
|
|
|
| ```python
|
| import joblib
|
| import numpy as np
|
|
|
| bundle = joblib.load("model.pkl")
|
|
|
| model = bundle["model"]
|
| scaler = bundle["scaler"]
|
|
|
| sample = np.random.rand(1, 50)
|
| sample_scaled = scaler.transform(sample)
|
|
|
| prediction = model.predict(sample_scaled)
|
|
|
| print("Malicious" if prediction[0] == 1 else "Benign") |