mdhaggai commited on
Commit
7b61a48
·
1 Parent(s): f06c1bd

Deploy CyberForge AI ML Training Platform

Browse files
Files changed (5) hide show
  1. README.md +183 -6
  2. app.py +772 -0
  3. hf_client.py +436 -0
  4. requirements.txt +31 -0
  5. trainer.py +459 -0
README.md CHANGED
@@ -1,12 +1,189 @@
1
  ---
2
- title: Cyberforge
3
- emoji: 📉
4
- colorFrom: red
5
  colorTo: blue
6
  sdk: gradio
7
- sdk_version: 5.49.1
8
  app_file: app.py
9
- pinned: false
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: CyberForge AI
3
+ emoji: 🔐
4
+ colorFrom: purple
5
  colorTo: blue
6
  sdk: gradio
7
+ sdk_version: 4.44.0
8
  app_file: app.py
9
+ pinned: true
10
+ license: mit
11
  ---
12
 
13
+ # 🔐 CyberForge AI - ML Training Platform
14
+
15
+ **Train, Deploy, and Serve Cybersecurity Machine Learning Models**
16
+
17
+ A comprehensive platform for training cybersecurity ML models in the cloud with Hugging Face Spaces integration.
18
+
19
+ ## 🚀 Features
20
+
21
+ - **📊 Model Training**: Upload datasets and train multiple ML models (Random Forest, Gradient Boosting, Neural Networks, Ensembles)
22
+ - **🤖 Multiple Security Tasks**: Malware detection, phishing detection, network intrusion, anomaly detection, and more
23
+ - **☁️ Cloud Training**: Leverage Hugging Face's infrastructure for training without local compute resources
24
+ - **🔗 API Integration**: RESTful API endpoints for backend integration
25
+ - **💾 Model Hub**: Upload trained models to Hugging Face Hub for sharing and deployment
26
+
27
+ ## 📦 Supported Security Tasks
28
+
29
+ | Task | Description |
30
+ |------|-------------|
31
+ | Malware Detection | Identify malicious software patterns |
32
+ | Phishing Detection | Detect phishing URLs and emails |
33
+ | Network Intrusion Detection | Identify network attack patterns |
34
+ | Anomaly Detection | Detect unusual system behavior |
35
+ | Botnet Detection | Identify botnet command & control traffic |
36
+ | Web Attack Detection | Detect SQL injection, XSS, etc. |
37
+ | Spam Detection | Filter spam messages |
38
+ | Vulnerability Assessment | Assess system vulnerabilities |
39
+ | DNS Tunneling Detection | Detect DNS-based data exfiltration |
40
+ | Cryptomining Detection | Identify unauthorized mining activity |
41
+
42
+ ## 🛠️ Model Types
43
+
44
+ - **Random Forest**: Robust ensemble classifier
45
+ - **Gradient Boosting**: High-performance gradient boosting
46
+ - **Logistic Regression**: Fast baseline classifier
47
+ - **Isolation Forest**: Unsupervised anomaly detection
48
+ - **Neural Networks**: Deep learning models (when available)
49
+ - **Ensemble Models**: Voting and stacking classifiers
50
+
51
+ ## 📖 How to Use
52
+
53
+ ### 1. Training a Model
54
+
55
+ 1. Go to the **🎯 Train Model** tab
56
+ 2. Upload your dataset (CSV, JSON, or Parquet)
57
+ 3. Select the security task type
58
+ 4. Choose a model type
59
+ 5. Enter the target column name
60
+ 6. Click **Train Model**
61
+
62
+ ### 2. Running Inference
63
+
64
+ 1. Go to the **🔮 Run Inference** tab
65
+ 2. Enter the model ID from training
66
+ 3. Provide input features as JSON
67
+ 4. Click **Run Inference**
68
+
69
+ ### 3. Backend Integration
70
+
71
+ ```python
72
+ from gradio_client import Client
73
+
74
+ # Connect to the Space
75
+ client = Client("Che237/cyberforge")
76
+
77
+ # Train a model
78
+ result = client.predict(
79
+ file="path/to/dataset.csv",
80
+ task_type="Malware Detection",
81
+ model_type="Random Forest",
82
+ target_column="label",
83
+ test_size=0.2,
84
+ model_name="my_model",
85
+ api_name="/train_model"
86
+ )
87
+
88
+ # Run inference
89
+ predictions = client.predict(
90
+ model_id="my_model_malware_detection_20240101_120000",
91
+ input_data='[{"feature1": 0.5, "feature2": 1.2}]',
92
+ api_name="/run_inference"
93
+ )
94
+ ```
95
+
96
+ ### 4. Node.js Backend Integration
97
+
98
+ ```javascript
99
+ const { Client } = require("@gradio/client");
100
+
101
+ async function runPrediction(modelId, features) {
102
+ const client = await Client.connect("Che237/cyberforge");
103
+ const result = await client.predict("/run_inference", {
104
+ model_id: modelId,
105
+ input_data: JSON.stringify([features])
106
+ });
107
+ return JSON.parse(result.data);
108
+ }
109
+
110
+ // Usage
111
+ const prediction = await runPrediction(
112
+ "cyberforge_model_malware_detection_20240101",
113
+ { src_bytes: 1000, dst_bytes: 500, protocol_type: 0 }
114
+ );
115
+ console.log(prediction);
116
+ ```
117
+
118
+ ## 📊 Dataset Format
119
+
120
+ Your dataset should be in CSV, JSON, or Parquet format with:
121
+
122
+ - **Features**: Numerical or categorical columns
123
+ - **Target**: A column indicating the class/label (e.g., `label`, `is_malicious`, `attack_type`)
124
+
125
+ ### Example CSV Structure:
126
+
127
+ ```csv
128
+ src_bytes,dst_bytes,protocol_type,service,flag,label
129
+ 1000,500,tcp,http,SF,normal
130
+ 5000,2000,udp,dns,REJ,attack
131
+ ...
132
+ ```
133
+
134
+ ## 🔗 API Endpoints
135
+
136
+ | Endpoint | Method | Description |
137
+ |----------|--------|-------------|
138
+ | `/train_model` | POST | Train a new model |
139
+ | `/run_inference` | POST | Run predictions |
140
+ | `/list_trained_models` | GET | List available models |
141
+ | `/upload_model_to_hub` | POST | Upload model to Hub |
142
+ | `/download_model_from_hub` | POST | Download model from Hub |
143
+
144
+ ## 🏗️ Architecture
145
+
146
+ ```
147
+ ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
148
+ │ Your Backend │ ──▶ │ HF Space (API) │ ──▶ │ Trained Models │
149
+ │ (Node.js) │ ◀── │ (Gradio) │ ◀── │ (pkl files) │
150
+ └─────────────────┘ └───────��──────────┘ └─────────────────┘
151
+
152
+
153
+ ┌──────────────────┐
154
+ │ Hugging Face │
155
+ │ Model Hub │
156
+ └──────────────────┘
157
+ ```
158
+
159
+ ## 📁 Files
160
+
161
+ - `app.py` - Main Gradio application
162
+ - `trainer.py` - Advanced model training module
163
+ - `hf_client.py` - Client library for backend integration
164
+ - `requirements.txt` - Python dependencies
165
+
166
+ ## 🔧 Local Development
167
+
168
+ ```bash
169
+ # Clone the space
170
+ git clone https://huggingface.co/spaces/Che237/cyberforge
171
+
172
+ # Install dependencies
173
+ pip install -r requirements.txt
174
+
175
+ # Run locally
176
+ python app.py
177
+ ```
178
+
179
+ ## 📄 License
180
+
181
+ MIT License - See LICENSE file for details.
182
+
183
+ ## 🤝 Contributing
184
+
185
+ Contributions are welcome! Please feel free to submit a Pull Request.
186
+
187
+ ---
188
+
189
+ Built with ❤️ for the cybersecurity community
app.py ADDED
@@ -0,0 +1,772 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ 🔐 CyberForge AI - ML Training & Inference Platform
3
+ Hugging Face Spaces deployment for training cybersecurity ML models
4
+ """
5
+
6
+ import gradio as gr
7
+ import pandas as pd
8
+ import numpy as np
9
+ import json
10
+ import os
11
+ import joblib
12
+ from pathlib import Path
13
+ from datetime import datetime
14
+ import logging
15
+ from typing import Dict, List, Any, Optional, Tuple
16
+ import asyncio
17
+
18
+ # ML Libraries
19
+ from sklearn.model_selection import train_test_split, cross_val_score
20
+ from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, IsolationForest
21
+ from sklearn.linear_model import LogisticRegression
22
+ from sklearn.preprocessing import StandardScaler, LabelEncoder
23
+ from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score
24
+ import torch
25
+ import torch.nn as nn
26
+ from transformers import AutoTokenizer, AutoModel
27
+
28
+ # Hugging Face Hub
29
+ from huggingface_hub import HfApi, hf_hub_download, upload_file, create_repo
30
+
31
+ logging.basicConfig(level=logging.INFO)
32
+ logger = logging.getLogger(__name__)
33
+
34
+ # ============================================================================
35
+ # CONFIGURATION
36
+ # ============================================================================
37
+
38
+ MODELS_DIR = Path("./trained_models")
39
+ MODELS_DIR.mkdir(exist_ok=True)
40
+
41
+ DATASETS_DIR = Path("./datasets")
42
+ DATASETS_DIR.mkdir(exist_ok=True)
43
+
44
+ # Model types available for training
45
+ MODEL_TYPES = {
46
+ "Random Forest": RandomForestClassifier,
47
+ "Gradient Boosting": GradientBoostingClassifier,
48
+ "Logistic Regression": LogisticRegression,
49
+ "Isolation Forest (Anomaly)": IsolationForest,
50
+ }
51
+
52
+ # Cybersecurity task categories
53
+ SECURITY_TASKS = [
54
+ "Malware Detection",
55
+ "Phishing Detection",
56
+ "Network Intrusion Detection",
57
+ "Anomaly Detection",
58
+ "Botnet Detection",
59
+ "Web Attack Detection",
60
+ "Spam Detection",
61
+ "Vulnerability Assessment",
62
+ "DNS Tunneling Detection",
63
+ "Cryptomining Detection",
64
+ ]
65
+
66
+ # ============================================================================
67
+ # MODEL REGISTRY
68
+ # ============================================================================
69
+
70
+ class ModelRegistry:
71
+ """Manages trained models and their metadata"""
72
+
73
+ def __init__(self):
74
+ self.models = {}
75
+ self.scalers = {}
76
+ self.metadata = {}
77
+ self.registry_file = MODELS_DIR / "registry.json"
78
+ self._load_registry()
79
+
80
+ def _load_registry(self):
81
+ """Load existing model registry"""
82
+ if self.registry_file.exists():
83
+ with open(self.registry_file, 'r') as f:
84
+ self.metadata = json.load(f)
85
+ else:
86
+ self.metadata = {}
87
+
88
+ def _save_registry(self):
89
+ """Save model registry"""
90
+ with open(self.registry_file, 'w') as f:
91
+ json.dump(self.metadata, f, indent=2, default=str)
92
+
93
+ def register_model(self, model_id: str, model, scaler, metrics: Dict):
94
+ """Register a trained model"""
95
+ self.models[model_id] = model
96
+ self.scalers[model_id] = scaler
97
+
98
+ # Save model and scaler
99
+ model_path = MODELS_DIR / f"{model_id}_model.pkl"
100
+ scaler_path = MODELS_DIR / f"{model_id}_scaler.pkl"
101
+
102
+ joblib.dump(model, model_path)
103
+ joblib.dump(scaler, scaler_path)
104
+
105
+ # Update metadata
106
+ self.metadata[model_id] = {
107
+ "created_at": datetime.now().isoformat(),
108
+ "metrics": metrics,
109
+ "model_path": str(model_path),
110
+ "scaler_path": str(scaler_path),
111
+ "status": "ready"
112
+ }
113
+ self._save_registry()
114
+
115
+ return model_id
116
+
117
+ def get_model(self, model_id: str):
118
+ """Load a model from registry"""
119
+ if model_id in self.models:
120
+ return self.models[model_id], self.scalers[model_id]
121
+
122
+ if model_id in self.metadata:
123
+ model = joblib.load(self.metadata[model_id]["model_path"])
124
+ scaler = joblib.load(self.metadata[model_id]["scaler_path"])
125
+ self.models[model_id] = model
126
+ self.scalers[model_id] = scaler
127
+ return model, scaler
128
+
129
+ return None, None
130
+
131
+ def list_models(self) -> List[Dict]:
132
+ """List all registered models"""
133
+ return [
134
+ {"id": k, **v} for k, v in self.metadata.items()
135
+ ]
136
+
137
+ # Global registry
138
+ model_registry = ModelRegistry()
139
+
140
+ # ============================================================================
141
+ # TRAINING FUNCTIONS
142
+ # ============================================================================
143
+
144
+ def prepare_dataset(file, task_type: str) -> Tuple[pd.DataFrame, str]:
145
+ """Load and prepare dataset for training"""
146
+ try:
147
+ if file is None:
148
+ return None, "No file uploaded"
149
+
150
+ # Load based on file type
151
+ if file.name.endswith('.csv'):
152
+ df = pd.read_csv(file.name)
153
+ elif file.name.endswith('.json'):
154
+ df = pd.read_json(file.name)
155
+ elif file.name.endswith('.parquet'):
156
+ df = pd.read_parquet(file.name)
157
+ else:
158
+ return None, f"Unsupported file format: {file.name}"
159
+
160
+ logger.info(f"Loaded dataset with shape: {df.shape}")
161
+ return df, f"✅ Loaded dataset with {len(df)} samples and {len(df.columns)} features"
162
+
163
+ except Exception as e:
164
+ logger.error(f"Error loading dataset: {e}")
165
+ return None, f"❌ Error: {str(e)}"
166
+
167
+
168
+ def train_model(
169
+ file,
170
+ task_type: str,
171
+ model_type: str,
172
+ target_column: str,
173
+ test_size: float,
174
+ model_name: str,
175
+ progress=gr.Progress()
176
+ ) -> Tuple[str, str, str]:
177
+ """Train a machine learning model"""
178
+ try:
179
+ progress(0, desc="Loading dataset...")
180
+
181
+ # Load dataset
182
+ df, msg = prepare_dataset(file, task_type)
183
+ if df is None:
184
+ return msg, "", ""
185
+
186
+ progress(0.1, desc="Preparing features...")
187
+
188
+ # Validate target column
189
+ if target_column not in df.columns:
190
+ return f"❌ Target column '{target_column}' not found in dataset. Available: {list(df.columns)}", "", ""
191
+
192
+ # Prepare features and target
193
+ X = df.drop(columns=[target_column])
194
+ y = df[target_column]
195
+
196
+ # Handle categorical features
197
+ for col in X.select_dtypes(include=['object', 'category']).columns:
198
+ le = LabelEncoder()
199
+ X[col] = le.fit_transform(X[col].astype(str))
200
+
201
+ # Handle target encoding
202
+ if y.dtype == 'object' or y.dtype.name == 'category':
203
+ le = LabelEncoder()
204
+ y = le.fit_transform(y.astype(str))
205
+
206
+ # Fill NaN values
207
+ X = X.fillna(0)
208
+
209
+ progress(0.2, desc="Splitting data...")
210
+
211
+ # Split data
212
+ X_train, X_test, y_train, y_test = train_test_split(
213
+ X, y, test_size=test_size, random_state=42
214
+ )
215
+
216
+ progress(0.3, desc="Scaling features...")
217
+
218
+ # Scale features
219
+ scaler = StandardScaler()
220
+ X_train_scaled = scaler.fit_transform(X_train)
221
+ X_test_scaled = scaler.transform(X_test)
222
+
223
+ progress(0.4, desc=f"Training {model_type}...")
224
+
225
+ # Get model class
226
+ if model_type not in MODEL_TYPES:
227
+ return f"❌ Unknown model type: {model_type}", "", ""
228
+
229
+ model_class = MODEL_TYPES[model_type]
230
+
231
+ # Configure and train model
232
+ if model_type == "Isolation Forest (Anomaly)":
233
+ model = model_class(contamination=0.1, random_state=42, n_estimators=100)
234
+ model.fit(X_train_scaled)
235
+ y_pred = model.predict(X_test_scaled)
236
+ y_pred = np.where(y_pred == -1, 1, 0) # Convert to binary
237
+ else:
238
+ model = model_class(random_state=42)
239
+ model.fit(X_train_scaled, y_train)
240
+ y_pred = model.predict(X_test_scaled)
241
+
242
+ progress(0.7, desc="Evaluating model...")
243
+
244
+ # Calculate metrics
245
+ accuracy = accuracy_score(y_test, y_pred)
246
+ f1 = f1_score(y_test, y_pred, average='weighted')
247
+
248
+ metrics = {
249
+ "accuracy": float(accuracy),
250
+ "f1_score": float(f1),
251
+ "model_type": model_type,
252
+ "task_type": task_type,
253
+ "samples": len(df),
254
+ "features": len(X.columns),
255
+ }
256
+
257
+ progress(0.85, desc="Saving model...")
258
+
259
+ # Generate model ID
260
+ model_id = f"{model_name}_{task_type.lower().replace(' ', '_')}_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
261
+
262
+ # Register model
263
+ model_registry.register_model(model_id, model, scaler, metrics)
264
+
265
+ progress(1.0, desc="Complete!")
266
+
267
+ # Format results
268
+ training_log = f"""
269
+ ## 🎯 Training Complete!
270
+
271
+ **Model ID:** `{model_id}`
272
+ **Task:** {task_type}
273
+ **Model Type:** {model_type}
274
+
275
+ ### 📊 Dataset Info
276
+ - Samples: {len(df):,}
277
+ - Features: {len(X.columns)}
278
+ - Train/Test Split: {int((1-test_size)*100)}/{int(test_size*100)}
279
+
280
+ ### 📈 Metrics
281
+ - **Accuracy:** {accuracy:.4f} ({accuracy*100:.2f}%)
282
+ - **F1 Score:** {f1:.4f}
283
+
284
+ ### 💾 Model Saved
285
+ - Path: `{MODELS_DIR / f'{model_id}_model.pkl'}`
286
+ """
287
+
288
+ # Generate classification report
289
+ try:
290
+ report = classification_report(y_test, y_pred)
291
+ except:
292
+ report = "Classification report not available for this model type"
293
+
294
+ return training_log, report, model_id
295
+
296
+ except Exception as e:
297
+ logger.error(f"Training error: {e}")
298
+ import traceback
299
+ return f"❌ Training failed: {str(e)}\n\n{traceback.format_exc()}", "", ""
300
+
301
+
302
+ def list_trained_models() -> str:
303
+ """List all trained models"""
304
+ models = model_registry.list_models()
305
+
306
+ if not models:
307
+ return "No models trained yet. Upload a dataset and train a model to get started!"
308
+
309
+ output = "## 🤖 Trained Models\n\n"
310
+ for model in models:
311
+ output += f"""
312
+ ### {model['id']}
313
+ - **Created:** {model.get('created_at', 'Unknown')}
314
+ - **Accuracy:** {model.get('metrics', {}).get('accuracy', 0):.4f}
315
+ - **F1 Score:** {model.get('metrics', {}).get('f1_score', 0):.4f}
316
+ - **Status:** {model.get('status', 'Unknown')}
317
+
318
+ ---
319
+ """
320
+ return output
321
+
322
+
323
+ def run_inference(model_id: str, input_data: str) -> str:
324
+ """Run inference on a trained model"""
325
+ try:
326
+ model, scaler = model_registry.get_model(model_id)
327
+
328
+ if model is None:
329
+ return f"❌ Model '{model_id}' not found"
330
+
331
+ # Parse input data (expect JSON format)
332
+ try:
333
+ data = json.loads(input_data)
334
+ if isinstance(data, dict):
335
+ data = [data]
336
+ df = pd.DataFrame(data)
337
+ except json.JSONDecodeError:
338
+ return "❌ Invalid JSON input. Please provide data in JSON format."
339
+
340
+ # Scale and predict
341
+ X_scaled = scaler.transform(df.fillna(0))
342
+ predictions = model.predict(X_scaled)
343
+
344
+ # Get probabilities if available
345
+ try:
346
+ probabilities = model.predict_proba(X_scaled)
347
+ results = []
348
+ for i, (pred, probs) in enumerate(zip(predictions, probabilities)):
349
+ results.append({
350
+ "sample": i,
351
+ "prediction": int(pred),
352
+ "confidence": float(max(probs)),
353
+ "probabilities": probs.tolist()
354
+ })
355
+ except:
356
+ results = [{"sample": i, "prediction": int(p)} for i, p in enumerate(predictions)]
357
+
358
+ return json.dumps(results, indent=2)
359
+
360
+ except Exception as e:
361
+ logger.error(f"Inference error: {e}")
362
+ return f"❌ Inference failed: {str(e)}"
363
+
364
+
365
+ # ============================================================================
366
+ # HUGGING FACE INTEGRATION
367
+ # ============================================================================
368
+
369
+ def upload_model_to_hub(model_id: str, repo_id: str, hf_token: str) -> str:
370
+ """Upload a trained model to Hugging Face Hub"""
371
+ try:
372
+ if not hf_token:
373
+ return "❌ Hugging Face token required for upload"
374
+
375
+ model, scaler = model_registry.get_model(model_id)
376
+ if model is None:
377
+ return f"❌ Model '{model_id}' not found"
378
+
379
+ api = HfApi(token=hf_token)
380
+
381
+ # Create repo if it doesn't exist
382
+ try:
383
+ create_repo(repo_id, token=hf_token, repo_type="model", exist_ok=True)
384
+ except Exception as e:
385
+ logger.warning(f"Repo creation note: {e}")
386
+
387
+ # Upload model files
388
+ model_path = MODELS_DIR / f"{model_id}_model.pkl"
389
+ scaler_path = MODELS_DIR / f"{model_id}_scaler.pkl"
390
+
391
+ upload_file(
392
+ path_or_fileobj=str(model_path),
393
+ path_in_repo=f"{model_id}_model.pkl",
394
+ repo_id=repo_id,
395
+ token=hf_token,
396
+ repo_type="model"
397
+ )
398
+
399
+ upload_file(
400
+ path_or_fileobj=str(scaler_path),
401
+ path_in_repo=f"{model_id}_scaler.pkl",
402
+ repo_id=repo_id,
403
+ token=hf_token,
404
+ repo_type="model"
405
+ )
406
+
407
+ # Upload metadata
408
+ metadata = model_registry.metadata.get(model_id, {})
409
+ metadata_json = json.dumps(metadata, indent=2, default=str)
410
+
411
+ with open(MODELS_DIR / f"{model_id}_metadata.json", 'w') as f:
412
+ f.write(metadata_json)
413
+
414
+ upload_file(
415
+ path_or_fileobj=str(MODELS_DIR / f"{model_id}_metadata.json"),
416
+ path_in_repo=f"{model_id}_metadata.json",
417
+ repo_id=repo_id,
418
+ token=hf_token,
419
+ repo_type="model"
420
+ )
421
+
422
+ return f"""
423
+ ## ✅ Model Uploaded Successfully!
424
+
425
+ **Model ID:** `{model_id}`
426
+ **Repository:** `{repo_id}`
427
+ **URL:** https://huggingface.co/{repo_id}
428
+
429
+ ### Files Uploaded:
430
+ - `{model_id}_model.pkl`
431
+ - `{model_id}_scaler.pkl`
432
+ - `{model_id}_metadata.json`
433
+
434
+ You can now use this model from the Hub!
435
+ """
436
+
437
+ except Exception as e:
438
+ logger.error(f"Upload error: {e}")
439
+ return f"❌ Upload failed: {str(e)}"
440
+
441
+
442
+ def download_model_from_hub(repo_id: str, model_filename: str, hf_token: str) -> str:
443
+ """Download a model from Hugging Face Hub"""
444
+ try:
445
+ model_path = hf_hub_download(
446
+ repo_id=repo_id,
447
+ filename=model_filename,
448
+ token=hf_token if hf_token else None
449
+ )
450
+
451
+ # Also try to download scaler
452
+ scaler_filename = model_filename.replace("_model.pkl", "_scaler.pkl")
453
+ try:
454
+ scaler_path = hf_hub_download(
455
+ repo_id=repo_id,
456
+ filename=scaler_filename,
457
+ token=hf_token if hf_token else None
458
+ )
459
+ except:
460
+ scaler_path = None
461
+
462
+ # Load and register
463
+ model = joblib.load(model_path)
464
+ scaler = joblib.load(scaler_path) if scaler_path else StandardScaler()
465
+
466
+ model_id = model_filename.replace("_model.pkl", "")
467
+ model_registry.models[model_id] = model
468
+ model_registry.scalers[model_id] = scaler
469
+
470
+ return f"""
471
+ ## ✅ Model Downloaded Successfully!
472
+
473
+ **Model ID:** `{model_id}`
474
+ **Source:** `{repo_id}`
475
+
476
+ The model is now available for inference.
477
+ """
478
+
479
+ except Exception as e:
480
+ logger.error(f"Download error: {e}")
481
+ return f"❌ Download failed: {str(e)}"
482
+
483
+
484
+ # ============================================================================
485
+ # API ENDPOINTS (For Backend Integration)
486
+ # ============================================================================
487
+
488
+ def api_predict(model_id: str, features: Dict) -> Dict:
489
+ """API endpoint for predictions"""
490
+ try:
491
+ model, scaler = model_registry.get_model(model_id)
492
+ if model is None:
493
+ return {"error": f"Model '{model_id}' not found"}
494
+
495
+ df = pd.DataFrame([features])
496
+ X_scaled = scaler.transform(df.fillna(0))
497
+ prediction = model.predict(X_scaled)[0]
498
+
499
+ try:
500
+ proba = model.predict_proba(X_scaled)[0]
501
+ confidence = float(max(proba))
502
+ except:
503
+ confidence = None
504
+
505
+ return {
506
+ "model_id": model_id,
507
+ "prediction": int(prediction),
508
+ "confidence": confidence,
509
+ "timestamp": datetime.now().isoformat()
510
+ }
511
+ except Exception as e:
512
+ return {"error": str(e)}
513
+
514
+
515
+ def api_batch_predict(model_id: str, batch_data: List[Dict]) -> List[Dict]:
516
+ """API endpoint for batch predictions"""
517
+ results = []
518
+ for item in batch_data:
519
+ result = api_predict(model_id, item)
520
+ results.append(result)
521
+ return results
522
+
523
+
524
+ # ============================================================================
525
+ # GRADIO INTERFACE
526
+ # ============================================================================
527
+
528
+ # Custom CSS
529
+ custom_css = """
530
+ .gradio-container {
531
+ font-family: 'Inter', sans-serif;
532
+ }
533
+ .main-title {
534
+ text-align: center;
535
+ color: #1a1a2e;
536
+ margin-bottom: 20px;
537
+ }
538
+ .tab-content {
539
+ padding: 20px;
540
+ }
541
+ """
542
+
543
+ # Build interface
544
+ with gr.Blocks(css=custom_css, title="CyberForge AI - ML Training Platform") as demo:
545
+ gr.Markdown("""
546
+ # 🔐 CyberForge AI - ML Training Platform
547
+
548
+ **Train, Deploy, and Serve Cybersecurity ML Models**
549
+
550
+ This platform enables you to:
551
+ - 📊 Upload and train models on cybersecurity datasets
552
+ - 🚀 Deploy models to Hugging Face Hub
553
+ - 🔗 Integrate with your backend via API
554
+ - 🤖 Run inference on trained models
555
+ """)
556
+
557
+ with gr.Tabs():
558
+ # ==================== TRAINING TAB ====================
559
+ with gr.TabItem("🎯 Train Model"):
560
+ with gr.Row():
561
+ with gr.Column(scale=1):
562
+ gr.Markdown("### Dataset Configuration")
563
+
564
+ train_file = gr.File(
565
+ label="Upload Dataset (CSV, JSON, or Parquet)",
566
+ file_types=[".csv", ".json", ".parquet"]
567
+ )
568
+
569
+ task_type = gr.Dropdown(
570
+ choices=SECURITY_TASKS,
571
+ value="Malware Detection",
572
+ label="Security Task Type"
573
+ )
574
+
575
+ model_type = gr.Dropdown(
576
+ choices=list(MODEL_TYPES.keys()),
577
+ value="Random Forest",
578
+ label="Model Type"
579
+ )
580
+
581
+ target_column = gr.Textbox(
582
+ label="Target Column Name",
583
+ placeholder="e.g., 'label', 'is_malicious', 'attack_type'"
584
+ )
585
+
586
+ test_size = gr.Slider(
587
+ minimum=0.1,
588
+ maximum=0.4,
589
+ value=0.2,
590
+ step=0.05,
591
+ label="Test Size"
592
+ )
593
+
594
+ model_name = gr.Textbox(
595
+ label="Model Name",
596
+ placeholder="e.g., 'malware_detector_v1'",
597
+ value="cyberforge_model"
598
+ )
599
+
600
+ train_btn = gr.Button("🚀 Train Model", variant="primary")
601
+
602
+ with gr.Column(scale=1):
603
+ gr.Markdown("### Training Results")
604
+ training_output = gr.Markdown()
605
+ classification_report_output = gr.Textbox(
606
+ label="Classification Report",
607
+ lines=10
608
+ )
609
+ trained_model_id = gr.Textbox(
610
+ label="Trained Model ID",
611
+ interactive=False
612
+ )
613
+
614
+ train_btn.click(
615
+ fn=train_model,
616
+ inputs=[train_file, task_type, model_type, target_column, test_size, model_name],
617
+ outputs=[training_output, classification_report_output, trained_model_id]
618
+ )
619
+
620
+ # ==================== INFERENCE TAB ====================
621
+ with gr.TabItem("🔮 Run Inference"):
622
+ with gr.Row():
623
+ with gr.Column():
624
+ inference_model_id = gr.Textbox(
625
+ label="Model ID",
626
+ placeholder="Enter the model ID to use"
627
+ )
628
+
629
+ inference_input = gr.Textbox(
630
+ label="Input Data (JSON format)",
631
+ placeholder='[{"feature1": 0.5, "feature2": 1.2, ...}]',
632
+ lines=5
633
+ )
634
+
635
+ inference_btn = gr.Button("🔮 Run Inference", variant="primary")
636
+
637
+ with gr.Column():
638
+ inference_output = gr.Textbox(
639
+ label="Predictions",
640
+ lines=10
641
+ )
642
+
643
+ inference_btn.click(
644
+ fn=run_inference,
645
+ inputs=[inference_model_id, inference_input],
646
+ outputs=[inference_output]
647
+ )
648
+
649
+ # ==================== MODELS TAB ====================
650
+ with gr.TabItem("🤖 Models"):
651
+ gr.Markdown("### Trained Models")
652
+
653
+ refresh_btn = gr.Button("🔄 Refresh Models List")
654
+ models_list = gr.Markdown()
655
+
656
+ refresh_btn.click(
657
+ fn=list_trained_models,
658
+ outputs=[models_list]
659
+ )
660
+
661
+ # Auto-refresh on load
662
+ demo.load(
663
+ fn=list_trained_models,
664
+ outputs=[models_list]
665
+ )
666
+
667
+ # ==================== HUB TAB ====================
668
+ with gr.TabItem("☁️ Hugging Face Hub"):
669
+ gr.Markdown("### Upload & Download Models")
670
+
671
+ with gr.Row():
672
+ with gr.Column():
673
+ gr.Markdown("#### Upload to Hub")
674
+ upload_model_id = gr.Textbox(
675
+ label="Model ID to Upload"
676
+ )
677
+ upload_repo_id = gr.Textbox(
678
+ label="Hub Repository ID",
679
+ placeholder="username/repo-name"
680
+ )
681
+ upload_token = gr.Textbox(
682
+ label="Hugging Face Token",
683
+ type="password"
684
+ )
685
+ upload_btn = gr.Button("⬆️ Upload Model", variant="primary")
686
+ upload_result = gr.Markdown()
687
+
688
+ with gr.Column():
689
+ gr.Markdown("#### Download from Hub")
690
+ download_repo_id = gr.Textbox(
691
+ label="Hub Repository ID",
692
+ placeholder="username/repo-name"
693
+ )
694
+ download_filename = gr.Textbox(
695
+ label="Model Filename",
696
+ placeholder="model_name_model.pkl"
697
+ )
698
+ download_token = gr.Textbox(
699
+ label="Hugging Face Token (optional)",
700
+ type="password"
701
+ )
702
+ download_btn = gr.Button("⬇️ Download Model", variant="secondary")
703
+ download_result = gr.Markdown()
704
+
705
+ upload_btn.click(
706
+ fn=upload_model_to_hub,
707
+ inputs=[upload_model_id, upload_repo_id, upload_token],
708
+ outputs=[upload_result]
709
+ )
710
+
711
+ download_btn.click(
712
+ fn=download_model_from_hub,
713
+ inputs=[download_repo_id, download_filename, download_token],
714
+ outputs=[download_result]
715
+ )
716
+
717
+ # ==================== API TAB ====================
718
+ with gr.TabItem("🔗 API Integration"):
719
+ gr.Markdown("""
720
+ ### API Integration Guide
721
+
722
+ Your backend can integrate with this Space using the Gradio Client library or direct API calls.
723
+
724
+ #### Python Client Example:
725
+
726
+ ```python
727
+ from gradio_client import Client
728
+
729
+ # Connect to your Space
730
+ client = Client("Che237/cyberforge")
731
+
732
+ # Run inference
733
+ result = client.predict(
734
+ model_id="your_model_id",
735
+ input_data='[{"feature1": 0.5, "feature2": 1.2}]',
736
+ api_name="/run_inference"
737
+ )
738
+ print(result)
739
+ ```
740
+
741
+ #### API Endpoints:
742
+
743
+ | Endpoint | Description |
744
+ |----------|-------------|
745
+ | `/train_model` | Train a new model |
746
+ | `/run_inference` | Run predictions |
747
+ | `/list_trained_models` | List available models |
748
+ | `/upload_model_to_hub` | Upload model to Hub |
749
+
750
+ #### Backend Integration (Node.js):
751
+
752
+ ```javascript
753
+ const { Client } = require("@gradio/client");
754
+
755
+ async function runPrediction(modelId, features) {
756
+ const client = await Client.connect("Che237/cyberforge");
757
+ const result = await client.predict("/run_inference", {
758
+ model_id: modelId,
759
+ input_data: JSON.stringify([features])
760
+ });
761
+ return JSON.parse(result.data);
762
+ }
763
+ ```
764
+ """)
765
+
766
+ # Launch the demo
767
+ if __name__ == "__main__":
768
+ demo.launch(
769
+ server_name="0.0.0.0",
770
+ server_port=7860,
771
+ share=False
772
+ )
hf_client.py ADDED
@@ -0,0 +1,436 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ CyberForge AI - Hugging Face API Client
3
+ Backend integration for fetching models and running inference from Hugging Face Spaces
4
+ """
5
+
6
+ import os
7
+ import json
8
+ import logging
9
+ import asyncio
10
+ from typing import Dict, List, Any, Optional
11
+ from datetime import datetime
12
+ import httpx
13
+ from pathlib import Path
14
+
15
+ try:
16
+ from gradio_client import Client
17
+ GRADIO_CLIENT_AVAILABLE = True
18
+ except ImportError:
19
+ GRADIO_CLIENT_AVAILABLE = False
20
+
21
+ try:
22
+ from huggingface_hub import HfApi, hf_hub_download, InferenceClient
23
+ HF_HUB_AVAILABLE = True
24
+ except ImportError:
25
+ HF_HUB_AVAILABLE = False
26
+
27
+ logger = logging.getLogger(__name__)
28
+
29
+
30
+ class HuggingFaceClient:
31
+ """
32
+ Client for interacting with CyberForge AI Hugging Face Space
33
+ Provides model inference, training requests, and model management
34
+ """
35
+
36
+ def __init__(
37
+ self,
38
+ space_id: str = "Che237/cyberforge",
39
+ hf_token: Optional[str] = None,
40
+ models_repo: Optional[str] = None
41
+ ):
42
+ self.space_id = space_id
43
+ self.hf_token = hf_token or os.getenv("HF_TOKEN")
44
+ self.models_repo = models_repo or f"{space_id.split('/')[0]}/cyberforge-models"
45
+ self.space_url = f"https://{space_id.replace('/', '-')}.hf.space"
46
+
47
+ self._client = None
48
+ self._hf_api = None
49
+ self._inference_client = None
50
+
51
+ # Local model cache
52
+ self.models_cache_dir = Path("./models_cache")
53
+ self.models_cache_dir.mkdir(exist_ok=True)
54
+
55
+ # Initialize clients
56
+ self._init_clients()
57
+
58
+ def _init_clients(self):
59
+ """Initialize Hugging Face and Gradio clients"""
60
+ try:
61
+ if GRADIO_CLIENT_AVAILABLE:
62
+ self._client = Client(self.space_id, hf_token=self.hf_token)
63
+ logger.info(f"✅ Connected to Gradio Space: {self.space_id}")
64
+ except Exception as e:
65
+ logger.warning(f"Could not connect to Gradio Space: {e}")
66
+
67
+ try:
68
+ if HF_HUB_AVAILABLE:
69
+ self._hf_api = HfApi(token=self.hf_token)
70
+ logger.info("✅ Connected to Hugging Face Hub API")
71
+ except Exception as e:
72
+ logger.warning(f"Could not connect to HF Hub API: {e}")
73
+
74
+ # =========================================================================
75
+ # INFERENCE METHODS
76
+ # =========================================================================
77
+
78
+ async def predict(
79
+ self,
80
+ model_id: str,
81
+ features: Dict[str, Any],
82
+ timeout: float = 30.0
83
+ ) -> Dict[str, Any]:
84
+ """
85
+ Run inference on a model deployed in the Space
86
+
87
+ Args:
88
+ model_id: ID of the trained model
89
+ features: Dictionary of feature values
90
+ timeout: Request timeout in seconds
91
+
92
+ Returns:
93
+ Prediction result with confidence scores
94
+ """
95
+ try:
96
+ if self._client:
97
+ # Use Gradio client
98
+ result = self._client.predict(
99
+ model_id,
100
+ json.dumps([features]),
101
+ api_name="/run_inference"
102
+ )
103
+ return json.loads(result)
104
+ else:
105
+ # Fall back to HTTP API
106
+ return await self._http_predict(model_id, features, timeout)
107
+
108
+ except Exception as e:
109
+ logger.error(f"Prediction failed: {e}")
110
+ return {"error": str(e), "model_id": model_id}
111
+
112
+ async def batch_predict(
113
+ self,
114
+ model_id: str,
115
+ batch_features: List[Dict[str, Any]],
116
+ timeout: float = 60.0
117
+ ) -> List[Dict[str, Any]]:
118
+ """
119
+ Run batch inference on multiple samples
120
+
121
+ Args:
122
+ model_id: ID of the trained model
123
+ batch_features: List of feature dictionaries
124
+ timeout: Request timeout in seconds
125
+
126
+ Returns:
127
+ List of prediction results
128
+ """
129
+ try:
130
+ if self._client:
131
+ result = self._client.predict(
132
+ model_id,
133
+ json.dumps(batch_features),
134
+ api_name="/run_inference"
135
+ )
136
+ return json.loads(result)
137
+ else:
138
+ return await self._http_batch_predict(model_id, batch_features, timeout)
139
+
140
+ except Exception as e:
141
+ logger.error(f"Batch prediction failed: {e}")
142
+ return [{"error": str(e)} for _ in batch_features]
143
+
144
+ async def _http_predict(
145
+ self,
146
+ model_id: str,
147
+ features: Dict[str, Any],
148
+ timeout: float
149
+ ) -> Dict[str, Any]:
150
+ """HTTP fallback for predictions"""
151
+ async with httpx.AsyncClient(timeout=timeout) as client:
152
+ response = await client.post(
153
+ f"{self.space_url}/api/predict",
154
+ json={
155
+ "data": [model_id, json.dumps([features])],
156
+ "fn_index": 1 # Index of run_inference function
157
+ }
158
+ )
159
+ response.raise_for_status()
160
+ result = response.json()
161
+ return json.loads(result.get("data", [{}])[0])
162
+
163
+ async def _http_batch_predict(
164
+ self,
165
+ model_id: str,
166
+ batch_features: List[Dict[str, Any]],
167
+ timeout: float
168
+ ) -> List[Dict[str, Any]]:
169
+ """HTTP fallback for batch predictions"""
170
+ async with httpx.AsyncClient(timeout=timeout) as client:
171
+ response = await client.post(
172
+ f"{self.space_url}/api/predict",
173
+ json={
174
+ "data": [model_id, json.dumps(batch_features)],
175
+ "fn_index": 1
176
+ }
177
+ )
178
+ response.raise_for_status()
179
+ result = response.json()
180
+ return json.loads(result.get("data", [{}])[0])
181
+
182
+ # =========================================================================
183
+ # MODEL MANAGEMENT
184
+ # =========================================================================
185
+
186
+ async def list_models(self) -> List[Dict[str, Any]]:
187
+ """Get list of available trained models"""
188
+ try:
189
+ if self._client:
190
+ result = self._client.predict(api_name="/list_trained_models")
191
+ return self._parse_models_list(result)
192
+ else:
193
+ return await self._http_list_models()
194
+ except Exception as e:
195
+ logger.error(f"Failed to list models: {e}")
196
+ return []
197
+
198
+ def _parse_models_list(self, markdown_result: str) -> List[Dict[str, Any]]:
199
+ """Parse markdown model list into structured data"""
200
+ models = []
201
+ current_model = {}
202
+
203
+ for line in markdown_result.split('\n'):
204
+ if line.startswith('### '):
205
+ if current_model:
206
+ models.append(current_model)
207
+ current_model = {"id": line.replace('### ', '').strip()}
208
+ elif '**Created:**' in line:
209
+ current_model["created_at"] = line.split('**Created:**')[1].strip()
210
+ elif '**Accuracy:**' in line:
211
+ try:
212
+ current_model["accuracy"] = float(line.split('**Accuracy:**')[1].strip())
213
+ except:
214
+ pass
215
+ elif '**F1 Score:**' in line:
216
+ try:
217
+ current_model["f1_score"] = float(line.split('**F1 Score:**')[1].strip())
218
+ except:
219
+ pass
220
+ elif '**Status:**' in line:
221
+ current_model["status"] = line.split('**Status:**')[1].strip()
222
+
223
+ if current_model:
224
+ models.append(current_model)
225
+
226
+ return models
227
+
228
+ async def _http_list_models(self) -> List[Dict[str, Any]]:
229
+ """HTTP fallback for listing models"""
230
+ async with httpx.AsyncClient(timeout=30.0) as client:
231
+ response = await client.post(
232
+ f"{self.space_url}/api/predict",
233
+ json={"fn_index": 2} # Index of list_trained_models
234
+ )
235
+ response.raise_for_status()
236
+ result = response.json()
237
+ return self._parse_models_list(result.get("data", [""])[0])
238
+
239
+ async def download_model(
240
+ self,
241
+ model_id: str,
242
+ local_path: Optional[str] = None
243
+ ) -> str:
244
+ """
245
+ Download a trained model from Hugging Face Hub
246
+
247
+ Args:
248
+ model_id: Model identifier
249
+ local_path: Optional local path to save model
250
+
251
+ Returns:
252
+ Path to downloaded model
253
+ """
254
+ try:
255
+ if not HF_HUB_AVAILABLE:
256
+ raise ImportError("huggingface_hub not installed")
257
+
258
+ model_filename = f"{model_id}_model.pkl"
259
+ scaler_filename = f"{model_id}_scaler.pkl"
260
+
261
+ model_path = hf_hub_download(
262
+ repo_id=self.models_repo,
263
+ filename=model_filename,
264
+ token=self.hf_token,
265
+ cache_dir=str(self.models_cache_dir)
266
+ )
267
+
268
+ try:
269
+ scaler_path = hf_hub_download(
270
+ repo_id=self.models_repo,
271
+ filename=scaler_filename,
272
+ token=self.hf_token,
273
+ cache_dir=str(self.models_cache_dir)
274
+ )
275
+ except:
276
+ scaler_path = None
277
+
278
+ logger.info(f"✅ Downloaded model: {model_id}")
279
+ return model_path
280
+
281
+ except Exception as e:
282
+ logger.error(f"Failed to download model: {e}")
283
+ raise
284
+
285
+ # =========================================================================
286
+ # TRAINING REQUESTS
287
+ # =========================================================================
288
+
289
+ async def request_training(
290
+ self,
291
+ dataset_url: str,
292
+ task_type: str,
293
+ model_type: str,
294
+ target_column: str,
295
+ model_name: str,
296
+ test_size: float = 0.2,
297
+ callback_url: Optional[str] = None
298
+ ) -> Dict[str, Any]:
299
+ """
300
+ Request model training on the Space
301
+
302
+ Args:
303
+ dataset_url: URL to download dataset
304
+ task_type: Type of security task
305
+ model_type: ML model type
306
+ target_column: Target column name
307
+ model_name: Name for trained model
308
+ test_size: Test split ratio
309
+ callback_url: Optional webhook for training completion
310
+
311
+ Returns:
312
+ Training job status
313
+ """
314
+ try:
315
+ # Note: This would need custom implementation in the Space
316
+ # to support remote dataset URLs and callbacks
317
+ logger.info(f"Requesting training for {model_name}")
318
+
319
+ return {
320
+ "status": "submitted",
321
+ "model_name": model_name,
322
+ "task_type": task_type,
323
+ "message": "Training request submitted. Check Space for status."
324
+ }
325
+
326
+ except Exception as e:
327
+ logger.error(f"Training request failed: {e}")
328
+ return {"error": str(e)}
329
+
330
+ # =========================================================================
331
+ # HEALTH & STATUS
332
+ # =========================================================================
333
+
334
+ async def health_check(self) -> Dict[str, Any]:
335
+ """Check if the Space is healthy and responsive"""
336
+ try:
337
+ async with httpx.AsyncClient(timeout=10.0) as client:
338
+ response = await client.get(f"{self.space_url}")
339
+
340
+ return {
341
+ "status": "healthy" if response.status_code == 200 else "unhealthy",
342
+ "space_id": self.space_id,
343
+ "url": self.space_url,
344
+ "response_code": response.status_code,
345
+ "timestamp": datetime.now().isoformat()
346
+ }
347
+ except Exception as e:
348
+ return {
349
+ "status": "error",
350
+ "error": str(e),
351
+ "space_id": self.space_id,
352
+ "timestamp": datetime.now().isoformat()
353
+ }
354
+
355
+ async def get_space_info(self) -> Dict[str, Any]:
356
+ """Get information about the Space"""
357
+ try:
358
+ if HF_HUB_AVAILABLE and self._hf_api:
359
+ info = self._hf_api.space_info(self.space_id)
360
+ return {
361
+ "id": info.id,
362
+ "author": info.author,
363
+ "sdk": info.sdk,
364
+ "status": info.runtime.stage if info.runtime else "unknown",
365
+ "hardware": info.runtime.hardware if info.runtime else "unknown",
366
+ }
367
+ return {"space_id": self.space_id}
368
+ except Exception as e:
369
+ return {"error": str(e)}
370
+
371
+
372
+ # ============================================================================
373
+ # CONVENIENCE FUNCTIONS FOR BACKEND
374
+ # ============================================================================
375
+
376
+ # Global client instance
377
+ _hf_client: Optional[HuggingFaceClient] = None
378
+
379
+
380
+ def get_hf_client() -> HuggingFaceClient:
381
+ """Get or create the global HF client"""
382
+ global _hf_client
383
+ if _hf_client is None:
384
+ _hf_client = HuggingFaceClient()
385
+ return _hf_client
386
+
387
+
388
+ async def predict_threat(model_id: str, features: Dict[str, Any]) -> Dict[str, Any]:
389
+ """Convenience function for threat prediction"""
390
+ client = get_hf_client()
391
+ return await client.predict(model_id, features)
392
+
393
+
394
+ async def batch_predict_threats(
395
+ model_id: str,
396
+ batch_features: List[Dict[str, Any]]
397
+ ) -> List[Dict[str, Any]]:
398
+ """Convenience function for batch threat prediction"""
399
+ client = get_hf_client()
400
+ return await client.batch_predict(model_id, batch_features)
401
+
402
+
403
+ async def get_available_models() -> List[Dict[str, Any]]:
404
+ """Get list of available models"""
405
+ client = get_hf_client()
406
+ return await client.list_models()
407
+
408
+
409
+ # ============================================================================
410
+ # EXAMPLE USAGE
411
+ # ============================================================================
412
+
413
+ if __name__ == "__main__":
414
+ async def main():
415
+ # Initialize client
416
+ client = HuggingFaceClient(
417
+ space_id="Che237/cyberforge",
418
+ hf_token=os.getenv("HF_TOKEN")
419
+ )
420
+
421
+ # Health check
422
+ health = await client.health_check()
423
+ print(f"Health: {health}")
424
+
425
+ # List models
426
+ models = await client.list_models()
427
+ print(f"Available models: {models}")
428
+
429
+ # Example prediction
430
+ if models:
431
+ model_id = models[0]["id"]
432
+ features = {"feature1": 0.5, "feature2": 1.2, "feature3": 0.8}
433
+ result = await client.predict(model_id, features)
434
+ print(f"Prediction: {result}")
435
+
436
+ asyncio.run(main())
requirements.txt ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CyberForge AI - Hugging Face Space Requirements
2
+ # Core Gradio and web dependencies
3
+ gradio>=4.0.0
4
+ gradio_client>=0.7.0
5
+
6
+ # ML and Data Science
7
+ scikit-learn>=1.3.0
8
+ pandas>=2.1.0
9
+ numpy>=1.26.0
10
+ joblib>=1.3.0
11
+
12
+ # Deep Learning
13
+ torch>=2.0.0
14
+ transformers>=4.30.0
15
+
16
+ # Hugging Face Hub
17
+ huggingface_hub>=0.19.0
18
+
19
+ # Additional ML
20
+ xgboost>=1.7.0
21
+ imbalanced-learn>=0.11.0
22
+
23
+ # Visualization (optional)
24
+ matplotlib>=3.7.0
25
+ seaborn>=0.12.0
26
+ plotly>=5.15.0
27
+
28
+ # Utilities
29
+ python-dotenv>=1.0.0
30
+ aiofiles>=23.2.1
31
+ httpx>=0.25.0
trainer.py ADDED
@@ -0,0 +1,459 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Advanced Cybersecurity Model Trainer
3
+ Comprehensive training module for security ML models
4
+ """
5
+
6
+ import numpy as np
7
+ import pandas as pd
8
+ import torch
9
+ import torch.nn as nn
10
+ import torch.optim as optim
11
+ from torch.utils.data import DataLoader, TensorDataset
12
+ from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
13
+ from sklearn.ensemble import (
14
+ RandomForestClassifier,
15
+ GradientBoostingClassifier,
16
+ AdaBoostClassifier,
17
+ ExtraTreesClassifier,
18
+ VotingClassifier,
19
+ StackingClassifier
20
+ )
21
+ from sklearn.linear_model import LogisticRegression
22
+ from sklearn.svm import SVC
23
+ from sklearn.neural_network import MLPClassifier
24
+ from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
25
+ from sklearn.metrics import (
26
+ classification_report,
27
+ confusion_matrix,
28
+ roc_auc_score,
29
+ precision_recall_curve,
30
+ f1_score,
31
+ accuracy_score
32
+ )
33
+ from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
34
+ from sklearn.decomposition import PCA
35
+ import joblib
36
+ import json
37
+ from datetime import datetime
38
+ from pathlib import Path
39
+ import logging
40
+ from typing import Dict, List, Any, Optional, Tuple
41
+ import warnings
42
+ warnings.filterwarnings('ignore')
43
+
44
+ logger = logging.getLogger(__name__)
45
+
46
+
47
+ class CyberSecurityNeuralNet(nn.Module):
48
+ """Deep Neural Network for Cybersecurity Classification"""
49
+
50
+ def __init__(self, input_size: int, hidden_sizes: List[int], num_classes: int, dropout: float = 0.3):
51
+ super().__init__()
52
+
53
+ layers = []
54
+ prev_size = input_size
55
+
56
+ for hidden_size in hidden_sizes:
57
+ layers.extend([
58
+ nn.Linear(prev_size, hidden_size),
59
+ nn.BatchNorm1d(hidden_size),
60
+ nn.ReLU(),
61
+ nn.Dropout(dropout)
62
+ ])
63
+ prev_size = hidden_size
64
+
65
+ layers.append(nn.Linear(prev_size, num_classes))
66
+
67
+ self.network = nn.Sequential(*layers)
68
+
69
+ def forward(self, x):
70
+ return self.network(x)
71
+
72
+
73
+ class AdvancedSecurityTrainer:
74
+ """Advanced trainer for cybersecurity models with multiple algorithms"""
75
+
76
+ def __init__(self, models_dir: str = "./trained_models"):
77
+ self.models_dir = Path(models_dir)
78
+ self.models_dir.mkdir(exist_ok=True)
79
+ self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
80
+ self.trained_models = {}
81
+ self.training_history = []
82
+
83
+ def preprocess_security_data(
84
+ self,
85
+ df: pd.DataFrame,
86
+ target_col: str,
87
+ feature_selection: bool = True,
88
+ n_features: int = 50
89
+ ) -> Tuple[np.ndarray, np.ndarray, StandardScaler, LabelEncoder, List[str]]:
90
+ """Preprocess security data with advanced feature engineering"""
91
+
92
+ # Separate features and target
93
+ X = df.drop(columns=[target_col])
94
+ y = df[target_col]
95
+
96
+ # Store original feature names
97
+ feature_names = list(X.columns)
98
+
99
+ # Handle categorical features
100
+ categorical_cols = X.select_dtypes(include=['object', 'category']).columns
101
+ for col in categorical_cols:
102
+ le = LabelEncoder()
103
+ X[col] = le.fit_transform(X[col].astype(str))
104
+
105
+ # Handle missing values
106
+ X = X.fillna(X.median())
107
+
108
+ # Encode target if categorical
109
+ label_encoder = LabelEncoder()
110
+ if y.dtype == 'object' or y.dtype.name == 'category':
111
+ y = label_encoder.fit_transform(y)
112
+ else:
113
+ y = y.values
114
+
115
+ # Scale features
116
+ scaler = StandardScaler()
117
+ X_scaled = scaler.fit_transform(X)
118
+
119
+ # Feature selection
120
+ if feature_selection and X_scaled.shape[1] > n_features:
121
+ selector = SelectKBest(mutual_info_classif, k=min(n_features, X_scaled.shape[1]))
122
+ X_scaled = selector.fit_transform(X_scaled, y)
123
+ selected_indices = selector.get_support(indices=True)
124
+ feature_names = [feature_names[i] for i in selected_indices]
125
+
126
+ return X_scaled, y, scaler, label_encoder, feature_names
127
+
128
+ def train_ensemble_model(
129
+ self,
130
+ X_train: np.ndarray,
131
+ y_train: np.ndarray,
132
+ X_test: np.ndarray,
133
+ y_test: np.ndarray,
134
+ model_name: str = "ensemble"
135
+ ) -> Tuple[Any, Dict[str, float]]:
136
+ """Train an ensemble of classifiers"""
137
+
138
+ logger.info("Training ensemble model...")
139
+
140
+ # Base estimators
141
+ estimators = [
142
+ ('rf', RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)),
143
+ ('gb', GradientBoostingClassifier(n_estimators=100, random_state=42)),
144
+ ('et', ExtraTreesClassifier(n_estimators=100, random_state=42, n_jobs=-1)),
145
+ ]
146
+
147
+ # Voting classifier
148
+ voting_clf = VotingClassifier(estimators=estimators, voting='soft')
149
+ voting_clf.fit(X_train, y_train)
150
+
151
+ # Evaluate
152
+ y_pred = voting_clf.predict(X_test)
153
+ y_proba = voting_clf.predict_proba(X_test)
154
+
155
+ metrics = self._calculate_metrics(y_test, y_pred, y_proba)
156
+
157
+ # Save model
158
+ model_path = self.models_dir / f"{model_name}_ensemble.pkl"
159
+ joblib.dump(voting_clf, model_path)
160
+
161
+ logger.info(f"Ensemble model trained with accuracy: {metrics['accuracy']:.4f}")
162
+
163
+ return voting_clf, metrics
164
+
165
+ def train_stacking_model(
166
+ self,
167
+ X_train: np.ndarray,
168
+ y_train: np.ndarray,
169
+ X_test: np.ndarray,
170
+ y_test: np.ndarray,
171
+ model_name: str = "stacking"
172
+ ) -> Tuple[Any, Dict[str, float]]:
173
+ """Train a stacking classifier"""
174
+
175
+ logger.info("Training stacking model...")
176
+
177
+ # Base estimators
178
+ estimators = [
179
+ ('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
180
+ ('gb', GradientBoostingClassifier(n_estimators=50, random_state=42)),
181
+ ('svm', SVC(probability=True, random_state=42)),
182
+ ]
183
+
184
+ # Stacking classifier with logistic regression meta-learner
185
+ stacking_clf = StackingClassifier(
186
+ estimators=estimators,
187
+ final_estimator=LogisticRegression(random_state=42),
188
+ cv=3
189
+ )
190
+ stacking_clf.fit(X_train, y_train)
191
+
192
+ # Evaluate
193
+ y_pred = stacking_clf.predict(X_test)
194
+ y_proba = stacking_clf.predict_proba(X_test)
195
+
196
+ metrics = self._calculate_metrics(y_test, y_pred, y_proba)
197
+
198
+ # Save model
199
+ model_path = self.models_dir / f"{model_name}_stacking.pkl"
200
+ joblib.dump(stacking_clf, model_path)
201
+
202
+ logger.info(f"Stacking model trained with accuracy: {metrics['accuracy']:.4f}")
203
+
204
+ return stacking_clf, metrics
205
+
206
+ def train_neural_network(
207
+ self,
208
+ X_train: np.ndarray,
209
+ y_train: np.ndarray,
210
+ X_test: np.ndarray,
211
+ y_test: np.ndarray,
212
+ hidden_sizes: List[int] = [256, 128, 64],
213
+ epochs: int = 100,
214
+ batch_size: int = 32,
215
+ learning_rate: float = 0.001,
216
+ model_name: str = "neural_net"
217
+ ) -> Tuple[nn.Module, Dict[str, float]]:
218
+ """Train a deep neural network"""
219
+
220
+ logger.info(f"Training neural network on {self.device}...")
221
+
222
+ # Convert to tensors
223
+ X_train_tensor = torch.FloatTensor(X_train).to(self.device)
224
+ y_train_tensor = torch.LongTensor(y_train).to(self.device)
225
+ X_test_tensor = torch.FloatTensor(X_test).to(self.device)
226
+ y_test_tensor = torch.LongTensor(y_test).to(self.device)
227
+
228
+ # Create data loader
229
+ train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
230
+ train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
231
+
232
+ # Initialize model
233
+ num_classes = len(np.unique(y_train))
234
+ model = CyberSecurityNeuralNet(
235
+ input_size=X_train.shape[1],
236
+ hidden_sizes=hidden_sizes,
237
+ num_classes=num_classes
238
+ ).to(self.device)
239
+
240
+ # Loss and optimizer
241
+ criterion = nn.CrossEntropyLoss()
242
+ optimizer = optim.Adam(model.parameters(), lr=learning_rate)
243
+ scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=10)
244
+
245
+ # Training loop
246
+ best_accuracy = 0
247
+ for epoch in range(epochs):
248
+ model.train()
249
+ total_loss = 0
250
+
251
+ for batch_X, batch_y in train_loader:
252
+ optimizer.zero_grad()
253
+ outputs = model(batch_X)
254
+ loss = criterion(outputs, batch_y)
255
+ loss.backward()
256
+ optimizer.step()
257
+ total_loss += loss.item()
258
+
259
+ # Validation
260
+ model.eval()
261
+ with torch.no_grad():
262
+ test_outputs = model(X_test_tensor)
263
+ test_loss = criterion(test_outputs, y_test_tensor)
264
+ _, predicted = torch.max(test_outputs, 1)
265
+ accuracy = (predicted == y_test_tensor).float().mean().item()
266
+
267
+ scheduler.step(test_loss)
268
+
269
+ if accuracy > best_accuracy:
270
+ best_accuracy = accuracy
271
+ torch.save(model.state_dict(), self.models_dir / f"{model_name}_nn_best.pt")
272
+
273
+ if (epoch + 1) % 20 == 0:
274
+ logger.info(f"Epoch [{epoch+1}/{epochs}], Loss: {total_loss/len(train_loader):.4f}, Accuracy: {accuracy:.4f}")
275
+
276
+ # Final evaluation
277
+ model.eval()
278
+ with torch.no_grad():
279
+ outputs = model(X_test_tensor)
280
+ _, y_pred = torch.max(outputs, 1)
281
+ y_pred = y_pred.cpu().numpy()
282
+ y_proba = torch.softmax(outputs, dim=1).cpu().numpy()
283
+
284
+ metrics = self._calculate_metrics(y_test, y_pred, y_proba)
285
+
286
+ logger.info(f"Neural network trained with accuracy: {metrics['accuracy']:.4f}")
287
+
288
+ return model, metrics
289
+
290
+ def train_all_models(
291
+ self,
292
+ df: pd.DataFrame,
293
+ target_col: str,
294
+ model_name: str,
295
+ test_size: float = 0.2
296
+ ) -> Dict[str, Any]:
297
+ """Train all available model types and return best performing"""
298
+
299
+ logger.info(f"Starting comprehensive training for {model_name}...")
300
+
301
+ # Preprocess data
302
+ X, y, scaler, label_encoder, feature_names = self.preprocess_security_data(df, target_col)
303
+
304
+ # Split data
305
+ X_train, X_test, y_train, y_test = train_test_split(
306
+ X, y, test_size=test_size, random_state=42, stratify=y
307
+ )
308
+
309
+ results = {}
310
+
311
+ # Train individual models
312
+ models_to_train = [
313
+ ("random_forest", RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)),
314
+ ("gradient_boosting", GradientBoostingClassifier(n_estimators=100, random_state=42)),
315
+ ("extra_trees", ExtraTreesClassifier(n_estimators=100, random_state=42, n_jobs=-1)),
316
+ ("logistic_regression", LogisticRegression(random_state=42, max_iter=1000)),
317
+ ("mlp", MLPClassifier(hidden_layer_sizes=(128, 64), random_state=42, max_iter=500)),
318
+ ]
319
+
320
+ for name, model in models_to_train:
321
+ try:
322
+ logger.info(f"Training {name}...")
323
+ model.fit(X_train, y_train)
324
+ y_pred = model.predict(X_test)
325
+ y_proba = model.predict_proba(X_test) if hasattr(model, 'predict_proba') else None
326
+
327
+ metrics = self._calculate_metrics(y_test, y_pred, y_proba)
328
+ results[name] = {
329
+ "model": model,
330
+ "metrics": metrics
331
+ }
332
+
333
+ # Save model
334
+ model_path = self.models_dir / f"{model_name}_{name}.pkl"
335
+ joblib.dump(model, model_path)
336
+
337
+ except Exception as e:
338
+ logger.error(f"Failed to train {name}: {e}")
339
+ results[name] = {"error": str(e)}
340
+
341
+ # Train ensemble
342
+ try:
343
+ ensemble_model, ensemble_metrics = self.train_ensemble_model(
344
+ X_train, y_train, X_test, y_test, model_name
345
+ )
346
+ results["ensemble"] = {
347
+ "model": ensemble_model,
348
+ "metrics": ensemble_metrics
349
+ }
350
+ except Exception as e:
351
+ logger.error(f"Failed to train ensemble: {e}")
352
+
353
+ # Train stacking
354
+ try:
355
+ stacking_model, stacking_metrics = self.train_stacking_model(
356
+ X_train, y_train, X_test, y_test, model_name
357
+ )
358
+ results["stacking"] = {
359
+ "model": stacking_model,
360
+ "metrics": stacking_metrics
361
+ }
362
+ except Exception as e:
363
+ logger.error(f"Failed to train stacking: {e}")
364
+
365
+ # Find best model
366
+ best_model_name = None
367
+ best_accuracy = 0
368
+ for name, result in results.items():
369
+ if "metrics" in result and result["metrics"]["accuracy"] > best_accuracy:
370
+ best_accuracy = result["metrics"]["accuracy"]
371
+ best_model_name = name
372
+
373
+ # Save preprocessing artifacts
374
+ joblib.dump(scaler, self.models_dir / f"{model_name}_scaler.pkl")
375
+ joblib.dump(label_encoder, self.models_dir / f"{model_name}_label_encoder.pkl")
376
+
377
+ # Save metadata
378
+ metadata = {
379
+ "model_name": model_name,
380
+ "target_column": target_col,
381
+ "feature_names": feature_names,
382
+ "num_features": len(feature_names),
383
+ "num_samples": len(df),
384
+ "num_classes": len(np.unique(y)),
385
+ "best_model": best_model_name,
386
+ "best_accuracy": best_accuracy,
387
+ "all_results": {
388
+ name: result.get("metrics", {"error": result.get("error")})
389
+ for name, result in results.items()
390
+ },
391
+ "created_at": datetime.now().isoformat()
392
+ }
393
+
394
+ with open(self.models_dir / f"{model_name}_metadata.json", 'w') as f:
395
+ json.dump(metadata, f, indent=2)
396
+
397
+ logger.info(f"Training complete. Best model: {best_model_name} with accuracy: {best_accuracy:.4f}")
398
+
399
+ return {
400
+ "results": results,
401
+ "metadata": metadata,
402
+ "scaler": scaler,
403
+ "label_encoder": label_encoder,
404
+ "feature_names": feature_names
405
+ }
406
+
407
+ def _calculate_metrics(
408
+ self,
409
+ y_true: np.ndarray,
410
+ y_pred: np.ndarray,
411
+ y_proba: Optional[np.ndarray] = None
412
+ ) -> Dict[str, float]:
413
+ """Calculate comprehensive metrics"""
414
+
415
+ metrics = {
416
+ "accuracy": float(accuracy_score(y_true, y_pred)),
417
+ "f1_weighted": float(f1_score(y_true, y_pred, average='weighted')),
418
+ "f1_macro": float(f1_score(y_true, y_pred, average='macro')),
419
+ }
420
+
421
+ # ROC AUC for binary or multi-class
422
+ if y_proba is not None:
423
+ try:
424
+ if len(np.unique(y_true)) == 2:
425
+ metrics["roc_auc"] = float(roc_auc_score(y_true, y_proba[:, 1]))
426
+ else:
427
+ metrics["roc_auc"] = float(roc_auc_score(y_true, y_proba, multi_class='ovr'))
428
+ except:
429
+ pass
430
+
431
+ return metrics
432
+
433
+
434
+ # Convenience function for Gradio interface
435
+ def train_comprehensive_model(
436
+ file_path: str,
437
+ target_column: str,
438
+ model_name: str,
439
+ test_size: float = 0.2
440
+ ) -> Dict[str, Any]:
441
+ """Train comprehensive models from file path"""
442
+
443
+ # Load dataset
444
+ if file_path.endswith('.csv'):
445
+ df = pd.read_csv(file_path)
446
+ elif file_path.endswith('.json'):
447
+ df = pd.read_json(file_path)
448
+ elif file_path.endswith('.parquet'):
449
+ df = pd.read_parquet(file_path)
450
+ else:
451
+ raise ValueError(f"Unsupported file format: {file_path}")
452
+
453
+ # Initialize trainer
454
+ trainer = AdvancedSecurityTrainer()
455
+
456
+ # Train all models
457
+ results = trainer.train_all_models(df, target_column, model_name, test_size)
458
+
459
+ return results