File size: 5,661 Bytes
77c5bf8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
# CyberForge ML Notebooks

Production-ready ML pipeline for CyberForge cybersecurity AI system.

## Notebook Structure

| # | Notebook | Purpose | Key Outputs |
|---|----------|---------|-------------|
| 00 | [environment_setup](00_environment_setup.ipynb) | Environment validation, dependencies | System readiness report |
| 01 | [data_acquisition](01_data_acquisition.ipynb) | Data collection from WebScraper API, HF | Normalized datasets |
| 02 | [feature_engineering](02_feature_engineering.ipynb) | URL, network, security feature extraction | Feature-engineered data |
| 03 | [model_training](03_model_training.ipynb) | Train detection models | Trained .pkl models |
| 04 | [agent_intelligence](04_agent_intelligence.ipynb) | Decision scoring, Gemini integration | Agent module |
| 05 | [model_validation](05_model_validation.ipynb) | Performance, edge case testing | Validation report |
| 06 | [backend_integration](06_backend_integration.ipynb) | API packaging, serialization | Backend package |
| 07 | [deployment_artifacts](07_deployment_artifacts.ipynb) | Docker, HF upload, documentation | Deployment package |

## Quick Start

1. **Configure environment:**
   ```bash
   cd ml-services
   # Ensure notebook_config.json has your API keys
   ```

2. **Run notebooks in order:**
   ```bash
   jupyter notebook notebooks/00_environment_setup.ipynb
   ```

3. **Or run all:**
   ```bash
   jupyter nbconvert --execute --to notebook notebooks/*.ipynb
   ```

## Configuration

All notebooks use `../notebook_config.json` for configuration:

```json
{
  "datasets_dir": "../datasets",
  "hf_repo": "Che237/cyberforge-models",
  "gemini_api_key": "",
  "webscraper_api_key": "your_key"
}
```

## Output Directories

After running all notebooks:

```
ml-services/
β”œβ”€β”€ datasets/
β”‚   β”œβ”€β”€ processed/       # Cleaned datasets
β”‚   └── features/        # Feature-engineered data
β”œβ”€β”€ models/              # Trained models
β”‚   β”œβ”€β”€ phishing_detection/
β”‚   β”œβ”€β”€ malware_detection/
β”‚   └── model_registry.json
β”œβ”€β”€ agent/               # Agent intelligence module
β”œβ”€β”€ validation/          # Validation reports
β”œβ”€β”€ backend_package/     # Backend integration files
└── deployment/          # Deployment artifacts
```

## Integration Points

### Backend (mlService.js)
- Use `backend_package/inference.py` or `backend_package/ml_client.js`
- Prediction endpoint: `POST /predict`

### Desktop App (caido-app.js)
- Agent module: `agent/cyberforge_agent.py`
- Real-time analysis via backend API

### Hugging Face
- Models: `huggingface.co/Che237/cyberforge-models`
- Datasets: `huggingface.co/datasets/Che237/cyberforge-datasets`
- Space: `huggingface.co/spaces/Che237/cyberforge`

## Requirements

- Python 3.11+
- scikit-learn >= 1.3.0
- pandas >= 2.0.0
- huggingface_hub >= 0.19.0
- google-generativeai >= 0.3.0

## License

MIT

### 3. **Network Security Analysis** 🌐
**File**: `network_security_analysis.ipynb`
**Purpose**: Network-specific security analysis and monitoring
**Runtime**: ~20-30 minutes
**Description**:
- Network traffic analysis
- Intrusion detection model training
- Port scanning detection
- Network anomaly detection

```bash
jupyter notebook network_security_analysis.ipynb
```

### 4. **Comprehensive AI Agent Training** πŸ€–
**File**: `ai_agent_comprehensive_training.ipynb`
**Purpose**: Advanced AI agent with full capabilities
**Runtime**: ~45-60 minutes
**Description**:
- Enhanced communication skills
- Web scraping and threat intelligence
- Real-time monitoring capabilities
- Natural language processing for security analysis
- **RUN LAST** - Integrates all previous models

```bash
jupyter notebook ai_agent_comprehensive_training.ipynb
```

## πŸ“Š Expected Outputs

After running all notebooks, you should have:

1. **Trained Models**: Saved in `../models/` directory
2. **Performance Metrics**: Evaluation reports and visualizations
3. **AI Agent**: Fully trained agent ready for deployment
4. **Configuration Files**: Model configs for production use

## πŸ”§ Troubleshooting

### Common Issues:

**Memory Errors**: 
- Reduce batch size in deep learning models
- Close other applications to free RAM
- Consider using smaller datasets for testing

**Package Installation Failures**:
- Update pip: `pip install --upgrade pip`
- Use conda if pip fails: `conda install <package>`
- Check Python version compatibility

**CUDA/GPU Issues**:
- For TensorFlow GPU: Install CUDA 11.8+ and cuDNN
- For CPU-only: Models will run slower but still work
- Check GPU availability: `tensorflow.test.is_gpu_available()`

**Data Download Issues**:
- Ensure internet connection for Kaggle datasets
- Set up Kaggle API credentials if needed
- Some notebooks include fallback synthetic data generation

## πŸ“ Notes

- **First Run**: Initial execution takes longer due to package installation and data downloads
- **Subsequent Runs**: Much faster as dependencies are cached
- **Customization**: Modify hyperparameters in notebooks for different results
- **Production**: Use the saved models in the main application

## 🎯 Next Steps

After completing all notebooks:

1. **Deploy Models**: Copy trained models to production environment
2. **Integration**: Connect models with the desktop application
3. **Monitoring**: Set up model performance monitoring
4. **Updates**: Retrain models with new data periodically

## πŸ†˜ Support

If you encounter issues:
1. Check the troubleshooting section above
2. Verify all prerequisites are met
3. Review notebook outputs for specific error messages
4. Create an issue in the repository with error details

---

**Happy Training! πŸš€**