cyberforge / notebooks /README.md
Che237's picture
Add README.md
77c5bf8 verified
# CyberForge ML Notebooks
Production-ready ML pipeline for CyberForge cybersecurity AI system.
## Notebook Structure
| # | Notebook | Purpose | Key Outputs |
|---|----------|---------|-------------|
| 00 | [environment_setup](00_environment_setup.ipynb) | Environment validation, dependencies | System readiness report |
| 01 | [data_acquisition](01_data_acquisition.ipynb) | Data collection from WebScraper API, HF | Normalized datasets |
| 02 | [feature_engineering](02_feature_engineering.ipynb) | URL, network, security feature extraction | Feature-engineered data |
| 03 | [model_training](03_model_training.ipynb) | Train detection models | Trained .pkl models |
| 04 | [agent_intelligence](04_agent_intelligence.ipynb) | Decision scoring, Gemini integration | Agent module |
| 05 | [model_validation](05_model_validation.ipynb) | Performance, edge case testing | Validation report |
| 06 | [backend_integration](06_backend_integration.ipynb) | API packaging, serialization | Backend package |
| 07 | [deployment_artifacts](07_deployment_artifacts.ipynb) | Docker, HF upload, documentation | Deployment package |
## Quick Start
1. **Configure environment:**
```bash
cd ml-services
# Ensure notebook_config.json has your API keys
```
2. **Run notebooks in order:**
```bash
jupyter notebook notebooks/00_environment_setup.ipynb
```
3. **Or run all:**
```bash
jupyter nbconvert --execute --to notebook notebooks/*.ipynb
```
## Configuration
All notebooks use `../notebook_config.json` for configuration:
```json
{
"datasets_dir": "../datasets",
"hf_repo": "Che237/cyberforge-models",
"gemini_api_key": "",
"webscraper_api_key": "your_key"
}
```
## Output Directories
After running all notebooks:
```
ml-services/
β”œβ”€β”€ datasets/
β”‚ β”œβ”€β”€ processed/ # Cleaned datasets
β”‚ └── features/ # Feature-engineered data
β”œβ”€β”€ models/ # Trained models
β”‚ β”œβ”€β”€ phishing_detection/
β”‚ β”œβ”€β”€ malware_detection/
β”‚ └── model_registry.json
β”œβ”€β”€ agent/ # Agent intelligence module
β”œβ”€β”€ validation/ # Validation reports
β”œβ”€β”€ backend_package/ # Backend integration files
└── deployment/ # Deployment artifacts
```
## Integration Points
### Backend (mlService.js)
- Use `backend_package/inference.py` or `backend_package/ml_client.js`
- Prediction endpoint: `POST /predict`
### Desktop App (caido-app.js)
- Agent module: `agent/cyberforge_agent.py`
- Real-time analysis via backend API
### Hugging Face
- Models: `huggingface.co/Che237/cyberforge-models`
- Datasets: `huggingface.co/datasets/Che237/cyberforge-datasets`
- Space: `huggingface.co/spaces/Che237/cyberforge`
## Requirements
- Python 3.11+
- scikit-learn >= 1.3.0
- pandas >= 2.0.0
- huggingface_hub >= 0.19.0
- google-generativeai >= 0.3.0
## License
MIT
### 3. **Network Security Analysis** 🌐
**File**: `network_security_analysis.ipynb`
**Purpose**: Network-specific security analysis and monitoring
**Runtime**: ~20-30 minutes
**Description**:
- Network traffic analysis
- Intrusion detection model training
- Port scanning detection
- Network anomaly detection
```bash
jupyter notebook network_security_analysis.ipynb
```
### 4. **Comprehensive AI Agent Training** πŸ€–
**File**: `ai_agent_comprehensive_training.ipynb`
**Purpose**: Advanced AI agent with full capabilities
**Runtime**: ~45-60 minutes
**Description**:
- Enhanced communication skills
- Web scraping and threat intelligence
- Real-time monitoring capabilities
- Natural language processing for security analysis
- **RUN LAST** - Integrates all previous models
```bash
jupyter notebook ai_agent_comprehensive_training.ipynb
```
## πŸ“Š Expected Outputs
After running all notebooks, you should have:
1. **Trained Models**: Saved in `../models/` directory
2. **Performance Metrics**: Evaluation reports and visualizations
3. **AI Agent**: Fully trained agent ready for deployment
4. **Configuration Files**: Model configs for production use
## πŸ”§ Troubleshooting
### Common Issues:
**Memory Errors**:
- Reduce batch size in deep learning models
- Close other applications to free RAM
- Consider using smaller datasets for testing
**Package Installation Failures**:
- Update pip: `pip install --upgrade pip`
- Use conda if pip fails: `conda install <package>`
- Check Python version compatibility
**CUDA/GPU Issues**:
- For TensorFlow GPU: Install CUDA 11.8+ and cuDNN
- For CPU-only: Models will run slower but still work
- Check GPU availability: `tensorflow.test.is_gpu_available()`
**Data Download Issues**:
- Ensure internet connection for Kaggle datasets
- Set up Kaggle API credentials if needed
- Some notebooks include fallback synthetic data generation
## πŸ“ Notes
- **First Run**: Initial execution takes longer due to package installation and data downloads
- **Subsequent Runs**: Much faster as dependencies are cached
- **Customization**: Modify hyperparameters in notebooks for different results
- **Production**: Use the saved models in the main application
## 🎯 Next Steps
After completing all notebooks:
1. **Deploy Models**: Copy trained models to production environment
2. **Integration**: Connect models with the desktop application
3. **Monitoring**: Set up model performance monitoring
4. **Updates**: Retrain models with new data periodically
## πŸ†˜ Support
If you encounter issues:
1. Check the troubleshooting section above
2. Verify all prerequisites are met
3. Review notebook outputs for specific error messages
4. Create an issue in the repository with error details
---
**Happy Training! πŸš€**