cyberforge / notebooks /README.md
Che237's picture
Add README.md
77c5bf8 verified

CyberForge ML Notebooks

Production-ready ML pipeline for CyberForge cybersecurity AI system.

Notebook Structure

# Notebook Purpose Key Outputs
00 environment_setup Environment validation, dependencies System readiness report
01 data_acquisition Data collection from WebScraper API, HF Normalized datasets
02 feature_engineering URL, network, security feature extraction Feature-engineered data
03 model_training Train detection models Trained .pkl models
04 agent_intelligence Decision scoring, Gemini integration Agent module
05 model_validation Performance, edge case testing Validation report
06 backend_integration API packaging, serialization Backend package
07 deployment_artifacts Docker, HF upload, documentation Deployment package

Quick Start

  1. Configure environment:

    cd ml-services
    # Ensure notebook_config.json has your API keys
    
  2. Run notebooks in order:

    jupyter notebook notebooks/00_environment_setup.ipynb
    
  3. Or run all:

    jupyter nbconvert --execute --to notebook notebooks/*.ipynb
    

Configuration

All notebooks use ../notebook_config.json for configuration:

{
  "datasets_dir": "../datasets",
  "hf_repo": "Che237/cyberforge-models",
  "gemini_api_key": "",
  "webscraper_api_key": "your_key"
}

Output Directories

After running all notebooks:

ml-services/
β”œβ”€β”€ datasets/
β”‚   β”œβ”€β”€ processed/       # Cleaned datasets
β”‚   └── features/        # Feature-engineered data
β”œβ”€β”€ models/              # Trained models
β”‚   β”œβ”€β”€ phishing_detection/
β”‚   β”œβ”€β”€ malware_detection/
β”‚   └── model_registry.json
β”œβ”€β”€ agent/               # Agent intelligence module
β”œβ”€β”€ validation/          # Validation reports
β”œβ”€β”€ backend_package/     # Backend integration files
└── deployment/          # Deployment artifacts

Integration Points

Backend (mlService.js)

  • Use backend_package/inference.py or backend_package/ml_client.js
  • Prediction endpoint: POST /predict

Desktop App (caido-app.js)

  • Agent module: agent/cyberforge_agent.py
  • Real-time analysis via backend API

Hugging Face

  • Models: huggingface.co/Che237/cyberforge-models
  • Datasets: huggingface.co/datasets/Che237/cyberforge-datasets
  • Space: huggingface.co/spaces/Che237/cyberforge

Requirements

  • Python 3.11+
  • scikit-learn >= 1.3.0
  • pandas >= 2.0.0
  • huggingface_hub >= 0.19.0
  • google-generativeai >= 0.3.0

License

MIT

3. Network Security Analysis 🌐

File: network_security_analysis.ipynb Purpose: Network-specific security analysis and monitoring Runtime: ~20-30 minutes Description:

  • Network traffic analysis
  • Intrusion detection model training
  • Port scanning detection
  • Network anomaly detection
jupyter notebook network_security_analysis.ipynb

4. Comprehensive AI Agent Training πŸ€–

File: ai_agent_comprehensive_training.ipynb Purpose: Advanced AI agent with full capabilities Runtime: ~45-60 minutes Description:

  • Enhanced communication skills
  • Web scraping and threat intelligence
  • Real-time monitoring capabilities
  • Natural language processing for security analysis
  • RUN LAST - Integrates all previous models
jupyter notebook ai_agent_comprehensive_training.ipynb

πŸ“Š Expected Outputs

After running all notebooks, you should have:

  1. Trained Models: Saved in ../models/ directory
  2. Performance Metrics: Evaluation reports and visualizations
  3. AI Agent: Fully trained agent ready for deployment
  4. Configuration Files: Model configs for production use

πŸ”§ Troubleshooting

Common Issues:

Memory Errors:

  • Reduce batch size in deep learning models
  • Close other applications to free RAM
  • Consider using smaller datasets for testing

Package Installation Failures:

  • Update pip: pip install --upgrade pip
  • Use conda if pip fails: conda install <package>
  • Check Python version compatibility

CUDA/GPU Issues:

  • For TensorFlow GPU: Install CUDA 11.8+ and cuDNN
  • For CPU-only: Models will run slower but still work
  • Check GPU availability: tensorflow.test.is_gpu_available()

Data Download Issues:

  • Ensure internet connection for Kaggle datasets
  • Set up Kaggle API credentials if needed
  • Some notebooks include fallback synthetic data generation

πŸ“ Notes

  • First Run: Initial execution takes longer due to package installation and data downloads
  • Subsequent Runs: Much faster as dependencies are cached
  • Customization: Modify hyperparameters in notebooks for different results
  • Production: Use the saved models in the main application

🎯 Next Steps

After completing all notebooks:

  1. Deploy Models: Copy trained models to production environment
  2. Integration: Connect models with the desktop application
  3. Monitoring: Set up model performance monitoring
  4. Updates: Retrain models with new data periodically

πŸ†˜ Support

If you encounter issues:

  1. Check the troubleshooting section above
  2. Verify all prerequisites are met
  3. Review notebook outputs for specific error messages
  4. Create an issue in the repository with error details

Happy Training! πŸš€