Spaces:

Che237
/

cyberforge

Running

App Files Files Community

cyberforge / notebooks /README.md

Che237

Add README.md

77c5bf8 verified 11 days ago

preview code

raw

history blame contribute delete

5.66 kB

CyberForge ML Notebooks

Production-ready ML pipeline for CyberForge cybersecurity AI system.

Notebook Structure

#	Notebook	Purpose	Key Outputs
00	environment_setup	Environment validation, dependencies	System readiness report
01	data_acquisition	Data collection from WebScraper API, HF	Normalized datasets
02	feature_engineering	URL, network, security feature extraction	Feature-engineered data
03	model_training	Train detection models	Trained .pkl models
04	agent_intelligence	Decision scoring, Gemini integration	Agent module
05	model_validation	Performance, edge case testing	Validation report
06	backend_integration	API packaging, serialization	Backend package
07	deployment_artifacts	Docker, HF upload, documentation	Deployment package

Quick Start

Configure environment:

cd ml-services
# Ensure notebook_config.json has your API keys

Run notebooks in order:

jupyter notebook notebooks/00_environment_setup.ipynb

Or run all:

jupyter nbconvert --execute --to notebook notebooks/*.ipynb

Configuration

All notebooks use ../notebook_config.json for configuration:

{
  "datasets_dir": "../datasets",
  "hf_repo": "Che237/cyberforge-models",
  "gemini_api_key": "",
  "webscraper_api_key": "your_key"
}

Output Directories

After running all notebooks:

ml-services/
├── datasets/
│   ├── processed/       # Cleaned datasets
│   └── features/        # Feature-engineered data
├── models/              # Trained models
│   ├── phishing_detection/
│   ├── malware_detection/
│   └── model_registry.json
├── agent/               # Agent intelligence module
├── validation/          # Validation reports
├── backend_package/     # Backend integration files
└── deployment/          # Deployment artifacts

Integration Points

Backend (mlService.js)

Use backend_package/inference.py or backend_package/ml_client.js
Prediction endpoint: POST /predict

Desktop App (caido-app.js)

Agent module: agent/cyberforge_agent.py
Real-time analysis via backend API

Hugging Face

Models: huggingface.co/Che237/cyberforge-models
Datasets: huggingface.co/datasets/Che237/cyberforge-datasets
Space: huggingface.co/spaces/Che237/cyberforge

Requirements

Python 3.11+
scikit-learn >= 1.3.0
pandas >= 2.0.0
huggingface_hub >= 0.19.0
google-generativeai >= 0.3.0

License

MIT

3. Network Security Analysis 🌐

File: network_security_analysis.ipynb Purpose: Network-specific security analysis and monitoring Runtime: ~20-30 minutes Description:

Network traffic analysis
Intrusion detection model training
Port scanning detection
Network anomaly detection

jupyter notebook network_security_analysis.ipynb

4. Comprehensive AI Agent Training 🤖

File: ai_agent_comprehensive_training.ipynb Purpose: Advanced AI agent with full capabilities Runtime: ~45-60 minutes Description:

Enhanced communication skills
Web scraping and threat intelligence
Real-time monitoring capabilities
Natural language processing for security analysis
RUN LAST - Integrates all previous models

jupyter notebook ai_agent_comprehensive_training.ipynb

📊 Expected Outputs

After running all notebooks, you should have:

Trained Models: Saved in ../models/ directory
Performance Metrics: Evaluation reports and visualizations
AI Agent: Fully trained agent ready for deployment
Configuration Files: Model configs for production use

🔧 Troubleshooting

Common Issues:

Memory Errors:

Reduce batch size in deep learning models
Close other applications to free RAM
Consider using smaller datasets for testing

Package Installation Failures:

Update pip: pip install --upgrade pip
Use conda if pip fails: conda install <package>
Check Python version compatibility

CUDA/GPU Issues:

For TensorFlow GPU: Install CUDA 11.8+ and cuDNN
For CPU-only: Models will run slower but still work
Check GPU availability: tensorflow.test.is_gpu_available()

Data Download Issues:

Ensure internet connection for Kaggle datasets
Set up Kaggle API credentials if needed
Some notebooks include fallback synthetic data generation

📝 Notes

First Run: Initial execution takes longer due to package installation and data downloads
Subsequent Runs: Much faster as dependencies are cached
Customization: Modify hyperparameters in notebooks for different results
Production: Use the saved models in the main application

🎯 Next Steps

After completing all notebooks:

Deploy Models: Copy trained models to production environment
Integration: Connect models with the desktop application
Monitoring: Set up model performance monitoring
Updates: Retrain models with new data periodically

🆘 Support

If you encounter issues:

Check the troubleshooting section above
Verify all prerequisites are met
Review notebook outputs for specific error messages
Create an issue in the repository with error details

Happy Training! 🚀