--- title: Dataset Explorer emoji: 🔐 colorFrom: blue colorTo: purple sdk: gradio sdk_version: 4.44.1 app_file: app.py pinned: false license: apache-2.0 tags: - cybersecurity - datasets - data-explorer - analytics - visualization --- # 🔐 Cybersecurity Dataset Explorer A comprehensive Gradio Space to explore and analyze 80+ cybersecurity datasets from HuggingFace. ## Features ### 🔍 Search & Filter - Search by keyword across dataset names, descriptions, and tags - Filter by language (English, Chinese, Korean, Italian, French, Russian, etc.) - Filter by category (AI, Defensive, Offensive, Compliance) - Filter by popularity (minimum downloads and likes) - View results in interactive tables ### 📊 Dataset Details - Comprehensive metadata for each dataset - Statistics (downloads, likes, size, language) - Complete tag listings - Direct links to HuggingFace repositories - Mock preview functionality (shows structure) ### 📈 Statistics & Visualizations Interactive charts powered by Plotly: - **Category Distribution**: Pie chart showing dataset distribution across categories - **Language Distribution**: Bar chart of top 10 languages - **Top Downloads**: Horizontal bar chart of most popular datasets - **Size Distribution**: Distribution of dataset sizes ### 📥 Export Capabilities - Export filtered results to CSV format - Export filtered results to JSON format - Download data for offline analysis ### 🎨 Dark Theme Beautiful dark theme optimized for readability with: - High contrast colors - Interactive hover effects - Responsive layout - Professional visualization styling ## Dataset Categories ### AI (27 datasets) Datasets for training and evaluating AI/ML models in cybersecurity: - Instruction-tuning datasets - ShareGPT format conversations - Question-answering pairs - Synthetic training data - Fine-tuning datasets ### Defensive (28 datasets) Blue team, security operations, and threat detection: - Threat intelligence - Incident response - Security operations - Detection rules (SIGMA, YARA, Suricata) - Honeypot data - News and threat feeds ### Offensive (10 datasets) Red team, penetration testing, and security research: - Penetration testing techniques - Exploit databases - Attack scenarios - Vulnerability data - CVE databases ### Compliance (5 datasets) Regulatory frameworks and standards: - NIST Cybersecurity Framework - ISO/IEC 27001 - Taiwan Cybersecurity Law - Compliance training data ## Top Datasets 1. **ethanolivertroy/nist-cybersecurity-training** (8,000 downloads) - Largest open-source NIST cybersecurity training dataset - 100K-1M samples for LLM fine-tuning 2. **clydeiii/cybersecurity** (4,000 downloads) - APT notes from GitHub - Threat intelligence focus 3. **vinitvek/cybersecurityattacks** (2,300 downloads) - Cybersecurity attacks dataset - 10K-100K samples 4. **Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset** (786 downloads, 78 likes) - 53,202 instruction-tuning examples - Defensive security focus 5. **AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.0** (353 downloads) - 83,920 high-quality training triples - Defensive cybersecurity ## Statistics - **Total Datasets**: 80 - **Total Downloads**: 18,000+ - **Languages**: 10+ (English, Chinese, Korean, Italian, French, Russian, etc.) - **Size Range**: <1K to 10M+ samples ## Usage ### Search Examples 1. **Find NIST-related datasets**: - Keyword: "NIST" - Category: Compliance 2. **Find penetration testing datasets**: - Keyword: "penetration" or "pentest" - Category: Offensive 3. **Find instruction-tuning datasets**: - Keyword: "instruction" - Category: AI - Min Downloads: 100 4. **Find threat intelligence datasets**: - Keyword: "threat" - Category: Defensive ### Export Workflow 1. Apply desired filters 2. Click "Search Datasets" 3. Click "Export to CSV" or "Export to JSON" 4. Download the file from the interface ## Technologies - **Gradio 4.44.1**: Interactive web interface - **Pandas 2.1.4**: Data manipulation and filtering - **Plotly 5.18.0**: Interactive visualizations - **HuggingFace Datasets 2.16.1**: Dataset metadata ## Data Sources All datasets are publicly available on HuggingFace Hub. This explorer provides: - Curated metadata from 80 cybersecurity datasets - Filtering and search capabilities - Visual analytics - Export functionality To access actual dataset content, click the HuggingFace URL for any dataset. ## Development ### Local Setup ```bash pip install -r requirements.txt python app.py ``` ### File Structure ``` dataset-explorer/ ├── app.py # Main Gradio application ├── requirements.txt # Python dependencies └── README.md # This file ``` ## Future Enhancements - Live dataset preview (load actual samples) - Full-text search within dataset content - Advanced filtering (by date, size range) - Dataset comparison tool - API integration for real-time updates - Custom visualization builder - Dataset recommendation engine ## License Apache 2.0 ## Author **AYI-NEDJIMI** ## Acknowledgments Special thanks to the HuggingFace community and all dataset creators who make their cybersecurity datasets publicly available. --- **Note**: This is a metadata explorer. To download and use the actual datasets, visit the HuggingFace links provided in the interface.