| | --- |
| | title: Dataset Explorer |
| | emoji: π |
| | colorFrom: blue |
| | colorTo: purple |
| | sdk: gradio |
| | sdk_version: 4.44.1 |
| | app_file: app.py |
| | pinned: false |
| | license: apache-2.0 |
| | tags: |
| | - cybersecurity |
| | - datasets |
| | - data-explorer |
| | - analytics |
| | - visualization |
| | --- |
| | |
| | # π Cybersecurity Dataset Explorer |
| |
|
| | A comprehensive Gradio Space to explore and analyze 80+ cybersecurity datasets from HuggingFace. |
| |
|
| | ## Features |
| |
|
| | ### π Search & Filter |
| | - Search by keyword across dataset names, descriptions, and tags |
| | - Filter by language (English, Chinese, Korean, Italian, French, Russian, etc.) |
| | - Filter by category (AI, Defensive, Offensive, Compliance) |
| | - Filter by popularity (minimum downloads and likes) |
| | - View results in interactive tables |
| |
|
| | ### π Dataset Details |
| | - Comprehensive metadata for each dataset |
| | - Statistics (downloads, likes, size, language) |
| | - Complete tag listings |
| | - Direct links to HuggingFace repositories |
| | - Mock preview functionality (shows structure) |
| |
|
| | ### π Statistics & Visualizations |
| | Interactive charts powered by Plotly: |
| | - **Category Distribution**: Pie chart showing dataset distribution across categories |
| | - **Language Distribution**: Bar chart of top 10 languages |
| | - **Top Downloads**: Horizontal bar chart of most popular datasets |
| | - **Size Distribution**: Distribution of dataset sizes |
| |
|
| | ### π₯ Export Capabilities |
| | - Export filtered results to CSV format |
| | - Export filtered results to JSON format |
| | - Download data for offline analysis |
| |
|
| | ### π¨ Dark Theme |
| | Beautiful dark theme optimized for readability with: |
| | - High contrast colors |
| | - Interactive hover effects |
| | - Responsive layout |
| | - Professional visualization styling |
| |
|
| | ## Dataset Categories |
| |
|
| | ### AI (27 datasets) |
| | Datasets for training and evaluating AI/ML models in cybersecurity: |
| | - Instruction-tuning datasets |
| | - ShareGPT format conversations |
| | - Question-answering pairs |
| | - Synthetic training data |
| | - Fine-tuning datasets |
| |
|
| | ### Defensive (28 datasets) |
| | Blue team, security operations, and threat detection: |
| | - Threat intelligence |
| | - Incident response |
| | - Security operations |
| | - Detection rules (SIGMA, YARA, Suricata) |
| | - Honeypot data |
| | - News and threat feeds |
| |
|
| | ### Offensive (10 datasets) |
| | Red team, penetration testing, and security research: |
| | - Penetration testing techniques |
| | - Exploit databases |
| | - Attack scenarios |
| | - Vulnerability data |
| | - CVE databases |
| |
|
| | ### Compliance (5 datasets) |
| | Regulatory frameworks and standards: |
| | - NIST Cybersecurity Framework |
| | - ISO/IEC 27001 |
| | - Taiwan Cybersecurity Law |
| | - Compliance training data |
| |
|
| | ## Top Datasets |
| |
|
| | 1. **ethanolivertroy/nist-cybersecurity-training** (8,000 downloads) |
| | - Largest open-source NIST cybersecurity training dataset |
| | - 100K-1M samples for LLM fine-tuning |
| |
|
| | 2. **clydeiii/cybersecurity** (4,000 downloads) |
| | - APT notes from GitHub |
| | - Threat intelligence focus |
| |
|
| | 3. **vinitvek/cybersecurityattacks** (2,300 downloads) |
| | - Cybersecurity attacks dataset |
| | - 10K-100K samples |
| |
|
| | 4. **Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset** (786 downloads, 78 likes) |
| | - 53,202 instruction-tuning examples |
| | - Defensive security focus |
| |
|
| | 5. **AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.0** (353 downloads) |
| | - 83,920 high-quality training triples |
| | - Defensive cybersecurity |
| |
|
| | ## Statistics |
| |
|
| | - **Total Datasets**: 80 |
| | - **Total Downloads**: 18,000+ |
| | - **Languages**: 10+ (English, Chinese, Korean, Italian, French, Russian, etc.) |
| | - **Size Range**: <1K to 10M+ samples |
| |
|
| | ## Usage |
| |
|
| | ### Search Examples |
| |
|
| | 1. **Find NIST-related datasets**: |
| | - Keyword: "NIST" |
| | - Category: Compliance |
| |
|
| | 2. **Find penetration testing datasets**: |
| | - Keyword: "penetration" or "pentest" |
| | - Category: Offensive |
| |
|
| | 3. **Find instruction-tuning datasets**: |
| | - Keyword: "instruction" |
| | - Category: AI |
| | - Min Downloads: 100 |
| |
|
| | 4. **Find threat intelligence datasets**: |
| | - Keyword: "threat" |
| | - Category: Defensive |
| |
|
| | ### Export Workflow |
| |
|
| | 1. Apply desired filters |
| | 2. Click "Search Datasets" |
| | 3. Click "Export to CSV" or "Export to JSON" |
| | 4. Download the file from the interface |
| |
|
| | ## Technologies |
| |
|
| | - **Gradio 4.44.1**: Interactive web interface |
| | - **Pandas 2.1.4**: Data manipulation and filtering |
| | - **Plotly 5.18.0**: Interactive visualizations |
| | - **HuggingFace Datasets 2.16.1**: Dataset metadata |
| |
|
| | ## Data Sources |
| |
|
| | All datasets are publicly available on HuggingFace Hub. This explorer provides: |
| | - Curated metadata from 80 cybersecurity datasets |
| | - Filtering and search capabilities |
| | - Visual analytics |
| | - Export functionality |
| |
|
| | To access actual dataset content, click the HuggingFace URL for any dataset. |
| |
|
| | ## Development |
| |
|
| | ### Local Setup |
| |
|
| | ```bash |
| | pip install -r requirements.txt |
| | python app.py |
| | ``` |
| |
|
| | ### File Structure |
| |
|
| | ``` |
| | dataset-explorer/ |
| | βββ app.py # Main Gradio application |
| | βββ requirements.txt # Python dependencies |
| | βββ README.md # This file |
| | ``` |
| |
|
| | ## Future Enhancements |
| |
|
| | - Live dataset preview (load actual samples) |
| | - Full-text search within dataset content |
| | - Advanced filtering (by date, size range) |
| | - Dataset comparison tool |
| | - API integration for real-time updates |
| | - Custom visualization builder |
| | - Dataset recommendation engine |
| |
|
| | ## License |
| |
|
| | Apache 2.0 |
| |
|
| | ## Author |
| |
|
| | **AYI-NEDJIMI** |
| |
|
| | ## Acknowledgments |
| |
|
| | Special thanks to the HuggingFace community and all dataset creators who make their cybersecurity datasets publicly available. |
| |
|
| | --- |
| |
|
| | **Note**: This is a metadata explorer. To download and use the actual datasets, visit the HuggingFace links provided in the interface. |
| |
|