Dataset-Explorer / README.md
AYI-NEDJIMI
Initial commit: Dataset Explorer v1.0
14b051b

A newer version of the Gradio SDK is available: 6.6.0

Upgrade
metadata
title: Dataset Explorer
emoji: πŸ”
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
license: apache-2.0
tags:
  - cybersecurity
  - datasets
  - data-explorer
  - analytics
  - visualization

πŸ” Cybersecurity Dataset Explorer

A comprehensive Gradio Space to explore and analyze 80+ cybersecurity datasets from HuggingFace.

Features

πŸ” Search & Filter

  • Search by keyword across dataset names, descriptions, and tags
  • Filter by language (English, Chinese, Korean, Italian, French, Russian, etc.)
  • Filter by category (AI, Defensive, Offensive, Compliance)
  • Filter by popularity (minimum downloads and likes)
  • View results in interactive tables

πŸ“Š Dataset Details

  • Comprehensive metadata for each dataset
  • Statistics (downloads, likes, size, language)
  • Complete tag listings
  • Direct links to HuggingFace repositories
  • Mock preview functionality (shows structure)

πŸ“ˆ Statistics & Visualizations

Interactive charts powered by Plotly:

  • Category Distribution: Pie chart showing dataset distribution across categories
  • Language Distribution: Bar chart of top 10 languages
  • Top Downloads: Horizontal bar chart of most popular datasets
  • Size Distribution: Distribution of dataset sizes

πŸ“₯ Export Capabilities

  • Export filtered results to CSV format
  • Export filtered results to JSON format
  • Download data for offline analysis

🎨 Dark Theme

Beautiful dark theme optimized for readability with:

  • High contrast colors
  • Interactive hover effects
  • Responsive layout
  • Professional visualization styling

Dataset Categories

AI (27 datasets)

Datasets for training and evaluating AI/ML models in cybersecurity:

  • Instruction-tuning datasets
  • ShareGPT format conversations
  • Question-answering pairs
  • Synthetic training data
  • Fine-tuning datasets

Defensive (28 datasets)

Blue team, security operations, and threat detection:

  • Threat intelligence
  • Incident response
  • Security operations
  • Detection rules (SIGMA, YARA, Suricata)
  • Honeypot data
  • News and threat feeds

Offensive (10 datasets)

Red team, penetration testing, and security research:

  • Penetration testing techniques
  • Exploit databases
  • Attack scenarios
  • Vulnerability data
  • CVE databases

Compliance (5 datasets)

Regulatory frameworks and standards:

  • NIST Cybersecurity Framework
  • ISO/IEC 27001
  • Taiwan Cybersecurity Law
  • Compliance training data

Top Datasets

  1. ethanolivertroy/nist-cybersecurity-training (8,000 downloads)

    • Largest open-source NIST cybersecurity training dataset
    • 100K-1M samples for LLM fine-tuning
  2. clydeiii/cybersecurity (4,000 downloads)

    • APT notes from GitHub
    • Threat intelligence focus
  3. vinitvek/cybersecurityattacks (2,300 downloads)

    • Cybersecurity attacks dataset
    • 10K-100K samples
  4. Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset (786 downloads, 78 likes)

    • 53,202 instruction-tuning examples
    • Defensive security focus
  5. AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.0 (353 downloads)

    • 83,920 high-quality training triples
    • Defensive cybersecurity

Statistics

  • Total Datasets: 80
  • Total Downloads: 18,000+
  • Languages: 10+ (English, Chinese, Korean, Italian, French, Russian, etc.)
  • Size Range: <1K to 10M+ samples

Usage

Search Examples

  1. Find NIST-related datasets:

    • Keyword: "NIST"
    • Category: Compliance
  2. Find penetration testing datasets:

    • Keyword: "penetration" or "pentest"
    • Category: Offensive
  3. Find instruction-tuning datasets:

    • Keyword: "instruction"
    • Category: AI
    • Min Downloads: 100
  4. Find threat intelligence datasets:

    • Keyword: "threat"
    • Category: Defensive

Export Workflow

  1. Apply desired filters
  2. Click "Search Datasets"
  3. Click "Export to CSV" or "Export to JSON"
  4. Download the file from the interface

Technologies

  • Gradio 4.44.1: Interactive web interface
  • Pandas 2.1.4: Data manipulation and filtering
  • Plotly 5.18.0: Interactive visualizations
  • HuggingFace Datasets 2.16.1: Dataset metadata

Data Sources

All datasets are publicly available on HuggingFace Hub. This explorer provides:

  • Curated metadata from 80 cybersecurity datasets
  • Filtering and search capabilities
  • Visual analytics
  • Export functionality

To access actual dataset content, click the HuggingFace URL for any dataset.

Development

Local Setup

pip install -r requirements.txt
python app.py

File Structure

dataset-explorer/
β”œβ”€β”€ app.py                 # Main Gradio application
β”œβ”€β”€ requirements.txt       # Python dependencies
└── README.md             # This file

Future Enhancements

  • Live dataset preview (load actual samples)
  • Full-text search within dataset content
  • Advanced filtering (by date, size range)
  • Dataset comparison tool
  • API integration for real-time updates
  • Custom visualization builder
  • Dataset recommendation engine

License

Apache 2.0

Author

AYI-NEDJIMI

Acknowledgments

Special thanks to the HuggingFace community and all dataset creators who make their cybersecurity datasets publicly available.


Note: This is a metadata explorer. To download and use the actual datasets, visit the HuggingFace links provided in the interface.