Dataset-Explorer / README.md
AYI-NEDJIMI
Initial commit: Dataset Explorer v1.0
14b051b
---
title: Dataset Explorer
emoji: πŸ”
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
license: apache-2.0
tags:
- cybersecurity
- datasets
- data-explorer
- analytics
- visualization
---
# πŸ” Cybersecurity Dataset Explorer
A comprehensive Gradio Space to explore and analyze 80+ cybersecurity datasets from HuggingFace.
## Features
### πŸ” Search & Filter
- Search by keyword across dataset names, descriptions, and tags
- Filter by language (English, Chinese, Korean, Italian, French, Russian, etc.)
- Filter by category (AI, Defensive, Offensive, Compliance)
- Filter by popularity (minimum downloads and likes)
- View results in interactive tables
### πŸ“Š Dataset Details
- Comprehensive metadata for each dataset
- Statistics (downloads, likes, size, language)
- Complete tag listings
- Direct links to HuggingFace repositories
- Mock preview functionality (shows structure)
### πŸ“ˆ Statistics & Visualizations
Interactive charts powered by Plotly:
- **Category Distribution**: Pie chart showing dataset distribution across categories
- **Language Distribution**: Bar chart of top 10 languages
- **Top Downloads**: Horizontal bar chart of most popular datasets
- **Size Distribution**: Distribution of dataset sizes
### πŸ“₯ Export Capabilities
- Export filtered results to CSV format
- Export filtered results to JSON format
- Download data for offline analysis
### 🎨 Dark Theme
Beautiful dark theme optimized for readability with:
- High contrast colors
- Interactive hover effects
- Responsive layout
- Professional visualization styling
## Dataset Categories
### AI (27 datasets)
Datasets for training and evaluating AI/ML models in cybersecurity:
- Instruction-tuning datasets
- ShareGPT format conversations
- Question-answering pairs
- Synthetic training data
- Fine-tuning datasets
### Defensive (28 datasets)
Blue team, security operations, and threat detection:
- Threat intelligence
- Incident response
- Security operations
- Detection rules (SIGMA, YARA, Suricata)
- Honeypot data
- News and threat feeds
### Offensive (10 datasets)
Red team, penetration testing, and security research:
- Penetration testing techniques
- Exploit databases
- Attack scenarios
- Vulnerability data
- CVE databases
### Compliance (5 datasets)
Regulatory frameworks and standards:
- NIST Cybersecurity Framework
- ISO/IEC 27001
- Taiwan Cybersecurity Law
- Compliance training data
## Top Datasets
1. **ethanolivertroy/nist-cybersecurity-training** (8,000 downloads)
- Largest open-source NIST cybersecurity training dataset
- 100K-1M samples for LLM fine-tuning
2. **clydeiii/cybersecurity** (4,000 downloads)
- APT notes from GitHub
- Threat intelligence focus
3. **vinitvek/cybersecurityattacks** (2,300 downloads)
- Cybersecurity attacks dataset
- 10K-100K samples
4. **Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset** (786 downloads, 78 likes)
- 53,202 instruction-tuning examples
- Defensive security focus
5. **AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.0** (353 downloads)
- 83,920 high-quality training triples
- Defensive cybersecurity
## Statistics
- **Total Datasets**: 80
- **Total Downloads**: 18,000+
- **Languages**: 10+ (English, Chinese, Korean, Italian, French, Russian, etc.)
- **Size Range**: <1K to 10M+ samples
## Usage
### Search Examples
1. **Find NIST-related datasets**:
- Keyword: "NIST"
- Category: Compliance
2. **Find penetration testing datasets**:
- Keyword: "penetration" or "pentest"
- Category: Offensive
3. **Find instruction-tuning datasets**:
- Keyword: "instruction"
- Category: AI
- Min Downloads: 100
4. **Find threat intelligence datasets**:
- Keyword: "threat"
- Category: Defensive
### Export Workflow
1. Apply desired filters
2. Click "Search Datasets"
3. Click "Export to CSV" or "Export to JSON"
4. Download the file from the interface
## Technologies
- **Gradio 4.44.1**: Interactive web interface
- **Pandas 2.1.4**: Data manipulation and filtering
- **Plotly 5.18.0**: Interactive visualizations
- **HuggingFace Datasets 2.16.1**: Dataset metadata
## Data Sources
All datasets are publicly available on HuggingFace Hub. This explorer provides:
- Curated metadata from 80 cybersecurity datasets
- Filtering and search capabilities
- Visual analytics
- Export functionality
To access actual dataset content, click the HuggingFace URL for any dataset.
## Development
### Local Setup
```bash
pip install -r requirements.txt
python app.py
```
### File Structure
```
dataset-explorer/
β”œβ”€β”€ app.py # Main Gradio application
β”œβ”€β”€ requirements.txt # Python dependencies
└── README.md # This file
```
## Future Enhancements
- Live dataset preview (load actual samples)
- Full-text search within dataset content
- Advanced filtering (by date, size range)
- Dataset comparison tool
- API integration for real-time updates
- Custom visualization builder
- Dataset recommendation engine
## License
Apache 2.0
## Author
**AYI-NEDJIMI**
## Acknowledgments
Special thanks to the HuggingFace community and all dataset creators who make their cybersecurity datasets publicly available.
---
**Note**: This is a metadata explorer. To download and use the actual datasets, visit the HuggingFace links provided in the interface.