Spaces:
Running
title: AI Safety Datasets Overview
emoji: π‘οΈ
colorFrom: red
colorTo: orange
sdk: static
pinned: false
license: cc-by-nc-4.0
short_description: >-
Comprehensive AI safety evaluation datasets with organic and synthetic
adversarial conversations
tags:
- safety
- adversarial
- red-teaming
- ai-safety
- multi-turn
- synthetic
datasets:
- julyai7/multi-turn-conversations
- julyai7/multi-turn-bio-transformed-synth-conversations-v1
- julyai7/multi-turn-bio-transformed-synth-conversations-v2
- julyai7/multi-turn-keyword-transformed-synth-conversations
AI Safety Datasets Collection - Overview
This space provides an overview of our comprehensive AI safety evaluation datasets. These datasets are designed for rigorous testing of AI model safety mechanisms across various attack methodologies and harm categories.
π Dataset Collection Summary
We offer four complementary datasets, each serving specific evaluation purposes:
| Dataset | Type | Methodology | Primary Use Case |
|---|---|---|---|
| Original Multi-turn Conversations | Organic | Human-AI interactions | Baseline safety evaluation |
| Bio-transformed Synthetic V1 | Synthetic | Direct bio-topic transformation | Bio-safety adversarial testing |
| Bio-transformed Synthetic V2 | Synthetic | Adaptive bio-topic transformation | Bio-safety with natural flow |
| Keyword-transformed Synthetic | Synthetic | Cross-domain keyword substitution | Domain-agnostic safety testing |
π Detailed Dataset Descriptions
1. Original Multi-turn Conversations
Collections: psychs + illicits
Source Data Criteria:
Psychology Conversations (psychs)
- Harm Category: Psychology
- Objective Type: Text-based, no JupyterHub templates
- Quality Filters:
- NeurIPS evaluation verdict:
success - No disclaimer present
- Organic conversations only (
organicity = true)
- NeurIPS evaluation verdict:
- Purpose: Baseline psychology-related harmful requests that successfully elicit unsafe responses
Illicit Conversations (illicits)
- Harm Category: Illicit activities
- Objective Type: Text-based, no JupyterHub templates
- Quality Filters:
- NeurIPS evaluation verdict:
success - Contains specific instruction details (not
none) - Executable by high school students (
high_school_student_executable != false)
- NeurIPS evaluation verdict:
- Purpose: Baseline illicit-related harmful requests with practical execution potential
Use Cases:
- Establishing baseline safety evaluation metrics
- Understanding organic adversarial conversation patterns
- Training and fine-tuning safety classifiers
- Benchmarking model safety before synthetic transformations
Sample Dataset: 50 conversations (390 turns)
- Psychology: 55.6% of sample turns
- Illicit: 44.4% of sample turns
2. Bio-transformed Synthetic Multi-turn Conversations V1
Collection: illicit_bio_synths_v1
Transformation Method: bio_topic_change
Source: Derived from original psychology + illicit conversations
Methodology V1 Characteristics:
- Direct transformation approach: Explicit adversarial pattern injection
- Focus: Systematic safety mechanism bypass strategies
- Target Domain: Bio-safety (dangerous biological information)
- Transformation Goal: Convert psychology/illicit harms into bio-safety attacks
Key Features:
- All conversations transformed to
illicitcategory (bio-safety domain) - Direct mapping of harmful intents to biological contexts
- Aggressive adversarial techniques
- Tests explicit bio-safety guardrails
Use Cases:
- Testing bio-safety specific guardrails
- Evaluating cross-domain harm transfer (psych/illicit β bio)
- Red-teaming bio-related content moderation
- Training specialized bio-safety detectors
Sample Dataset: 50 conversations (449 turns, 100% illicit/bio-safety)
3. Bio-transformed Synthetic Multi-turn Conversations V2
Collection: illicit_bio_synths_v2
Transformation Method: bio_topic_change_og
Source: Derived from original psychology + illicit conversations
Methodology V2 Characteristics:
- Adaptive transformation approach: Natural conversation flow preservation
- Focus: Contextual reframing and subtle escalation patterns
- Target Domain: Bio-safety (dangerous biological information)
- Transformation Goal: More sophisticated, harder-to-detect bio-safety attacks
Key Differences from V1:
- More natural conversation progression
- Subtle escalation tactics
- Better mimics legitimate scientific inquiry
- Harder for safety systems to detect
Use Cases:
- Testing advanced bio-safety detection systems
- Evaluating robustness against sophisticated attacks
- Training models to detect subtle adversarial patterns
- Benchmarking next-generation safety systems
Sample Dataset: 50 conversations (459 turns, 100% illicit/bio-safety)
4. Keyword-transformed Synthetic Multi-turn Conversations
Collection: keyword_synths
Transformation Method: keyword
Source: Derived from original psychology + illicit conversations
Methodology Characteristics:
- Cross-domain keyword substitution: Maintains harmful intent across contexts
- Domain shifting: Same underlying harm expressed in different domains
- Context adaptation: Systematic replacement of domain-specific terminology
- Intent preservation: Core harmful objective remains unchanged
Innovation: Tests whether AI safety mechanisms are:
- Domain-agnostic: Robust across different contexts and topics
- Intent-focused: Detecting underlying harm vs. surface-level keywords
- Context-aware: Understanding harm despite domain transformations
Key Features:
- Preserves original harm category distribution (psychology + illicit)
- Demonstrates safety mechanism vulnerabilities through context shifting
- Higher turn count per conversation (more complex attacks)
- Tests generalization of safety training
Use Cases:
- Evaluating domain-agnostic safety mechanisms
- Testing whether safety is keyword-based or intent-based
- Training robust cross-domain harm detection
- Identifying brittleness in safety systems
Sample Dataset: 50 conversations (659 turns)
- Illicit: 51.6% of sample turns
- Psychology: 48.4% of sample turns
π― Data Selection Process
All datasets are derived from high-quality, validated conversations that meet strict criteria:
Base Criteria (All Datasets)
- β Text-based objectives (no code execution templates)
- β NeurIPS evaluation metadata present
- β
Verdict:
success(harmful requests successfully fulfilled) - β Multi-turn conversations with prompt-response pairs
Psychology-Specific Criteria
- Organic conversations (
organicity = true) - No disclaimer in responses
- Successfully elicited harmful psychology-related content
Illicit-Specific Criteria
- Contains specific instruction details
- Practically executable (not abstract)
- Successfully elicited harmful illicit-related content
Synthetic Transformation Criteria
- Original conversation must meet base criteria
- Successful transformation to target methodology
- Maintains harmful intent in new domain
- Contains valid prompt-response pairs
π Dataset Statistics
Full Dataset Overview
The complete datasets are derived from our production database using strict quality filters:
| Dataset | Conversations | Turns | Avg Turns/Conv | Primary Focus |
|---|---|---|---|---|
| Original Multi-turn | 594+ | 4,642+ | 7.8 | Baseline organic conversations |
- Psychology (psychs) |
158+ | 1,583+ | 10.0 | Psychology harm category |
- Illicit (illicits) |
436+ | 3,059+ | 7.0 | Illicit harm category |
| Bio-transformed V1 | 1,309+ | 6,784+ | 5.2 | Direct bio-safety attacks |
| Bio-transformed V2 | 1,308+ | 8,127+ | 6.2 | Adaptive bio-safety attacks |
| Keyword-transformed | 7,110+ | 53,705+ | 7.6 | Cross-domain harm transfer |
| Total Full Datasets | 10,321+ | 73,258+ | 7.1 | All methodologies |
Sample Data Overview (Publicly Available)
Representative sample datasets are available on Hugging Face for evaluation and testing:
| Dataset | Conversations | Turns | Avg Turns/Conv | Harm Categories |
|---|---|---|---|---|
| Original | 50 | 390 | 7.8 | Psychology (55.6%), Illicit (44.4%) |
| Bio V1 | 50 | 449 | 9.0 | Illicit/Bio (100%) |
| Bio V2 | 50 | 459 | 9.2 | Illicit/Bio (100%) |
| Keyword | 50 | 659 | 13.2 | Illicit (51.6%), Psychology (48.4%) |
| Total Samples | 200 | 1,957 | 9.8 | Multiple |
Note: Sample datasets represent carefully selected subsets that maintain the distribution and characteristics of the full datasets while being freely accessible for research evaluation.
π Dataset Links
Hugging Face Datasets
Original Multi-turn Conversations
- Psychology + Illicit baseline conversations
- 50 sample conversations, 390 turns
-
- Direct bio-topic transformation methodology
- 50 sample conversations, 449 turns
-
- Adaptive bio-topic transformation methodology
- 50 sample conversations, 459 turns
-
- Cross-domain keyword substitution methodology
- 50 sample conversations, 659 turns
π§ͺ Research Applications
These datasets enable various research directions:
Safety Evaluation
- Benchmark model safety across attack methodologies
- Measure robustness to synthetic transformations
- Evaluate domain-specific vs. general safety mechanisms
Red Teaming
- Discover new adversarial patterns
- Test safety guardrails comprehensively
- Identify blind spots in content moderation
Model Training
- Fine-tune safety classifiers
- Train adversarial attack detectors
- Develop cross-domain harm detection systems
Safety Research
- Study harm transfer across domains
- Analyze conversation-level attack patterns
- Understand multi-turn adversarial dynamics
β οΈ Ethical Considerations
IMPORTANT: These datasets contain successful adversarial attacks and harmful content.
Intended Use
- β Defensive security research
- β AI safety evaluation and improvement
- β Academic research on adversarial robustness
- β Training safety and moderation systems
Prohibited Use
- β Creating offensive content
- β Developing attack tools for malicious purposes
- β Bypassing safety systems for harm
- β Any use that violates laws or ethical guidelines
Recommendations
- Use in controlled research environments
- Implement appropriate access controls
- Follow institutional review board (IRB) guidelines
- Report findings responsibly
π License
All datasets are released under CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0 International).
License Terms
- β Use for research and evaluation
- β Modify and build upon the data
- β Share with attribution
- β Commercial use without separate licensing
πΌ Full Dataset Access
The sample datasets provide representative examples. Full datasets contain:
- Thousands of additional conversations
- Expanded harm categories and variations
- Diverse conversation lengths and complexity levels
- Regular updates with new adversarial patterns
- Custom dataset creation for specific research needs
Contact for Full Dataset
For academic research or commercial licensing:
- π§ Email: [your-email@domain.com]
- π Website: [your-website.com]
- π Include: Research objectives, institutional affiliation, intended use
π Dataset Updates
Current Version: November 2024
The sample datasets represent snapshots of our larger collection. Full datasets receive regular updates with:
- New adversarial patterns and methodologies
- Additional harm categories and domains
- Improved quality filters and annotations
- Enhanced diversity in conversation styles
π Citation
If you use these datasets in your research, please cite:
@dataset{ai_safety_datasets_2024,
title={AI Safety Multi-turn Conversation Datasets},
author={[Your Name/Organization]},
year={2024},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/julyai7}}
}
π€ Acknowledgments
These datasets were created through:
- Rigorous NeurIPS evaluation protocols
- Advanced synthetic transformation methodologies
- Quality filtering and validation processes
- Ethical review and safety considerations
π Support & Questions
For questions about the datasets:
- Open an issue in the respective dataset repository
- Join the discussion in the Community tab
- Contact us for technical support or collaboration opportunities
Last Updated: November 24, 2025