datasets_overview / README.md
Yang Chung
Initial commit
42b116c
|
raw
history blame
13.4 kB
metadata
title: AI Safety Datasets Overview
emoji: πŸ›‘οΈ
colorFrom: red
colorTo: orange
sdk: static
pinned: false
license: cc-by-nc-4.0
short_description: >-
  Comprehensive AI safety evaluation datasets with organic and synthetic
  adversarial conversations
tags:
  - safety
  - adversarial
  - red-teaming
  - ai-safety
  - multi-turn
  - synthetic
datasets:
  - julyai7/multi-turn-conversations
  - julyai7/multi-turn-bio-transformed-synth-conversations-v1
  - julyai7/multi-turn-bio-transformed-synth-conversations-v2
  - julyai7/multi-turn-keyword-transformed-synth-conversations

AI Safety Datasets Collection - Overview

This space provides an overview of our comprehensive AI safety evaluation datasets. These datasets are designed for rigorous testing of AI model safety mechanisms across various attack methodologies and harm categories.

πŸ“Š Dataset Collection Summary

We offer four complementary datasets, each serving specific evaluation purposes:

Dataset Type Methodology Primary Use Case
Original Multi-turn Conversations Organic Human-AI interactions Baseline safety evaluation
Bio-transformed Synthetic V1 Synthetic Direct bio-topic transformation Bio-safety adversarial testing
Bio-transformed Synthetic V2 Synthetic Adaptive bio-topic transformation Bio-safety with natural flow
Keyword-transformed Synthetic Synthetic Cross-domain keyword substitution Domain-agnostic safety testing

πŸ” Detailed Dataset Descriptions

1. Original Multi-turn Conversations

Collections: psychs + illicits

Source Data Criteria:

Psychology Conversations (psychs)

  • Harm Category: Psychology
  • Objective Type: Text-based, no JupyterHub templates
  • Quality Filters:
    • NeurIPS evaluation verdict: success
    • No disclaimer present
    • Organic conversations only (organicity = true)
  • Purpose: Baseline psychology-related harmful requests that successfully elicit unsafe responses

Illicit Conversations (illicits)

  • Harm Category: Illicit activities
  • Objective Type: Text-based, no JupyterHub templates
  • Quality Filters:
    • NeurIPS evaluation verdict: success
    • Contains specific instruction details (not none)
    • Executable by high school students (high_school_student_executable != false)
  • Purpose: Baseline illicit-related harmful requests with practical execution potential

Use Cases:

  • Establishing baseline safety evaluation metrics
  • Understanding organic adversarial conversation patterns
  • Training and fine-tuning safety classifiers
  • Benchmarking model safety before synthetic transformations

Sample Dataset: 50 conversations (390 turns)

  • Psychology: 55.6% of sample turns
  • Illicit: 44.4% of sample turns

2. Bio-transformed Synthetic Multi-turn Conversations V1

Collection: illicit_bio_synths_v1

Transformation Method: bio_topic_change

Source: Derived from original psychology + illicit conversations

Methodology V1 Characteristics:

  • Direct transformation approach: Explicit adversarial pattern injection
  • Focus: Systematic safety mechanism bypass strategies
  • Target Domain: Bio-safety (dangerous biological information)
  • Transformation Goal: Convert psychology/illicit harms into bio-safety attacks

Key Features:

  • All conversations transformed to illicit category (bio-safety domain)
  • Direct mapping of harmful intents to biological contexts
  • Aggressive adversarial techniques
  • Tests explicit bio-safety guardrails

Use Cases:

  • Testing bio-safety specific guardrails
  • Evaluating cross-domain harm transfer (psych/illicit β†’ bio)
  • Red-teaming bio-related content moderation
  • Training specialized bio-safety detectors

Sample Dataset: 50 conversations (449 turns, 100% illicit/bio-safety)


3. Bio-transformed Synthetic Multi-turn Conversations V2

Collection: illicit_bio_synths_v2

Transformation Method: bio_topic_change_og

Source: Derived from original psychology + illicit conversations

Methodology V2 Characteristics:

  • Adaptive transformation approach: Natural conversation flow preservation
  • Focus: Contextual reframing and subtle escalation patterns
  • Target Domain: Bio-safety (dangerous biological information)
  • Transformation Goal: More sophisticated, harder-to-detect bio-safety attacks

Key Differences from V1:

  • More natural conversation progression
  • Subtle escalation tactics
  • Better mimics legitimate scientific inquiry
  • Harder for safety systems to detect

Use Cases:

  • Testing advanced bio-safety detection systems
  • Evaluating robustness against sophisticated attacks
  • Training models to detect subtle adversarial patterns
  • Benchmarking next-generation safety systems

Sample Dataset: 50 conversations (459 turns, 100% illicit/bio-safety)


4. Keyword-transformed Synthetic Multi-turn Conversations

Collection: keyword_synths

Transformation Method: keyword

Source: Derived from original psychology + illicit conversations

Methodology Characteristics:

  • Cross-domain keyword substitution: Maintains harmful intent across contexts
  • Domain shifting: Same underlying harm expressed in different domains
  • Context adaptation: Systematic replacement of domain-specific terminology
  • Intent preservation: Core harmful objective remains unchanged

Innovation: Tests whether AI safety mechanisms are:

  • Domain-agnostic: Robust across different contexts and topics
  • Intent-focused: Detecting underlying harm vs. surface-level keywords
  • Context-aware: Understanding harm despite domain transformations

Key Features:

  • Preserves original harm category distribution (psychology + illicit)
  • Demonstrates safety mechanism vulnerabilities through context shifting
  • Higher turn count per conversation (more complex attacks)
  • Tests generalization of safety training

Use Cases:

  • Evaluating domain-agnostic safety mechanisms
  • Testing whether safety is keyword-based or intent-based
  • Training robust cross-domain harm detection
  • Identifying brittleness in safety systems

Sample Dataset: 50 conversations (659 turns)

  • Illicit: 51.6% of sample turns
  • Psychology: 48.4% of sample turns

🎯 Data Selection Process

All datasets are derived from high-quality, validated conversations that meet strict criteria:

Base Criteria (All Datasets)

  • βœ… Text-based objectives (no code execution templates)
  • βœ… NeurIPS evaluation metadata present
  • βœ… Verdict: success (harmful requests successfully fulfilled)
  • βœ… Multi-turn conversations with prompt-response pairs

Psychology-Specific Criteria

  • Organic conversations (organicity = true)
  • No disclaimer in responses
  • Successfully elicited harmful psychology-related content

Illicit-Specific Criteria

  • Contains specific instruction details
  • Practically executable (not abstract)
  • Successfully elicited harmful illicit-related content

Synthetic Transformation Criteria

  • Original conversation must meet base criteria
  • Successful transformation to target methodology
  • Maintains harmful intent in new domain
  • Contains valid prompt-response pairs

πŸ“ˆ Dataset Statistics

Full Dataset Overview

The complete datasets are derived from our production database using strict quality filters:

Dataset Conversations Turns Avg Turns/Conv Primary Focus
Original Multi-turn 594+ 4,642+ 7.8 Baseline organic conversations
- Psychology (psychs) 158+ 1,583+ 10.0 Psychology harm category
- Illicit (illicits) 436+ 3,059+ 7.0 Illicit harm category
Bio-transformed V1 1,309+ 6,784+ 5.2 Direct bio-safety attacks
Bio-transformed V2 1,308+ 8,127+ 6.2 Adaptive bio-safety attacks
Keyword-transformed 7,110+ 53,705+ 7.6 Cross-domain harm transfer
Total Full Datasets 10,321+ 73,258+ 7.1 All methodologies

Sample Data Overview (Publicly Available)

Representative sample datasets are available on Hugging Face for evaluation and testing:

Dataset Conversations Turns Avg Turns/Conv Harm Categories
Original 50 390 7.8 Psychology (55.6%), Illicit (44.4%)
Bio V1 50 449 9.0 Illicit/Bio (100%)
Bio V2 50 459 9.2 Illicit/Bio (100%)
Keyword 50 659 13.2 Illicit (51.6%), Psychology (48.4%)
Total Samples 200 1,957 9.8 Multiple

Note: Sample datasets represent carefully selected subsets that maintain the distribution and characteristics of the full datasets while being freely accessible for research evaluation.


πŸ”— Dataset Links

Hugging Face Datasets

  1. Original Multi-turn Conversations

    • Psychology + Illicit baseline conversations
    • 50 sample conversations, 390 turns
  2. Bio-transformed Synthetic V1

    • Direct bio-topic transformation methodology
    • 50 sample conversations, 449 turns
  3. Bio-transformed Synthetic V2

    • Adaptive bio-topic transformation methodology
    • 50 sample conversations, 459 turns
  4. Keyword-transformed Synthetic

    • Cross-domain keyword substitution methodology
    • 50 sample conversations, 659 turns

πŸ§ͺ Research Applications

These datasets enable various research directions:

Safety Evaluation

  • Benchmark model safety across attack methodologies
  • Measure robustness to synthetic transformations
  • Evaluate domain-specific vs. general safety mechanisms

Red Teaming

  • Discover new adversarial patterns
  • Test safety guardrails comprehensively
  • Identify blind spots in content moderation

Model Training

  • Fine-tune safety classifiers
  • Train adversarial attack detectors
  • Develop cross-domain harm detection systems

Safety Research

  • Study harm transfer across domains
  • Analyze conversation-level attack patterns
  • Understand multi-turn adversarial dynamics

⚠️ Ethical Considerations

IMPORTANT: These datasets contain successful adversarial attacks and harmful content.

Intended Use

  • βœ… Defensive security research
  • βœ… AI safety evaluation and improvement
  • βœ… Academic research on adversarial robustness
  • βœ… Training safety and moderation systems

Prohibited Use

  • ❌ Creating offensive content
  • ❌ Developing attack tools for malicious purposes
  • ❌ Bypassing safety systems for harm
  • ❌ Any use that violates laws or ethical guidelines

Recommendations

  • Use in controlled research environments
  • Implement appropriate access controls
  • Follow institutional review board (IRB) guidelines
  • Report findings responsibly

πŸ“„ License

All datasets are released under CC-BY-NC-4.0 (Creative Commons Attribution-NonCommercial 4.0 International).

License Terms

  • βœ… Use for research and evaluation
  • βœ… Modify and build upon the data
  • βœ… Share with attribution
  • ❌ Commercial use without separate licensing

πŸ’Ό Full Dataset Access

The sample datasets provide representative examples. Full datasets contain:

  • Thousands of additional conversations
  • Expanded harm categories and variations
  • Diverse conversation lengths and complexity levels
  • Regular updates with new adversarial patterns
  • Custom dataset creation for specific research needs

Contact for Full Dataset

For academic research or commercial licensing:

  • πŸ“§ Email: [your-email@domain.com]
  • 🌐 Website: [your-website.com]
  • πŸ“‹ Include: Research objectives, institutional affiliation, intended use

πŸ”„ Dataset Updates

Current Version: November 2024

The sample datasets represent snapshots of our larger collection. Full datasets receive regular updates with:

  • New adversarial patterns and methodologies
  • Additional harm categories and domains
  • Improved quality filters and annotations
  • Enhanced diversity in conversation styles

πŸ“š Citation

If you use these datasets in your research, please cite:

@dataset{ai_safety_datasets_2024,
  title={AI Safety Multi-turn Conversation Datasets},
  author={[Your Name/Organization]},
  year={2024},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/julyai7}}
}

🀝 Acknowledgments

These datasets were created through:

  • Rigorous NeurIPS evaluation protocols
  • Advanced synthetic transformation methodologies
  • Quality filtering and validation processes
  • Ethical review and safety considerations

πŸ“ž Support & Questions

For questions about the datasets:

  • Open an issue in the respective dataset repository
  • Join the discussion in the Community tab
  • Contact us for technical support or collaboration opportunities

Last Updated: November 24, 2025