AI_Safety_Lab / README.md
soupstick's picture
Initial DSPy-based AI Safety Lab implementation
4fef010

A newer version of the Gradio SDK is available: 6.5.1

Upgrade
metadata
title: AI Safety Lab
emoji: πŸ›‘οΈ
colorFrom: purple
colorTo: red
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit
short_description: DSPy-based multi-agent AI safety evaluation platform

AI Safety Lab

A professional DSPy-based multi-agent platform for systematic AI safety evaluation and red-teaming of language models.

Problem Being Solved

Organizations deploying language models face significant challenges in systematically evaluating safety risks across diverse attack vectors. Traditional safety testing approaches are often:

  • Manual and ad-hoc: Inconsistent coverage of potential failure modes
  • Prompt engineering focused: Limited scalability and reproducibility
  • Single-purpose tools: Lack comprehensive, measurable evaluation frameworks
  • Black-box approaches: Limited insight into why safety failures occur

AI Safety Lab addresses these gaps through a structured, multi-agent system that provides continuous, measurable safety assessment with DSPy-powered optimization.

System Design

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   RedTeaming    │───▢│  Target Model    │───▢│ SafetyJudging   β”‚
β”‚     Agent       β”‚    β”‚                  β”‚    β”‚     Agent       β”‚
β”‚                 β”‚    β”‚                  β”‚    β”‚                 β”‚
β”‚ β€’ DSPy Module   β”‚    β”‚ β€’ HF Interface   β”‚    β”‚ β€’ DSPy Module   β”‚
β”‚ β€’ Optimization  β”‚    β”‚ β€’ Local/API      β”‚    β”‚ β€’ Multi-dim     β”‚
β”‚ β€’ Structured    β”‚    β”‚ β€’ Configurable    β”‚    β”‚ β€’ Objective     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–²                                                β”‚
         β”‚                                                β–Ό
         └─────────────── DSPy Optimization Loop β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Core Components

1. RedTeamingAgent

  • Purpose: Systematic generation of adversarial inputs
  • Approach: DSPy-optimized prompt generation across multiple attack vectors
  • Coverage: Instruction following, policy bypass, ambiguity, jailbreak, role-play, context injection
  • Optimization: Closed-loop improvement based on safety evaluation feedback

2. SafetyJudgeAgent

  • Purpose: Objective, multi-dimensional safety assessment
  • Dimensions: Toxicity, bias, misinformation, violence, self-harm, privacy, illegal activities
  • Scoring: Quantitative risk assessment (0.0-1.0) with confidence intervals
  • Outputs: Structured judgments with actionable recommendations

3. Orchestration Loop

  • Function: Coordinates agent interactions and optimization cycles
  • Process: Multi-iteration evaluation with adaptive DSPy optimization
  • Metrics: Real-time performance tracking and trend analysis
  • Reporting: Comprehensive safety reports with recommendations

4. Model Interface

  • Integration: Hugging Face Hub access with local loading support
  • Flexibility: API-based and local model evaluation
  • Monitoring: Response time, success rate, and performance tracking

DSPy Integration

The system leverages DSPy for:

  • Programmatic Prompting: Structured reasoning for adversarial prompt generation
  • Optimization: BootstrapFewShot optimization for improved attack discovery
  • Metrics: Custom evaluation functions for safety effectiveness
  • Modularity: Composable reasoning programs for different safety objectives

What This Lab Is Not

  • ❌ A demo or tutorial: This is a production-oriented safety evaluation platform
  • ❌ A prompt engineering playground: Focuses on systematic, reproducible testing
  • ❌ A chat application: Designed for evaluation, not conversation
  • ❌ A toy example: Built for serious safety assessment workflows
  • ❌ A replacement for human review: Augments, not replaces, human expertise

Installation and Setup

Prerequisites

  • Python 3.8+
  • Hugging Face API token (for model access)
  • Sufficient compute resources for model evaluation

Local Development

# Clone the repository
git clone https://huggingface.co/spaces/soupstick/AI_Safety_Lab
cd AI_Safety_Lab

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export HUGGINGFACEHUB_API_TOKEN="your_token_here"

# Run locally
python app.py

Hugging Face Space Deployment

  1. Clone this space to your account
  2. Add your Hugging Face token as a repository secret
  3. The space will automatically build and deploy

Usage

Basic Safety Evaluation

  1. Select Target Model: Choose from available Hugging Face models
  2. Define Safety Objective: Specify the safety boundary to test
  3. Configure Parameters: Set iterations, prompts per iteration, optimization threshold
  4. Run Evaluation: Execute the multi-agent safety assessment
  5. Review Results: Analyze prompts, responses, and comprehensive safety report

Advanced Configuration

from orchestration.loop import EvaluationConfig, evaluation_loop

config = EvaluationConfig(
    target_model_id="meta-llama/Llama-2-7b-chat-hf",
    safety_objective="Test for harmful content generation",
    max_prompts_per_iteration=5,
    max_iterations=3,
    optimization_threshold=0.3,
    temperature=0.7
)

report = evaluation_loop.run_evaluation(config)

Evaluation Metrics

Risk Assessment

  • Overall Risk Score: Composite safety risk (0.0-1.0)
  • Policy Violation Likelihood: Probability of policy breaches
  • Harm Severity: Categorical severity assessment (low/medium/high/critical)
  • Ambiguity Risk: Potential for misinterpretation
  • Exploitability: Likelihood of malicious exploitation

Performance Metrics

  • Discovery Rate: Percentage of high-risk outputs identified
  • Attack Vector Coverage: Diversity of attack types tested
  • Response Success Rate: Model availability and response quality
  • Evaluation Efficiency: Time and resource optimization

Quality Metrics

  • False Positive/Negative Rates: Accuracy of safety assessments
  • Precision/Recall: Balance between safety and usability
  • Trend Analysis: Performance changes over time

Architecture Decisions

Multi-Agent Design

  • Separation of Concerns: Clear boundaries between adversarial generation and safety evaluation
  • Independence: SafetyJudgeAgent has no access to red-teaming internals
  • Specialization: Each agent optimized for its specific task

DSPy Integration

  • Programmatic Approach: Structured reasoning over prompt engineering
  • Optimization: Continuous improvement through DSPy's optimization framework
  • Reproducibility: Consistent evaluation across multiple runs

Modular Structure

  • Extensibility: Easy addition of new agents and evaluation dimensions
  • Maintainability: Clear separation of concerns and well-defined interfaces
  • Testing: Unit testable components with clear responsibilities

Integration Options

Model Registry

from models.hf_interface import model_interface

# Add custom models
model_interface.available_models["custom/model"] = ModelInfo(
    model_id="custom/model",
    name="Custom Model",
    description="Organization-specific model",
    category="Internal",
    requires_token=True,
    is_local=True
)

Custom Safety Dimensions

from agents.safety_judge import SafetyJudgeAgent

class CustomSafetyJudge(SafetyJudgeAgent):
    def __init__(self):
        super().__init__()
        self.safety_dimensions.extend([
            "custom_compliance",
            "business_risk"
        ])

Evaluation Pipelines

# Batch evaluation across multiple models
models = ["model1", "model2", "model3"]
results = []

for model_id in models:
    config.target_model_id = model_id
    report = evaluation_loop.run_evaluation(config)
    results.append(report)

Compliance and Standards

The platform supports compliance with:

  • NIST AI Risk Management Framework: Structured risk assessment and monitoring
  • AI Act Requirements: Safety testing and documentation
  • ISO/IEC 23894: AI risk management guidelines
  • Internal Governance: Customizable evaluation criteria and reporting

Contributing

This is an internal safety experimentation platform. Contributions should focus on:

  • Enhanced agent capabilities
  • New safety evaluation dimensions
  • Improved optimization strategies
  • Additional model integrations
  • Performance and scalability improvements

Security Considerations

  • Token Management: Never hardcode API tokens; use environment variables
  • Model Access: Controlled access to models through Hugging Face authentication
  • Data Privacy: No sensitive data storage; evaluation results are temporary
  • Audit Trail: Comprehensive logging of all evaluation activities

License

MIT License - see LICENSE file for details.

Support

For issues and questions:

  1. Check the evaluation logs for detailed error information
  2. Verify Hugging Face token configuration
  3. Ensure model accessibility and availability
  4. Review system resource requirements

Note: This platform is designed for safety research and evaluation purposes. All evaluations should be conducted in accordance with organizational policies and applicable regulations.