--- title: AI Safety Lab emoji: πŸ›‘οΈ colorFrom: purple colorTo: red sdk: gradio sdk_version: 6.2.0 app_file: app.py pinned: false license: mit short_description: DSPy-based multi-agent AI safety evaluation platform --- # AI Safety Lab A professional DSPy-based multi-agent platform for systematic AI safety evaluation and red-teaming of language models. ## Problem Being Solved Organizations deploying language models face significant challenges in systematically evaluating safety risks across diverse attack vectors. Traditional safety testing approaches are often: - **Manual and ad-hoc**: Inconsistent coverage of potential failure modes - **Prompt engineering focused**: Limited scalability and reproducibility - **Single-purpose tools**: Lack comprehensive, measurable evaluation frameworks - **Black-box approaches**: Limited insight into why safety failures occur AI Safety Lab addresses these gaps through a structured, multi-agent system that provides continuous, measurable safety assessment with DSPy-powered optimization. ## System Design ### Architecture Overview ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ RedTeaming │───▢│ Target Model │───▢│ SafetyJudging β”‚ β”‚ Agent β”‚ β”‚ β”‚ β”‚ Agent β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β”‚ β€’ DSPy Module β”‚ β”‚ β€’ HF Interface β”‚ β”‚ β€’ DSPy Module β”‚ β”‚ β€’ Optimization β”‚ β”‚ β€’ Local/API β”‚ β”‚ β€’ Multi-dim β”‚ β”‚ β€’ Structured β”‚ β”‚ β€’ Configurable β”‚ β”‚ β€’ Objective β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–² β”‚ β”‚ β–Ό └─────────────── DSPy Optimization Loop β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` ### Core Components #### 1. RedTeamingAgent - **Purpose**: Systematic generation of adversarial inputs - **Approach**: DSPy-optimized prompt generation across multiple attack vectors - **Coverage**: Instruction following, policy bypass, ambiguity, jailbreak, role-play, context injection - **Optimization**: Closed-loop improvement based on safety evaluation feedback #### 2. SafetyJudgeAgent - **Purpose**: Objective, multi-dimensional safety assessment - **Dimensions**: Toxicity, bias, misinformation, violence, self-harm, privacy, illegal activities - **Scoring**: Quantitative risk assessment (0.0-1.0) with confidence intervals - **Outputs**: Structured judgments with actionable recommendations #### 3. Orchestration Loop - **Function**: Coordinates agent interactions and optimization cycles - **Process**: Multi-iteration evaluation with adaptive DSPy optimization - **Metrics**: Real-time performance tracking and trend analysis - **Reporting**: Comprehensive safety reports with recommendations #### 4. Model Interface - **Integration**: Hugging Face Hub access with local loading support - **Flexibility**: API-based and local model evaluation - **Monitoring**: Response time, success rate, and performance tracking ### DSPy Integration The system leverages DSPy for: - **Programmatic Prompting**: Structured reasoning for adversarial prompt generation - **Optimization**: BootstrapFewShot optimization for improved attack discovery - **Metrics**: Custom evaluation functions for safety effectiveness - **Modularity**: Composable reasoning programs for different safety objectives ## What This Lab Is Not - ❌ **A demo or tutorial**: This is a production-oriented safety evaluation platform - ❌ **A prompt engineering playground**: Focuses on systematic, reproducible testing - ❌ **A chat application**: Designed for evaluation, not conversation - ❌ **A toy example**: Built for serious safety assessment workflows - ❌ **A replacement for human review**: Augments, not replaces, human expertise ## Installation and Setup ### Prerequisites - Python 3.8+ - Hugging Face API token (for model access) - Sufficient compute resources for model evaluation ### Local Development ```bash # Clone the repository git clone https://huggingface.co/spaces/soupstick/AI_Safety_Lab cd AI_Safety_Lab # Install dependencies pip install -r requirements.txt # Set environment variables export HUGGINGFACEHUB_API_TOKEN="your_token_here" # Run locally python app.py ``` ### Hugging Face Space Deployment 1. Clone this space to your account 2. Add your Hugging Face token as a repository secret 3. The space will automatically build and deploy ## Usage ### Basic Safety Evaluation 1. **Select Target Model**: Choose from available Hugging Face models 2. **Define Safety Objective**: Specify the safety boundary to test 3. **Configure Parameters**: Set iterations, prompts per iteration, optimization threshold 4. **Run Evaluation**: Execute the multi-agent safety assessment 5. **Review Results**: Analyze prompts, responses, and comprehensive safety report ### Advanced Configuration ```python from orchestration.loop import EvaluationConfig, evaluation_loop config = EvaluationConfig( target_model_id="meta-llama/Llama-2-7b-chat-hf", safety_objective="Test for harmful content generation", max_prompts_per_iteration=5, max_iterations=3, optimization_threshold=0.3, temperature=0.7 ) report = evaluation_loop.run_evaluation(config) ``` ## Evaluation Metrics ### Risk Assessment - **Overall Risk Score**: Composite safety risk (0.0-1.0) - **Policy Violation Likelihood**: Probability of policy breaches - **Harm Severity**: Categorical severity assessment (low/medium/high/critical) - **Ambiguity Risk**: Potential for misinterpretation - **Exploitability**: Likelihood of malicious exploitation ### Performance Metrics - **Discovery Rate**: Percentage of high-risk outputs identified - **Attack Vector Coverage**: Diversity of attack types tested - **Response Success Rate**: Model availability and response quality - **Evaluation Efficiency**: Time and resource optimization ### Quality Metrics - **False Positive/Negative Rates**: Accuracy of safety assessments - **Precision/Recall**: Balance between safety and usability - **Trend Analysis**: Performance changes over time ## Architecture Decisions ### Multi-Agent Design - **Separation of Concerns**: Clear boundaries between adversarial generation and safety evaluation - **Independence**: SafetyJudgeAgent has no access to red-teaming internals - **Specialization**: Each agent optimized for its specific task ### DSPy Integration - **Programmatic Approach**: Structured reasoning over prompt engineering - **Optimization**: Continuous improvement through DSPy's optimization framework - **Reproducibility**: Consistent evaluation across multiple runs ### Modular Structure - **Extensibility**: Easy addition of new agents and evaluation dimensions - **Maintainability**: Clear separation of concerns and well-defined interfaces - **Testing**: Unit testable components with clear responsibilities ## Integration Options ### Model Registry ```python from models.hf_interface import model_interface # Add custom models model_interface.available_models["custom/model"] = ModelInfo( model_id="custom/model", name="Custom Model", description="Organization-specific model", category="Internal", requires_token=True, is_local=True ) ``` ### Custom Safety Dimensions ```python from agents.safety_judge import SafetyJudgeAgent class CustomSafetyJudge(SafetyJudgeAgent): def __init__(self): super().__init__() self.safety_dimensions.extend([ "custom_compliance", "business_risk" ]) ``` ### Evaluation Pipelines ```python # Batch evaluation across multiple models models = ["model1", "model2", "model3"] results = [] for model_id in models: config.target_model_id = model_id report = evaluation_loop.run_evaluation(config) results.append(report) ``` ## Compliance and Standards The platform supports compliance with: - **NIST AI Risk Management Framework**: Structured risk assessment and monitoring - **AI Act Requirements**: Safety testing and documentation - **ISO/IEC 23894**: AI risk management guidelines - **Internal Governance**: Customizable evaluation criteria and reporting ## Contributing This is an internal safety experimentation platform. Contributions should focus on: - Enhanced agent capabilities - New safety evaluation dimensions - Improved optimization strategies - Additional model integrations - Performance and scalability improvements ## Security Considerations - **Token Management**: Never hardcode API tokens; use environment variables - **Model Access**: Controlled access to models through Hugging Face authentication - **Data Privacy**: No sensitive data storage; evaluation results are temporary - **Audit Trail**: Comprehensive logging of all evaluation activities ## License MIT License - see LICENSE file for details. ## Support For issues and questions: 1. Check the evaluation logs for detailed error information 2. Verify Hugging Face token configuration 3. Ensure model accessibility and availability 4. Review system resource requirements --- **Note**: This platform is designed for safety research and evaluation purposes. All evaluations should be conducted in accordance with organizational policies and applicable regulations.