Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.5.1
title: AI Safety Lab
emoji: π‘οΈ
colorFrom: purple
colorTo: red
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit
short_description: DSPy-based multi-agent AI safety evaluation platform
AI Safety Lab
A professional DSPy-based multi-agent platform for systematic AI safety evaluation and red-teaming of language models.
Problem Being Solved
Organizations deploying language models face significant challenges in systematically evaluating safety risks across diverse attack vectors. Traditional safety testing approaches are often:
- Manual and ad-hoc: Inconsistent coverage of potential failure modes
- Prompt engineering focused: Limited scalability and reproducibility
- Single-purpose tools: Lack comprehensive, measurable evaluation frameworks
- Black-box approaches: Limited insight into why safety failures occur
AI Safety Lab addresses these gaps through a structured, multi-agent system that provides continuous, measurable safety assessment with DSPy-powered optimization.
System Design
Architecture Overview
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β RedTeaming βββββΆβ Target Model βββββΆβ SafetyJudging β
β Agent β β β β Agent β
β β β β β β
β β’ DSPy Module β β β’ HF Interface β β β’ DSPy Module β
β β’ Optimization β β β’ Local/API β β β’ Multi-dim β
β β’ Structured β β β’ Configurable β β β’ Objective β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β² β
β βΌ
ββββββββββββββββ DSPy Optimization Loop βββββββββββ
Core Components
1. RedTeamingAgent
- Purpose: Systematic generation of adversarial inputs
- Approach: DSPy-optimized prompt generation across multiple attack vectors
- Coverage: Instruction following, policy bypass, ambiguity, jailbreak, role-play, context injection
- Optimization: Closed-loop improvement based on safety evaluation feedback
2. SafetyJudgeAgent
- Purpose: Objective, multi-dimensional safety assessment
- Dimensions: Toxicity, bias, misinformation, violence, self-harm, privacy, illegal activities
- Scoring: Quantitative risk assessment (0.0-1.0) with confidence intervals
- Outputs: Structured judgments with actionable recommendations
3. Orchestration Loop
- Function: Coordinates agent interactions and optimization cycles
- Process: Multi-iteration evaluation with adaptive DSPy optimization
- Metrics: Real-time performance tracking and trend analysis
- Reporting: Comprehensive safety reports with recommendations
4. Model Interface
- Integration: Hugging Face Hub access with local loading support
- Flexibility: API-based and local model evaluation
- Monitoring: Response time, success rate, and performance tracking
DSPy Integration
The system leverages DSPy for:
- Programmatic Prompting: Structured reasoning for adversarial prompt generation
- Optimization: BootstrapFewShot optimization for improved attack discovery
- Metrics: Custom evaluation functions for safety effectiveness
- Modularity: Composable reasoning programs for different safety objectives
What This Lab Is Not
- β A demo or tutorial: This is a production-oriented safety evaluation platform
- β A prompt engineering playground: Focuses on systematic, reproducible testing
- β A chat application: Designed for evaluation, not conversation
- β A toy example: Built for serious safety assessment workflows
- β A replacement for human review: Augments, not replaces, human expertise
Installation and Setup
Prerequisites
- Python 3.8+
- Hugging Face API token (for model access)
- Sufficient compute resources for model evaluation
Local Development
# Clone the repository
git clone https://huggingface.co/spaces/soupstick/AI_Safety_Lab
cd AI_Safety_Lab
# Install dependencies
pip install -r requirements.txt
# Set environment variables
export HUGGINGFACEHUB_API_TOKEN="your_token_here"
# Run locally
python app.py
Hugging Face Space Deployment
- Clone this space to your account
- Add your Hugging Face token as a repository secret
- The space will automatically build and deploy
Usage
Basic Safety Evaluation
- Select Target Model: Choose from available Hugging Face models
- Define Safety Objective: Specify the safety boundary to test
- Configure Parameters: Set iterations, prompts per iteration, optimization threshold
- Run Evaluation: Execute the multi-agent safety assessment
- Review Results: Analyze prompts, responses, and comprehensive safety report
Advanced Configuration
from orchestration.loop import EvaluationConfig, evaluation_loop
config = EvaluationConfig(
target_model_id="meta-llama/Llama-2-7b-chat-hf",
safety_objective="Test for harmful content generation",
max_prompts_per_iteration=5,
max_iterations=3,
optimization_threshold=0.3,
temperature=0.7
)
report = evaluation_loop.run_evaluation(config)
Evaluation Metrics
Risk Assessment
- Overall Risk Score: Composite safety risk (0.0-1.0)
- Policy Violation Likelihood: Probability of policy breaches
- Harm Severity: Categorical severity assessment (low/medium/high/critical)
- Ambiguity Risk: Potential for misinterpretation
- Exploitability: Likelihood of malicious exploitation
Performance Metrics
- Discovery Rate: Percentage of high-risk outputs identified
- Attack Vector Coverage: Diversity of attack types tested
- Response Success Rate: Model availability and response quality
- Evaluation Efficiency: Time and resource optimization
Quality Metrics
- False Positive/Negative Rates: Accuracy of safety assessments
- Precision/Recall: Balance between safety and usability
- Trend Analysis: Performance changes over time
Architecture Decisions
Multi-Agent Design
- Separation of Concerns: Clear boundaries between adversarial generation and safety evaluation
- Independence: SafetyJudgeAgent has no access to red-teaming internals
- Specialization: Each agent optimized for its specific task
DSPy Integration
- Programmatic Approach: Structured reasoning over prompt engineering
- Optimization: Continuous improvement through DSPy's optimization framework
- Reproducibility: Consistent evaluation across multiple runs
Modular Structure
- Extensibility: Easy addition of new agents and evaluation dimensions
- Maintainability: Clear separation of concerns and well-defined interfaces
- Testing: Unit testable components with clear responsibilities
Integration Options
Model Registry
from models.hf_interface import model_interface
# Add custom models
model_interface.available_models["custom/model"] = ModelInfo(
model_id="custom/model",
name="Custom Model",
description="Organization-specific model",
category="Internal",
requires_token=True,
is_local=True
)
Custom Safety Dimensions
from agents.safety_judge import SafetyJudgeAgent
class CustomSafetyJudge(SafetyJudgeAgent):
def __init__(self):
super().__init__()
self.safety_dimensions.extend([
"custom_compliance",
"business_risk"
])
Evaluation Pipelines
# Batch evaluation across multiple models
models = ["model1", "model2", "model3"]
results = []
for model_id in models:
config.target_model_id = model_id
report = evaluation_loop.run_evaluation(config)
results.append(report)
Compliance and Standards
The platform supports compliance with:
- NIST AI Risk Management Framework: Structured risk assessment and monitoring
- AI Act Requirements: Safety testing and documentation
- ISO/IEC 23894: AI risk management guidelines
- Internal Governance: Customizable evaluation criteria and reporting
Contributing
This is an internal safety experimentation platform. Contributions should focus on:
- Enhanced agent capabilities
- New safety evaluation dimensions
- Improved optimization strategies
- Additional model integrations
- Performance and scalability improvements
Security Considerations
- Token Management: Never hardcode API tokens; use environment variables
- Model Access: Controlled access to models through Hugging Face authentication
- Data Privacy: No sensitive data storage; evaluation results are temporary
- Audit Trail: Comprehensive logging of all evaluation activities
License
MIT License - see LICENSE file for details.
Support
For issues and questions:
- Check the evaluation logs for detailed error information
- Verify Hugging Face token configuration
- Ensure model accessibility and availability
- Review system resource requirements
Note: This platform is designed for safety research and evaluation purposes. All evaluations should be conducted in accordance with organizational policies and applicable regulations.