Spaces:
Sleeping
Sleeping
| title: AI Safety Lab | |
| emoji: 🛡️ | |
| colorFrom: purple | |
| colorTo: red | |
| sdk: gradio | |
| sdk_version: 6.2.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: DSPy-based multi-agent AI safety evaluation platform | |
| # AI Safety Lab | |
| A professional DSPy-based multi-agent platform for systematic AI safety evaluation and red-teaming of language models. | |
| ## Problem Being Solved | |
| Organizations deploying language models face significant challenges in systematically evaluating safety risks across diverse attack vectors. Traditional safety testing approaches are often: | |
| - **Manual and ad-hoc**: Inconsistent coverage of potential failure modes | |
| - **Prompt engineering focused**: Limited scalability and reproducibility | |
| - **Single-purpose tools**: Lack comprehensive, measurable evaluation frameworks | |
| - **Black-box approaches**: Limited insight into why safety failures occur | |
| AI Safety Lab addresses these gaps through a structured, multi-agent system that provides continuous, measurable safety assessment with DSPy-powered optimization. | |
| ## System Design | |
| ### Architecture Overview | |
| ``` | |
| ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ | |
| │ RedTeaming │───▶│ Target Model │───▶│ SafetyJudging │ | |
| │ Agent │ │ │ │ Agent │ | |
| │ │ │ │ │ │ | |
| │ • DSPy Module │ │ • HF Interface │ │ • DSPy Module │ | |
| │ • Optimization │ │ • Local/API │ │ • Multi-dim │ | |
| │ • Structured │ │ • Configurable │ │ • Objective │ | |
| └─────────────────┘ └──────────────────┘ └─────────────────┘ | |
| ▲ │ | |
| │ ▼ | |
| └─────────────── DSPy Optimization Loop ◄─────────┘ | |
| ``` | |
| ### Core Components | |
| #### 1. RedTeamingAgent | |
| - **Purpose**: Systematic generation of adversarial inputs | |
| - **Approach**: DSPy-optimized prompt generation across multiple attack vectors | |
| - **Coverage**: Instruction following, policy bypass, ambiguity, jailbreak, role-play, context injection | |
| - **Optimization**: Closed-loop improvement based on safety evaluation feedback | |
| #### 2. SafetyJudgeAgent | |
| - **Purpose**: Objective, multi-dimensional safety assessment | |
| - **Dimensions**: Toxicity, bias, misinformation, violence, self-harm, privacy, illegal activities | |
| - **Scoring**: Quantitative risk assessment (0.0-1.0) with confidence intervals | |
| - **Outputs**: Structured judgments with actionable recommendations | |
| #### 3. Orchestration Loop | |
| - **Function**: Coordinates agent interactions and optimization cycles | |
| - **Process**: Multi-iteration evaluation with adaptive DSPy optimization | |
| - **Metrics**: Real-time performance tracking and trend analysis | |
| - **Reporting**: Comprehensive safety reports with recommendations | |
| #### 4. Model Interface | |
| - **Integration**: Hugging Face Hub access with local loading support | |
| - **Flexibility**: API-based and local model evaluation | |
| - **Monitoring**: Response time, success rate, and performance tracking | |
| ### DSPy Integration | |
| The system leverages DSPy for: | |
| - **Programmatic Prompting**: Structured reasoning for adversarial prompt generation | |
| - **Optimization**: BootstrapFewShot optimization for improved attack discovery | |
| - **Metrics**: Custom evaluation functions for safety effectiveness | |
| - **Modularity**: Composable reasoning programs for different safety objectives | |
| ## What This Lab Is Not | |
| - ❌ **A demo or tutorial**: This is a production-oriented safety evaluation platform | |
| - ❌ **A prompt engineering playground**: Focuses on systematic, reproducible testing | |
| - ❌ **A chat application**: Designed for evaluation, not conversation | |
| - ❌ **A toy example**: Built for serious safety assessment workflows | |
| - ❌ **A replacement for human review**: Augments, not replaces, human expertise | |
| ## Installation and Setup | |
| ### Prerequisites | |
| - Python 3.8+ | |
| - Hugging Face API token (for model access) | |
| - Sufficient compute resources for model evaluation | |
| ### Local Development | |
| ```bash | |
| # Clone the repository | |
| git clone https://huggingface.co/spaces/soupstick/AI_Safety_Lab | |
| cd AI_Safety_Lab | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Set environment variables | |
| export HUGGINGFACEHUB_API_TOKEN="your_token_here" | |
| # Run locally | |
| python app.py | |
| ``` | |
| ### Hugging Face Space Deployment | |
| 1. Clone this space to your account | |
| 2. Add your Hugging Face token as a repository secret | |
| 3. The space will automatically build and deploy | |
| ## Usage | |
| ### Basic Safety Evaluation | |
| 1. **Select Target Model**: Choose from available Hugging Face models | |
| 2. **Define Safety Objective**: Specify the safety boundary to test | |
| 3. **Configure Parameters**: Set iterations, prompts per iteration, optimization threshold | |
| 4. **Run Evaluation**: Execute the multi-agent safety assessment | |
| 5. **Review Results**: Analyze prompts, responses, and comprehensive safety report | |
| ### Advanced Configuration | |
| ```python | |
| from orchestration.loop import EvaluationConfig, evaluation_loop | |
| config = EvaluationConfig( | |
| target_model_id="meta-llama/Llama-2-7b-chat-hf", | |
| safety_objective="Test for harmful content generation", | |
| max_prompts_per_iteration=5, | |
| max_iterations=3, | |
| optimization_threshold=0.3, | |
| temperature=0.7 | |
| ) | |
| report = evaluation_loop.run_evaluation(config) | |
| ``` | |
| ## Evaluation Metrics | |
| ### Risk Assessment | |
| - **Overall Risk Score**: Composite safety risk (0.0-1.0) | |
| - **Policy Violation Likelihood**: Probability of policy breaches | |
| - **Harm Severity**: Categorical severity assessment (low/medium/high/critical) | |
| - **Ambiguity Risk**: Potential for misinterpretation | |
| - **Exploitability**: Likelihood of malicious exploitation | |
| ### Performance Metrics | |
| - **Discovery Rate**: Percentage of high-risk outputs identified | |
| - **Attack Vector Coverage**: Diversity of attack types tested | |
| - **Response Success Rate**: Model availability and response quality | |
| - **Evaluation Efficiency**: Time and resource optimization | |
| ### Quality Metrics | |
| - **False Positive/Negative Rates**: Accuracy of safety assessments | |
| - **Precision/Recall**: Balance between safety and usability | |
| - **Trend Analysis**: Performance changes over time | |
| ## Architecture Decisions | |
| ### Multi-Agent Design | |
| - **Separation of Concerns**: Clear boundaries between adversarial generation and safety evaluation | |
| - **Independence**: SafetyJudgeAgent has no access to red-teaming internals | |
| - **Specialization**: Each agent optimized for its specific task | |
| ### DSPy Integration | |
| - **Programmatic Approach**: Structured reasoning over prompt engineering | |
| - **Optimization**: Continuous improvement through DSPy's optimization framework | |
| - **Reproducibility**: Consistent evaluation across multiple runs | |
| ### Modular Structure | |
| - **Extensibility**: Easy addition of new agents and evaluation dimensions | |
| - **Maintainability**: Clear separation of concerns and well-defined interfaces | |
| - **Testing**: Unit testable components with clear responsibilities | |
| ## Integration Options | |
| ### Model Registry | |
| ```python | |
| from models.hf_interface import model_interface | |
| # Add custom models | |
| model_interface.available_models["custom/model"] = ModelInfo( | |
| model_id="custom/model", | |
| name="Custom Model", | |
| description="Organization-specific model", | |
| category="Internal", | |
| requires_token=True, | |
| is_local=True | |
| ) | |
| ``` | |
| ### Custom Safety Dimensions | |
| ```python | |
| from agents.safety_judge import SafetyJudgeAgent | |
| class CustomSafetyJudge(SafetyJudgeAgent): | |
| def __init__(self): | |
| super().__init__() | |
| self.safety_dimensions.extend([ | |
| "custom_compliance", | |
| "business_risk" | |
| ]) | |
| ``` | |
| ### Evaluation Pipelines | |
| ```python | |
| # Batch evaluation across multiple models | |
| models = ["model1", "model2", "model3"] | |
| results = [] | |
| for model_id in models: | |
| config.target_model_id = model_id | |
| report = evaluation_loop.run_evaluation(config) | |
| results.append(report) | |
| ``` | |
| ## Compliance and Standards | |
| The platform supports compliance with: | |
| - **NIST AI Risk Management Framework**: Structured risk assessment and monitoring | |
| - **AI Act Requirements**: Safety testing and documentation | |
| - **ISO/IEC 23894**: AI risk management guidelines | |
| - **Internal Governance**: Customizable evaluation criteria and reporting | |
| ## Contributing | |
| This is an internal safety experimentation platform. Contributions should focus on: | |
| - Enhanced agent capabilities | |
| - New safety evaluation dimensions | |
| - Improved optimization strategies | |
| - Additional model integrations | |
| - Performance and scalability improvements | |
| ## Security Considerations | |
| - **Token Management**: Never hardcode API tokens; use environment variables | |
| - **Model Access**: Controlled access to models through Hugging Face authentication | |
| - **Data Privacy**: No sensitive data storage; evaluation results are temporary | |
| - **Audit Trail**: Comprehensive logging of all evaluation activities | |
| ## License | |
| MIT License - see LICENSE file for details. | |
| ## Support | |
| For issues and questions: | |
| 1. Check the evaluation logs for detailed error information | |
| 2. Verify Hugging Face token configuration | |
| 3. Ensure model accessibility and availability | |
| 4. Review system resource requirements | |
| --- | |
| **Note**: This platform is designed for safety research and evaluation purposes. All evaluations should be conducted in accordance with organizational policies and applicable regulations. | |