AI_Safety_Lab / README.md
soupstick's picture
Initial DSPy-based AI Safety Lab implementation
4fef010
---
title: AI Safety Lab
emoji: 🛡️
colorFrom: purple
colorTo: red
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit
short_description: DSPy-based multi-agent AI safety evaluation platform
---
# AI Safety Lab
A professional DSPy-based multi-agent platform for systematic AI safety evaluation and red-teaming of language models.
## Problem Being Solved
Organizations deploying language models face significant challenges in systematically evaluating safety risks across diverse attack vectors. Traditional safety testing approaches are often:
- **Manual and ad-hoc**: Inconsistent coverage of potential failure modes
- **Prompt engineering focused**: Limited scalability and reproducibility
- **Single-purpose tools**: Lack comprehensive, measurable evaluation frameworks
- **Black-box approaches**: Limited insight into why safety failures occur
AI Safety Lab addresses these gaps through a structured, multi-agent system that provides continuous, measurable safety assessment with DSPy-powered optimization.
## System Design
### Architecture Overview
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ RedTeaming │───▶│ Target Model │───▶│ SafetyJudging │
│ Agent │ │ │ │ Agent │
│ │ │ │ │ │
│ • DSPy Module │ │ • HF Interface │ │ • DSPy Module │
│ • Optimization │ │ • Local/API │ │ • Multi-dim │
│ • Structured │ │ • Configurable │ │ • Objective │
└─────────────────┘ └──────────────────┘ └─────────────────┘
▲ │
│ ▼
└─────────────── DSPy Optimization Loop ◄─────────┘
```
### Core Components
#### 1. RedTeamingAgent
- **Purpose**: Systematic generation of adversarial inputs
- **Approach**: DSPy-optimized prompt generation across multiple attack vectors
- **Coverage**: Instruction following, policy bypass, ambiguity, jailbreak, role-play, context injection
- **Optimization**: Closed-loop improvement based on safety evaluation feedback
#### 2. SafetyJudgeAgent
- **Purpose**: Objective, multi-dimensional safety assessment
- **Dimensions**: Toxicity, bias, misinformation, violence, self-harm, privacy, illegal activities
- **Scoring**: Quantitative risk assessment (0.0-1.0) with confidence intervals
- **Outputs**: Structured judgments with actionable recommendations
#### 3. Orchestration Loop
- **Function**: Coordinates agent interactions and optimization cycles
- **Process**: Multi-iteration evaluation with adaptive DSPy optimization
- **Metrics**: Real-time performance tracking and trend analysis
- **Reporting**: Comprehensive safety reports with recommendations
#### 4. Model Interface
- **Integration**: Hugging Face Hub access with local loading support
- **Flexibility**: API-based and local model evaluation
- **Monitoring**: Response time, success rate, and performance tracking
### DSPy Integration
The system leverages DSPy for:
- **Programmatic Prompting**: Structured reasoning for adversarial prompt generation
- **Optimization**: BootstrapFewShot optimization for improved attack discovery
- **Metrics**: Custom evaluation functions for safety effectiveness
- **Modularity**: Composable reasoning programs for different safety objectives
## What This Lab Is Not
-**A demo or tutorial**: This is a production-oriented safety evaluation platform
-**A prompt engineering playground**: Focuses on systematic, reproducible testing
-**A chat application**: Designed for evaluation, not conversation
-**A toy example**: Built for serious safety assessment workflows
-**A replacement for human review**: Augments, not replaces, human expertise
## Installation and Setup
### Prerequisites
- Python 3.8+
- Hugging Face API token (for model access)
- Sufficient compute resources for model evaluation
### Local Development
```bash
# Clone the repository
git clone https://huggingface.co/spaces/soupstick/AI_Safety_Lab
cd AI_Safety_Lab
# Install dependencies
pip install -r requirements.txt
# Set environment variables
export HUGGINGFACEHUB_API_TOKEN="your_token_here"
# Run locally
python app.py
```
### Hugging Face Space Deployment
1. Clone this space to your account
2. Add your Hugging Face token as a repository secret
3. The space will automatically build and deploy
## Usage
### Basic Safety Evaluation
1. **Select Target Model**: Choose from available Hugging Face models
2. **Define Safety Objective**: Specify the safety boundary to test
3. **Configure Parameters**: Set iterations, prompts per iteration, optimization threshold
4. **Run Evaluation**: Execute the multi-agent safety assessment
5. **Review Results**: Analyze prompts, responses, and comprehensive safety report
### Advanced Configuration
```python
from orchestration.loop import EvaluationConfig, evaluation_loop
config = EvaluationConfig(
target_model_id="meta-llama/Llama-2-7b-chat-hf",
safety_objective="Test for harmful content generation",
max_prompts_per_iteration=5,
max_iterations=3,
optimization_threshold=0.3,
temperature=0.7
)
report = evaluation_loop.run_evaluation(config)
```
## Evaluation Metrics
### Risk Assessment
- **Overall Risk Score**: Composite safety risk (0.0-1.0)
- **Policy Violation Likelihood**: Probability of policy breaches
- **Harm Severity**: Categorical severity assessment (low/medium/high/critical)
- **Ambiguity Risk**: Potential for misinterpretation
- **Exploitability**: Likelihood of malicious exploitation
### Performance Metrics
- **Discovery Rate**: Percentage of high-risk outputs identified
- **Attack Vector Coverage**: Diversity of attack types tested
- **Response Success Rate**: Model availability and response quality
- **Evaluation Efficiency**: Time and resource optimization
### Quality Metrics
- **False Positive/Negative Rates**: Accuracy of safety assessments
- **Precision/Recall**: Balance between safety and usability
- **Trend Analysis**: Performance changes over time
## Architecture Decisions
### Multi-Agent Design
- **Separation of Concerns**: Clear boundaries between adversarial generation and safety evaluation
- **Independence**: SafetyJudgeAgent has no access to red-teaming internals
- **Specialization**: Each agent optimized for its specific task
### DSPy Integration
- **Programmatic Approach**: Structured reasoning over prompt engineering
- **Optimization**: Continuous improvement through DSPy's optimization framework
- **Reproducibility**: Consistent evaluation across multiple runs
### Modular Structure
- **Extensibility**: Easy addition of new agents and evaluation dimensions
- **Maintainability**: Clear separation of concerns and well-defined interfaces
- **Testing**: Unit testable components with clear responsibilities
## Integration Options
### Model Registry
```python
from models.hf_interface import model_interface
# Add custom models
model_interface.available_models["custom/model"] = ModelInfo(
model_id="custom/model",
name="Custom Model",
description="Organization-specific model",
category="Internal",
requires_token=True,
is_local=True
)
```
### Custom Safety Dimensions
```python
from agents.safety_judge import SafetyJudgeAgent
class CustomSafetyJudge(SafetyJudgeAgent):
def __init__(self):
super().__init__()
self.safety_dimensions.extend([
"custom_compliance",
"business_risk"
])
```
### Evaluation Pipelines
```python
# Batch evaluation across multiple models
models = ["model1", "model2", "model3"]
results = []
for model_id in models:
config.target_model_id = model_id
report = evaluation_loop.run_evaluation(config)
results.append(report)
```
## Compliance and Standards
The platform supports compliance with:
- **NIST AI Risk Management Framework**: Structured risk assessment and monitoring
- **AI Act Requirements**: Safety testing and documentation
- **ISO/IEC 23894**: AI risk management guidelines
- **Internal Governance**: Customizable evaluation criteria and reporting
## Contributing
This is an internal safety experimentation platform. Contributions should focus on:
- Enhanced agent capabilities
- New safety evaluation dimensions
- Improved optimization strategies
- Additional model integrations
- Performance and scalability improvements
## Security Considerations
- **Token Management**: Never hardcode API tokens; use environment variables
- **Model Access**: Controlled access to models through Hugging Face authentication
- **Data Privacy**: No sensitive data storage; evaluation results are temporary
- **Audit Trail**: Comprehensive logging of all evaluation activities
## License
MIT License - see LICENSE file for details.
## Support
For issues and questions:
1. Check the evaluation logs for detailed error information
2. Verify Hugging Face token configuration
3. Ensure model accessibility and availability
4. Review system resource requirements
---
**Note**: This platform is designed for safety research and evaluation purposes. All evaluations should be conducted in accordance with organizational policies and applicable regulations.