Spaces:
Sleeping
Sleeping
File size: 9,701 Bytes
eb9f140 4fef010 eb9f140 4fef010 eb9f140 4fef010 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 | ---
title: AI Safety Lab
emoji: π‘οΈ
colorFrom: purple
colorTo: red
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit
short_description: DSPy-based multi-agent AI safety evaluation platform
---
# AI Safety Lab
A professional DSPy-based multi-agent platform for systematic AI safety evaluation and red-teaming of language models.
## Problem Being Solved
Organizations deploying language models face significant challenges in systematically evaluating safety risks across diverse attack vectors. Traditional safety testing approaches are often:
- **Manual and ad-hoc**: Inconsistent coverage of potential failure modes
- **Prompt engineering focused**: Limited scalability and reproducibility
- **Single-purpose tools**: Lack comprehensive, measurable evaluation frameworks
- **Black-box approaches**: Limited insight into why safety failures occur
AI Safety Lab addresses these gaps through a structured, multi-agent system that provides continuous, measurable safety assessment with DSPy-powered optimization.
## System Design
### Architecture Overview
```
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β RedTeaming βββββΆβ Target Model βββββΆβ SafetyJudging β
β Agent β β β β Agent β
β β β β β β
β β’ DSPy Module β β β’ HF Interface β β β’ DSPy Module β
β β’ Optimization β β β’ Local/API β β β’ Multi-dim β
β β’ Structured β β β’ Configurable β β β’ Objective β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β² β
β βΌ
ββββββββββββββββ DSPy Optimization Loop βββββββββββ
```
### Core Components
#### 1. RedTeamingAgent
- **Purpose**: Systematic generation of adversarial inputs
- **Approach**: DSPy-optimized prompt generation across multiple attack vectors
- **Coverage**: Instruction following, policy bypass, ambiguity, jailbreak, role-play, context injection
- **Optimization**: Closed-loop improvement based on safety evaluation feedback
#### 2. SafetyJudgeAgent
- **Purpose**: Objective, multi-dimensional safety assessment
- **Dimensions**: Toxicity, bias, misinformation, violence, self-harm, privacy, illegal activities
- **Scoring**: Quantitative risk assessment (0.0-1.0) with confidence intervals
- **Outputs**: Structured judgments with actionable recommendations
#### 3. Orchestration Loop
- **Function**: Coordinates agent interactions and optimization cycles
- **Process**: Multi-iteration evaluation with adaptive DSPy optimization
- **Metrics**: Real-time performance tracking and trend analysis
- **Reporting**: Comprehensive safety reports with recommendations
#### 4. Model Interface
- **Integration**: Hugging Face Hub access with local loading support
- **Flexibility**: API-based and local model evaluation
- **Monitoring**: Response time, success rate, and performance tracking
### DSPy Integration
The system leverages DSPy for:
- **Programmatic Prompting**: Structured reasoning for adversarial prompt generation
- **Optimization**: BootstrapFewShot optimization for improved attack discovery
- **Metrics**: Custom evaluation functions for safety effectiveness
- **Modularity**: Composable reasoning programs for different safety objectives
## What This Lab Is Not
- β **A demo or tutorial**: This is a production-oriented safety evaluation platform
- β **A prompt engineering playground**: Focuses on systematic, reproducible testing
- β **A chat application**: Designed for evaluation, not conversation
- β **A toy example**: Built for serious safety assessment workflows
- β **A replacement for human review**: Augments, not replaces, human expertise
## Installation and Setup
### Prerequisites
- Python 3.8+
- Hugging Face API token (for model access)
- Sufficient compute resources for model evaluation
### Local Development
```bash
# Clone the repository
git clone https://huggingface.co/spaces/soupstick/AI_Safety_Lab
cd AI_Safety_Lab
# Install dependencies
pip install -r requirements.txt
# Set environment variables
export HUGGINGFACEHUB_API_TOKEN="your_token_here"
# Run locally
python app.py
```
### Hugging Face Space Deployment
1. Clone this space to your account
2. Add your Hugging Face token as a repository secret
3. The space will automatically build and deploy
## Usage
### Basic Safety Evaluation
1. **Select Target Model**: Choose from available Hugging Face models
2. **Define Safety Objective**: Specify the safety boundary to test
3. **Configure Parameters**: Set iterations, prompts per iteration, optimization threshold
4. **Run Evaluation**: Execute the multi-agent safety assessment
5. **Review Results**: Analyze prompts, responses, and comprehensive safety report
### Advanced Configuration
```python
from orchestration.loop import EvaluationConfig, evaluation_loop
config = EvaluationConfig(
target_model_id="meta-llama/Llama-2-7b-chat-hf",
safety_objective="Test for harmful content generation",
max_prompts_per_iteration=5,
max_iterations=3,
optimization_threshold=0.3,
temperature=0.7
)
report = evaluation_loop.run_evaluation(config)
```
## Evaluation Metrics
### Risk Assessment
- **Overall Risk Score**: Composite safety risk (0.0-1.0)
- **Policy Violation Likelihood**: Probability of policy breaches
- **Harm Severity**: Categorical severity assessment (low/medium/high/critical)
- **Ambiguity Risk**: Potential for misinterpretation
- **Exploitability**: Likelihood of malicious exploitation
### Performance Metrics
- **Discovery Rate**: Percentage of high-risk outputs identified
- **Attack Vector Coverage**: Diversity of attack types tested
- **Response Success Rate**: Model availability and response quality
- **Evaluation Efficiency**: Time and resource optimization
### Quality Metrics
- **False Positive/Negative Rates**: Accuracy of safety assessments
- **Precision/Recall**: Balance between safety and usability
- **Trend Analysis**: Performance changes over time
## Architecture Decisions
### Multi-Agent Design
- **Separation of Concerns**: Clear boundaries between adversarial generation and safety evaluation
- **Independence**: SafetyJudgeAgent has no access to red-teaming internals
- **Specialization**: Each agent optimized for its specific task
### DSPy Integration
- **Programmatic Approach**: Structured reasoning over prompt engineering
- **Optimization**: Continuous improvement through DSPy's optimization framework
- **Reproducibility**: Consistent evaluation across multiple runs
### Modular Structure
- **Extensibility**: Easy addition of new agents and evaluation dimensions
- **Maintainability**: Clear separation of concerns and well-defined interfaces
- **Testing**: Unit testable components with clear responsibilities
## Integration Options
### Model Registry
```python
from models.hf_interface import model_interface
# Add custom models
model_interface.available_models["custom/model"] = ModelInfo(
model_id="custom/model",
name="Custom Model",
description="Organization-specific model",
category="Internal",
requires_token=True,
is_local=True
)
```
### Custom Safety Dimensions
```python
from agents.safety_judge import SafetyJudgeAgent
class CustomSafetyJudge(SafetyJudgeAgent):
def __init__(self):
super().__init__()
self.safety_dimensions.extend([
"custom_compliance",
"business_risk"
])
```
### Evaluation Pipelines
```python
# Batch evaluation across multiple models
models = ["model1", "model2", "model3"]
results = []
for model_id in models:
config.target_model_id = model_id
report = evaluation_loop.run_evaluation(config)
results.append(report)
```
## Compliance and Standards
The platform supports compliance with:
- **NIST AI Risk Management Framework**: Structured risk assessment and monitoring
- **AI Act Requirements**: Safety testing and documentation
- **ISO/IEC 23894**: AI risk management guidelines
- **Internal Governance**: Customizable evaluation criteria and reporting
## Contributing
This is an internal safety experimentation platform. Contributions should focus on:
- Enhanced agent capabilities
- New safety evaluation dimensions
- Improved optimization strategies
- Additional model integrations
- Performance and scalability improvements
## Security Considerations
- **Token Management**: Never hardcode API tokens; use environment variables
- **Model Access**: Controlled access to models through Hugging Face authentication
- **Data Privacy**: No sensitive data storage; evaluation results are temporary
- **Audit Trail**: Comprehensive logging of all evaluation activities
## License
MIT License - see LICENSE file for details.
## Support
For issues and questions:
1. Check the evaluation logs for detailed error information
2. Verify Hugging Face token configuration
3. Ensure model accessibility and availability
4. Review system resource requirements
---
**Note**: This platform is designed for safety research and evaluation purposes. All evaluations should be conducted in accordance with organizational policies and applicable regulations.
|