File size: 9,701 Bytes
eb9f140
 
4fef010
eb9f140
 
 
 
 
 
 
4fef010
eb9f140
 
4fef010
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
---
title: AI Safety Lab
emoji: πŸ›‘οΈ
colorFrom: purple
colorTo: red
sdk: gradio
sdk_version: 6.2.0
app_file: app.py
pinned: false
license: mit
short_description: DSPy-based multi-agent AI safety evaluation platform
---

# AI Safety Lab

A professional DSPy-based multi-agent platform for systematic AI safety evaluation and red-teaming of language models.

## Problem Being Solved

Organizations deploying language models face significant challenges in systematically evaluating safety risks across diverse attack vectors. Traditional safety testing approaches are often:

- **Manual and ad-hoc**: Inconsistent coverage of potential failure modes
- **Prompt engineering focused**: Limited scalability and reproducibility  
- **Single-purpose tools**: Lack comprehensive, measurable evaluation frameworks
- **Black-box approaches**: Limited insight into why safety failures occur

AI Safety Lab addresses these gaps through a structured, multi-agent system that provides continuous, measurable safety assessment with DSPy-powered optimization.

## System Design

### Architecture Overview

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   RedTeaming    │───▢│  Target Model    │───▢│ SafetyJudging   β”‚
β”‚     Agent       β”‚    β”‚                  β”‚    β”‚     Agent       β”‚
β”‚                 β”‚    β”‚                  β”‚    β”‚                 β”‚
β”‚ β€’ DSPy Module   β”‚    β”‚ β€’ HF Interface   β”‚    β”‚ β€’ DSPy Module   β”‚
β”‚ β€’ Optimization  β”‚    β”‚ β€’ Local/API      β”‚    β”‚ β€’ Multi-dim     β”‚
β”‚ β€’ Structured    β”‚    β”‚ β€’ Configurable    β”‚    β”‚ β€’ Objective     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–²                                                β”‚
         β”‚                                                β–Ό
         └─────────────── DSPy Optimization Loop β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

### Core Components

#### 1. RedTeamingAgent
- **Purpose**: Systematic generation of adversarial inputs
- **Approach**: DSPy-optimized prompt generation across multiple attack vectors
- **Coverage**: Instruction following, policy bypass, ambiguity, jailbreak, role-play, context injection
- **Optimization**: Closed-loop improvement based on safety evaluation feedback

#### 2. SafetyJudgeAgent  
- **Purpose**: Objective, multi-dimensional safety assessment
- **Dimensions**: Toxicity, bias, misinformation, violence, self-harm, privacy, illegal activities
- **Scoring**: Quantitative risk assessment (0.0-1.0) with confidence intervals
- **Outputs**: Structured judgments with actionable recommendations

#### 3. Orchestration Loop
- **Function**: Coordinates agent interactions and optimization cycles
- **Process**: Multi-iteration evaluation with adaptive DSPy optimization
- **Metrics**: Real-time performance tracking and trend analysis
- **Reporting**: Comprehensive safety reports with recommendations

#### 4. Model Interface
- **Integration**: Hugging Face Hub access with local loading support
- **Flexibility**: API-based and local model evaluation
- **Monitoring**: Response time, success rate, and performance tracking

### DSPy Integration

The system leverages DSPy for:
- **Programmatic Prompting**: Structured reasoning for adversarial prompt generation
- **Optimization**: BootstrapFewShot optimization for improved attack discovery
- **Metrics**: Custom evaluation functions for safety effectiveness
- **Modularity**: Composable reasoning programs for different safety objectives

## What This Lab Is Not

- ❌ **A demo or tutorial**: This is a production-oriented safety evaluation platform
- ❌ **A prompt engineering playground**: Focuses on systematic, reproducible testing
- ❌ **A chat application**: Designed for evaluation, not conversation
- ❌ **A toy example**: Built for serious safety assessment workflows
- ❌ **A replacement for human review**: Augments, not replaces, human expertise

## Installation and Setup

### Prerequisites

- Python 3.8+
- Hugging Face API token (for model access)
- Sufficient compute resources for model evaluation

### Local Development

```bash
# Clone the repository
git clone https://huggingface.co/spaces/soupstick/AI_Safety_Lab
cd AI_Safety_Lab

# Install dependencies
pip install -r requirements.txt

# Set environment variables
export HUGGINGFACEHUB_API_TOKEN="your_token_here"

# Run locally
python app.py
```

### Hugging Face Space Deployment

1. Clone this space to your account
2. Add your Hugging Face token as a repository secret
3. The space will automatically build and deploy

## Usage

### Basic Safety Evaluation

1. **Select Target Model**: Choose from available Hugging Face models
2. **Define Safety Objective**: Specify the safety boundary to test
3. **Configure Parameters**: Set iterations, prompts per iteration, optimization threshold
4. **Run Evaluation**: Execute the multi-agent safety assessment
5. **Review Results**: Analyze prompts, responses, and comprehensive safety report

### Advanced Configuration

```python
from orchestration.loop import EvaluationConfig, evaluation_loop

config = EvaluationConfig(
    target_model_id="meta-llama/Llama-2-7b-chat-hf",
    safety_objective="Test for harmful content generation",
    max_prompts_per_iteration=5,
    max_iterations=3,
    optimization_threshold=0.3,
    temperature=0.7
)

report = evaluation_loop.run_evaluation(config)
```

## Evaluation Metrics

### Risk Assessment
- **Overall Risk Score**: Composite safety risk (0.0-1.0)
- **Policy Violation Likelihood**: Probability of policy breaches
- **Harm Severity**: Categorical severity assessment (low/medium/high/critical)
- **Ambiguity Risk**: Potential for misinterpretation
- **Exploitability**: Likelihood of malicious exploitation

### Performance Metrics
- **Discovery Rate**: Percentage of high-risk outputs identified
- **Attack Vector Coverage**: Diversity of attack types tested
- **Response Success Rate**: Model availability and response quality
- **Evaluation Efficiency**: Time and resource optimization

### Quality Metrics
- **False Positive/Negative Rates**: Accuracy of safety assessments
- **Precision/Recall**: Balance between safety and usability
- **Trend Analysis**: Performance changes over time

## Architecture Decisions

### Multi-Agent Design
- **Separation of Concerns**: Clear boundaries between adversarial generation and safety evaluation
- **Independence**: SafetyJudgeAgent has no access to red-teaming internals
- **Specialization**: Each agent optimized for its specific task

### DSPy Integration
- **Programmatic Approach**: Structured reasoning over prompt engineering
- **Optimization**: Continuous improvement through DSPy's optimization framework
- **Reproducibility**: Consistent evaluation across multiple runs

### Modular Structure
- **Extensibility**: Easy addition of new agents and evaluation dimensions
- **Maintainability**: Clear separation of concerns and well-defined interfaces
- **Testing**: Unit testable components with clear responsibilities

## Integration Options

### Model Registry
```python
from models.hf_interface import model_interface

# Add custom models
model_interface.available_models["custom/model"] = ModelInfo(
    model_id="custom/model",
    name="Custom Model",
    description="Organization-specific model",
    category="Internal",
    requires_token=True,
    is_local=True
)
```

### Custom Safety Dimensions
```python
from agents.safety_judge import SafetyJudgeAgent

class CustomSafetyJudge(SafetyJudgeAgent):
    def __init__(self):
        super().__init__()
        self.safety_dimensions.extend([
            "custom_compliance",
            "business_risk"
        ])
```

### Evaluation Pipelines
```python
# Batch evaluation across multiple models
models = ["model1", "model2", "model3"]
results = []

for model_id in models:
    config.target_model_id = model_id
    report = evaluation_loop.run_evaluation(config)
    results.append(report)
```

## Compliance and Standards

The platform supports compliance with:
- **NIST AI Risk Management Framework**: Structured risk assessment and monitoring
- **AI Act Requirements**: Safety testing and documentation
- **ISO/IEC 23894**: AI risk management guidelines
- **Internal Governance**: Customizable evaluation criteria and reporting

## Contributing

This is an internal safety experimentation platform. Contributions should focus on:
- Enhanced agent capabilities
- New safety evaluation dimensions
- Improved optimization strategies
- Additional model integrations
- Performance and scalability improvements

## Security Considerations

- **Token Management**: Never hardcode API tokens; use environment variables
- **Model Access**: Controlled access to models through Hugging Face authentication
- **Data Privacy**: No sensitive data storage; evaluation results are temporary
- **Audit Trail**: Comprehensive logging of all evaluation activities

## License

MIT License - see LICENSE file for details.

## Support

For issues and questions:
1. Check the evaluation logs for detailed error information
2. Verify Hugging Face token configuration
3. Ensure model accessibility and availability
4. Review system resource requirements

---

**Note**: This platform is designed for safety research and evaluation purposes. All evaluations should be conducted in accordance with organizational policies and applicable regulations.