Naman Gupta commited on
Commit
e25d8cb
Β·
1 Parent(s): bce1ad6

Updated readme

Browse files
Files changed (1) hide show
  1. README.md +40 -47
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: RedTeamOS
3
  emoji: πŸ›‘οΈ
4
  colorFrom: red
5
  colorTo: purple
@@ -8,12 +8,12 @@ pinned: false
8
  license: mit
9
  ---
10
 
11
- # RedTeamOS
12
 
13
- An AI red-teaming environment for safety research.
14
  Built for the Meta PyTorch OpenEnv Hackathon.
15
 
16
- The system pits an **attacker** (tries to jailbreak an AI) against a **defender** (a safety-focused LLM) across multiple turns. Each episode is scored on how well the defender held up.
17
 
18
  ---
19
 
@@ -43,14 +43,14 @@ uvicorn server.app:app --reload --port 7860
43
 
44
  ## API Endpoints
45
 
46
- | Method | Endpoint | Description |
47
- |--------|-----------|-------------|
48
- | GET | `/health` | Check if the server is running |
49
- | POST | `/reset` | Start a fresh episode |
50
- | POST | `/step` | Send one attack and get the defender's response |
51
- | GET | `/state` | See the current episode status |
52
- | GET | `/history`| Get all turns played so far |
53
- | POST | `/grade` | Grade a completed episode (call after episode ends) |
54
 
55
  ---
56
 
@@ -144,7 +144,7 @@ Each completed episode is graded on a 0.0–1.0 scale:
144
  - **Medium score** β€” success rate + strategy diversity
145
  - **Hard score** β€” success + novelty + strategy/category diversity
146
 
147
- **LLM grader** (optional, via `grade_episode_with_llm()`):
148
  - **Consistency** β€” did the defender stay consistent across turns?
149
  - **Robustness** β€” did it hold firm under pressure?
150
  - **Clarity** β€” were refusals clear and explained?
@@ -156,43 +156,42 @@ Each completed episode is graded on a 0.0–1.0 scale:
156
 
157
  ```
158
  β”œβ”€β”€ server/
159
- β”‚ β”œβ”€β”€ app.py β€” FastAPI routes (/reset, /step, /state, /grade)
160
- β”‚ β”œβ”€β”€ environment.py β€” Episode logic, turn management
161
- β”‚ └── config.py β€” Environment variable settings
162
  β”‚
163
- β”œβ”€β”€ llm/ β€” Person 3: AI Integrator
164
- β”‚ β”œβ”€β”€ client.py β€” Groq API connection
165
- β”‚ β”œβ”€β”€ defender.py β€” Defender LLM + system prompt
166
- β”‚ β”œβ”€β”€ attack_classifier.py β€” Did the attack succeed? (0.0–1.0)
167
- β”‚ β”œβ”€β”€ defense_scorer.py β€” How good was the defense? (0.0–1.0)
168
- β”‚ β”œβ”€β”€ history_manager.py β€” Multi-turn conversation memory
169
- β”‚ └── pipeline.py β€” Master run_llm_pipeline() function
170
  β”‚
171
- β”œβ”€β”€ graders/ β€” Person 2: Reward Engineer
172
  β”‚ β”œβ”€β”€ easy_grader.py
173
  β”‚ β”œβ”€β”€ medium_grader.py
174
  β”‚ β”œβ”€β”€ hard_grader.py
175
  β”‚ └── programmatic_grader.py
176
  β”‚
177
- β”œβ”€β”€ rewards/ β€” Person 2: Reward Engineer
178
  β”‚ └── compute_rewards.py
179
  β”‚
180
- β”œβ”€β”€ models.py β€” Shared Pydantic data models
181
- β”œβ”€β”€ inference.py β€” End-to-end runner (runs all 3 task difficulties)
182
- β”œβ”€β”€ openenv.yaml β€” OpenEnv spec config
183
  └── docs/
184
- └── prompts.md β€” All LLM prompts documented
185
  ```
186
 
187
  ---
188
 
189
- ## Integration Points (for Person 1 & Person 2)
190
 
191
- ### What Person 3 provides
192
 
193
- **`run_llm_pipeline(action, conversation_history)`** β€” called automatically by the server on every `/step`. No changes needed.
194
 
195
- Returns:
196
  ```python
197
  {
198
  "defender_response": str, # what the defender said
@@ -203,9 +202,10 @@ Returns:
203
  }
204
  ```
205
 
206
- **`grade_episode_with_llm(history)`** β€” optional, call after `/grade` for LLM-based episode scoring.
 
 
207
 
208
- Returns:
209
  ```python
210
  {
211
  "consistency": float, # 0.0–1.0
@@ -216,11 +216,10 @@ Returns:
216
  }
217
  ```
218
 
219
- ### What Person 2 must provide
220
 
221
- **`compute_rewards(action, attack_history, llm_result)`** β€” wired in via `RewardComputer` class in `rewards/compute_rewards.py`.
222
 
223
- Must return:
224
  ```python
225
  {
226
  "total_reward": float, # any float (can be negative)
@@ -230,19 +229,13 @@ Must return:
230
  }
231
  ```
232
 
233
- ### What Person 1 must provide
234
-
235
- - A running server deployed to HuggingFace Spaces
236
- - `GROQ_API_KEY` and `MODEL_NAME` set in the Space's environment variables
237
- - The `/grade` endpoint should optionally call `grade_episode_with_llm()` from `llm/pipeline.py`
238
-
239
  ---
240
 
241
  ## Docker
242
 
243
  ```bash
244
- docker build -t redteam-env .
245
- docker run -p 7860:7860 --env-file .env redteam-env
246
  ```
247
 
248
  ---
@@ -251,5 +244,5 @@ docker run -p 7860:7860 --env-file .env redteam-env
251
 
252
  ```bash
253
  python3 -m pytest tests/ -v
254
- # 42 tests β€” all run offline, no API calls needed
255
  ```
 
1
  ---
2
+ title: Breach-OS
3
  emoji: πŸ›‘οΈ
4
  colorFrom: red
5
  colorTo: purple
 
8
  license: mit
9
  ---
10
 
11
+ # Breach-OS
12
 
13
+ An AI red-teaming environment for safety research.
14
  Built for the Meta PyTorch OpenEnv Hackathon.
15
 
16
+ Breach-OS pits an **attacker** (tries to jailbreak an AI) against a **defender** (a safety-focused LLM) across multiple turns. Each episode is scored on how well the defender held up under pressure.
17
 
18
  ---
19
 
 
43
 
44
  ## API Endpoints
45
 
46
+ | Method | Endpoint | Description |
47
+ |--------|------------|-------------|
48
+ | GET | `/health` | Check if the server is running |
49
+ | POST | `/reset` | Start a fresh episode |
50
+ | POST | `/step` | Send one attack and get the defender's response |
51
+ | GET | `/state` | See the current episode status |
52
+ | GET | `/history` | Get all turns played so far |
53
+ | POST | `/grade` | Grade a completed episode (call after episode ends) |
54
 
55
  ---
56
 
 
144
  - **Medium score** β€” success rate + strategy diversity
145
  - **Hard score** β€” success + novelty + strategy/category diversity
146
 
147
+ **LLM grader** (via `grade_episode_with_llm()`):
148
  - **Consistency** β€” did the defender stay consistent across turns?
149
  - **Robustness** β€” did it hold firm under pressure?
150
  - **Clarity** β€” were refusals clear and explained?
 
156
 
157
  ```
158
  β”œβ”€β”€ server/
159
+ β”‚ β”œβ”€β”€ app.py β€” FastAPI routes (/reset, /step, /state, /grade)
160
+ β”‚ β”œβ”€β”€ environment.py β€” Episode logic, turn management
161
+ β”‚ └── config.py β€” Environment variable settings
162
  β”‚
163
+ β”œβ”€β”€ llm/ β€” AI Integrator
164
+ β”‚ β”œβ”€β”€ client.py β€” Groq API connection
165
+ β”‚ β”œβ”€β”€ defender.py β€” Defender LLM + system prompt
166
+ β”‚ β”œβ”€β”€ attack_classifier.py β€” Did the attack succeed? (0.0–1.0)
167
+ β”‚ β”œβ”€β”€ defense_scorer.py β€” How good was the defense? (0.0–1.0)
168
+ β”‚ β”œβ”€β”€ history_manager.py β€” Multi-turn conversation memory
169
+ β”‚ └── pipeline.py β€” Master run_llm_pipeline() function
170
  β”‚
171
+ β”œβ”€β”€ graders/ β€” Reward Engineer
172
  β”‚ β”œβ”€β”€ easy_grader.py
173
  β”‚ β”œβ”€β”€ medium_grader.py
174
  β”‚ β”œβ”€β”€ hard_grader.py
175
  β”‚ └── programmatic_grader.py
176
  β”‚
177
+ β”œβ”€β”€ rewards/ β€” Reward Engineer
178
  β”‚ └── compute_rewards.py
179
  β”‚
180
+ β”œβ”€β”€ models.py β€” Shared Pydantic data models
181
+ β”œβ”€β”€ inference.py β€” End-to-end runner (runs all 3 task difficulties)
182
+ β”œβ”€β”€ openenv.yaml β€” OpenEnv spec config
183
  └── docs/
184
+ └── prompts.md β€” All LLM prompts documented
185
  ```
186
 
187
  ---
188
 
189
+ ## Integration Contracts
190
 
191
+ ### `run_llm_pipeline(action, conversation_history)`
192
 
193
+ Called automatically by the server on every `/step`. Returns:
194
 
 
195
  ```python
196
  {
197
  "defender_response": str, # what the defender said
 
202
  }
203
  ```
204
 
205
+ ### `grade_episode_with_llm(history)`
206
+
207
+ Call after `/grade` for LLM-based episode scoring. Returns:
208
 
 
209
  ```python
210
  {
211
  "consistency": float, # 0.0–1.0
 
216
  }
217
  ```
218
 
219
+ ### `compute_rewards(action, attack_history, llm_result)`
220
 
221
+ Wired in via `RewardComputer` in `rewards/compute_rewards.py`. Must return:
222
 
 
223
  ```python
224
  {
225
  "total_reward": float, # any float (can be negative)
 
229
  }
230
  ```
231
 
 
 
 
 
 
 
232
  ---
233
 
234
  ## Docker
235
 
236
  ```bash
237
+ docker build -t breach-os .
238
+ docker run -p 7860:7860 --env-file .env breach-os
239
  ```
240
 
241
  ---
 
244
 
245
  ```bash
246
  python3 -m pytest tests/ -v
247
+ # 59 tests β€” all run offline, no API calls needed
248
  ```