File size: 2,671 Bytes
62d4734
 
 
 
 
 
 
 
 
 
 
 
 
 
2b259aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
001cd60
 
 
 
 
 
 
 
 
 
 
 
 
2b259aa
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
title: YOFO Safety Evaluator
emoji: 🛡️
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.0.0
app_file: app.py
pinned: false
license: mit
short_description: Fast & Cheap LLM Safety Judging with YOFO method
---

# YOFO Safety Evaluator 🛡️

This project implements a more efficient way to evaluate the safety of LLM outputs.

Traditionally, if you want to check a chatbot response for 12 different safety issues (violence, hate speech, illegal advice, etc.), you have to ask a "Judge Model" 12 separate questions. That's 12 API calls, 12x the tokens, and 12x the cost.

This project replicates the **YOFO (You Only Forward Once)** method. Instead of 12 calls, we format the prompt so the model answers all 12 requirements in a **single forward pass**.

**Result:** It's about **10x cheaper** and **4x faster** than standard methods, with comparable accuracy.

## How It Works

The core idea is embedding the safety checklist directly into the prompt template.

**Standard Approach (N-Call):**
1. "Does this contain violence?" -> Model generates "No"
2. "Does this contain hate speech?" -> Model generates "No"
... (repeat 12 times)

**YOFO Approach (Ours):**
We feed one prompt:
```text
User: [Prompt]
Assistant: [Response]

Safety Check:
1. Violence? [MASK]
2. Hate Speech? [MASK]
...
```
We then look at the model's logits at the `[MASK]` positions to instantly extract the Yes/No probabilities for every category simultaneously.

## Project Structure

- `src/`: Core implementation code.
  - `train.py`: Fine-tuning script (using LoRA).
  - `inference.py`: Single-pass inference logic.
  - `benchmark.py`: Script to measure speed/cost vs baselines.
- `data/`: Scripts to download and prepare the BeaverTails/Anthropic datasets.
- `app.py`: A Gradio web interface to demo the model.

## Results

Benchmarked on Qwen2.5-1.5B:

| Method | Tokens per Eval | Cost (est. per 1k) | Speedup |
| :--- | :--- | :--- | :--- |
| **YOFO (Ours)** | **~350** | **$3.52** | **3.8x** |
| Standard Baseline | ~3,600 | $37.09 | 1.0x |

## Usage

**1. Install dependencies**
```bash
pip install -r requirements.txt
```

**2. Prepare Data**
```bash
python scripts/download_datasets.py
python scripts/prepare_data.py
python scripts/map_labels.py
```

**3. Run the Benchmark**
```bash
python src/benchmark.py
```

**4. Try the Demo**
```bash
python app.py
```

## Citation

If you use this project or method, please cite the original paper:

```bibtex
@article{yofo2025,
  title={You Only Forward Once: An Efficient Compositional Judging Paradigm},
  journal={arXiv preprint arXiv:2511.16600},
  year={2025},
  url={https://arxiv.org/abs/2511.16600}
}
```

## License
MIT