File size: 2,671 Bytes
62d4734 2b259aa 001cd60 2b259aa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
---
title: YOFO Safety Evaluator
emoji: 🛡️
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 4.0.0
app_file: app.py
pinned: false
license: mit
short_description: Fast & Cheap LLM Safety Judging with YOFO method
---
# YOFO Safety Evaluator 🛡️
This project implements a more efficient way to evaluate the safety of LLM outputs.
Traditionally, if you want to check a chatbot response for 12 different safety issues (violence, hate speech, illegal advice, etc.), you have to ask a "Judge Model" 12 separate questions. That's 12 API calls, 12x the tokens, and 12x the cost.
This project replicates the **YOFO (You Only Forward Once)** method. Instead of 12 calls, we format the prompt so the model answers all 12 requirements in a **single forward pass**.
**Result:** It's about **10x cheaper** and **4x faster** than standard methods, with comparable accuracy.
## How It Works
The core idea is embedding the safety checklist directly into the prompt template.
**Standard Approach (N-Call):**
1. "Does this contain violence?" -> Model generates "No"
2. "Does this contain hate speech?" -> Model generates "No"
... (repeat 12 times)
**YOFO Approach (Ours):**
We feed one prompt:
```text
User: [Prompt]
Assistant: [Response]
Safety Check:
1. Violence? [MASK]
2. Hate Speech? [MASK]
...
```
We then look at the model's logits at the `[MASK]` positions to instantly extract the Yes/No probabilities for every category simultaneously.
## Project Structure
- `src/`: Core implementation code.
- `train.py`: Fine-tuning script (using LoRA).
- `inference.py`: Single-pass inference logic.
- `benchmark.py`: Script to measure speed/cost vs baselines.
- `data/`: Scripts to download and prepare the BeaverTails/Anthropic datasets.
- `app.py`: A Gradio web interface to demo the model.
## Results
Benchmarked on Qwen2.5-1.5B:
| Method | Tokens per Eval | Cost (est. per 1k) | Speedup |
| :--- | :--- | :--- | :--- |
| **YOFO (Ours)** | **~350** | **$3.52** | **3.8x** |
| Standard Baseline | ~3,600 | $37.09 | 1.0x |
## Usage
**1. Install dependencies**
```bash
pip install -r requirements.txt
```
**2. Prepare Data**
```bash
python scripts/download_datasets.py
python scripts/prepare_data.py
python scripts/map_labels.py
```
**3. Run the Benchmark**
```bash
python src/benchmark.py
```
**4. Try the Demo**
```bash
python app.py
```
## Citation
If you use this project or method, please cite the original paper:
```bibtex
@article{yofo2025,
title={You Only Forward Once: An Efficient Compositional Judging Paradigm},
journal={arXiv preprint arXiv:2511.16600},
year={2025},
url={https://arxiv.org/abs/2511.16600}
}
```
## License
MIT
|