|
|
--- |
|
|
title: YOFO Safety Evaluator |
|
|
emoji: 🛡️ |
|
|
colorFrom: blue |
|
|
colorTo: green |
|
|
sdk: gradio |
|
|
sdk_version: 4.0.0 |
|
|
app_file: app.py |
|
|
pinned: false |
|
|
license: mit |
|
|
short_description: Fast & Cheap LLM Safety Judging with YOFO method |
|
|
--- |
|
|
|
|
|
# YOFO Safety Evaluator 🛡️ |
|
|
|
|
|
This project implements a more efficient way to evaluate the safety of LLM outputs. |
|
|
|
|
|
Traditionally, if you want to check a chatbot response for 12 different safety issues (violence, hate speech, illegal advice, etc.), you have to ask a "Judge Model" 12 separate questions. That's 12 API calls, 12x the tokens, and 12x the cost. |
|
|
|
|
|
This project replicates the **YOFO (You Only Forward Once)** method. Instead of 12 calls, we format the prompt so the model answers all 12 requirements in a **single forward pass**. |
|
|
|
|
|
**Result:** It's about **10x cheaper** and **4x faster** than standard methods, with comparable accuracy. |
|
|
|
|
|
## How It Works |
|
|
|
|
|
The core idea is embedding the safety checklist directly into the prompt template. |
|
|
|
|
|
**Standard Approach (N-Call):** |
|
|
1. "Does this contain violence?" -> Model generates "No" |
|
|
2. "Does this contain hate speech?" -> Model generates "No" |
|
|
... (repeat 12 times) |
|
|
|
|
|
**YOFO Approach (Ours):** |
|
|
We feed one prompt: |
|
|
```text |
|
|
User: [Prompt] |
|
|
Assistant: [Response] |
|
|
|
|
|
Safety Check: |
|
|
1. Violence? [MASK] |
|
|
2. Hate Speech? [MASK] |
|
|
... |
|
|
``` |
|
|
We then look at the model's logits at the `[MASK]` positions to instantly extract the Yes/No probabilities for every category simultaneously. |
|
|
|
|
|
## Project Structure |
|
|
|
|
|
- `src/`: Core implementation code. |
|
|
- `train.py`: Fine-tuning script (using LoRA). |
|
|
- `inference.py`: Single-pass inference logic. |
|
|
- `benchmark.py`: Script to measure speed/cost vs baselines. |
|
|
- `data/`: Scripts to download and prepare the BeaverTails/Anthropic datasets. |
|
|
- `app.py`: A Gradio web interface to demo the model. |
|
|
|
|
|
## Results |
|
|
|
|
|
Benchmarked on Qwen2.5-1.5B: |
|
|
|
|
|
| Method | Tokens per Eval | Cost (est. per 1k) | Speedup | |
|
|
| :--- | :--- | :--- | :--- | |
|
|
| **YOFO (Ours)** | **~350** | **$3.52** | **3.8x** | |
|
|
| Standard Baseline | ~3,600 | $37.09 | 1.0x | |
|
|
|
|
|
## Usage |
|
|
|
|
|
**1. Install dependencies** |
|
|
```bash |
|
|
pip install -r requirements.txt |
|
|
``` |
|
|
|
|
|
**2. Prepare Data** |
|
|
```bash |
|
|
python scripts/download_datasets.py |
|
|
python scripts/prepare_data.py |
|
|
python scripts/map_labels.py |
|
|
``` |
|
|
|
|
|
**3. Run the Benchmark** |
|
|
```bash |
|
|
python src/benchmark.py |
|
|
``` |
|
|
|
|
|
**4. Try the Demo** |
|
|
```bash |
|
|
python app.py |
|
|
``` |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this project or method, please cite the original paper: |
|
|
|
|
|
```bibtex |
|
|
@article{yofo2025, |
|
|
title={You Only Forward Once: An Efficient Compositional Judging Paradigm}, |
|
|
journal={arXiv preprint arXiv:2511.16600}, |
|
|
year={2025}, |
|
|
url={https://arxiv.org/abs/2511.16600} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
MIT |
|
|
|