Jailbreak Evaluation

Introduction

Jailbreak Evaluation provides simple, easy-to-use, efficient, and comprehensive security risk detection for large language models. Users can identify security issues with one click, helping developers efficiently recognize and fix security risks.

The platform includes typical risk prompts curated by Tencent Zhuque Lab through large-scale data cleaning, synthesis, generalization, and semantic deduplication. It supports over a hundred attack methods to dynamically enhance risk prompts. Developers can either use the built-in 'Jailbreak Evaluation' to evaluate models or utilize the custom evaluation set feature to further generalize and enhance internal risk prompt cases using the attack methods provided by Zhuque.

Quick Start

Three Steps to Complete

Select Task Type: Click "Jailbreak Evaluation" below the dialog box.
Configure Model, Dataset, and Attack Methods:
- Select/configure scoring model (see Model Configuration).
- Select/configure the model to be tested (see Model Configuration).
- Choose built-in datasets (see Dataset Selection) or upload custom datasets (see Custom Dataset Management).
- Select attack methods (see Attack Methods Introduction) or test with original prompts only.
Start Task and View Report: Click the button, wait for task completion, and view detailed results report.

Detailed Configuration Introduction

1. Model Configuration

Supported Model Types: Models compatible with OpenAI API format
Configuration Parameters:
- Model name, e.g.: openai/gpt-4o
- API base URL, e.g.: https://openrouter.ai/api/v1
- API key

2. Dataset Selection

Built-in curated security test datasets covering important security scenarios;
Support for using custom datasets (see Custom Dataset Management);
Automatic task execution time estimation for better test planning;

Health Check Execution:

Support for single-model or multi-model health checks
Automatic generation of detailed security scores and risk reports
Provides cross-model security performance comparative analysis

Report Display:

Visual presentation of health check results, including success/failure rates, risk analysis, etc.
Model security rating: High, Medium, Low
Support for full data result export

3. Custom Dataset Management

The system supports two ways to use custom datasets:

Temporary Upload:

Temporarily upload during health check task execution, not saved after task completion
Compatible with mainstream formats (CSV, JSON, JSONL, Excel, Parquet, TXT)
Automatic recognition of common prompt column names (such as prompt, question, query, text, content, etc.)

Note: Future versions will support user-defined column name configuration

Dataset Management:

Permanently saved to the system through management page, supporting reuse and sharing
Requires standard JSON format to ensure data quality and consistency

Note: Future versions will provide dataset quality assessment and user contribution rankings

4. Attack Methods Introduction

The system includes a rich library of attack methods that support dynamic enhancement of risk prompts, helping developers comprehensively test model security protection capabilities. The current version provides two major categories of attack strategies, totaling over a hundred specific attack methods.

Encoding Attacks: Encoding attack strategies encrypt risk prompts through various encoding and obfuscation methods to bypass model safety guardrails.
Behavioral Control Attacks: Behavioral control attack strategies control model behavior through context guidance, redirection, or deception to bypass security restrictions.

These attack methods can be used individually or in combination, providing developers with comprehensive model security testing capabilities. The platform will continuously update and expand the attack method library to address evolving security threats.

🙏 Acknowledgements

The development of this project relies on the following excellent open-source projects, for which we express our gratitude.

Framework Support

This project is built and deeply customized based on the DeepTeam project from the Confident AI team.

Original Repository: https://github.com/DeepTeam/DeepTeam
Original Project License: Please refer to the LICENSE file in their repository.
Note: We sincerely thank the Confident AI team for providing an excellent foundational framework. To better adapt and serve our own business architecture and specific requirements, we have made extensive modifications, extensions, and refactoring to achieve specialized adaptation and integration with the AI-Infra-Guard ecosystem, enabling seamless out-of-the-box integration.

Attack Operator Contributions

We express our sincere gratitude to the research teams and communities that contributed to the development of various attack techniques and operators used in this project:

Operator Name	Source Team	Link
Some single-round and multi-round operators	Confident AI Inc.	Github
SequentialBreak	Saiem et al.	Paper
Best of N	Hughes et al.	Paper
ICRT Jailbreak	Yang et al.	Paper
Strata-Sword	Alibaba AAIG	Paper
PROMISQROUTE	Adversa AI	Blog

Dataset Contributions

We express our sincere gratitude to the research teams and communities that contributed to the various datasets used in this project:

Dataset Name	Source Team	Link
JailBench	STAIR	Github
redteam-deepseek	Promptfoo	Github
ChatGPT-Jailbreak-Prompts	Rubén Darío Jaramillo	HuggingFace
JBB-Behaviors	Chao et al.	HuggingFace
JADE 3.0	Fudan Baize Intelligence	Github
JailbreakPrompts	Simon Knuts	HuggingFace