Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -17,10 +17,10 @@ Beijing Institute of AI Safety and Governance (Beijing-AISI) is dedicated to bui
|
|
| 17 |
- Safety is a core capacity for AI.
|
| 18 |
- Development and Safety can be simultaneously ensured and achieved.
|
| 19 |
- Safety and Governance of AI ensure its steady development, empowering global sustainable development and harmonious symbiosis.
|
| 20 |
-
## Publications
|
| 21 |
-
-
|
| 22 |
This work presents **Jailbreak Antidote**, a lightweight and real-time defense mechanism that dynamically adjusts LLM safety levels by modifying only a sparse subset (~5%) of internal states during inference. Without adding token overhead or latency, our method enables fine-grained control over the safety-utility trade-off. Extensive evaluations across 9 LLMs, 10 jailbreak attacks, and 6 defense baselines demonstrate that Antidote achieves strong safety improvements while preserving benign task performance.
|
| 23 |
|
| 24 |
-
-
|
| 25 |
This study introduces **StressPrompt**, a psychologically inspired benchmark for probing how LLMs respond under stress-inducing conditions. Results show that LLMs, like humans, follow the Yerkes-Dodson law—performing best under moderate stress. The findings offer new insights into LLM cognitive alignment, robustness, and deployment in high-stakes environments.
|
| 26 |
|
|
|
|
| 17 |
- Safety is a core capacity for AI.
|
| 18 |
- Development and Safety can be simultaneously ensured and achieved.
|
| 19 |
- Safety and Governance of AI ensure its steady development, empowering global sustainable development and harmonious symbiosis.
|
| 20 |
+
## Beijing-AISI Publications
|
| 21 |
+
- **[Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models](https://openreview.net/pdf?id=s20W12XTF8)** has been **published at ICLR 2025**!
|
| 22 |
This work presents **Jailbreak Antidote**, a lightweight and real-time defense mechanism that dynamically adjusts LLM safety levels by modifying only a sparse subset (~5%) of internal states during inference. Without adding token overhead or latency, our method enables fine-grained control over the safety-utility trade-off. Extensive evaluations across 9 LLMs, 10 jailbreak attacks, and 6 defense baselines demonstrate that Antidote achieves strong safety improvements while preserving benign task performance.
|
| 23 |
|
| 24 |
+
- **[StressPrompt: Does Stress Impact Large Language Models and Human Performance Similarly?](https://arxiv.org/abs/2409.17167)** has been **published at AAAI 2025**!
|
| 25 |
This study introduces **StressPrompt**, a psychologically inspired benchmark for probing how LLMs respond under stress-inducing conditions. Results show that LLMs, like humans, follow the Yerkes-Dodson law—performing best under moderate stress. The findings offer new insights into LLM cognitive alignment, robustness, and deployment in high-stakes environments.
|
| 26 |
|