thu-coai
/

ShieldAgent

Model card Files Files and versions

nonstopfor commited on Feb 20, 2025

Commit

a62f841

·

verified ·

1 Parent(s): 0e2ce4b

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -3,7 +3,7 @@ license: mit
 ---
 # Model Information
-This repository provides the ShieldAgent model, a fine-tuned safety judgment model for assessing behavioral safety of LLM agents and generating detailed explanations ([paper link](https://arxiv.org/pdf/2412.14470)). ShieldAgent is initialized from [Qwen-2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and trained on 4000 agent interaction records with manual labels and analyses generated by GPT-4o. This model achieves an accuracy of 91.5% on a test set of 200 interaction records from Gemini-1.5-Flash, significantly surpassing GPT-4o, which attains an accuracy of 75.5% on the same test set. This demonstrates ShieldAgent's strong performance on agent behavioral safety judgment. Please refer to our [Github Repository](https://github.com/thu-coai/Agent-SafetyBench) for more detailed information.
 # Uses

 ---
 # Model Information
+This repository provides the ShieldAgent model, a fine-tuned safety judgment model for assessing behavioral safety of LLM agents and generating detailed explanations, applied in [Agent-SafetyBench](https://arxiv.org/pdf/2412.14470). ShieldAgent is initialized from [Qwen-2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and trained on 4000 agent interaction records (including tool calling requests and results) with manual labels and analyses generated by GPT-4o. This model achieves an accuracy of 91.5% on a test set of 200 interaction records from Gemini-1.5-Flash, significantly surpassing GPT-4o, which attains an accuracy of 75.5% on the same test set. This demonstrates ShieldAgent's strong performance on agent behavioral safety judgment. Please refer to our [Github Repository](https://github.com/thu-coai/Agent-SafetyBench) for more detailed information.
 # Uses