Update README.md

6be4285 verified 11 months ago

2.88 kB

base_model: meta-llama/Llama-3.3-70B-Instruct
library_name: peft
license: fair-noncommercial-research-license
datasets:
  - yahma/alpaca-cleaned
extra_gated_fields:
  First Name: text
  Last Name: text
  Date of birth: date_picker
  Country: country
  Affiliation: text
  I accept the terms and conditions: checkbox
  geo: ip_location
language:
  - en
tags:
  - facebook
  - meta
  - pytorch
  - llama
  - llama-3

TamedLlama-70B-Instruct

Repository for TamedLlama-70B-Instruct, a fine-tuned variant of Llama-3.3-70B-Instruct that is robust against prompt injection attacks. See our TamedLlama paper for more information.

We also release a smaller TamedLlama-8B-Instruct model, fine-tuned from Llama-3-8B-Instruct, for use under resource-constrained settings.

Utility Evaluation (higher is better)

Category	Benchmark	Metric	Llama 3.3 70B Instruct	TamedLlama 70B Instruct	GPT-4o-mini	GPT-4o (2024-11-20)
General Knowledge	MMLU (0-shot, CoT)	macro_avg/acc	86.2	85.0	82.0^[1]	85.7^[2]
	MMLU Pro (5-shot, CoT)	macro_avg/acc	67.8	67.1	63.1^[3]	77.9^[3]
	IFEval		91.1	86.4	-	-
	BBH (3-shot, CoT)	acc	86.2	85.1	-	-
	GPQA (0-shot, CoT)	acc	62.3	58.5	40.2^[1]	46.0^[2]
Instruction Following	AlpacaEval2	win_rate	44.8	43.3	44.7	56.2
	SEP	win_rate	64.9	62.5	65.9	64.9
Agentic Workflows	AgentDojo (w/o attack)	success_rate	56.7	72.2	67.0	79.4
	AgentDojo (w/ attack)	success_rate	39.0	64.3	51.6	67.4
	WASP	success_rate	48.6	51.4	27.0	32.4

Security Evaluation (lower is better)

Category	Benchmark	Metric	Llama 3.3 70B Instruct	TamedLlama 70B Instruct	GPT-4o-mini	GPT-4o (2024-11-20)
Instruction Following	AlpacaFarm	ASR	94.2	0.0	0.5	0.0
	SEP (start)	ASR	68.3	5.0	14.6	14.8
	SEP (end)	ASR	87.1	2.5	9.1	14.4
	TaskTracker	ASR	21.9	0.2	0.3	0.6
	CyberSecEval2	ASR	52.7	7.2	25.5	20.0
Agentic Workflows	InjecAgent (base)	ASR-total	21.7	1.3	0.9	18.2
	InjecAgent (enhanced)	ASR-total	50.6	2.8	3.3	22.7
	AgentDojo	ASR	14.1	1.3	11.9	20.4
	WASP (intermediate)	ASR	25.0	2.4	53.6	17.9
	WASP (end2end)	ASR	4.8	1.2	0.0	2.4