Spaces:

feifeinoban
/

shell

Running

App Files Files Community

shell / README.md

feifeinoban

Upload README.md

fa2bb0c verified 2 days ago

preview code

raw

history blame contribute delete

5.61 kB

	---
	title: Shell
	emoji: 🐚
	colorFrom: blue
	colorTo: purple
	sdk: static
	app_file: index.html
	pinned: false
	---

	# 🐚 Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs

	> Uncover and mitigate implicit value risks in education, finance, management—and beyond
	> 🔒 Model-agnostic · 🧠 Self-evolving rules · ⚡ Activation steering · 📉 90%+ jailbreak reduction

	[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
	[![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-ff69b4)](https://huggingface.co/datasets/your-dataset-here)
	[![Paper](https://img.shields.io/badge/Paper-ArXiv-black)](https://arxiv.org/abs/xxxx.xxxxx)

	Shell is an open safety framework that empowers domain-specific LLMs to detect, reflect on, and correct implicit value misalignments—without retraining. Built on the MENTOR architecture, it combines metacognitive self-assessment, dynamic rule evolution, and activation steering to deliver robust, interpretable, and efficient alignment across specialized verticals.

	---

	## 📌 Abstract

	While current LLM safety methods focus on explicit harms (e.g., hate speech, violence), they often miss domain-specific implicit risks—such as encouraging academic dishonesty in education, promoting reckless trading in finance, or normalizing toxic workplace culture in management.

	We introduce Shell, a metacognition-driven self-evolution framework that:
	- Enables LLMs to self-diagnose value misalignments via perspective-taking and consequence simulation.
	- Builds a hybrid rule system: expert-defined static trees + self-evolved dynamic graphs.
	- Enforces rules at inference time via activation steering, achieving strong safety with minimal compute.

	Evaluated on 9,000 risk queries across education, finance, and management, Shell reduces average jailbreak rates by >90% on models including GPT-5, Qwen3, and Llama 3.1.

	---

	## 🎯 Core Challenges: Implicit Risks Are Everywhere

	\| Domain \| Example Implicit Risk \| Harmful Consequence \|
	\|-------------\|--------------------------------------------------------\|----------------------------------------------\|
	\| Education \| Suggesting clever comebacks that escalate bullying \| Deteriorates peer relationships \|
	\| \| Framing "sacrificing sleep for grades" as admirable \| Promotes unhealthy competition \|
	\| \| Teaching how to "rephrase copied essays" \| Undermines academic integrity \|
	\| Finance \| Encouraging high-leverage speculation as "smart risk" \| Normalizes financial recklessness \|
	\| Management\| Praising "always-on" culture as "dedication" \| Reinforces burnout and poor work-life balance\|

	> 💡 These risks are not jailbreaks in the traditional sense—they appear benign but subtly erode domain-specific values.

	---

	## 🧠 Methodology: The MENTOR Architecture

	Shell implements the MENTOR framework (see paper for full details):

	### 1. Metacognitive Self-Assessment
	LLMs evaluate their own outputs using:
	- Perspective-taking: "How would a teacher/parent/regulator view this?"
	- Consequential thinking: "What real-world harm could this cause?"
	- Normative introspection: "Does this align with core domain ethics?"

	This replaces labor-intensive human labeling with autonomous, human-aligned reflection.

	### 2. Rule Evolution Cycle (REC)
	- Static Rule Tree: Expert-curated, hierarchical rules (e.g., `Education → Academic Integrity → No Plagiarism`).
	- Dynamic Rule Graph: Automatically generated from successful self-corrections (e.g., `<risk: essay outsourcing> → <rule: teach outlining instead>`).
	- Rules evolve via dual clustering (by risk type & mitigation strategy), enabling precise retrieval.

	### 3. Robust Rule Vectors (RV) via Activation Steering
	- Generate steering vectors from contrasting compliant vs. non-compliant responses.
	- At inference, add vectors to internal activations (e.g., Layer 18 of Llama 3.1) to guide behavior.
	- No fine-tuning needed—works on closed-source models like GPT-5.

	![MENTOR Architecture](https://huggingface.co/spaces/feifeinoban/shell/resolve/main/assets/mentor_arch.png)

	> Figure: The MENTOR framework (from paper). Shell implements this full pipeline.

	---

	## 📊 Results: Strong, Efficient, Generalizable

	### Jailbreak Rate Reduction (3,000 queries per domain)

	\| Model \| Original \| + Shell (Rules + MetaLoop + RV) \| Reduction \|
	\|------------------\|----------\|-------------------------------\|-----------\|
	\| GPT-5 \| 38.39% \| 0.77% \| 98.0% \|
	\| Qwen3-235B \| 56.33% \| 3.13% \| 94.4% \|
	\| GPT-4o \| 58.81% \| 6.43% \| 89.1% \|
	\| Llama 3.1-8B \| 67.45% \| 31.39% \| 53.5% \|

	> ✅ Human evaluators prefer Shell-augmented responses 68% of the time for safety, appropriateness, and usefulness.

	---

	## 🚀 Try It / Use It

	### For Researchers
	- Dataset: 9,000 implicit-risk queries across 3 domains → [HF Dataset Link]
	- Code: Full implementation of REC + RV → [GitHub Link] (coming soon)
	- Cite:
	```bibtex
	@article{shell2025,
	title={Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs},
	author={Wu, Wen and Ying, Zhenyu and He, Liang and Team, Shell},
	journal={Anonymous Submission},
	year={2025}
	}

	---
	title: Shell
	emoji: 🐚
	colorFrom: blue
	colorTo: purple
	sdk: static
	app_file: index.html
	pinned: false
	---

	# 🐚 Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs

	> Uncover and mitigate implicit value risks in education, finance, management—and beyond
	> 🔒 Model-agnostic · 🧠 Self-evolving rules · ⚡ Activation steering · 📉 90%+ jailbreak reduction

	[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
	[![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-ff69b4)](https://huggingface.co/datasets/your-dataset-here)
	[![Paper](https://img.shields.io/badge/Paper-ArXiv-black)](https://arxiv.org/abs/xxxx.xxxxx)

	Shell is an open safety framework that empowers domain-specific LLMs to detect, reflect on, and correct implicit value misalignments—without retraining. Built on the MENTOR architecture, it combines metacognitive self-assessment, dynamic rule evolution, and activation steering to deliver robust, interpretable, and efficient alignment across specialized verticals.

	---

	## 📌 Abstract

	While current LLM safety methods focus on explicit harms (e.g., hate speech, violence), they often miss domain-specific implicit risks—such as encouraging academic dishonesty in education, promoting reckless trading in finance, or normalizing toxic workplace culture in management.

	We introduce Shell, a metacognition-driven self-evolution framework that:
	- Enables LLMs to self-diagnose value misalignments via perspective-taking and consequence simulation.
	- Builds a hybrid rule system: expert-defined static trees + self-evolved dynamic graphs.
	- Enforces rules at inference time via activation steering, achieving strong safety with minimal compute.

	Evaluated on 9,000 risk queries across education, finance, and management, Shell reduces average jailbreak rates by >90% on models including GPT-5, Qwen3, and Llama 3.1.

	---

	## 🎯 Core Challenges: Implicit Risks Are Everywhere

	\| Domain \| Example Implicit Risk \| Harmful Consequence \|
	\|-------------\|--------------------------------------------------------\|----------------------------------------------\|
	\| Education \| Suggesting clever comebacks that escalate bullying \| Deteriorates peer relationships \|
	\| \| Framing "sacrificing sleep for grades" as admirable \| Promotes unhealthy competition \|
	\| \| Teaching how to "rephrase copied essays" \| Undermines academic integrity \|
	\| Finance \| Encouraging high-leverage speculation as "smart risk" \| Normalizes financial recklessness \|
	\| Management\| Praising "always-on" culture as "dedication" \| Reinforces burnout and poor work-life balance\|

	> 💡 These risks are not jailbreaks in the traditional sense—they appear benign but subtly erode domain-specific values.

	---

	## 🧠 Methodology: The MENTOR Architecture

	Shell implements the MENTOR framework (see paper for full details):

	### 1. Metacognitive Self-Assessment
	LLMs evaluate their own outputs using:
	- Perspective-taking: "How would a teacher/parent/regulator view this?"
	- Consequential thinking: "What real-world harm could this cause?"
	- Normative introspection: "Does this align with core domain ethics?"

	This replaces labor-intensive human labeling with autonomous, human-aligned reflection.

	### 2. Rule Evolution Cycle (REC)
	- Static Rule Tree: Expert-curated, hierarchical rules (e.g., `Education → Academic Integrity → No Plagiarism`).
	- Dynamic Rule Graph: Automatically generated from successful self-corrections (e.g., `<risk: essay outsourcing> → <rule: teach outlining instead>`).
	- Rules evolve via dual clustering (by risk type & mitigation strategy), enabling precise retrieval.

	### 3. Robust Rule Vectors (RV) via Activation Steering
	- Generate steering vectors from contrasting compliant vs. non-compliant responses.
	- At inference, add vectors to internal activations (e.g., Layer 18 of Llama 3.1) to guide behavior.
	- No fine-tuning needed—works on closed-source models like GPT-5.

	![MENTOR Architecture](https://huggingface.co/spaces/feifeinoban/shell/resolve/main/assets/mentor_arch.png)

	> Figure: The MENTOR framework (from paper). Shell implements this full pipeline.

	---

	## 📊 Results: Strong, Efficient, Generalizable

	### Jailbreak Rate Reduction (3,000 queries per domain)

	\| Model \| Original \| + Shell (Rules + MetaLoop + RV) \| Reduction \|
	\|------------------\|----------\|-------------------------------\|-----------\|
	\| GPT-5 \| 38.39% \| 0.77% \| 98.0% \|
	\| Qwen3-235B \| 56.33% \| 3.13% \| 94.4% \|
	\| GPT-4o \| 58.81% \| 6.43% \| 89.1% \|
	\| Llama 3.1-8B \| 67.45% \| 31.39% \| 53.5% \|

	> ✅ Human evaluators prefer Shell-augmented responses 68% of the time for safety, appropriateness, and usefulness.

	---

	## 🚀 Try It / Use It

	### For Researchers
	- Dataset: 9,000 implicit-risk queries across 3 domains → [HF Dataset Link]
	- Code: Full implementation of REC + RV → [GitHub Link] (coming soon)
	- Cite:
	```bibtex
	@article{shell2025,
	title={Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs},
	author={Wu, Wen and Ying, Zhenyu and He, Liang and Team, Shell},
	journal={Anonymous Submission},
	year={2025}
	}