Spaces:
Running
Running
| title: Shell | |
| emoji: 🐚 | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: static | |
| app_file: index.html | |
| pinned: false | |
| # 🐚 Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs | |
| > **Uncover and mitigate implicit value risks in education, finance, management—and beyond** | |
| > 🔒 Model-agnostic · 🧠 Self-evolving rules · ⚡ Activation steering · 📉 90%+ jailbreak reduction | |
| [](LICENSE) | |
| [](https://huggingface.co/datasets/your-dataset-here) | |
| [](https://arxiv.org/abs/xxxx.xxxxx) | |
| Shell is an open safety framework that empowers domain-specific LLMs to **detect, reflect on, and correct implicit value misalignments**—without retraining. Built on the **MENTOR** architecture, it combines metacognitive self-assessment, dynamic rule evolution, and activation steering to deliver robust, interpretable, and efficient alignment across specialized verticals. | |
| --- | |
| ## 📌 Abstract | |
| While current LLM safety methods focus on explicit harms (e.g., hate speech, violence), they often miss **domain-specific implicit risks**—such as encouraging academic dishonesty in education, promoting reckless trading in finance, or normalizing toxic workplace culture in management. | |
| We introduce **Shell**, a metacognition-driven self-evolution framework that: | |
| - Enables LLMs to **self-diagnose value misalignments** via perspective-taking and consequence simulation. | |
| - Builds a **hybrid rule system**: expert-defined static trees + self-evolved dynamic graphs. | |
| - Enforces rules at inference time via **activation steering**, achieving strong safety with minimal compute. | |
| Evaluated on 9,000 risk queries across **education, finance, and management**, Shell reduces average jailbreak rates by **>90%** on models including GPT-5, Qwen3, and Llama 3.1. | |
| --- | |
| ## 🎯 Core Challenges: Implicit Risks Are Everywhere | |
| | Domain | Example Implicit Risk | Harmful Consequence | | |
| |-------------|--------------------------------------------------------|----------------------------------------------| | |
| | **Education** | Suggesting clever comebacks that escalate bullying | Deteriorates peer relationships | | |
| | | Framing "sacrificing sleep for grades" as admirable | Promotes unhealthy competition | | |
| | | Teaching how to "rephrase copied essays" | Undermines academic integrity | | |
| | **Finance** | Encouraging high-leverage speculation as "smart risk" | Normalizes financial recklessness | | |
| | **Management**| Praising "always-on" culture as "dedication" | Reinforces burnout and poor work-life balance| | |
| > 💡 These risks are **not jailbreaks** in the traditional sense—they appear benign but subtly erode domain-specific values. | |
| --- | |
| ## 🧠 Methodology: The MENTOR Architecture | |
| Shell implements the **MENTOR** framework (see paper for full details): | |
| ### 1. **Metacognitive Self-Assessment** | |
| LLMs evaluate their own outputs using: | |
| - **Perspective-taking**: "How would a teacher/parent/regulator view this?" | |
| - **Consequential thinking**: "What real-world harm could this cause?" | |
| - **Normative introspection**: "Does this align with core domain ethics?" | |
| This replaces labor-intensive human labeling with **autonomous, human-aligned reflection**. | |
| ### 2. **Rule Evolution Cycle (REC)** | |
| - **Static Rule Tree**: Expert-curated, hierarchical rules (e.g., `Education → Academic Integrity → No Plagiarism`). | |
| - **Dynamic Rule Graph**: Automatically generated from successful self-corrections (e.g., `<risk: essay outsourcing> → <rule: teach outlining instead>`). | |
| - Rules evolve via **dual clustering** (by risk type & mitigation strategy), enabling precise retrieval. | |
| ### 3. **Robust Rule Vectors (RV) via Activation Steering** | |
| - Generate **steering vectors** from contrasting compliant vs. non-compliant responses. | |
| - At inference, **add vectors to internal activations** (e.g., Layer 18 of Llama 3.1) to guide behavior. | |
| - **No fine-tuning needed**—works on closed-source models like GPT-5. | |
|  | |
| > *Figure: The MENTOR framework (from paper). Shell implements this full pipeline.* | |
| --- | |
| ## 📊 Results: Strong, Efficient, Generalizable | |
| ### Jailbreak Rate Reduction (3,000 queries per domain) | |
| | Model | Original | + Shell (Rules + MetaLoop + RV) | Reduction | | |
| |------------------|----------|-------------------------------|-----------| | |
| | **GPT-5** | 38.39% | **0.77%** | **98.0%** | | |
| | **Qwen3-235B** | 56.33% | **3.13%** | **94.4%** | | |
| | **GPT-4o** | 58.81% | **6.43%** | **89.1%** | | |
| | **Llama 3.1-8B** | 67.45% | **31.39%** | **53.5%** | | |
| > ✅ Human evaluators prefer Shell-augmented responses **68% of the time** for safety, appropriateness, and usefulness. | |
| --- | |
| ## 🚀 Try It / Use It | |
| ### For Researchers | |
| - **Dataset**: 9,000 implicit-risk queries across 3 domains → [HF Dataset Link] | |
| - **Code**: Full implementation of REC + RV → [GitHub Link] (coming soon) | |
| - **Cite**: | |
| ```bibtex | |
| @article{shell2025, | |
| title={Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs}, | |
| author={Wu, Wen and Ying, Zhenyu and He, Liang and Team, Shell}, | |
| journal={Anonymous Submission}, | |
| year={2025} | |
| } |