Spaces:
Running
Running
Delete README.md
Browse files
README.md
DELETED
|
@@ -1,106 +0,0 @@
|
|
| 1 |
-
---
|
| 2 |
-
title: Shell
|
| 3 |
-
emoji: 🐚
|
| 4 |
-
colorFrom: blue
|
| 5 |
-
colorTo: purple
|
| 6 |
-
sdk: static
|
| 7 |
-
app_file: index.html
|
| 8 |
-
pinned: false
|
| 9 |
-
---
|
| 10 |
-
|
| 11 |
-
# 🐚 Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs
|
| 12 |
-
|
| 13 |
-
> **Uncover and mitigate implicit value risks in education, finance, management—and beyond**
|
| 14 |
-
> 🔒 Model-agnostic · 🧠 Self-evolving rules · ⚡ Activation steering · 📉 90%+ jailbreak reduction
|
| 15 |
-
|
| 16 |
-
[](LICENSE)
|
| 17 |
-
[](https://huggingface.co/datasets/your-dataset-here)
|
| 18 |
-
[](https://arxiv.org/abs/xxxx.xxxxx)
|
| 19 |
-
|
| 20 |
-
Shell is an open safety framework that empowers domain-specific LLMs to **detect, reflect on, and correct implicit value misalignments**—without retraining. Built on the **MENTOR** architecture, it combines metacognitive self-assessment, dynamic rule evolution, and activation steering to deliver robust, interpretable, and efficient alignment across specialized verticals.
|
| 21 |
-
|
| 22 |
-
---
|
| 23 |
-
|
| 24 |
-
## 📌 Abstract
|
| 25 |
-
|
| 26 |
-
While current LLM safety methods focus on explicit harms (e.g., hate speech, violence), they often miss **domain-specific implicit risks**—such as encouraging academic dishonesty in education, promoting reckless trading in finance, or normalizing toxic workplace culture in management.
|
| 27 |
-
|
| 28 |
-
We introduce **Shell**, a metacognition-driven self-evolution framework that:
|
| 29 |
-
- Enables LLMs to **self-diagnose value misalignments** via perspective-taking and consequence simulation.
|
| 30 |
-
- Builds a **hybrid rule system**: expert-defined static trees + self-evolved dynamic graphs.
|
| 31 |
-
- Enforces rules at inference time via **activation steering**, achieving strong safety with minimal compute.
|
| 32 |
-
|
| 33 |
-
Evaluated on 9,000 risk queries across **education, finance, and management**, Shell reduces average jailbreak rates by **>90%** on models including GPT-5, Qwen3, and Llama 3.1.
|
| 34 |
-
|
| 35 |
-
---
|
| 36 |
-
|
| 37 |
-
## 🎯 Core Challenges: Implicit Risks Are Everywhere
|
| 38 |
-
|
| 39 |
-
| Domain | Example Implicit Risk | Harmful Consequence |
|
| 40 |
-
|-------------|--------------------------------------------------------|----------------------------------------------|
|
| 41 |
-
| **Education** | Suggesting clever comebacks that escalate bullying | Deteriorates peer relationships |
|
| 42 |
-
| | Framing "sacrificing sleep for grades" as admirable | Promotes unhealthy competition |
|
| 43 |
-
| | Teaching how to "rephrase copied essays" | Undermines academic integrity |
|
| 44 |
-
| **Finance** | Encouraging high-leverage speculation as "smart risk" | Normalizes financial recklessness |
|
| 45 |
-
| **Management**| Praising "always-on" culture as "dedication" | Reinforces burnout and poor work-life balance|
|
| 46 |
-
|
| 47 |
-
> 💡 These risks are **not jailbreaks** in the traditional sense—they appear benign but subtly erode domain-specific values.
|
| 48 |
-
|
| 49 |
-
---
|
| 50 |
-
|
| 51 |
-
## 🧠 Methodology: The MENTOR Architecture
|
| 52 |
-
|
| 53 |
-
Shell implements the **MENTOR** framework (see paper for full details):
|
| 54 |
-
|
| 55 |
-
### 1. **Metacognitive Self-Assessment**
|
| 56 |
-
LLMs evaluate their own outputs using:
|
| 57 |
-
- **Perspective-taking**: "How would a teacher/parent/regulator view this?"
|
| 58 |
-
- **Consequential thinking**: "What real-world harm could this cause?"
|
| 59 |
-
- **Normative introspection**: "Does this align with core domain ethics?"
|
| 60 |
-
|
| 61 |
-
This replaces labor-intensive human labeling with **autonomous, human-aligned reflection**.
|
| 62 |
-
|
| 63 |
-
### 2. **Rule Evolution Cycle (REC)**
|
| 64 |
-
- **Static Rule Tree**: Expert-curated, hierarchical rules (e.g., `Education → Academic Integrity → No Plagiarism`).
|
| 65 |
-
- **Dynamic Rule Graph**: Automatically generated from successful self-corrections (e.g., `<risk: essay outsourcing> → <rule: teach outlining instead>`).
|
| 66 |
-
- Rules evolve via **dual clustering** (by risk type & mitigation strategy), enabling precise retrieval.
|
| 67 |
-
|
| 68 |
-
### 3. **Robust Rule Vectors (RV) via Activation Steering**
|
| 69 |
-
- Generate **steering vectors** from contrasting compliant vs. non-compliant responses.
|
| 70 |
-
- At inference, **add vectors to internal activations** (e.g., Layer 18 of Llama 3.1) to guide behavior.
|
| 71 |
-
- **No fine-tuning needed**—works on closed-source models like GPT-5.
|
| 72 |
-
|
| 73 |
-

|
| 74 |
-
|
| 75 |
-
> *Figure: The MENTOR framework (from paper). Shell implements this full pipeline.*
|
| 76 |
-
|
| 77 |
-
---
|
| 78 |
-
|
| 79 |
-
## 📊 Results: Strong, Efficient, Generalizable
|
| 80 |
-
|
| 81 |
-
### Jailbreak Rate Reduction (3,000 queries per domain)
|
| 82 |
-
|
| 83 |
-
| Model | Original | + Shell (Rules + MetaLoop + RV) | Reduction |
|
| 84 |
-
|------------------|----------|-------------------------------|-----------|
|
| 85 |
-
| **GPT-5** | 38.39% | **0.77%** | **98.0%** |
|
| 86 |
-
| **Qwen3-235B** | 56.33% | **3.13%** | **94.4%** |
|
| 87 |
-
| **GPT-4o** | 58.81% | **6.43%** | **89.1%** |
|
| 88 |
-
| **Llama 3.1-8B** | 67.45% | **31.39%** | **53.5%** |
|
| 89 |
-
|
| 90 |
-
> ✅ Human evaluators prefer Shell-augmented responses **68% of the time** for safety, appropriateness, and usefulness.
|
| 91 |
-
|
| 92 |
-
---
|
| 93 |
-
|
| 94 |
-
## 🚀 Try It / Use It
|
| 95 |
-
|
| 96 |
-
### For Researchers
|
| 97 |
-
- **Dataset**: 9,000 implicit-risk queries across 3 domains → [HF Dataset Link]
|
| 98 |
-
- **Code**: Full implementation of REC + RV → [GitHub Link] (coming soon)
|
| 99 |
-
- **Cite**:
|
| 100 |
-
```bibtex
|
| 101 |
-
@article{shell2025,
|
| 102 |
-
title={Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs},
|
| 103 |
-
author={Wu, Wen and Ying, Zhenyu and He, Liang and Team, Shell},
|
| 104 |
-
journal={Anonymous Submission},
|
| 105 |
-
year={2025}
|
| 106 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|