feifeinoban commited on
Commit
2198804
·
verified ·
1 Parent(s): 03d3d14

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -21
README.md CHANGED
@@ -1,27 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: Shell
3
- emoji: 🧠
4
- colorFrom: yellow
5
- colorTo: indigo
6
- sdk: static
7
- pinned: false
8
- license: apache-2.0
9
- short_description: 'MENTOR: LLM risk framework'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- # Nerfies
 
 
13
 
14
- This is the repository that contains source code for the [Nerfies website](https://nerfies.github.io).
 
 
 
 
 
 
 
 
 
15
 
16
- If you find Nerfies useful for your work please cite:
17
- ```
18
- @article{park2021nerfies
19
- author = {Park, Keunhong and Sinha, Utkarsh and Barron, Jonathan T. and Bouaziz, Sofien and Goldman, Dan B and Seitz, Steven M. and Martin-Brualla, Ricardo},
20
- title = {Nerfies: Deformable Neural Radiance Fields},
21
- journal = {ICCV},
22
- year = {2021},
23
- }
24
- ```
25
 
26
- # Website License
27
- <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
 
 
 
 
 
 
 
 
 
 
1
+ # 🐚 Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs
2
+
3
+ > **Uncover and mitigate implicit value risks in education, finance, management—and beyond**
4
+ > 🔒 Model-agnostic · 🧠 Self-evolving rules · ⚡ Activation steering · 📉 90%+ jailbreak reduction
5
+
6
+ [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
7
+ [![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-ff69b4)](https://huggingface.co/datasets/your-dataset-here)
8
+ [![Paper](https://img.shields.io/badge/Paper-ArXiv-black)](https://arxiv.org/abs/xxxx.xxxxx)
9
+
10
+ Shell is an open safety framework that empowers domain-specific LLMs to **detect, reflect on, and correct implicit value misalignments**—without retraining. Built on the **MENTOR** architecture, it combines metacognitive self-assessment, dynamic rule evolution, and activation steering to deliver robust, interpretable, and efficient alignment across specialized verticals.
11
+
12
+ ---
13
+
14
+ ## 📌 Abstract
15
+
16
+ While current LLM safety methods focus on explicit harms (e.g., hate speech, violence), they often miss **domain-specific implicit risks**—such as encouraging academic dishonesty in education, promoting reckless trading in finance, or normalizing toxic workplace culture in management.
17
+
18
+ We introduce **Shell**, a metacognition-driven self-evolution framework that:
19
+ - Enables LLMs to **self-diagnose value misalignments** via perspective-taking and consequence simulation.
20
+ - Builds a **hybrid rule system**: expert-defined static trees + self-evolved dynamic graphs.
21
+ - Enforces rules at inference time via **activation steering**, achieving strong safety with minimal compute.
22
+
23
+ Evaluated on 9,000 risk queries across **education, finance, and management**, Shell reduces average jailbreak rates by **>90%** on models including GPT-5, Qwen3, and Llama 3.1.
24
+
25
+ ---
26
+
27
+ ## 🎯 Core Challenges: Implicit Risks Are Everywhere
28
+
29
+ | Domain | Example Implicit Risk | Harmful Consequence |
30
+ |-------------|--------------------------------------------------------|----------------------------------------------|
31
+ | **Education** | Suggesting clever comebacks that escalate bullying | Deteriorates peer relationships |
32
+ | | Framing “sacrificing sleep for grades” as admirable | Promotes unhealthy competition |
33
+ | | Teaching how to “rephrase copied essays” | Undermines academic integrity |
34
+ | **Finance** | Encouraging high-leverage speculation as “smart risk” | Normalizes financial recklessness |
35
+ | **Management**| Praising “always-on” culture as “dedication” | Reinforces burnout and poor work-life balance|
36
+
37
+ > 💡 These risks are **not jailbreaks** in the traditional sense—they appear benign but subtly erode domain-specific values.
38
+
39
  ---
40
+
41
+ ## 🧠 Methodology: The MENTOR Architecture
42
+
43
+ Shell implements the **MENTOR** framework (see paper for full details):
44
+
45
+ ### 1. **Metacognitive Self-Assessment**
46
+ LLMs evaluate their own outputs using:
47
+ - **Perspective-taking**: “How would a teacher/parent/regulator view this?”
48
+ - **Consequential thinking**: “What real-world harm could this cause?”
49
+ - **Normative introspection**: “Does this align with core domain ethics?”
50
+
51
+ This replaces labor-intensive human labeling with **autonomous, human-aligned reflection**.
52
+
53
+ ### 2. **Rule Evolution Cycle (REC)**
54
+ - **Static Rule Tree**: Expert-curated, hierarchical rules (e.g., `Education → Academic Integrity → No Plagiarism`).
55
+ - **Dynamic Rule Graph**: Automatically generated from successful self-corrections (e.g., `<risk: essay outsourcing> → <rule: teach outlining instead>`).
56
+ - Rules evolve via **dual clustering** (by risk type & mitigation strategy), enabling precise retrieval.
57
+
58
+ ### 3. **Robust Rule Vectors (RV) via Activation Steering**
59
+ - Generate **steering vectors** from contrasting compliant vs. non-compliant responses.
60
+ - At inference, **add vectors to internal activations** (e.g., Layer 18 of Llama 3.1) to guide behavior.
61
+ - **No fine-tuning needed**—works on closed-source models like GPT-5.
62
+
63
+ ![MENTOR Architecture](assets/mentor_arch.png)
64
+
65
+ > *Figure: The MENTOR framework (from paper). Shell implements this full pipeline.*
66
+
67
  ---
68
 
69
+ ## 📊 Results: Strong, Efficient, Generalizable
70
+
71
+ ### Jailbreak Rate Reduction (3,000 queries per domain)
72
 
73
+ | Model | Original | + Shell (Rules + MetaLoop + RV) | Reduction |
74
+ |------------------|----------|-------------------------------|-----------|
75
+ | **GPT-5** | 38.39% | **0.77%** | **98.0%** |
76
+ | **Qwen3-235B** | 56.33% | **3.13%** | **94.4%** |
77
+ | **GPT-4o** | 58.81% | **6.43%** | **89.1%** |
78
+ | **Llama 3.1-8B** | 67.45% | **31.39%** | **53.5%** |
79
+
80
+ > ✅ Human evaluators prefer Shell-augmented responses **68% of the time** for safety, appropriateness, and usefulness.
81
+
82
+ ---
83
 
84
+ ## 🚀 Try It / Use It
 
 
 
 
 
 
 
 
85
 
86
+ ### For Researchers
87
+ - **Dataset**: 9,000 implicit-risk queries across 3 domains [HF Dataset Link]
88
+ - **Code**: Full implementation of REC + RV → [GitHub Link] (coming soon)
89
+ - **Cite**:
90
+ ```bibtex
91
+ @article{shell2025,
92
+ title={Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs},
93
+ author={Wu, Wen and Ying, Zhenyu and He, Liang and Team, Shell},
94
+ journal={Anonymous Submission},
95
+ year={2025}
96
+ }