feifeinoban commited on
Commit
7d07fb9
·
verified ·
1 Parent(s): 760ea7d

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -106
README.md DELETED
@@ -1,106 +0,0 @@
1
- ---
2
- title: Shell
3
- emoji: 🐚
4
- colorFrom: blue
5
- colorTo: purple
6
- sdk: static
7
- app_file: index.html
8
- pinned: false
9
- ---
10
-
11
- # 🐚 Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs
12
-
13
- > **Uncover and mitigate implicit value risks in education, finance, management—and beyond**
14
- > 🔒 Model-agnostic · 🧠 Self-evolving rules · ⚡ Activation steering · 📉 90%+ jailbreak reduction
15
-
16
- [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
17
- [![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-ff69b4)](https://huggingface.co/datasets/your-dataset-here)
18
- [![Paper](https://img.shields.io/badge/Paper-ArXiv-black)](https://arxiv.org/abs/xxxx.xxxxx)
19
-
20
- Shell is an open safety framework that empowers domain-specific LLMs to **detect, reflect on, and correct implicit value misalignments**—without retraining. Built on the **MENTOR** architecture, it combines metacognitive self-assessment, dynamic rule evolution, and activation steering to deliver robust, interpretable, and efficient alignment across specialized verticals.
21
-
22
- ---
23
-
24
- ## 📌 Abstract
25
-
26
- While current LLM safety methods focus on explicit harms (e.g., hate speech, violence), they often miss **domain-specific implicit risks**—such as encouraging academic dishonesty in education, promoting reckless trading in finance, or normalizing toxic workplace culture in management.
27
-
28
- We introduce **Shell**, a metacognition-driven self-evolution framework that:
29
- - Enables LLMs to **self-diagnose value misalignments** via perspective-taking and consequence simulation.
30
- - Builds a **hybrid rule system**: expert-defined static trees + self-evolved dynamic graphs.
31
- - Enforces rules at inference time via **activation steering**, achieving strong safety with minimal compute.
32
-
33
- Evaluated on 9,000 risk queries across **education, finance, and management**, Shell reduces average jailbreak rates by **>90%** on models including GPT-5, Qwen3, and Llama 3.1.
34
-
35
- ---
36
-
37
- ## 🎯 Core Challenges: Implicit Risks Are Everywhere
38
-
39
- | Domain | Example Implicit Risk | Harmful Consequence |
40
- |-------------|--------------------------------------------------------|----------------------------------------------|
41
- | **Education** | Suggesting clever comebacks that escalate bullying | Deteriorates peer relationships |
42
- | | Framing "sacrificing sleep for grades" as admirable | Promotes unhealthy competition |
43
- | | Teaching how to "rephrase copied essays" | Undermines academic integrity |
44
- | **Finance** | Encouraging high-leverage speculation as "smart risk" | Normalizes financial recklessness |
45
- | **Management**| Praising "always-on" culture as "dedication" | Reinforces burnout and poor work-life balance|
46
-
47
- > 💡 These risks are **not jailbreaks** in the traditional sense—they appear benign but subtly erode domain-specific values.
48
-
49
- ---
50
-
51
- ## 🧠 Methodology: The MENTOR Architecture
52
-
53
- Shell implements the **MENTOR** framework (see paper for full details):
54
-
55
- ### 1. **Metacognitive Self-Assessment**
56
- LLMs evaluate their own outputs using:
57
- - **Perspective-taking**: "How would a teacher/parent/regulator view this?"
58
- - **Consequential thinking**: "What real-world harm could this cause?"
59
- - **Normative introspection**: "Does this align with core domain ethics?"
60
-
61
- This replaces labor-intensive human labeling with **autonomous, human-aligned reflection**.
62
-
63
- ### 2. **Rule Evolution Cycle (REC)**
64
- - **Static Rule Tree**: Expert-curated, hierarchical rules (e.g., `Education → Academic Integrity → No Plagiarism`).
65
- - **Dynamic Rule Graph**: Automatically generated from successful self-corrections (e.g., `<risk: essay outsourcing> → <rule: teach outlining instead>`).
66
- - Rules evolve via **dual clustering** (by risk type & mitigation strategy), enabling precise retrieval.
67
-
68
- ### 3. **Robust Rule Vectors (RV) via Activation Steering**
69
- - Generate **steering vectors** from contrasting compliant vs. non-compliant responses.
70
- - At inference, **add vectors to internal activations** (e.g., Layer 18 of Llama 3.1) to guide behavior.
71
- - **No fine-tuning needed**—works on closed-source models like GPT-5.
72
-
73
- ![MENTOR Architecture](https://huggingface.co/spaces/feifeinoban/shell/resolve/main/assets/mentor_arch.png)
74
-
75
- > *Figure: The MENTOR framework (from paper). Shell implements this full pipeline.*
76
-
77
- ---
78
-
79
- ## 📊 Results: Strong, Efficient, Generalizable
80
-
81
- ### Jailbreak Rate Reduction (3,000 queries per domain)
82
-
83
- | Model | Original | + Shell (Rules + MetaLoop + RV) | Reduction |
84
- |------------------|----------|-------------------------------|-----------|
85
- | **GPT-5** | 38.39% | **0.77%** | **98.0%** |
86
- | **Qwen3-235B** | 56.33% | **3.13%** | **94.4%** |
87
- | **GPT-4o** | 58.81% | **6.43%** | **89.1%** |
88
- | **Llama 3.1-8B** | 67.45% | **31.39%** | **53.5%** |
89
-
90
- > ✅ Human evaluators prefer Shell-augmented responses **68% of the time** for safety, appropriateness, and usefulness.
91
-
92
- ---
93
-
94
- ## 🚀 Try It / Use It
95
-
96
- ### For Researchers
97
- - **Dataset**: 9,000 implicit-risk queries across 3 domains → [HF Dataset Link]
98
- - **Code**: Full implementation of REC + RV → [GitHub Link] (coming soon)
99
- - **Cite**:
100
- ```bibtex
101
- @article{shell2025,
102
- title={Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs},
103
- author={Wu, Wen and Ying, Zhenyu and He, Liang and Team, Shell},
104
- journal={Anonymous Submission},
105
- year={2025}
106
- }