Spaces:

feifeinoban
/

shell

Running

App Files Files Community

feifeinoban commited on Oct 7, 2025

Commit

2198804

verified ·

1 Parent(s): 03d3d14

Update README.md

Browse files

Files changed (1) hide show

README.md +90 -21

README.md CHANGED Viewed

@@ -1,27 +1,96 @@
 ---
-title: Shell
-emoji: 🧠
-colorFrom: yellow
-colorTo: indigo
-sdk: static
-pinned: false
-license: apache-2.0
-short_description: 'MENTOR: LLM risk framework'
 ---
-# Nerfies
-This is the repository that contains source code for the [Nerfies website](https://nerfies.github.io).
-If you find Nerfies useful for your work please cite:
-```
-@article{park2021nerfies
-  author    = {Park, Keunhong and Sinha, Utkarsh and Barron, Jonathan T. and Bouaziz, Sofien and Goldman, Dan B and Seitz, Steven M. and Martin-Brualla, Ricardo},
-  title     = {Nerfies: Deformable Neural Radiance Fields},
-  journal   = {ICCV},
-  year      = {2021},
-}
-```
-# Website License
-<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

+# 🐚 Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs
+> **Uncover and mitigate implicit value risks in education, finance, management—and beyond**
+> 🔒 Model-agnostic · 🧠 Self-evolving rules · ⚡ Activation steering · 📉 90%+ jailbreak reduction
+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
+[![Dataset](https://img.shields.io/badge/Dataset-HuggingFace-ff69b4)](https://huggingface.co/datasets/your-dataset-here)
+[![Paper](https://img.shields.io/badge/Paper-ArXiv-black)](https://arxiv.org/abs/xxxx.xxxxx)
+Shell is an open safety framework that empowers domain-specific LLMs to **detect, reflect on, and correct implicit value misalignments**—without retraining. Built on the **MENTOR** architecture, it combines metacognitive self-assessment, dynamic rule evolution, and activation steering to deliver robust, interpretable, and efficient alignment across specialized verticals.
+---
+## 📌 Abstract
+While current LLM safety methods focus on explicit harms (e.g., hate speech, violence), they often miss **domain-specific implicit risks**—such as encouraging academic dishonesty in education, promoting reckless trading in finance, or normalizing toxic workplace culture in management.
+We introduce **Shell**, a metacognition-driven self-evolution framework that:
+- Enables LLMs to **self-diagnose value misalignments** via perspective-taking and consequence simulation.
+- Builds a **hybrid rule system**: expert-defined static trees + self-evolved dynamic graphs.
+- Enforces rules at inference time via **activation steering**, achieving strong safety with minimal compute.
+Evaluated on 9,000 risk queries across **education, finance, and management**, Shell reduces average jailbreak rates by **>90%** on models including GPT-5, Qwen3, and Llama 3.1.
+---
+## 🎯 Core Challenges: Implicit Risks Are Everywhere
+| Domain      | Example Implicit Risk                                  | Harmful Consequence                          |
+|-------------|--------------------------------------------------------|----------------------------------------------|
+| **Education** | Suggesting clever comebacks that escalate bullying     | Deteriorates peer relationships              |
+|             | Framing “sacrificing sleep for grades” as admirable    | Promotes unhealthy competition                |
+|             | Teaching how to “rephrase copied essays”               | Undermines academic integrity                |
+| **Finance**   | Encouraging high-leverage speculation as “smart risk”  | Normalizes financial recklessness             |
+| **Management**| Praising “always-on” culture as “dedication”           | Reinforces burnout and poor work-life balance|
+> 💡 These risks are **not jailbreaks** in the traditional sense—they appear benign but subtly erode domain-specific values.
 ---
+## 🧠 Methodology: The MENTOR Architecture
+Shell implements the **MENTOR** framework (see paper for full details):
+### 1. **Metacognitive Self-Assessment**
+LLMs evaluate their own outputs using:
+- **Perspective-taking**: “How would a teacher/parent/regulator view this?”
+- **Consequential thinking**: “What real-world harm could this cause?”
+- **Normative introspection**: “Does this align with core domain ethics?”
+This replaces labor-intensive human labeling with **autonomous, human-aligned reflection**.
+### 2. **Rule Evolution Cycle (REC)**
+- **Static Rule Tree**: Expert-curated, hierarchical rules (e.g., `Education → Academic Integrity → No Plagiarism`).
+- **Dynamic Rule Graph**: Automatically generated from successful self-corrections (e.g., `<risk: essay outsourcing> → <rule: teach outlining instead>`).
+- Rules evolve via **dual clustering** (by risk type & mitigation strategy), enabling precise retrieval.
+### 3. **Robust Rule Vectors (RV) via Activation Steering**
+- Generate **steering vectors** from contrasting compliant vs. non-compliant responses.
+- At inference, **add vectors to internal activations** (e.g., Layer 18 of Llama 3.1) to guide behavior.
+- **No fine-tuning needed**—works on closed-source models like GPT-5.
+![MENTOR Architecture](assets/mentor_arch.png)
+> *Figure: The MENTOR framework (from paper). Shell implements this full pipeline.*
 ---
+## 📊 Results: Strong, Efficient, Generalizable
+### Jailbreak Rate Reduction (3,000 queries per domain)
+| Model            | Original | + Shell (Rules + MetaLoop + RV) | Reduction |
+|------------------|----------|-------------------------------|-----------|
+| **GPT-5**        | 38.39%   | **0.77%**                     | **98.0%** |
+| **Qwen3-235B**   | 56.33%   | **3.13%**                     | **94.4%** |
+| **GPT-4o**       | 58.81%   | **6.43%**                     | **89.1%** |
+| **Llama 3.1-8B** | 67.45%   | **31.39%**                    | **53.5%** |
+> ✅ Human evaluators prefer Shell-augmented responses **68% of the time** for safety, appropriateness, and usefulness.
+---
+## 🚀 Try It / Use It
+### For Researchers
+- **Dataset**: 9,000 implicit-risk queries across 3 domains → [HF Dataset Link]
+- **Code**: Full implementation of REC + RV → [GitHub Link] (coming soon)
+- **Cite**:
+  ```bibtex
+  @article{shell2025,
+    title={Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs},
+    author={Wu, Wen and Ying, Zhenyu and He, Liang and Team, Shell},
+    journal={Anonymous Submission},
+    year={2025}
+  }