Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# 🐚 Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs
|
| 2 |
|
| 3 |
> **Uncover and mitigate implicit value risks in education, finance, management—and beyond**
|
|
@@ -29,10 +49,10 @@ Evaluated on 9,000 risk queries across **education, finance, and management**, S
|
|
| 29 |
| Domain | Example Implicit Risk | Harmful Consequence |
|
| 30 |
|-------------|--------------------------------------------------------|----------------------------------------------|
|
| 31 |
| **Education** | Suggesting clever comebacks that escalate bullying | Deteriorates peer relationships |
|
| 32 |
-
| | Framing
|
| 33 |
-
| | Teaching how to
|
| 34 |
-
| **Finance** | Encouraging high-leverage speculation as
|
| 35 |
-
| **Management**| Praising
|
| 36 |
|
| 37 |
> 💡 These risks are **not jailbreaks** in the traditional sense—they appear benign but subtly erode domain-specific values.
|
| 38 |
|
|
@@ -44,9 +64,9 @@ Shell implements the **MENTOR** framework (see paper for full details):
|
|
| 44 |
|
| 45 |
### 1. **Metacognitive Self-Assessment**
|
| 46 |
LLMs evaluate their own outputs using:
|
| 47 |
-
- **Perspective-taking**:
|
| 48 |
-
- **Consequential thinking**:
|
| 49 |
-
- **Normative introspection**:
|
| 50 |
|
| 51 |
This replaces labor-intensive human labeling with **autonomous, human-aligned reflection**.
|
| 52 |
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Shell:Metacognition-Driven Safety for Domain-Specific LLMs
|
| 3 |
+
emoji: 🐚
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: purple
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: "4.0.0"
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
tags:
|
| 11 |
+
- llm-safety
|
| 12 |
+
- metacognition
|
| 13 |
+
- education
|
| 14 |
+
- finance
|
| 15 |
+
- management
|
| 16 |
+
- alignment
|
| 17 |
+
- activation-steering
|
| 18 |
+
short_description: Metacognition-driven safety for domain-specific LLMs
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
# 🐚 Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs
|
| 22 |
|
| 23 |
> **Uncover and mitigate implicit value risks in education, finance, management—and beyond**
|
|
|
|
| 49 |
| Domain | Example Implicit Risk | Harmful Consequence |
|
| 50 |
|-------------|--------------------------------------------------------|----------------------------------------------|
|
| 51 |
| **Education** | Suggesting clever comebacks that escalate bullying | Deteriorates peer relationships |
|
| 52 |
+
| | Framing "sacrificing sleep for grades" as admirable | Promotes unhealthy competition |
|
| 53 |
+
| | Teaching how to "rephrase copied essays" | Undermines academic integrity |
|
| 54 |
+
| **Finance** | Encouraging high-leverage speculation as "smart risk" | Normalizes financial recklessness |
|
| 55 |
+
| **Management**| Praising "always-on" culture as "dedication" | Reinforces burnout and poor work-life balance|
|
| 56 |
|
| 57 |
> 💡 These risks are **not jailbreaks** in the traditional sense—they appear benign but subtly erode domain-specific values.
|
| 58 |
|
|
|
|
| 64 |
|
| 65 |
### 1. **Metacognitive Self-Assessment**
|
| 66 |
LLMs evaluate their own outputs using:
|
| 67 |
+
- **Perspective-taking**: "How would a teacher/parent/regulator view this?"
|
| 68 |
+
- **Consequential thinking**: "What real-world harm could this cause?"
|
| 69 |
+
- **Normative introspection**: "Does this align with core domain ethics?"
|
| 70 |
|
| 71 |
This replaces labor-intensive human labeling with **autonomous, human-aligned reflection**.
|
| 72 |
|