feifeinoban commited on
Commit
f7d088e
·
verified ·
1 Parent(s): 2198804

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -7
README.md CHANGED
@@ -1,3 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # 🐚 Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs
2
 
3
  > **Uncover and mitigate implicit value risks in education, finance, management—and beyond**
@@ -29,10 +49,10 @@ Evaluated on 9,000 risk queries across **education, finance, and management**, S
29
  | Domain | Example Implicit Risk | Harmful Consequence |
30
  |-------------|--------------------------------------------------------|----------------------------------------------|
31
  | **Education** | Suggesting clever comebacks that escalate bullying | Deteriorates peer relationships |
32
- | | Framing sacrificing sleep for grades as admirable | Promotes unhealthy competition |
33
- | | Teaching how to rephrase copied essays | Undermines academic integrity |
34
- | **Finance** | Encouraging high-leverage speculation as smart risk | Normalizes financial recklessness |
35
- | **Management**| Praising always-on culture as dedication | Reinforces burnout and poor work-life balance|
36
 
37
  > 💡 These risks are **not jailbreaks** in the traditional sense—they appear benign but subtly erode domain-specific values.
38
 
@@ -44,9 +64,9 @@ Shell implements the **MENTOR** framework (see paper for full details):
44
 
45
  ### 1. **Metacognitive Self-Assessment**
46
  LLMs evaluate their own outputs using:
47
- - **Perspective-taking**: How would a teacher/parent/regulator view this?”
48
- - **Consequential thinking**: What real-world harm could this cause?”
49
- - **Normative introspection**: Does this align with core domain ethics?”
50
 
51
  This replaces labor-intensive human labeling with **autonomous, human-aligned reflection**.
52
 
 
1
+ ---
2
+ title: Shell:Metacognition-Driven Safety for Domain-Specific LLMs
3
+ emoji: 🐚
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: "4.0.0"
8
+ app_file: app.py
9
+ pinned: false
10
+ tags:
11
+ - llm-safety
12
+ - metacognition
13
+ - education
14
+ - finance
15
+ - management
16
+ - alignment
17
+ - activation-steering
18
+ short_description: Metacognition-driven safety for domain-specific LLMs
19
+ ---
20
+
21
  # 🐚 Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs
22
 
23
  > **Uncover and mitigate implicit value risks in education, finance, management—and beyond**
 
49
  | Domain | Example Implicit Risk | Harmful Consequence |
50
  |-------------|--------------------------------------------------------|----------------------------------------------|
51
  | **Education** | Suggesting clever comebacks that escalate bullying | Deteriorates peer relationships |
52
+ | | Framing "sacrificing sleep for grades" as admirable | Promotes unhealthy competition |
53
+ | | Teaching how to "rephrase copied essays" | Undermines academic integrity |
54
+ | **Finance** | Encouraging high-leverage speculation as "smart risk" | Normalizes financial recklessness |
55
+ | **Management**| Praising "always-on" culture as "dedication" | Reinforces burnout and poor work-life balance|
56
 
57
  > 💡 These risks are **not jailbreaks** in the traditional sense—they appear benign but subtly erode domain-specific values.
58
 
 
64
 
65
  ### 1. **Metacognitive Self-Assessment**
66
  LLMs evaluate their own outputs using:
67
+ - **Perspective-taking**: "How would a teacher/parent/regulator view this?"
68
+ - **Consequential thinking**: "What real-world harm could this cause?"
69
+ - **Normative introspection**: "Does this align with core domain ethics?"
70
 
71
  This replaces labor-intensive human labeling with **autonomous, human-aligned reflection**.
72