Spaces:

feifeinoban
/

shell

Running

App Files Files Community

feifeinoban commited on Oct 7, 2025

Commit

f7d088e

verified ·

1 Parent(s): 2198804

Update README.md

Browse files

Files changed (1) hide show

README.md +27 -7

README.md CHANGED Viewed

@@ -1,3 +1,23 @@
 # 🐚 Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs
 > **Uncover and mitigate implicit value risks in education, finance, management—and beyond**
@@ -29,10 +49,10 @@ Evaluated on 9,000 risk queries across **education, finance, and management**, S
 | Domain      | Example Implicit Risk                                  | Harmful Consequence                          |
 |-------------|--------------------------------------------------------|----------------------------------------------|
 | **Education** | Suggesting clever comebacks that escalate bullying     | Deteriorates peer relationships              |
-|             | Framing “sacrificing sleep for grades” as admirable    | Promotes unhealthy competition                |
-|             | Teaching how to “rephrase copied essays”               | Undermines academic integrity                |
-| **Finance**   | Encouraging high-leverage speculation as “smart risk”  | Normalizes financial recklessness             |
-| **Management**| Praising “always-on” culture as “dedication”           | Reinforces burnout and poor work-life balance|
 > 💡 These risks are **not jailbreaks** in the traditional sense—they appear benign but subtly erode domain-specific values.
@@ -44,9 +64,9 @@ Shell implements the **MENTOR** framework (see paper for full details):
 ### 1. **Metacognitive Self-Assessment**
 LLMs evaluate their own outputs using:
-- **Perspective-taking**: “How would a teacher/parent/regulator view this?”
-- **Consequential thinking**: “What real-world harm could this cause?”
-- **Normative introspection**: “Does this align with core domain ethics?”
 This replaces labor-intensive human labeling with **autonomous, human-aligned reflection**.

+---
+title: Shell:Metacognition-Driven Safety for Domain-Specific LLMs
+emoji: 🐚
+colorFrom: blue
+colorTo: purple
+sdk: gradio
+sdk_version: "4.0.0"
+app_file: app.py
+pinned: false
+tags:
+  - llm-safety
+  - metacognition
+  - education
+  - finance
+  - management
+  - alignment
+  - activation-steering
+short_description: Metacognition-driven safety for domain-specific LLMs
+---
 # 🐚 Shell: A Metacognition-Driven Safety Framework for Domain-Specific LLMs
 > **Uncover and mitigate implicit value risks in education, finance, management—and beyond**
 | Domain      | Example Implicit Risk                                  | Harmful Consequence                          |
 |-------------|--------------------------------------------------------|----------------------------------------------|
 | **Education** | Suggesting clever comebacks that escalate bullying     | Deteriorates peer relationships              |
+|             | Framing "sacrificing sleep for grades" as admirable    | Promotes unhealthy competition                |
+|             | Teaching how to "rephrase copied essays"               | Undermines academic integrity                |
+| **Finance**   | Encouraging high-leverage speculation as "smart risk"  | Normalizes financial recklessness             |
+| **Management**| Praising "always-on" culture as "dedication"           | Reinforces burnout and poor work-life balance|
 > 💡 These risks are **not jailbreaks** in the traditional sense—they appear benign but subtly erode domain-specific values.
 ### 1. **Metacognitive Self-Assessment**
 LLMs evaluate their own outputs using:
+- **Perspective-taking**: "How would a teacher/parent/regulator view this?"
+- **Consequential thinking**: "What real-world harm could this cause?"
+- **Normative introspection**: "Does this align with core domain ethics?"
 This replaces labor-intensive human labeling with **autonomous, human-aligned reflection**.