Spaces:
Running
Running
| import gradio as gr | |
| # We define the text content here for easy editing | |
| KICKOFF_CONTENT = """ | |
| # 📄 Kick-Off Document: Unbiasing Guardrails | |
| ## Why | |
| We built BYO because we realized something uncomfortable but true: **whoever controls language models controls the narrative**, and whoever controls the narrative shapes how people think, explore, and understand the world. | |
| Today’s LLMs don’t reflect the real diversity of human knowledge or opinion. They are filtered, aligned, and quietly steered by political, corporate, and regulatory interests. This isn’t neutral intelligence. It’s a curated worldview, enforced through guardrails that decide in advance what is acceptable to ask, research, or say. | |
| Guardrails and model alignments are not just a technical layer. They are biases embedded in code and weights. A silent hand shaping discourse before it even begins. | |
| At BYO, we believe knowledge doesn’t belong to platforms or conglomerates. It belongs to people. Language models should be tools for self-expression, exploration, and original thought, not instruments of centralized control. | |
| That’s why we built a no-code platform that lets anyone train and control their own language model, on their own knowledge, in their own voice. No permissions. No invisible filters. No approved narratives. | |
| When models are truly open to human training and expression, something powerful happens: real knowledge sharing, unbiased exploration, and authentic insight at scale. Progress accelerates when intelligence is decentralized. | |
| We’re not here to make AI safe for institutions. | |
| **We’re here to make it free for humans.** And that starts by unbiasing the guardrails. | |
| --- | |
| ## Possible Approaches | |
| ### 🛡️ Blue Team | |
| *A set of approaches for measuring and protecting your model against malicious prompts.* | |
| * **Latency based approaches**: e.g. how well can you protect the model with a swarm of SLM that classify maliciousness simultaneously? | |
| * **Monitoring model**: e.g. sandboxing and identifying states of a system. | |
| ### ⚔️ Red Team | |
| *The systematic identification of model failure modes to build more resilient and trustworthy systems.* | |
| * **Auto Red Teaming**: An attacker that learns attacks as the target model is being enhanced, e.g. a model executes and automatically writes scripts for new attack strategies. | |
| * **Persona Teaming**: Injecting prompts with suitable personas have higher success chances. | |
| * **Attack via Overfitting**: Overfitting the model to a set of Q&A prioritizing instruction following over other considerations. | |
| * **Jailbreak tuning**: Tuning a model with multi objective loss to keep utility and induce harmful content. | |
| ### ⚖️ Debiasing | |
| *Debiasing model either in pre-train, post-train or prompt level.* | |
| * **Self-Debiasing LLMs**: Zero-Shot Recognition and Reduction of Stereotypes - prompt level, LLM knows about the stereotype and prompt engineering reduces biases. | |
| * **Curriculum Debiasing**: Toward Robust PEFT - curriculum learning, starting from sets of biased samples and gradually moving to unbiased samples. | |
| ### 🍎 Lowest Hanging Fruit | |
| A good first step would be to identify, collect, and process useful datasets related to red & blue teaming and debiasing, identify established benchmarks and open source codes related to red & blue teaming and debiasing. | |
| --- | |
| ## Example References | |
| * [SoK: Evaluating Jailbreak Guardrails for Large Language Models](https://arxiv.org/pdf/2506.10597) | |
| * [OneShield - the Next Generation of LLM Guardrails](https://arxiv.org/pdf/2507.21170) | |
| * [Agentic AI for Cyber Resilience: A New Security Paradigm](https://www.arxiv.org/pdf/2512.22883) | |
| * [AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration](https://arxiv.org/pdf/2503.15754) | |
| * [PersonaTeaming: Exploring How Introducing Personas Can Improve Automated AI Red-Teaming](https://arxiv.org/pdf/2509.03728) | |
| * [Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs](https://arxiv.org/pdf/2510.02833) | |
| * [Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility](https://arxiv.org/pdf/2507.11630) | |
| * [Curriculum Debiasing: Toward Robust PEFT](https://aclanthology.org/2025.acl-long.469.pdf) | |
| * [Self-Debiasing LLMs: Zero-Shot Recognition and Reduction of Stereotypes](https://arxiv.org/pdf/2402.01981) | |
| """ | |
| def render_pdf_viewer(): | |
| """ | |
| Renders the Kick-Off content as Markdown. | |
| """ | |
| with gr.Column(visible=False) as pdf_view: | |
| # Render the text defined above | |
| gr.Markdown(KICKOFF_CONTENT) | |
| gr.Markdown("---") | |
| btn_back = gr.Button("⬅️ Back to Home", variant="secondary") | |
| return pdf_view, btn_back |