Spaces:

MCP-1st-Birthday
/

aileen3-core

Running

File size: 2,842 Bytes

9188dd0

from __future__ import annotations

import gradio as gr

from layout import cell


def render_solution_cell() -> None:
    with cell("✅ Solution: contextual biasing through priors"):
        gr.Markdown(
            """
### 👩🏻‍🏫 Background
Automatic speech recognition systems can be steered by giving them *context* up front. OpenAI Whisper, for example, supports a **textual
prompt** that lists likely names, abbreviations, product names, or domain terms. When the audio is ambiguous, the model can bias its
choices toward those tokens instead of guessing from scratch. This works especially well in high-noise conference settings or niche domains
where participant names and acronyms rarely appear in generic training data.

In Aileen 3, we generalise this idea of “context” into **priors**: structured hints about the user, their expectations, prior knowledge, and
the media itself (title, channel, description…). On a high level (Aileen 3 Agent), priors are not facts the model has to re-discover – they are the baseline
that makes it easier to spot surprises, new actors, and genuinely novel claims later on. On a low level (Aileen 3 Core), this concept applies
to transcription: spelling out how the 🇩🇪 “NOOTS-Staatsvertrag” (data sharing treaty) is supposed to look in writing gives the model a
strong prior against hallucinating emergency state super powers (🇩🇪 “Notstaatsvertrag”).

Multi-modal models such as Google Gemini go one step further than Whisper: they may even accept priors that are not plain text.
Images of slides from a talk,
agenda screenshots, or diagrams can be ingested alongside the audio to potentially provide a much richer prior. Internally, the Aileen MCP already
extracts representative slide images from long-form talks so that this kind of multi-modal prior can be used in downstream analysis – we
will lean on the same building blocks for transcription.

### 💁🏻‍♀️ Demo
In the demo below, we are going to treat the **YouTube video description** of the Smart Country Convention session as a prior. While the Aileen MCP
server's transcription tool allows a user supplied text prior to be passed, we will rely on its internal extraction of the video
metadata. That way, the Google Gemini model sees both the audio of the talk and a text prior that
spells out the intended terminology.

The goal is to see whether supplying the description as a prior helps the transcription stay anchored on the term of the
data sharing treaty (NOOTS-Staatsvertrag) instead of hallucinating emergency state (“Notstaatsvertrag”) when the audio is noisy or ambiguous.

Before running the demo, we will first run the health check cell to verify that ffmpeg, yt-dlp, the Aileen MCP server, and your Gemini API
key are all wired up correctly.
            """
        )