Spaces:
Running
Running
| from __future__ import annotations | |
| import gradio as gr | |
| from layout import cell | |
| def render_solution_cell() -> None: | |
| with cell("✅ Solution: contextual biasing through priors"): | |
| gr.Markdown( | |
| """ | |
| ### 👩🏻🏫 Background | |
| Automatic speech recognition systems can be steered by giving them *context* up front. OpenAI Whisper, for example, supports a **textual | |
| prompt** that lists likely names, abbreviations, product names, or domain terms. When the audio is ambiguous, the model can bias its | |
| choices toward those tokens instead of guessing from scratch. This works especially well in high-noise conference settings or niche domains | |
| where participant names and acronyms rarely appear in generic training data. | |
| In Aileen 3, we generalise this idea of “context” into **priors**: structured hints about the user, their expectations, prior knowledge, and | |
| the media itself (title, channel, description…). On a high level (Aileen 3 Agent), priors are not facts the model has to re-discover – they are the baseline | |
| that makes it easier to spot surprises, new actors, and genuinely novel claims later on. On a low level (Aileen 3 Core), this concept applies | |
| to transcription: spelling out how the 🇩🇪 “NOOTS-Staatsvertrag” (data sharing treaty) is supposed to look in writing gives the model a | |
| strong prior against hallucinating emergency state super powers (🇩🇪 “Notstaatsvertrag”). | |
| Multi-modal models such as Google Gemini go one step further than Whisper: they may even accept priors that are not plain text. | |
| Images of slides from a talk, | |
| agenda screenshots, or diagrams can be ingested alongside the audio to potentially provide a much richer prior. Internally, the Aileen MCP already | |
| extracts representative slide images from long-form talks so that this kind of multi-modal prior can be used in downstream analysis – we | |
| will lean on the same building blocks for transcription. | |
| ### 💁🏻♀️ Demo | |
| In the demo below, we are going to treat the **YouTube video description** of the Smart Country Convention session as a prior. While the Aileen MCP | |
| server's transcription tool allows a user supplied text prior to be passed, we will rely on its internal extraction of the video | |
| metadata. That way, the Google Gemini model sees both the audio of the talk and a text prior that | |
| spells out the intended terminology. | |
| The goal is to see whether supplying the description as a prior helps the transcription stay anchored on the term of the | |
| data sharing treaty (NOOTS-Staatsvertrag) instead of hallucinating emergency state (“Notstaatsvertrag”) when the audio is noisy or ambiguous. | |
| Before running the demo, we will first run the health check cell to verify that ffmpeg, yt-dlp, the Aileen MCP server, and your Gemini API | |
| key are all wired up correctly. | |
| """ | |
| ) | |