Spaces:

MCP-1st-Birthday
/

aileen3-core

Running

App Files Files Community

aileen3-core / demo /solution_cell.py

ndurner

problem/solution statement

9188dd0 18 days ago

raw

history blame contribute delete

2.84 kB

	from __future__ import annotations

	import gradio as gr

	from layout import cell


	def render_solution_cell() -> None:
	with cell("✅ Solution: contextual biasing through priors"):
	gr.Markdown(
	"""
	### 👩🏻‍🏫 Background
	Automatic speech recognition systems can be steered by giving them context up front. OpenAI Whisper, for example, supports a **textual
	prompt** that lists likely names, abbreviations, product names, or domain terms. When the audio is ambiguous, the model can bias its
	choices toward those tokens instead of guessing from scratch. This works especially well in high-noise conference settings or niche domains
	where participant names and acronyms rarely appear in generic training data.

	In Aileen 3, we generalise this idea of “context” into priors: structured hints about the user, their expectations, prior knowledge, and
	the media itself (title, channel, description…). On a high level (Aileen 3 Agent), priors are not facts the model has to re-discover – they are the baseline
	that makes it easier to spot surprises, new actors, and genuinely novel claims later on. On a low level (Aileen 3 Core), this concept applies
	to transcription: spelling out how the 🇩🇪 “NOOTS-Staatsvertrag” (data sharing treaty) is supposed to look in writing gives the model a
	strong prior against hallucinating emergency state super powers (🇩🇪 “Notstaatsvertrag”).

	Multi-modal models such as Google Gemini go one step further than Whisper: they may even accept priors that are not plain text.
	Images of slides from a talk,
	agenda screenshots, or diagrams can be ingested alongside the audio to potentially provide a much richer prior. Internally, the Aileen MCP already
	extracts representative slide images from long-form talks so that this kind of multi-modal prior can be used in downstream analysis – we
	will lean on the same building blocks for transcription.

	### 💁🏻‍♀️ Demo
	In the demo below, we are going to treat the YouTube video description of the Smart Country Convention session as a prior. While the Aileen MCP
	server's transcription tool allows a user supplied text prior to be passed, we will rely on its internal extraction of the video
	metadata. That way, the Google Gemini model sees both the audio of the talk and a text prior that
	spells out the intended terminology.

	The goal is to see whether supplying the description as a prior helps the transcription stay anchored on the term of the
	data sharing treaty (NOOTS-Staatsvertrag) instead of hallucinating emergency state (“Notstaatsvertrag”) when the audio is noisy or ambiguous.

	Before running the demo, we will first run the health check cell to verify that ffmpeg, yt-dlp, the Aileen MCP server, and your Gemini API
	key are all wired up correctly.
	"""
	)