Spaces:
Running
Aileen3 MCP Server
Lightweight stdio MCP server exposing Aileen 3’s media tools for use by the Gradio demo, Claude Desktop, and other MCP clients.
Quick start
python -m pip install -e ./mcp
aileen3-mcp # starts the stdio MCP server
The server entrypoint is aileen3_mcp.server.make_app, which registers all tools on a FastMCP instance.
In short, the public tools are:
healthsearch_youtubestart_media_retrieval/get_media_retrieval_statusstart_slide_extraction/get_extracted_slidestranslate_slidestart_media_analysis/get_media_analysis_resultstart_media_transcription/get_media_transcription_result
These tools are designed to be called from an agentic chat interface that:
- first chooses a media
source(optionally usingsearch_youtube) - then calls
start_media_retrieval - and finally uses the
referencetoken to drive analysis, transcription, or slide translation.
MCP tools and definitions
Health and search
health() -> { ok, detail, ffmpeg, gemini_api_key }- Purpose: Lightweight health probe mirroring the Gradio demo’s health check. Confirms that
ffmpegis callable andGEMINI_API_KEYis present. - Usage: Call before running longer flows to surface missing runtime dependencies early.
- Purpose: Lightweight health probe mirroring the Gradio demo’s health check. Confirms that
search_youtube(query: str, max_results: int = 10) -> { videos: [...] }- Purpose: Fast YouTube search using
yt-dlp(no downloads). - Arguments:
query(required): Free-form search terms (e.g."taler auditor bachelorthesis").max_results(optional, default10, clamped to1–50).
- Returns:
videoslist withid,title,webpage_url,duration_seconds,channel,channel_id. - Typical flow: Use from an agent to shortlist candidate videos before picking one
sourcefor retrieval.
- Purpose: Fast YouTube search using
Media retrieval (entry point)
start_media_retrieval(source: str, prefer_audio_only: bool = False, wait_seconds: int = 54) -> dict- Purpose: Download long-form media (YouTube, podcasts, HTTP URLs) and normalize basic metadata.
- Arguments:
source: YouTube URL/ID, podcast URL, or otheryt-dlp-supported locator.prefer_audio_only: Whentrue, prefer audio-first formats; use when visuals are not needed.wait_seconds: How long to block before returning; if the job is still running, you get status + reference.
- Returns:
- On success:
{ reference, status: "done", metadata: {...}, cached? } - In progress:
{ reference, status: "pending"|"running", progress?, job_id } - On error:
{ is_error: true, status, detail, reference }
- On success:
- Typical flow: This is the first call once you have chosen a
source. Thereferencetoken is required for all downstream tools.
get_media_retrieval_status(reference: str, wait_seconds: int = 0) -> dict- Purpose: Poll the retrieval job or fetch cached metadata.
- Returns:
{ status: "done", reference, metadata }when cached or finished.{ status: "pending"|"running", ... }while in flight.{ status: "not_found", reference }if no job or cache exists.
Slides: extraction and translation
start_slide_extraction(reference: str, wait_seconds: int = 55) -> dict- Purpose: Extract representative slide stills from a downloaded video.
- Note: Full media analysis (
start_media_analysis) automatically triggers slide extraction; call this explicitly only if you need slides on their own. - Returns: Standard job envelope with
slidesonce done orstatus+job_idwhile running.
get_extracted_slides(reference: str, wait_seconds: int = 0) -> dict- Purpose: Fetch extracted slides or current extraction status.
- Returns:
{ status: "done", reference, slides: [...] }on success, otherwise a job status or{ status: "not_found" }. Slides include indices that are used bytranslate_slide.
translate_slide(reference: str, slide_index: int, language: str) -> ImageContent- Purpose: Translate a single slide image into another language using Gemini image-to-image.
- Arguments:
reference: Token fromstart_media_retrieval.slide_index: Zero-based index intoget_extracted_slides.slides[].index.language: Target language name (e.g."German","Spanish").
- Returns:
ImageContentwith base64-encoded translated slide image. Responses are cached per(reference, language, slide_index).
Expectation-driven analysis
start_media_analysis(reference: str, priors: object, wait_seconds: int = 55) -> dict- Purpose: Run expectation-driven analysis over the media’s audio and slides, surfacing surprises and new actors instead of rehashing everything.
- Arguments:
reference: Token produced bystart_media_retrieval.priors: Object with optional string fields:context: Scene setting (participants, venue, goal, spelled names).expectations: What the user already expects to hear.prior_knowledge: What the user already knows from past work.questions: Concrete questions to be answered.
- Important: Only populate
priorswith information coming from the user or trusted tools (e.g. Memory Bank); do not invent priors in the agent. - Returns: Same job envelope pattern as retrieval. When
status: "done", the payload includes ananalysismarkdown briefing optimised for fast reading.
get_media_analysis_result(reference: str, wait_seconds: int = 0) -> dict- Purpose: Poll for completion or fetch cached analysis for a
reference. - Returns:
status: "done"withanalysistext on success.status: "pending"|"running"during processing.- Errors include
is_error: true,detail,reference.
- Purpose: Poll for completion or fetch cached analysis for a
Transcription
start_media_transcription(reference: str, context: str = "", prefer_audio_only: bool = False, wait_seconds: int = 55) -> dict- Purpose: Produce a diarized, speaker-labelled transcription of the media’s audio channel.
- Arguments:
reference: Fromstart_media_retrieval.context: Optional grounding text with names, acronyms, or domain hints.prefer_audio_only: Whentrue, skip slide context for cheaper audio-only runs.wait_seconds: Poll window before returning.
- Returns: Job envelope, with
transcriptiononcestatus: "done".
get_media_transcription_result(reference: str, wait_seconds: int = 0) -> dict- Purpose: Retrieve a previously computed transcription or current job status.
- Returns: Same pattern as
get_media_analysis_result, but withtranscriptioninstead ofanalysis.