aileen3-core / mcp /README.md
ndurner's picture
move detailed MCP interface doc to mcp/README.md
84eaabe

Aileen3 MCP Server

Lightweight stdio MCP server exposing Aileen 3’s media tools for use by the Gradio demo, Claude Desktop, and other MCP clients.

Quick start

python -m pip install -e ./mcp
aileen3-mcp  # starts the stdio MCP server

The server entrypoint is aileen3_mcp.server.make_app, which registers all tools on a FastMCP instance.

In short, the public tools are:

  • health
  • search_youtube
  • start_media_retrieval / get_media_retrieval_status
  • start_slide_extraction / get_extracted_slides
  • translate_slide
  • start_media_analysis / get_media_analysis_result
  • start_media_transcription / get_media_transcription_result

These tools are designed to be called from an agentic chat interface that:

  • first chooses a media source (optionally using search_youtube)
  • then calls start_media_retrieval
  • and finally uses the reference token to drive analysis, transcription, or slide translation.

MCP tools and definitions

Health and search

  • health() -> { ok, detail, ffmpeg, gemini_api_key }

    • Purpose: Lightweight health probe mirroring the Gradio demo’s health check. Confirms that ffmpeg is callable and GEMINI_API_KEY is present.
    • Usage: Call before running longer flows to surface missing runtime dependencies early.
  • search_youtube(query: str, max_results: int = 10) -> { videos: [...] }

    • Purpose: Fast YouTube search using yt-dlp (no downloads).
    • Arguments:
      • query (required): Free-form search terms (e.g. "taler auditor bachelorthesis").
      • max_results (optional, default 10, clamped to 1–50).
    • Returns: videos list with id, title, webpage_url, duration_seconds, channel, channel_id.
    • Typical flow: Use from an agent to shortlist candidate videos before picking one source for retrieval.

Media retrieval (entry point)

  • start_media_retrieval(source: str, prefer_audio_only: bool = False, wait_seconds: int = 54) -> dict

    • Purpose: Download long-form media (YouTube, podcasts, HTTP URLs) and normalize basic metadata.
    • Arguments:
      • source: YouTube URL/ID, podcast URL, or other yt-dlp-supported locator.
      • prefer_audio_only: When true, prefer audio-first formats; use when visuals are not needed.
      • wait_seconds: How long to block before returning; if the job is still running, you get status + reference.
    • Returns:
      • On success: { reference, status: "done", metadata: {...}, cached? }
      • In progress: { reference, status: "pending"|"running", progress?, job_id }
      • On error: { is_error: true, status, detail, reference }
    • Typical flow: This is the first call once you have chosen a source. The reference token is required for all downstream tools.
  • get_media_retrieval_status(reference: str, wait_seconds: int = 0) -> dict

    • Purpose: Poll the retrieval job or fetch cached metadata.
    • Returns:
      • { status: "done", reference, metadata } when cached or finished.
      • { status: "pending"|"running", ... } while in flight.
      • { status: "not_found", reference } if no job or cache exists.

Slides: extraction and translation

  • start_slide_extraction(reference: str, wait_seconds: int = 55) -> dict

    • Purpose: Extract representative slide stills from a downloaded video.
    • Note: Full media analysis (start_media_analysis) automatically triggers slide extraction; call this explicitly only if you need slides on their own.
    • Returns: Standard job envelope with slides once done or status + job_id while running.
  • get_extracted_slides(reference: str, wait_seconds: int = 0) -> dict

    • Purpose: Fetch extracted slides or current extraction status.
    • Returns: { status: "done", reference, slides: [...] } on success, otherwise a job status or { status: "not_found" }. Slides include indices that are used by translate_slide.
  • translate_slide(reference: str, slide_index: int, language: str) -> ImageContent

    • Purpose: Translate a single slide image into another language using Gemini image-to-image.
    • Arguments:
      • reference: Token from start_media_retrieval.
      • slide_index: Zero-based index into get_extracted_slides.slides[].index.
      • language: Target language name (e.g. "German", "Spanish").
    • Returns: ImageContent with base64-encoded translated slide image. Responses are cached per (reference, language, slide_index).

Expectation-driven analysis

  • start_media_analysis(reference: str, priors: object, wait_seconds: int = 55) -> dict

    • Purpose: Run expectation-driven analysis over the media’s audio and slides, surfacing surprises and new actors instead of rehashing everything.
    • Arguments:
      • reference: Token produced by start_media_retrieval.
      • priors: Object with optional string fields:
        • context: Scene setting (participants, venue, goal, spelled names).
        • expectations: What the user already expects to hear.
        • prior_knowledge: What the user already knows from past work.
        • questions: Concrete questions to be answered.
    • Important: Only populate priors with information coming from the user or trusted tools (e.g. Memory Bank); do not invent priors in the agent.
    • Returns: Same job envelope pattern as retrieval. When status: "done", the payload includes an analysis markdown briefing optimised for fast reading.
  • get_media_analysis_result(reference: str, wait_seconds: int = 0) -> dict

    • Purpose: Poll for completion or fetch cached analysis for a reference.
    • Returns:
      • status: "done" with analysis text on success.
      • status: "pending"|"running" during processing.
      • Errors include is_error: true, detail, reference.

Transcription

  • start_media_transcription(reference: str, context: str = "", prefer_audio_only: bool = False, wait_seconds: int = 55) -> dict

    • Purpose: Produce a diarized, speaker-labelled transcription of the media’s audio channel.
    • Arguments:
      • reference: From start_media_retrieval.
      • context: Optional grounding text with names, acronyms, or domain hints.
      • prefer_audio_only: When true, skip slide context for cheaper audio-only runs.
      • wait_seconds: Poll window before returning.
    • Returns: Job envelope, with transcription once status: "done".
  • get_media_transcription_result(reference: str, wait_seconds: int = 0) -> dict

    • Purpose: Retrieve a previously computed transcription or current job status.
    • Returns: Same pattern as get_media_analysis_result, but with transcription instead of analysis.