Spaces:

MCP-1st-Birthday
/

aileen3-core

Running

aileen3-core / mcp /README.md

move detailed MCP interface doc to mcp/README.md

84eaabe 14 days ago

6.67 kB

Aileen3 MCP Server

Lightweight stdio MCP server exposing Aileen 3’s media tools for use by the Gradio demo, Claude Desktop, and other MCP clients.

python -m pip install -e ./mcp
aileen3-mcp  # starts the stdio MCP server

The server entrypoint is aileen3_mcp.server.make_app, which registers all tools on a FastMCP instance.

In short, the public tools are:

These tools are designed to be called from an agentic chat interface that:

first chooses a media source (optionally using search_youtube)
then calls start_media_retrieval
and finally uses the reference token to drive analysis, transcription, or slide translation.

health() -> { ok, detail, ffmpeg, gemini_api_key }
- Purpose: Lightweight health probe mirroring the Gradio demo’s health check. Confirms that ffmpeg is callable and GEMINI_API_KEY is present.
- Usage: Call before running longer flows to surface missing runtime dependencies early.
search_youtube(query: str, max_results: int = 10) -> { videos: [...] }
- Purpose: Fast YouTube search using yt-dlp (no downloads).
- Arguments:
  - query (required): Free-form search terms (e.g. "taler auditor bachelorthesis").
  - max_results (optional, default 10, clamped to 1–50).
- Returns: videos list with id, title, webpage_url, duration_seconds, channel, channel_id.
- Typical flow: Use from an agent to shortlist candidate videos before picking one source for retrieval.

start_slide_extraction(reference: str, wait_seconds: int = 55) -> dict
- Purpose: Extract representative slide stills from a downloaded video.
- Note: Full media analysis (start_media_analysis) automatically triggers slide extraction; call this explicitly only if you need slides on their own.
- Returns: Standard job envelope with slides once done or status + job_id while running.
get_extracted_slides(reference: str, wait_seconds: int = 0) -> dict
- Purpose: Fetch extracted slides or current extraction status.
- Returns: { status: "done", reference, slides: [...] } on success, otherwise a job status or { status: "not_found" }. Slides include indices that are used by translate_slide.
translate_slide(reference: str, slide_index: int, language: str) -> ImageContent
- Purpose: Translate a single slide image into another language using Gemini image-to-image.
- Arguments:
  - reference: Token from start_media_retrieval.
  - slide_index: Zero-based index into get_extracted_slides.slides[].index.
  - language: Target language name (e.g. "German", "Spanish").
- Returns: ImageContent with base64-encoded translated slide image. Responses are cached per (reference, language, slide_index).

start_media_transcription(reference: str, context: str = "", prefer_audio_only: bool = False, wait_seconds: int = 55) -> dict
- Purpose: Produce a diarized, speaker-labelled transcription of the media’s audio channel.
- Arguments:
  - reference: From start_media_retrieval.
  - context: Optional grounding text with names, acronyms, or domain hints.
  - prefer_audio_only: When true, skip slide context for cheaper audio-only runs.
  - wait_seconds: Poll window before returning.
- Returns: Job envelope, with transcription once status: "done".
get_media_transcription_result(reference: str, wait_seconds: int = 0) -> dict
- Purpose: Retrieve a previously computed transcription or current job status.
- Returns: Same pattern as get_media_analysis_result, but with transcription instead of analysis.