Spaces:
Running
Running
File size: 6,671 Bytes
02ec6d5 f6292c0 02ec6d5 84eaabe 6a09b0b f6292c0 6a09b0b f6292c0 6a09b0b f6292c0 6a09b0b f6292c0 6a09b0b 84eaabe |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
# Aileen3 MCP Server
Lightweight stdio MCP server exposing Aileen 3’s media tools for use by the Gradio demo, Claude Desktop, and other MCP clients.
## Quick start
```bash
python -m pip install -e ./mcp
aileen3-mcp # starts the stdio MCP server
```
The server entrypoint is `aileen3_mcp.server.make_app`, which registers all tools on a `FastMCP` instance.
In short, the public tools are:
- `health`
- `search_youtube`
- `start_media_retrieval` / `get_media_retrieval_status`
- `start_slide_extraction` / `get_extracted_slides`
- `translate_slide`
- `start_media_analysis` / `get_media_analysis_result`
- `start_media_transcription` / `get_media_transcription_result`
These tools are designed to be called from an agentic chat interface that:
- first chooses a media `source` (optionally using `search_youtube`)
- then calls `start_media_retrieval`
- and finally uses the `reference` token to drive analysis, transcription, or slide translation.
## MCP tools and definitions
### Health and search
- `health() -> { ok, detail, ffmpeg, gemini_api_key }`
- Purpose: Lightweight health probe mirroring the Gradio demo’s health check. Confirms that `ffmpeg` is callable and `GEMINI_API_KEY` is present.
- Usage: Call before running longer flows to surface missing runtime dependencies early.
- `search_youtube(query: str, max_results: int = 10) -> { videos: [...] }`
- Purpose: Fast YouTube search using `yt-dlp` (no downloads).
- Arguments:
- `query` (required): Free-form search terms (e.g. `"taler auditor bachelorthesis"`).
- `max_results` (optional, default `10`, clamped to `1–50`).
- Returns: `videos` list with `id`, `title`, `webpage_url`, `duration_seconds`, `channel`, `channel_id`.
- Typical flow: Use from an agent to shortlist candidate videos before picking one `source` for retrieval.
### Media retrieval (entry point)
- `start_media_retrieval(source: str, prefer_audio_only: bool = False, wait_seconds: int = 54) -> dict`
- Purpose: Download long-form media (YouTube, podcasts, HTTP URLs) and normalize basic metadata.
- Arguments:
- `source`: YouTube URL/ID, podcast URL, or other `yt-dlp`-supported locator.
- `prefer_audio_only`: When `true`, prefer audio-first formats; use when visuals are not needed.
- `wait_seconds`: How long to block before returning; if the job is still running, you get status + reference.
- Returns:
- On success: `{ reference, status: "done", metadata: {...}, cached? }`
- In progress: `{ reference, status: "pending"|"running", progress?, job_id }`
- On error: `{ is_error: true, status, detail, reference }`
- Typical flow: This is the first call once you have chosen a `source`. The `reference` token is required for all downstream tools.
- `get_media_retrieval_status(reference: str, wait_seconds: int = 0) -> dict`
- Purpose: Poll the retrieval job or fetch cached metadata.
- Returns:
- `{ status: "done", reference, metadata }` when cached or finished.
- `{ status: "pending"|"running", ... }` while in flight.
- `{ status: "not_found", reference }` if no job or cache exists.
### Slides: extraction and translation
- `start_slide_extraction(reference: str, wait_seconds: int = 55) -> dict`
- Purpose: Extract representative slide stills from a downloaded video.
- Note: Full media analysis (`start_media_analysis`) automatically triggers slide extraction; call this explicitly only if you need slides on their own.
- Returns: Standard job envelope with `slides` once done or `status` + `job_id` while running.
- `get_extracted_slides(reference: str, wait_seconds: int = 0) -> dict`
- Purpose: Fetch extracted slides or current extraction status.
- Returns: `{ status: "done", reference, slides: [...] }` on success, otherwise a job status or `{ status: "not_found" }`. Slides include indices that are used by `translate_slide`.
- `translate_slide(reference: str, slide_index: int, language: str) -> ImageContent`
- Purpose: Translate a single slide image into another language using Gemini image-to-image.
- Arguments:
- `reference`: Token from `start_media_retrieval`.
- `slide_index`: Zero-based index into `get_extracted_slides.slides[].index`.
- `language`: Target language name (e.g. `"German"`, `"Spanish"`).
- Returns: `ImageContent` with base64-encoded translated slide image. Responses are cached per `(reference, language, slide_index)`.
### Expectation-driven analysis
- `start_media_analysis(reference: str, priors: object, wait_seconds: int = 55) -> dict`
- Purpose: Run expectation-driven analysis over the media’s audio and slides, surfacing *surprises* and *new actors* instead of rehashing everything.
- Arguments:
- `reference`: Token produced by `start_media_retrieval`.
- `priors`: Object with optional string fields:
- `context`: Scene setting (participants, venue, goal, spelled names).
- `expectations`: What the user already expects to hear.
- `prior_knowledge`: What the user already knows from past work.
- `questions`: Concrete questions to be answered.
- Important: Only populate `priors` with information coming from the user or trusted tools (e.g. Memory Bank); do not invent priors in the agent.
- Returns: Same job envelope pattern as retrieval. When `status: "done"`, the payload includes an `analysis` markdown briefing optimised for fast reading.
- `get_media_analysis_result(reference: str, wait_seconds: int = 0) -> dict`
- Purpose: Poll for completion or fetch cached analysis for a `reference`.
- Returns:
- `status: "done"` with `analysis` text on success.
- `status: "pending"|"running"` during processing.
- Errors include `is_error: true`, `detail`, `reference`.
### Transcription
- `start_media_transcription(reference: str, context: str = "", prefer_audio_only: bool = False, wait_seconds: int = 55) -> dict`
- Purpose: Produce a diarized, speaker-labelled transcription of the media’s audio channel.
- Arguments:
- `reference`: From `start_media_retrieval`.
- `context`: Optional grounding text with names, acronyms, or domain hints.
- `prefer_audio_only`: When `true`, skip slide context for cheaper audio-only runs.
- `wait_seconds`: Poll window before returning.
- Returns: Job envelope, with `transcription` once `status: "done"`.
- `get_media_transcription_result(reference: str, wait_seconds: int = 0) -> dict`
- Purpose: Retrieve a previously computed transcription or current job status.
- Returns: Same pattern as `get_media_analysis_result`, but with `transcription` instead of `analysis`.
|