ndurner commited on
Commit
84eaabe
·
1 Parent(s): 5f045d0

move detailed MCP interface doc to mcp/README.md

Browse files
Files changed (3) hide show
  1. .github/README.md +1 -1
  2. README.md +21 -90
  3. mcp/README.md +92 -2
.github/README.md CHANGED
@@ -132,7 +132,7 @@ In short, the public tools are:
132
  - `start_media_analysis` / `get_media_analysis_result`
133
  - `start_media_transcription` / `get_media_transcription_result`
134
 
135
- These tools are designed to be called from an agentic chat interface that:
136
 
137
  - first chooses a media `source` (optionally using `search_youtube`)
138
  - then calls `start_media_retrieval`
 
132
  - `start_media_analysis` / `get_media_analysis_result`
133
  - `start_media_transcription` / `get_media_transcription_result`
134
 
135
+ These tools are designed to be called from an agentic AI system that:
136
 
137
  - first chooses a media `source` (optionally using `search_youtube`)
138
  - then calls `start_media_retrieval`
README.md CHANGED
@@ -132,96 +132,27 @@ When integrating this MCP into your own agent or client:
132
  - Set transport-level timeouts generously (10–20 minutes) and rely on the tools’ `wait_seconds` argument plus status polling for progress.
133
  - Ensure `GEMINI_API_KEY` (and any optional `AILEEN3_*` variables you use) are visible in the environment of the MCP server process, not just the client.
134
 
135
- ### 🛠️ MCP tools and definitions
136
- #### 🩺 Health and search
137
-
138
- - `health() -> { ok, detail, ffmpeg, gemini_api_key }`
139
- - Purpose: Lightweight health probe mirroring the Gradio demo’s health check. Confirms that `ffmpeg` is callable and `GEMINI_API_KEY` is present.
140
- - Usage: Call before running longer flows to surface missing runtime dependencies early.
141
-
142
- - `search_youtube(query: str, max_results: int = 10) -> { videos: [...] }`
143
- - Purpose: Fast YouTube search using `yt-dlp` (no downloads).
144
- - Arguments:
145
- - `query` (required): Free-form search terms (e.g. `"taler auditor bachelorthesis"`).
146
- - `max_results` (optional, default `10`, clamped to `1–50`).
147
- - Returns: `videos` list with `id`, `title`, `webpage_url`, `duration_seconds`, `channel`, `channel_id`.
148
- - Typical flow: Use from an agent to shortlist candidate videos before picking one `source` for retrieval.
149
-
150
- #### 📺 Media retrieval (entry point)
151
-
152
- - `start_media_retrieval(source: str, prefer_audio_only: bool = False, wait_seconds: int = 54) -> dict`
153
- - Purpose: Download long-form media (YouTube, podcasts, HTTP URLs) and normalize basic metadata.
154
- - Arguments:
155
- - `source`: YouTube URL/ID, podcast URL, or other `yt-dlp`-supported locator.
156
- - `prefer_audio_only`: When `true`, prefer audio-first formats; use when visuals are not needed.
157
- - `wait_seconds`: How long to block before returning; if the job is still running, you get status + reference.
158
- - Returns:
159
- - On success: `{ reference, status: "done", metadata: {...}, cached? }`
160
- - In progress: `{ reference, status: "pending"|"running", progress?, job_id }`
161
- - On error: `{ is_error: true, status, detail, reference }`
162
- - Typical flow: This is the first call once you have chosen a `source`. The `reference` token is required for all downstream tools.
163
-
164
- - `get_media_retrieval_status(reference: str, wait_seconds: int = 0) -> dict`
165
- - Purpose: Poll the retrieval job or fetch cached metadata.
166
- - Returns:
167
- - `{ status: "done", reference, metadata }` when cached or finished.
168
- - `{ status: "pending"|"running", ... }` while in flight.
169
- - `{ status: "not_found", reference }` if no job or cache exists.
170
-
171
- #### 🖼️ Slides: extraction and translation
172
-
173
- - `start_slide_extraction(reference: str, wait_seconds: int = 55) -> dict`
174
- - Purpose: Extract representative slide stills from a downloaded video.
175
- - Note: Full media analysis (`start_media_analysis`) automatically triggers slide extraction; call this explicitly only if you need slides on their own.
176
- - Returns: Standard job envelope with `slides` once done or `status` + `job_id` while running.
177
-
178
- - `get_extracted_slides(reference: str, wait_seconds: int = 0) -> dict`
179
- - Purpose: Fetch extracted slides or current extraction status.
180
- - Returns: `{ status: "done", reference, slides: [...] }` on success, otherwise a job status or `{ status: "not_found" }`. Slides include indices that are used by `translate_slide`.
181
-
182
- - `translate_slide(reference: str, slide_index: int, language: str) -> ImageContent`
183
- - Purpose: Translate a single slide image into another language using Gemini image-to-image.
184
- - Arguments:
185
- - `reference`: Token from `start_media_retrieval`.
186
- - `slide_index`: Zero-based index into `get_extracted_slides.slides[].index`.
187
- - `language`: Target language name (e.g. `"German"`, `"Spanish"`).
188
- - Returns: `ImageContent` with base64-encoded translated slide image. Responses are cached per `(reference, language, slide_index)`.
189
-
190
- #### ⛳️ Expectation-driven analysis
191
-
192
- - `start_media_analysis(reference: str, priors: object, wait_seconds: int = 55) -> dict`
193
- - Purpose: Run expectation-driven analysis over the media’s audio and slides, surfacing *surprises* and *new actors* instead of rehashing everything.
194
- - Arguments:
195
- - `reference`: Token produced by `start_media_retrieval`.
196
- - `priors`: Object with optional string fields:
197
- - `context`: Scene setting (participants, venue, goal, spelled names).
198
- - `expectations`: What the user already expects to hear.
199
- - `prior_knowledge`: What the user already knows from past work.
200
- - `questions`: Concrete questions to be answered.
201
- - Important: Only populate `priors` with information coming from the user or trusted tools (e.g. Memory Bank); do not invent priors in the agent.
202
- - Returns: Same job envelope pattern as retrieval. When `status: "done"`, the payload includes an `analysis` markdown briefing optimised for fast reading.
203
-
204
- - `get_media_analysis_result(reference: str, wait_seconds: int = 0) -> dict`
205
- - Purpose: Poll for completion or fetch cached analysis for a `reference`.
206
- - Returns:
207
- - `status: "done"` with `analysis` text on success.
208
- - `status: "pending"|"running"` during processing.
209
- - Errors include `is_error: true`, `detail`, `reference`.
210
-
211
- #### ✍️ Transcription
212
-
213
- - `start_media_transcription(reference: str, context: str = "", prefer_audio_only: bool = False, wait_seconds: int = 55) -> dict`
214
- - Purpose: Produce a diarized, speaker-labelled transcription of the media’s audio channel.
215
- - Arguments:
216
- - `reference`: From `start_media_retrieval`.
217
- - `context`: Optional grounding text with names, acronyms, or domain hints.
218
- - `prefer_audio_only`: When `true`, skip slide context for cheaper audio-only runs.
219
- - `wait_seconds`: Poll window before returning.
220
- - Returns: Job envelope, with `transcription` once `status: "done"`.
221
-
222
- - `get_media_transcription_result(reference: str, wait_seconds: int = 0) -> dict`
223
- - Purpose: Retrieve a previously computed transcription or current job status.
224
- - Returns: Same pattern as `get_media_analysis_result`, but with `transcription` instead of `analysis`.
225
 
226
  ## 🏆 Hackathon Context & Journey
227
  Aileen 3 Core was built for the [MCP's 1st Birthday - Hosted by Anthropic and Gradio](https://huggingface.co/MCP-1st-Birthday) and serves as the backbone for the [Aileen 3 Agent](https://ndurner.de/links/aileen3-kaggle-writeup) (developed for the [AI Agents Intensive Course with Google](https://www.kaggle.com/learn-guide/5-day-agents)).
 
132
  - Set transport-level timeouts generously (10–20 minutes) and rely on the tools’ `wait_seconds` argument plus status polling for progress.
133
  - Ensure `GEMINI_API_KEY` (and any optional `AILEEN3_*` variables you use) are visible in the environment of the MCP server process, not just the client.
134
 
135
+ ### 🛠️ MCP tools overview
136
+
137
+ All tools are registered in `aileen3_mcp.server.make_app` and exposed via a stdio MCP server for use by the Gradio demo, Claude Desktop, and other clients.
138
+
139
+ In short, the public tools are:
140
+
141
+ - `health`
142
+ - `search_youtube`
143
+ - `start_media_retrieval` / `get_media_retrieval_status`
144
+ - `start_slide_extraction` / `get_extracted_slides`
145
+ - `translate_slide`
146
+ - `start_media_analysis` / `get_media_analysis_result`
147
+ - `start_media_transcription` / `get_media_transcription_result`
148
+
149
+ These tools are designed to be called from an agentic AI system that:
150
+
151
+ - first chooses a media `source` (optionally using `search_youtube`)
152
+ - then calls `start_media_retrieval`
153
+ - and finally uses the `reference` token to drive analysis, transcription, or slide translation.
154
+
155
+ For detailed tool contracts (arguments, return payloads, and error shapes), see `mcp/README.md`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
 
157
  ## 🏆 Hackathon Context & Journey
158
  Aileen 3 Core was built for the [MCP's 1st Birthday - Hosted by Anthropic and Gradio](https://huggingface.co/MCP-1st-Birthday) and serves as the backbone for the [Aileen 3 Agent](https://ndurner.de/links/aileen3-kaggle-writeup) (developed for the [AI Agents Intensive Course with Google](https://www.kaggle.com/learn-guide/5-day-agents)).
mcp/README.md CHANGED
@@ -9,7 +9,7 @@ python -m pip install -e ./mcp
9
  aileen3-mcp # starts the stdio MCP server
10
  ```
11
 
12
- The server entrypoint is `aileen3_mcp.server.make_app`, which registers all tools on a `FastMCP` instance. For a complete description of available tools (health probes, YouTube search, media retrieval, slide extraction and translation, analysis, transcription), see the project root `README.md` under **“MCP tools and interface”**.
13
 
14
  In short, the public tools are:
15
 
@@ -27,4 +27,94 @@ These tools are designed to be called from an agentic chat interface that:
27
  - then calls `start_media_retrieval`
28
  - and finally uses the `reference` token to drive analysis, transcription, or slide translation.
29
 
30
- For detailed contracts (arguments, return payloads, and example usage), consult `README.md` in the repository root.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  aileen3-mcp # starts the stdio MCP server
10
  ```
11
 
12
+ The server entrypoint is `aileen3_mcp.server.make_app`, which registers all tools on a `FastMCP` instance.
13
 
14
  In short, the public tools are:
15
 
 
27
  - then calls `start_media_retrieval`
28
  - and finally uses the `reference` token to drive analysis, transcription, or slide translation.
29
 
30
+ ## MCP tools and definitions
31
+
32
+ ### Health and search
33
+
34
+ - `health() -> { ok, detail, ffmpeg, gemini_api_key }`
35
+ - Purpose: Lightweight health probe mirroring the Gradio demo’s health check. Confirms that `ffmpeg` is callable and `GEMINI_API_KEY` is present.
36
+ - Usage: Call before running longer flows to surface missing runtime dependencies early.
37
+
38
+ - `search_youtube(query: str, max_results: int = 10) -> { videos: [...] }`
39
+ - Purpose: Fast YouTube search using `yt-dlp` (no downloads).
40
+ - Arguments:
41
+ - `query` (required): Free-form search terms (e.g. `"taler auditor bachelorthesis"`).
42
+ - `max_results` (optional, default `10`, clamped to `1–50`).
43
+ - Returns: `videos` list with `id`, `title`, `webpage_url`, `duration_seconds`, `channel`, `channel_id`.
44
+ - Typical flow: Use from an agent to shortlist candidate videos before picking one `source` for retrieval.
45
+
46
+ ### Media retrieval (entry point)
47
+
48
+ - `start_media_retrieval(source: str, prefer_audio_only: bool = False, wait_seconds: int = 54) -> dict`
49
+ - Purpose: Download long-form media (YouTube, podcasts, HTTP URLs) and normalize basic metadata.
50
+ - Arguments:
51
+ - `source`: YouTube URL/ID, podcast URL, or other `yt-dlp`-supported locator.
52
+ - `prefer_audio_only`: When `true`, prefer audio-first formats; use when visuals are not needed.
53
+ - `wait_seconds`: How long to block before returning; if the job is still running, you get status + reference.
54
+ - Returns:
55
+ - On success: `{ reference, status: "done", metadata: {...}, cached? }`
56
+ - In progress: `{ reference, status: "pending"|"running", progress?, job_id }`
57
+ - On error: `{ is_error: true, status, detail, reference }`
58
+ - Typical flow: This is the first call once you have chosen a `source`. The `reference` token is required for all downstream tools.
59
+
60
+ - `get_media_retrieval_status(reference: str, wait_seconds: int = 0) -> dict`
61
+ - Purpose: Poll the retrieval job or fetch cached metadata.
62
+ - Returns:
63
+ - `{ status: "done", reference, metadata }` when cached or finished.
64
+ - `{ status: "pending"|"running", ... }` while in flight.
65
+ - `{ status: "not_found", reference }` if no job or cache exists.
66
+
67
+ ### Slides: extraction and translation
68
+
69
+ - `start_slide_extraction(reference: str, wait_seconds: int = 55) -> dict`
70
+ - Purpose: Extract representative slide stills from a downloaded video.
71
+ - Note: Full media analysis (`start_media_analysis`) automatically triggers slide extraction; call this explicitly only if you need slides on their own.
72
+ - Returns: Standard job envelope with `slides` once done or `status` + `job_id` while running.
73
+
74
+ - `get_extracted_slides(reference: str, wait_seconds: int = 0) -> dict`
75
+ - Purpose: Fetch extracted slides or current extraction status.
76
+ - Returns: `{ status: "done", reference, slides: [...] }` on success, otherwise a job status or `{ status: "not_found" }`. Slides include indices that are used by `translate_slide`.
77
+
78
+ - `translate_slide(reference: str, slide_index: int, language: str) -> ImageContent`
79
+ - Purpose: Translate a single slide image into another language using Gemini image-to-image.
80
+ - Arguments:
81
+ - `reference`: Token from `start_media_retrieval`.
82
+ - `slide_index`: Zero-based index into `get_extracted_slides.slides[].index`.
83
+ - `language`: Target language name (e.g. `"German"`, `"Spanish"`).
84
+ - Returns: `ImageContent` with base64-encoded translated slide image. Responses are cached per `(reference, language, slide_index)`.
85
+
86
+ ### Expectation-driven analysis
87
+
88
+ - `start_media_analysis(reference: str, priors: object, wait_seconds: int = 55) -> dict`
89
+ - Purpose: Run expectation-driven analysis over the media’s audio and slides, surfacing *surprises* and *new actors* instead of rehashing everything.
90
+ - Arguments:
91
+ - `reference`: Token produced by `start_media_retrieval`.
92
+ - `priors`: Object with optional string fields:
93
+ - `context`: Scene setting (participants, venue, goal, spelled names).
94
+ - `expectations`: What the user already expects to hear.
95
+ - `prior_knowledge`: What the user already knows from past work.
96
+ - `questions`: Concrete questions to be answered.
97
+ - Important: Only populate `priors` with information coming from the user or trusted tools (e.g. Memory Bank); do not invent priors in the agent.
98
+ - Returns: Same job envelope pattern as retrieval. When `status: "done"`, the payload includes an `analysis` markdown briefing optimised for fast reading.
99
+
100
+ - `get_media_analysis_result(reference: str, wait_seconds: int = 0) -> dict`
101
+ - Purpose: Poll for completion or fetch cached analysis for a `reference`.
102
+ - Returns:
103
+ - `status: "done"` with `analysis` text on success.
104
+ - `status: "pending"|"running"` during processing.
105
+ - Errors include `is_error: true`, `detail`, `reference`.
106
+
107
+ ### Transcription
108
+
109
+ - `start_media_transcription(reference: str, context: str = "", prefer_audio_only: bool = False, wait_seconds: int = 55) -> dict`
110
+ - Purpose: Produce a diarized, speaker-labelled transcription of the media’s audio channel.
111
+ - Arguments:
112
+ - `reference`: From `start_media_retrieval`.
113
+ - `context`: Optional grounding text with names, acronyms, or domain hints.
114
+ - `prefer_audio_only`: When `true`, skip slide context for cheaper audio-only runs.
115
+ - `wait_seconds`: Poll window before returning.
116
+ - Returns: Job envelope, with `transcription` once `status: "done"`.
117
+
118
+ - `get_media_transcription_result(reference: str, wait_seconds: int = 0) -> dict`
119
+ - Purpose: Retrieve a previously computed transcription or current job status.
120
+ - Returns: Same pattern as `get_media_analysis_result`, but with `transcription` instead of `analysis`.