Spaces:
Running
Running
rewording
Browse files- .github/README.md +236 -42
- README.md +238 -44
- demo/setup_cell.py +1 -1
- mcp/README.md +15 -25
- mcp/src/aileen3_mcp/media_tools.py +1 -1
.github/README.md
CHANGED
|
@@ -1,66 +1,260 @@
|
|
| 1 |
-
# Aileen 3 Core
|
| 2 |
-
|
| 3 |
-
<
|
| 4 |
-
<a href="https://ndurner.de/links/aileen3-
|
| 5 |
-
<a href="https://ndurner.de/links/aileen3-
|
|
|
|
|
|
|
|
|
|
| 6 |
<a href="https://ndurner.de/links/aileen3-kaggle-writeup"><img alt="Agent Kaggle Writeup" src="https://img.shields.io/badge/Agent-Writeup-lightgray?logo=kaggle"></img></a>
|
| 7 |
-
<a href="https://ndurner.de/links/aileen3-kaggle-video"><img alt="Agent Demo Video Badge" src="https://img.shields.io/badge/Agent%20Demo-Video-lightgray?logo=YouTube"></img></a>
|
| 8 |
</div>
|
| 9 |
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
```
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
}
|
| 38 |
}
|
| 39 |
-
|
| 40 |
-
}
|
| 41 |
-
```
|
| 42 |
4. Restart Claude
|
| 43 |
|
| 44 |
-
|
| 45 |
-
The model Haiku 4.5 is sufficient for basic tasks. To make your plans fully transparent to the LLM, refer to "aileen3" explicitely in the prompt, e.g.:
|
| 46 |
> Use aileen3 to translate slide 3 from YouTube video reference eXP-PvKcI9A to German.
|
| 47 |
|
| 48 |

|
| 49 |
|
| 50 |
-
|
| 51 |
The message exchange and Claude-facing error messages can be read from Claude log files:
|
| 52 |
```
|
| 53 |
tail -n 20 -F ~/Library/Logs/Claude/mcp*.log
|
| 54 |
```
|
| 55 |
|
| 56 |
-
## Local development
|
| 57 |
|
| 58 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
```bash
|
|
|
|
| 61 |
docker build -t aileen3-core .
|
|
|
|
|
|
|
| 62 |
docker run -it -p 7860:7860 aileen3-core
|
| 63 |
```
|
| 64 |
|
| 65 |
-
##
|
| 66 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Aileen 3 Core: Information Foraging MCP Server
|
| 2 |
+
|
| 3 |
+
<div style="display: flex; justify-content: center; gap: 10px; margin-bottom: 1em">
|
| 4 |
+
<a href="https://ndurner.de/links/aileen3-hf-space"><img alt="HuggingFace Space Badge" src="https://img.shields.io/badge/Gradio%206-HuggingFace%20Space-yellow?logo=gradio"></img></a>
|
| 5 |
+
<a href="https://ndurner.de/links/aileen3-linkedin"><img alt="LinkedIn Post Badge" src="https://img.shields.io/badge/🔗%20LinkedIn-Post-blue?logo=linkedin"></img></a>
|
| 6 |
+
<a href="https://ndurner.de/links/aileen3-hf-video"><img alt="MCP Demo Video Badge" src="https://img.shields.io/badge/MCP-Demo%20Video-red?logo=YouTube"></img></a>
|
| 7 |
+
🔜<a href="https://ndurner.de/links/aileen3-agent-github"><img alt="Agent Agent Github Badge" src="https://img.shields.io/badge/Agent-Github-lightgray?logo=github"></img></a>
|
| 8 |
+
<a href="https://ndurner.de/links/aileen3-kaggle-video"><img alt="Agent Demo Video Badge" src="https://img.shields.io/badge/Agent-Demo%20Video-lightgray?logo=YouTube"></img></a>
|
| 9 |
<a href="https://ndurner.de/links/aileen3-kaggle-writeup"><img alt="Agent Kaggle Writeup" src="https://img.shields.io/badge/Agent-Writeup-lightgray?logo=kaggle"></img></a>
|
|
|
|
| 10 |
</div>
|
| 11 |
|
| 12 |
+
> **"Information is surprises. You learn something when things don’t turn out the way you expected."** ⸺ Roger Schank
|
| 13 |
+
|
| 14 |
+
## ♨️ Problem: The Noise-Signal Ratio
|
| 15 |
+
Professionals working at the intersection of regulation and technology drink from a firehose of information. Staying current requires monitoring hours of conferences, webinars, and podcasts.
|
| 16 |
+
|
| 17 |
+
Standard AI **summarization fails** here because it creates "flat" summaries that rehash what you already know. It treats every sentence as equally important.
|
| 18 |
+
|
| 19 |
+
## ✅ Solution: Expectation-Driven Analysis
|
| 20 |
+
**Aileen 3 Core** is a Model Context Protocol (MCP) server designed for **Information Foraging**. Grounded in cognitive science, it models "novelty" as **prediction error**.
|
| 21 |
+
|
| 22 |
+
Instead of asking "Summarize this video," Aileen 3 Core allows users to task a Large Language Model with:
|
| 23 |
+
*"Here is what I already know, and here is what I expect the speaker to say. Tell me only where they deviate from this baseline."*. As part of a larger agentic AI system, the prior knowledge can even be derived from a memory bank.
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
### Key Capabilities
|
| 27 |
+
* **⛳️ Expectation-Driven Briefings:** Uses Google Gemini to analyze audio/video against user-supplied priors (context, expectations, and knowledge gaps) to surface genuine surprises.
|
| 28 |
+
* **🔍 Context-Biased Transcription:** Prevents hallucinations (e.g., confusing the German treaty "NOOTS" for "emergency state") by feeding media metadata as priors to the model.
|
| 29 |
+
* **🖼️ Visual Slide Extraction:** Automatically detects, extracts, and, on request, translates slide stills from video feeds, treating slides as high-density information artifacts.
|
| 30 |
+
* **🔌 Universal MCP Support:** Works with **Claude Desktop**, or any custom Agent.
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
## 🏗️ Architecture
|
| 35 |
+
|
| 36 |
+
Aileen 3 Core exposes tools that bridge the gap between raw media and reasoning agents.
|
| 37 |
+
|
| 38 |
+
```mermaid
|
| 39 |
+
graph LR
|
| 40 |
+
User[User / Agent] -->|Priors & Expectations| MCP[Aileen 3 Core MCP]
|
| 41 |
+
MCP -->|Retrieval| YT[YouTube/Media]
|
| 42 |
+
MCP -->|Visuals| Slides[Slide Extraction]
|
| 43 |
+
MCP -->|Audio| Trans[Transcription]
|
| 44 |
+
MCP -->|Reasoning| Gemini[Google Gemini]
|
| 45 |
+
|
| 46 |
+
YT --> Gemini
|
| 47 |
+
Slides --> Gemini
|
| 48 |
+
Trans --> Gemini
|
| 49 |
+
|
| 50 |
+
Gemini -->|Briefing: Surprises Only| MCP
|
| 51 |
+
Gemini -->|Localized slides| MCP
|
| 52 |
+
MCP -->|High-Signal Update| User
|
| 53 |
```
|
| 54 |
+
|
| 55 |
+
## 🚀 Quick Start: Claude Desktop
|
| 56 |
+
|
| 57 |
+
Aileen 3 Core is designed to be the "eyes and ears" for your local LLM client.
|
| 58 |
+
|
| 59 |
+
1. **Install:**
|
| 60 |
+
```bash
|
| 61 |
+
# Clone and install dependencies
|
| 62 |
+
pip install -e ./mcp
|
| 63 |
+
```
|
| 64 |
+
|
| 65 |
+
2. Obtain a Google Gemini API key: [Google AI Studio](https://aistudio.google.com)
|
| 66 |
+
|
| 67 |
+
3. **Configure `claude_desktop_config.json`:**. The Gemini API key will be read from the environment, so can also be set here:
|
| 68 |
+
```json
|
| 69 |
+
{
|
| 70 |
+
"mcpServers": {
|
| 71 |
+
"aileen3-mcp": {
|
| 72 |
+
"command": "python",
|
| 73 |
+
"args": ["-m", "aileen3_mcp.server"],
|
| 74 |
+
"env": {
|
| 75 |
+
"GEMINI_API_KEY": "AI..."
|
| 76 |
+
}
|
| 77 |
+
}
|
| 78 |
}
|
| 79 |
}
|
| 80 |
+
```
|
|
|
|
|
|
|
| 81 |
4. Restart Claude
|
| 82 |
|
| 83 |
+
5. The Haiku 4.5 model is sufficient for basic tasks. To make your plans fully transparent to the LLM, refer to "aileen3" explicitly in the prompt, e.g.:
|
|
|
|
| 84 |
> Use aileen3 to translate slide 3 from YouTube video reference eXP-PvKcI9A to German.
|
| 85 |
|
| 86 |

|
| 87 |
|
| 88 |
+
### 🔍 Debugging
|
| 89 |
The message exchange and Claude-facing error messages can be read from Claude log files:
|
| 90 |
```
|
| 91 |
tail -n 20 -F ~/Library/Logs/Claude/mcp*.log
|
| 92 |
```
|
| 93 |
|
|
|
|
| 94 |
|
| 95 |
+
## 🧪 The Gradio Space (Interactive Demo)
|
| 96 |
+
|
| 97 |
+
We have built a custom **Gradio 6** application that acts as a visual frontend for the MCP server. It demonstrates the pipeline step-by-step:
|
| 98 |
+
|
| 99 |
+
1. **Health Check:** Verifies `ffmpeg`, `yt-dlp`, and Gemini connectivity.
|
| 100 |
+
2. **Hallucination Check:** Demonstrates how lack of context leads to speech recognition errors.
|
| 101 |
+
3. **Context-biased Transcription:** Fixes these errors by establishing priors.
|
| 102 |
+
4. **Expectation-driven Analysis:** The core engine in action.
|
| 103 |
+
5. **Slide Translation:** Extracting and localizing visual assets.
|
| 104 |
+
|
| 105 |
+
[**👉 Try the Live Demo Here**](https://ndurner.de/links/aileen3-hf-space)
|
| 106 |
+
|
| 107 |
+
## 📘 MCP server overview
|
| 108 |
+
|
| 109 |
+
The MCP server is implemented in `mcp/src/aileen3_mcp` and exposes tools over stdio via `aileen3_mcp.server`. Google Gemini powers the analysis, transcription, and slide translation flows. Media retrieval is handled by `yt-dlp` and `ffmpeg`.
|
| 110 |
+
|
| 111 |
+
Environment prerequisites:
|
| 112 |
+
|
| 113 |
+
- `GEMINI_API_KEY` set to a valid Gemini API key
|
| 114 |
+
- `ffmpeg` installed and on `PATH`
|
| 115 |
+
|
| 116 |
+
Optional configuration:
|
| 117 |
+
|
| 118 |
+
- `AILEEN3_ANALYSIS_MODEL` to override the default Gemini model used for expectation-driven analysis (defaults to `gemini-flash-latest` for straightforward experimentation on the free tier of Google AI Studio; `gemini-3-pro-preview` recommended for accuracy).
|
| 119 |
+
- `AILEEN3_CACHE_DIR` to change the base cache directory (default: `~/.cache/aileen3`).
|
| 120 |
+
- `AILEEN3_DEBUG=1` to enable additional debug artefacts on disk.
|
| 121 |
+
|
| 122 |
+
### ⭐️ Example client integration
|
| 123 |
+
|
| 124 |
+
The companion project [Aileen 3 Agent](https://ndurner.de/links/aileen3-agent-github) uses this MCP server via the `google.adk` `McpToolset`, spawning `aileen3_mcp.server` over stdio with:
|
| 125 |
+
|
| 126 |
+
- `command`: `sys.executable`
|
| 127 |
+
- `args`: `["-m", "aileen3_mcp.server"]`
|
| 128 |
+
- `env`: explicitly forwarding `GEMINI_API_KEY` into the MCP process
|
| 129 |
+
- `timeout`: `1200` seconds at the MCP transport level, to accommodate long-running video analysis and transcription jobs beyond the 30 seconds default
|
| 130 |
+
|
| 131 |
+
When integrating this MCP into your own agent or client:
|
| 132 |
+
|
| 133 |
+
- Set transport-level timeouts generously (10–20 minutes) and rely on the tools’ `wait_seconds` argument plus status polling for progress.
|
| 134 |
+
- Ensure `GEMINI_API_KEY` (and any optional `AILEEN3_*` variables you use) are visible in the environment of the MCP server process, not just the client.
|
| 135 |
+
|
| 136 |
+
### 🛠️ MCP tools and definitions
|
| 137 |
+
#### 🩺 Health and search
|
| 138 |
+
|
| 139 |
+
- `health() -> { ok, detail, ffmpeg, gemini_api_key }`
|
| 140 |
+
- Purpose: Lightweight health probe mirroring the Gradio demo’s health check. Confirms that `ffmpeg` is callable and `GEMINI_API_KEY` is present.
|
| 141 |
+
- Usage: Call before running longer flows to surface missing runtime dependencies early.
|
| 142 |
+
|
| 143 |
+
- `search_youtube(query: str, max_results: int = 10) -> { videos: [...] }`
|
| 144 |
+
- Purpose: Fast YouTube search using `yt-dlp` (no downloads).
|
| 145 |
+
- Arguments:
|
| 146 |
+
- `query` (required): Free-form search terms (e.g. `"taler auditor bachelorthesis"`).
|
| 147 |
+
- `max_results` (optional, default `10`, clamped to `1–50`).
|
| 148 |
+
- Returns: `videos` list with `id`, `title`, `webpage_url`, `duration_seconds`, `channel`, `channel_id`.
|
| 149 |
+
- Typical flow: Use from an agent to shortlist candidate videos before picking one `source` for retrieval.
|
| 150 |
+
|
| 151 |
+
#### 📺 Media retrieval (entry point)
|
| 152 |
+
|
| 153 |
+
- `start_media_retrieval(source: str, prefer_audio_only: bool = False, wait_seconds: int = 54) -> dict`
|
| 154 |
+
- Purpose: Download long-form media (YouTube, podcasts, HTTP URLs) and normalize basic metadata.
|
| 155 |
+
- Arguments:
|
| 156 |
+
- `source`: YouTube URL/ID, podcast URL, or other `yt-dlp`-supported locator.
|
| 157 |
+
- `prefer_audio_only`: When `true`, prefer audio-first formats; use when visuals are not needed.
|
| 158 |
+
- `wait_seconds`: How long to block before returning; if the job is still running, you get status + reference.
|
| 159 |
+
- Returns:
|
| 160 |
+
- On success: `{ reference, status: "done", metadata: {...}, cached? }`
|
| 161 |
+
- In progress: `{ reference, status: "pending"|"running", progress?, job_id }`
|
| 162 |
+
- On error: `{ is_error: true, status, detail, reference }`
|
| 163 |
+
- Typical flow: This is the first call once you have chosen a `source`. The `reference` token is required for all downstream tools.
|
| 164 |
+
|
| 165 |
+
- `get_media_retrieval_status(reference: str, wait_seconds: int = 0) -> dict`
|
| 166 |
+
- Purpose: Poll the retrieval job or fetch cached metadata.
|
| 167 |
+
- Returns:
|
| 168 |
+
- `{ status: "done", reference, metadata }` when cached or finished.
|
| 169 |
+
- `{ status: "pending"|"running", ... }` while in flight.
|
| 170 |
+
- `{ status: "not_found", reference }` if no job or cache exists.
|
| 171 |
+
|
| 172 |
+
#### 🖼️ Slides: extraction and translation
|
| 173 |
+
|
| 174 |
+
- `start_slide_extraction(reference: str, wait_seconds: int = 55) -> dict`
|
| 175 |
+
- Purpose: Extract representative slide stills from a downloaded video.
|
| 176 |
+
- Note: Full media analysis (`start_media_analysis`) automatically triggers slide extraction; call this explicitly only if you need slides on their own.
|
| 177 |
+
- Returns: Standard job envelope with `slides` once done or `status` + `job_id` while running.
|
| 178 |
+
|
| 179 |
+
- `get_extracted_slides(reference: str, wait_seconds: int = 0) -> dict`
|
| 180 |
+
- Purpose: Fetch extracted slides or current extraction status.
|
| 181 |
+
- Returns: `{ status: "done", reference, slides: [...] }` on success, otherwise a job status or `{ status: "not_found" }`. Slides include indices that are used by `translate_slide`.
|
| 182 |
+
|
| 183 |
+
- `translate_slide(reference: str, slide_index: int, language: str) -> ImageContent`
|
| 184 |
+
- Purpose: Translate a single slide image into another language using Gemini image-to-image.
|
| 185 |
+
- Arguments:
|
| 186 |
+
- `reference`: Token from `start_media_retrieval`.
|
| 187 |
+
- `slide_index`: Zero-based index into `get_extracted_slides.slides[].index`.
|
| 188 |
+
- `language`: Target language name (e.g. `"German"`, `"Spanish"`).
|
| 189 |
+
- Returns: `ImageContent` with base64-encoded translated slide image. Responses are cached per `(reference, language, slide_index)`.
|
| 190 |
+
|
| 191 |
+
#### ⛳️ Expectation-driven analysis
|
| 192 |
+
|
| 193 |
+
- `start_media_analysis(reference: str, priors: object, wait_seconds: int = 55) -> dict`
|
| 194 |
+
- Purpose: Run expectation-driven analysis over the media’s audio and slides, surfacing *surprises* and *new actors* instead of rehashing everything.
|
| 195 |
+
- Arguments:
|
| 196 |
+
- `reference`: Token produced by `start_media_retrieval`.
|
| 197 |
+
- `priors`: Object with optional string fields:
|
| 198 |
+
- `context`: Scene setting (participants, venue, goal, spelled names).
|
| 199 |
+
- `expectations`: What the user already expects to hear.
|
| 200 |
+
- `prior_knowledge`: What the user already knows from past work.
|
| 201 |
+
- `questions`: Concrete questions to be answered.
|
| 202 |
+
- Important: Only populate `priors` with information coming from the user or trusted tools (e.g. Memory Bank); do not invent priors in the agent.
|
| 203 |
+
- Returns: Same job envelope pattern as retrieval. When `status: "done"`, the payload includes an `analysis` markdown briefing optimised for fast reading.
|
| 204 |
+
|
| 205 |
+
- `get_media_analysis_result(reference: str, wait_seconds: int = 0) -> dict`
|
| 206 |
+
- Purpose: Poll for completion or fetch cached analysis for a `reference`.
|
| 207 |
+
- Returns:
|
| 208 |
+
- `status: "done"` with `analysis` text on success.
|
| 209 |
+
- `status: "pending"|"running"` during processing.
|
| 210 |
+
- Errors include `is_error: true`, `detail`, `reference`.
|
| 211 |
+
|
| 212 |
+
#### ✍️ Transcription
|
| 213 |
+
|
| 214 |
+
- `start_media_transcription(reference: str, context: str = "", prefer_audio_only: bool = False, wait_seconds: int = 55) -> dict`
|
| 215 |
+
- Purpose: Produce a diarized, speaker-labelled transcription of the media’s audio channel.
|
| 216 |
+
- Arguments:
|
| 217 |
+
- `reference`: From `start_media_retrieval`.
|
| 218 |
+
- `context`: Optional grounding text with names, acronyms, or domain hints.
|
| 219 |
+
- `prefer_audio_only`: When `true`, skip slide context for cheaper audio-only runs.
|
| 220 |
+
- `wait_seconds`: Poll window before returning.
|
| 221 |
+
- Returns: Job envelope, with `transcription` once `status: "done"`.
|
| 222 |
+
|
| 223 |
+
- `get_media_transcription_result(reference: str, wait_seconds: int = 0) -> dict`
|
| 224 |
+
- Purpose: Retrieve a previously computed transcription or current job status.
|
| 225 |
+
- Returns: Same pattern as `get_media_analysis_result`, but with `transcription` instead of `analysis`.
|
| 226 |
+
|
| 227 |
+
## 🏆 Hackathon Context & Journey
|
| 228 |
+
Aileen 3 Core was built for the [MCP's 1st Birthday - Hosted by Anthropic and Gradio](https://huggingface.co/MCP-1st-Birthday) and serves as the backbone for the [Aileen 3 Agent](https://ndurner.de/links/aileen3-kaggle-writeup) (developed for the [AI Agents Intensive Course with Google](https://www.kaggle.com/learn-guide/5-day-agents)).
|
| 229 |
+
|
| 230 |
+
While most agents are passive summarizers, Aileen 3 represents a shift toward **active information foraging**, enabling professionals to filter signal from an ocean of noise.
|
| 231 |
+
|
| 232 |
+
## 📦 Local Development
|
| 233 |
|
| 234 |
```bash
|
| 235 |
+
# Build the Docker image
|
| 236 |
docker build -t aileen3-core .
|
| 237 |
+
|
| 238 |
+
# Run the Gradio interface
|
| 239 |
docker run -it -p 7860:7860 aileen3-core
|
| 240 |
```
|
| 241 |
|
| 242 |
+
## 🛡️ Security & privacy
|
| 243 |
+
- Your Gemini key is used only server-side to call Gemini models.
|
| 244 |
+
- Media is downloaded to cache for repeatability; clear ~/.cache/aileen3 to remove artefacts.
|
| 245 |
+
- No analytics or third-party telemetry included.
|
| 246 |
+
|
| 247 |
+
## 🚧 Limitations
|
| 248 |
+
- `translate_slide` does currently not benefit from priors; translation quality could be improved that way
|
| 249 |
+
- No AI safety guardrails (tone, style, anti prompt-injection, ...)
|
| 250 |
+
- No cost control
|
| 251 |
+
- Hallucination risk - Aileen may make mistakes.
|
| 252 |
+
- Remote MCP operating mode not tested; would rely on external access protection
|
| 253 |
+
|
| 254 |
+
## 👾 Troubleshooting
|
| 255 |
+
- Gemini 401 “API keys are not supported…”: use AI Studio key starting with “AI…”, not Vertex keys (“AQ…”).
|
| 256 |
+
- Long jobs: increase transport timeout (10–20 min) and leverage wait_seconds + polling get_* tools.
|
| 257 |
+
- YouTube access:
|
| 258 |
+
* ensure YouTube is reachable
|
| 259 |
+
* yt-dlp is recent
|
| 260 |
+
* if site JS protection breaks, install yt-dlp-ejs (see Space health check).
|
README.md
CHANGED
|
@@ -1,80 +1,274 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
emoji: 👩🏻💼
|
| 4 |
colorFrom: purple
|
| 5 |
colorTo: blue
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
| 8 |
license: cc-by-4.0
|
| 9 |
-
short_description:
|
| 10 |
tags:
|
| 11 |
- building-mcp-track-enterprise
|
| 12 |
- building-mcp-track-customer
|
| 13 |
---
|
| 14 |
|
| 15 |
-
# Aileen 3 Core
|
| 16 |
-
|
| 17 |
-
<
|
| 18 |
-
<a href="https://ndurner.de/links/aileen3-
|
| 19 |
-
<a href="https://ndurner.de/links/aileen3-
|
|
|
|
|
|
|
|
|
|
| 20 |
<a href="https://ndurner.de/links/aileen3-kaggle-writeup"><img alt="Agent Kaggle Writeup" src="https://img.shields.io/badge/Agent-Writeup-lightgray?logo=kaggle"></img></a>
|
| 21 |
-
<a href="https://ndurner.de/links/aileen3-kaggle-video"><img alt="Agent Demo Video Badge" src="https://img.shields.io/badge/Agent%20Demo-Video-lightgray?logo=YouTube"></img></a>
|
| 22 |
</div>
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
```
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
}
|
| 52 |
}
|
| 53 |
-
|
| 54 |
-
}
|
| 55 |
-
```
|
| 56 |
4. Restart Claude
|
| 57 |
|
| 58 |
-
|
| 59 |
-
The model Haiku 4.5 is sufficient for basic tasks. To make your plans fully transparent to the LLM, refer to "aileen3" explicitely in the prompt, e.g.:
|
| 60 |
> Use aileen3 to translate slide 3 from YouTube video reference eXP-PvKcI9A to German.
|
| 61 |
|
| 62 |

|
| 63 |
|
| 64 |
-
|
| 65 |
The message exchange and Claude-facing error messages can be read from Claude log files:
|
| 66 |
```
|
| 67 |
tail -n 20 -F ~/Library/Logs/Claude/mcp*.log
|
| 68 |
```
|
| 69 |
|
| 70 |
-
## Local development
|
| 71 |
|
| 72 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
|
| 74 |
```bash
|
|
|
|
| 75 |
docker build -t aileen3-core .
|
|
|
|
|
|
|
| 76 |
docker run -it -p 7860:7860 aileen3-core
|
| 77 |
```
|
| 78 |
|
| 79 |
-
##
|
| 80 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Aileen 3 Core - Information Foraging MCP Server
|
| 3 |
emoji: 👩🏻💼
|
| 4 |
colorFrom: purple
|
| 5 |
colorTo: blue
|
| 6 |
sdk: docker
|
| 7 |
pinned: false
|
| 8 |
license: cc-by-4.0
|
| 9 |
+
short_description: Turns 45 minute conference videos into 2-minute "surprise briefs". Expectation-driven, slide-aware, Gemini-powered.
|
| 10 |
tags:
|
| 11 |
- building-mcp-track-enterprise
|
| 12 |
- building-mcp-track-customer
|
| 13 |
---
|
| 14 |
|
| 15 |
+
# Aileen 3 Core: Information Foraging MCP Server
|
| 16 |
+
|
| 17 |
+
<div style="display: flex; justify-content: center; gap: 10px; margin-bottom: 1em">
|
| 18 |
+
<a href="https://ndurner.de/links/aileen3-hf-space"><img alt="HuggingFace Space Badge" src="https://img.shields.io/badge/Gradio%206-HuggingFace%20Space-yellow?logo=gradio"></img></a>
|
| 19 |
+
<a href="https://ndurner.de/links/aileen3-linkedin"><img alt="LinkedIn Post Badge" src="https://img.shields.io/badge/🔗%20LinkedIn-Post-blue?logo=linkedin"></img></a>
|
| 20 |
+
<a href="https://ndurner.de/links/aileen3-hf-video"><img alt="MCP Demo Video Badge" src="https://img.shields.io/badge/MCP-Demo%20Video-red?logo=YouTube"></img></a>
|
| 21 |
+
🔜<a href="https://ndurner.de/links/aileen3-agent-github"><img alt="Agent Agent Github Badge" src="https://img.shields.io/badge/Agent-Github-lightgray?logo=github"></img></a>
|
| 22 |
+
<a href="https://ndurner.de/links/aileen3-kaggle-video"><img alt="Agent Demo Video Badge" src="https://img.shields.io/badge/Agent-Demo%20Video-lightgray?logo=YouTube"></img></a>
|
| 23 |
<a href="https://ndurner.de/links/aileen3-kaggle-writeup"><img alt="Agent Kaggle Writeup" src="https://img.shields.io/badge/Agent-Writeup-lightgray?logo=kaggle"></img></a>
|
|
|
|
| 24 |
</div>
|
| 25 |
|
| 26 |
+
> **"Information is surprises. You learn something when things don’t turn out the way you expected."** ⸺ Roger Schank
|
| 27 |
+
|
| 28 |
+
## ♨️ Problem: The Noise-Signal Ratio
|
| 29 |
+
Professionals working at the intersection of regulation and technology drink from a firehose of information. Staying current requires monitoring hours of conferences, webinars, and podcasts.
|
| 30 |
+
|
| 31 |
+
Standard AI **summarization fails** here because it creates "flat" summaries that rehash what you already know. It treats every sentence as equally important.
|
| 32 |
+
|
| 33 |
+
## ✅ Solution: Expectation-Driven Analysis
|
| 34 |
+
**Aileen 3 Core** is a Model Context Protocol (MCP) server designed for **Information Foraging**. Grounded in cognitive science, it models "novelty" as **prediction error**.
|
| 35 |
+
|
| 36 |
+
Instead of asking "Summarize this video," Aileen 3 Core allows users to task a Large Language Model with:
|
| 37 |
+
*"Here is what I already know, and here is what I expect the speaker to say. Tell me only where they deviate from this baseline."*. As part of a larger agentic AI system, the prior knowledge can even be derived from a memory bank.
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
### Key Capabilities
|
| 41 |
+
* **⛳️ Expectation-Driven Briefings:** Uses Google Gemini to analyze audio/video against user-supplied priors (context, expectations, and knowledge gaps) to surface genuine surprises.
|
| 42 |
+
* **🔍 Context-Biased Transcription:** Prevents hallucinations (e.g., confusing the German treaty "NOOTS" for "emergency state") by feeding media metadata as priors to the model.
|
| 43 |
+
* **🖼️ Visual Slide Extraction:** Automatically detects, extracts, and, on request, translates slide stills from video feeds, treating slides as high-density information artifacts.
|
| 44 |
+
* **🔌 Universal MCP Support:** Works with **Claude Desktop**, or any custom Agent.
|
| 45 |
+
|
| 46 |
+
---
|
| 47 |
+
|
| 48 |
+
## 🏗️ Architecture
|
| 49 |
+
|
| 50 |
+
Aileen 3 Core exposes tools that bridge the gap between raw media and reasoning agents.
|
| 51 |
+
|
| 52 |
+
```mermaid
|
| 53 |
+
graph LR
|
| 54 |
+
User[User / Agent] -->|Priors & Expectations| MCP[Aileen 3 Core MCP]
|
| 55 |
+
MCP -->|Retrieval| YT[YouTube/Media]
|
| 56 |
+
MCP -->|Visuals| Slides[Slide Extraction]
|
| 57 |
+
MCP -->|Audio| Trans[Transcription]
|
| 58 |
+
MCP -->|Reasoning| Gemini[Google Gemini]
|
| 59 |
+
|
| 60 |
+
YT --> Gemini
|
| 61 |
+
Slides --> Gemini
|
| 62 |
+
Trans --> Gemini
|
| 63 |
+
|
| 64 |
+
Gemini -->|Briefing: Surprises Only| MCP
|
| 65 |
+
Gemini -->|Localized slides| MCP
|
| 66 |
+
MCP -->|High-Signal Update| User
|
| 67 |
```
|
| 68 |
+
|
| 69 |
+
## 🚀 Quick Start: Claude Desktop
|
| 70 |
+
|
| 71 |
+
Aileen 3 Core is designed to be the "eyes and ears" for your local LLM client.
|
| 72 |
+
|
| 73 |
+
1. **Install:**
|
| 74 |
+
```bash
|
| 75 |
+
# Clone and install dependencies
|
| 76 |
+
pip install -e ./mcp
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
2. Obtain a Google Gemini API key: [Google AI Studio](https://aistudio.google.com)
|
| 80 |
+
|
| 81 |
+
3. **Configure `claude_desktop_config.json`:**. The Gemini API key will be read from the environment, so can also be set here:
|
| 82 |
+
```json
|
| 83 |
+
{
|
| 84 |
+
"mcpServers": {
|
| 85 |
+
"aileen3-mcp": {
|
| 86 |
+
"command": "python",
|
| 87 |
+
"args": ["-m", "aileen3_mcp.server"],
|
| 88 |
+
"env": {
|
| 89 |
+
"GEMINI_API_KEY": "AI..."
|
| 90 |
+
}
|
| 91 |
+
}
|
| 92 |
}
|
| 93 |
}
|
| 94 |
+
```
|
|
|
|
|
|
|
| 95 |
4. Restart Claude
|
| 96 |
|
| 97 |
+
5. The Haiku 4.5 model is sufficient for basic tasks. To make your plans fully transparent to the LLM, refer to "aileen3" explicitly in the prompt, e.g.:
|
|
|
|
| 98 |
> Use aileen3 to translate slide 3 from YouTube video reference eXP-PvKcI9A to German.
|
| 99 |
|
| 100 |

|
| 101 |
|
| 102 |
+
### 🔍 Debugging
|
| 103 |
The message exchange and Claude-facing error messages can be read from Claude log files:
|
| 104 |
```
|
| 105 |
tail -n 20 -F ~/Library/Logs/Claude/mcp*.log
|
| 106 |
```
|
| 107 |
|
|
|
|
| 108 |
|
| 109 |
+
## 🧪 The Gradio Space (Interactive Demo)
|
| 110 |
+
|
| 111 |
+
We have built a custom **Gradio 6** application that acts as a visual frontend for the MCP server. It demonstrates the pipeline step-by-step:
|
| 112 |
+
|
| 113 |
+
1. **Health Check:** Verifies `ffmpeg`, `yt-dlp`, and Gemini connectivity.
|
| 114 |
+
2. **Hallucination Check:** Demonstrates how lack of context leads to speech recognition errors.
|
| 115 |
+
3. **Context-biased Transcription:** Fixes these errors by establishing priors.
|
| 116 |
+
4. **Expectation-driven Analysis:** The core engine in action.
|
| 117 |
+
5. **Slide Translation:** Extracting and localizing visual assets.
|
| 118 |
+
|
| 119 |
+
[**👉 Try the Live Demo Here**](https://ndurner.de/links/aileen3-hf-space)
|
| 120 |
+
|
| 121 |
+
## 📘 MCP server overview
|
| 122 |
+
|
| 123 |
+
The MCP server is implemented in `mcp/src/aileen3_mcp` and exposes tools over stdio via `aileen3_mcp.server`. Google Gemini powers the analysis, transcription, and slide translation flows. Media retrieval is handled by `yt-dlp` and `ffmpeg`.
|
| 124 |
+
|
| 125 |
+
Environment prerequisites:
|
| 126 |
+
|
| 127 |
+
- `GEMINI_API_KEY` set to a valid Gemini API key
|
| 128 |
+
- `ffmpeg` installed and on `PATH`
|
| 129 |
+
|
| 130 |
+
Optional configuration:
|
| 131 |
+
|
| 132 |
+
- `AILEEN3_ANALYSIS_MODEL` to override the default Gemini model used for expectation-driven analysis (defaults to `gemini-flash-latest` for straightforward experimentation on the free tier of Google AI Studio; `gemini-3-pro-preview` recommended for accuracy).
|
| 133 |
+
- `AILEEN3_CACHE_DIR` to change the base cache directory (default: `~/.cache/aileen3`).
|
| 134 |
+
- `AILEEN3_DEBUG=1` to enable additional debug artefacts on disk.
|
| 135 |
+
|
| 136 |
+
### ⭐️ Example client integration
|
| 137 |
+
|
| 138 |
+
The companion project [Aileen 3 Agent](https://ndurner.de/links/aileen3-agent-github) uses this MCP server via the `google.adk` `McpToolset`, spawning `aileen3_mcp.server` over stdio with:
|
| 139 |
+
|
| 140 |
+
- `command`: `sys.executable`
|
| 141 |
+
- `args`: `["-m", "aileen3_mcp.server"]`
|
| 142 |
+
- `env`: explicitly forwarding `GEMINI_API_KEY` into the MCP process
|
| 143 |
+
- `timeout`: `1200` seconds at the MCP transport level, to accommodate long-running video analysis and transcription jobs beyond the 30 seconds default
|
| 144 |
+
|
| 145 |
+
When integrating this MCP into your own agent or client:
|
| 146 |
+
|
| 147 |
+
- Set transport-level timeouts generously (10–20 minutes) and rely on the tools’ `wait_seconds` argument plus status polling for progress.
|
| 148 |
+
- Ensure `GEMINI_API_KEY` (and any optional `AILEEN3_*` variables you use) are visible in the environment of the MCP server process, not just the client.
|
| 149 |
+
|
| 150 |
+
### 🛠️ MCP tools and definitions
|
| 151 |
+
#### 🩺 Health and search
|
| 152 |
+
|
| 153 |
+
- `health() -> { ok, detail, ffmpeg, gemini_api_key }`
|
| 154 |
+
- Purpose: Lightweight health probe mirroring the Gradio demo’s health check. Confirms that `ffmpeg` is callable and `GEMINI_API_KEY` is present.
|
| 155 |
+
- Usage: Call before running longer flows to surface missing runtime dependencies early.
|
| 156 |
+
|
| 157 |
+
- `search_youtube(query: str, max_results: int = 10) -> { videos: [...] }`
|
| 158 |
+
- Purpose: Fast YouTube search using `yt-dlp` (no downloads).
|
| 159 |
+
- Arguments:
|
| 160 |
+
- `query` (required): Free-form search terms (e.g. `"taler auditor bachelorthesis"`).
|
| 161 |
+
- `max_results` (optional, default `10`, clamped to `1–50`).
|
| 162 |
+
- Returns: `videos` list with `id`, `title`, `webpage_url`, `duration_seconds`, `channel`, `channel_id`.
|
| 163 |
+
- Typical flow: Use from an agent to shortlist candidate videos before picking one `source` for retrieval.
|
| 164 |
+
|
| 165 |
+
#### 📺 Media retrieval (entry point)
|
| 166 |
+
|
| 167 |
+
- `start_media_retrieval(source: str, prefer_audio_only: bool = False, wait_seconds: int = 54) -> dict`
|
| 168 |
+
- Purpose: Download long-form media (YouTube, podcasts, HTTP URLs) and normalize basic metadata.
|
| 169 |
+
- Arguments:
|
| 170 |
+
- `source`: YouTube URL/ID, podcast URL, or other `yt-dlp`-supported locator.
|
| 171 |
+
- `prefer_audio_only`: When `true`, prefer audio-first formats; use when visuals are not needed.
|
| 172 |
+
- `wait_seconds`: How long to block before returning; if the job is still running, you get status + reference.
|
| 173 |
+
- Returns:
|
| 174 |
+
- On success: `{ reference, status: "done", metadata: {...}, cached? }`
|
| 175 |
+
- In progress: `{ reference, status: "pending"|"running", progress?, job_id }`
|
| 176 |
+
- On error: `{ is_error: true, status, detail, reference }`
|
| 177 |
+
- Typical flow: This is the first call once you have chosen a `source`. The `reference` token is required for all downstream tools.
|
| 178 |
+
|
| 179 |
+
- `get_media_retrieval_status(reference: str, wait_seconds: int = 0) -> dict`
|
| 180 |
+
- Purpose: Poll the retrieval job or fetch cached metadata.
|
| 181 |
+
- Returns:
|
| 182 |
+
- `{ status: "done", reference, metadata }` when cached or finished.
|
| 183 |
+
- `{ status: "pending"|"running", ... }` while in flight.
|
| 184 |
+
- `{ status: "not_found", reference }` if no job or cache exists.
|
| 185 |
+
|
| 186 |
+
#### 🖼️ Slides: extraction and translation
|
| 187 |
+
|
| 188 |
+
- `start_slide_extraction(reference: str, wait_seconds: int = 55) -> dict`
|
| 189 |
+
- Purpose: Extract representative slide stills from a downloaded video.
|
| 190 |
+
- Note: Full media analysis (`start_media_analysis`) automatically triggers slide extraction; call this explicitly only if you need slides on their own.
|
| 191 |
+
- Returns: Standard job envelope with `slides` once done or `status` + `job_id` while running.
|
| 192 |
+
|
| 193 |
+
- `get_extracted_slides(reference: str, wait_seconds: int = 0) -> dict`
|
| 194 |
+
- Purpose: Fetch extracted slides or current extraction status.
|
| 195 |
+
- Returns: `{ status: "done", reference, slides: [...] }` on success, otherwise a job status or `{ status: "not_found" }`. Slides include indices that are used by `translate_slide`.
|
| 196 |
+
|
| 197 |
+
- `translate_slide(reference: str, slide_index: int, language: str) -> ImageContent`
|
| 198 |
+
- Purpose: Translate a single slide image into another language using Gemini image-to-image.
|
| 199 |
+
- Arguments:
|
| 200 |
+
- `reference`: Token from `start_media_retrieval`.
|
| 201 |
+
- `slide_index`: Zero-based index into `get_extracted_slides.slides[].index`.
|
| 202 |
+
- `language`: Target language name (e.g. `"German"`, `"Spanish"`).
|
| 203 |
+
- Returns: `ImageContent` with base64-encoded translated slide image. Responses are cached per `(reference, language, slide_index)`.
|
| 204 |
+
|
| 205 |
+
#### ⛳️ Expectation-driven analysis
|
| 206 |
+
|
| 207 |
+
- `start_media_analysis(reference: str, priors: object, wait_seconds: int = 55) -> dict`
|
| 208 |
+
- Purpose: Run expectation-driven analysis over the media’s audio and slides, surfacing *surprises* and *new actors* instead of rehashing everything.
|
| 209 |
+
- Arguments:
|
| 210 |
+
- `reference`: Token produced by `start_media_retrieval`.
|
| 211 |
+
- `priors`: Object with optional string fields:
|
| 212 |
+
- `context`: Scene setting (participants, venue, goal, spelled names).
|
| 213 |
+
- `expectations`: What the user already expects to hear.
|
| 214 |
+
- `prior_knowledge`: What the user already knows from past work.
|
| 215 |
+
- `questions`: Concrete questions to be answered.
|
| 216 |
+
- Important: Only populate `priors` with information coming from the user or trusted tools (e.g. Memory Bank); do not invent priors in the agent.
|
| 217 |
+
- Returns: Same job envelope pattern as retrieval. When `status: "done"`, the payload includes an `analysis` markdown briefing optimised for fast reading.
|
| 218 |
+
|
| 219 |
+
- `get_media_analysis_result(reference: str, wait_seconds: int = 0) -> dict`
|
| 220 |
+
- Purpose: Poll for completion or fetch cached analysis for a `reference`.
|
| 221 |
+
- Returns:
|
| 222 |
+
- `status: "done"` with `analysis` text on success.
|
| 223 |
+
- `status: "pending"|"running"` during processing.
|
| 224 |
+
- Errors include `is_error: true`, `detail`, `reference`.
|
| 225 |
+
|
| 226 |
+
#### ✍️ Transcription
|
| 227 |
+
|
| 228 |
+
- `start_media_transcription(reference: str, context: str = "", prefer_audio_only: bool = False, wait_seconds: int = 55) -> dict`
|
| 229 |
+
- Purpose: Produce a diarized, speaker-labelled transcription of the media’s audio channel.
|
| 230 |
+
- Arguments:
|
| 231 |
+
- `reference`: From `start_media_retrieval`.
|
| 232 |
+
- `context`: Optional grounding text with names, acronyms, or domain hints.
|
| 233 |
+
- `prefer_audio_only`: When `true`, skip slide context for cheaper audio-only runs.
|
| 234 |
+
- `wait_seconds`: Poll window before returning.
|
| 235 |
+
- Returns: Job envelope, with `transcription` once `status: "done"`.
|
| 236 |
+
|
| 237 |
+
- `get_media_transcription_result(reference: str, wait_seconds: int = 0) -> dict`
|
| 238 |
+
- Purpose: Retrieve a previously computed transcription or current job status.
|
| 239 |
+
- Returns: Same pattern as `get_media_analysis_result`, but with `transcription` instead of `analysis`.
|
| 240 |
+
|
| 241 |
+
## 🏆 Hackathon Context & Journey
|
| 242 |
+
Aileen 3 Core was built for the [MCP's 1st Birthday - Hosted by Anthropic and Gradio](https://huggingface.co/MCP-1st-Birthday) and serves as the backbone for the [Aileen 3 Agent](https://ndurner.de/links/aileen3-kaggle-writeup) (developed for the [AI Agents Intensive Course with Google](https://www.kaggle.com/learn-guide/5-day-agents)).
|
| 243 |
+
|
| 244 |
+
While most agents are passive summarizers, Aileen 3 represents a shift toward **active information foraging**, enabling professionals to filter signal from an ocean of noise.
|
| 245 |
+
|
| 246 |
+
## 📦 Local Development
|
| 247 |
|
| 248 |
```bash
|
| 249 |
+
# Build the Docker image
|
| 250 |
docker build -t aileen3-core .
|
| 251 |
+
|
| 252 |
+
# Run the Gradio interface
|
| 253 |
docker run -it -p 7860:7860 aileen3-core
|
| 254 |
```
|
| 255 |
|
| 256 |
+
## 🛡️ Security & privacy
|
| 257 |
+
- Your Gemini key is used only server-side to call Gemini models.
|
| 258 |
+
- Media is downloaded to cache for repeatability; clear ~/.cache/aileen3 to remove artefacts.
|
| 259 |
+
- No analytics or third-party telemetry included.
|
| 260 |
+
|
| 261 |
+
## 🚧 Limitations
|
| 262 |
+
- `translate_slide` does currently not benefit from priors; translation quality could be improved that way
|
| 263 |
+
- No AI safety guardrails (tone, style, anti prompt-injection, ...)
|
| 264 |
+
- No cost control
|
| 265 |
+
- Hallucination risk - Aileen may make mistakes.
|
| 266 |
+
- Remote MCP operating mode not tested; would rely on external access protection
|
| 267 |
+
|
| 268 |
+
## 👾 Troubleshooting
|
| 269 |
+
- Gemini 401 “API keys are not supported…”: use AI Studio key starting with “AI…”, not Vertex keys (“AQ…”).
|
| 270 |
+
- Long jobs: increase transport timeout (10–20 min) and leverage wait_seconds + polling get_* tools.
|
| 271 |
+
- YouTube access:
|
| 272 |
+
* ensure YouTube is reachable
|
| 273 |
+
* yt-dlp is recent
|
| 274 |
+
* if site JS protection breaks, install yt-dlp-ejs (see Space health check).
|
demo/setup_cell.py
CHANGED
|
@@ -12,7 +12,7 @@ def render_setup_cell() -> gr.Textbox:
|
|
| 12 |
The returned textbox component is used by other cells to pass GEMINI_API_KEY
|
| 13 |
into the MCP server environment.
|
| 14 |
|
| 15 |
-
This
|
| 16 |
"""
|
| 17 |
with cell("🔑 Setup: Gemini API key"):
|
| 18 |
gr.Markdown(
|
|
|
|
| 12 |
The returned textbox component is used by other cells to pass GEMINI_API_KEY
|
| 13 |
into the MCP server environment.
|
| 14 |
|
| 15 |
+
This Space runs your key locally in the container to call Gemini. You can revoke it any time.
|
| 16 |
"""
|
| 17 |
with cell("🔑 Setup: Gemini API key"):
|
| 18 |
gr.Markdown(
|
mcp/README.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# Aileen3 MCP Server
|
| 2 |
|
| 3 |
-
Lightweight MCP server exposing
|
| 4 |
|
| 5 |
## Quick start
|
| 6 |
|
|
@@ -9,32 +9,22 @@ python -m pip install -e ./mcp
|
|
| 9 |
aileen3-mcp # starts the stdio MCP server
|
| 10 |
```
|
| 11 |
|
| 12 |
-
The server
|
| 13 |
|
| 14 |
-
|
| 15 |
-
2) `search_youtube` — finds YouTube videos using the yt-dlp Python API.
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
|
| 20 |
-
- **Arguments:**
|
| 21 |
-
- `query` (str, required): Free-form search terms, e.g. `"lofi hip hop beats"`.
|
| 22 |
-
- `max_results` (int, optional, default `10`, bounds `1–50`): number of videos to return.
|
| 23 |
-
- **Returns:** object with a `videos` array. Each entry includes `id`, `title`, `webpage_url`,
|
| 24 |
-
`duration_seconds`, `channel`, `channel_id`, `thumbnail_url`.
|
| 25 |
-
- **Usage note:** Keep `max_results` small (≤10) for faster responses. The tool only searches; it does not download media.
|
| 26 |
|
| 27 |
-
|
|
|
|
|
|
|
| 28 |
|
| 29 |
-
|
| 30 |
-
{
|
| 31 |
-
"name": "search_youtube",
|
| 32 |
-
"arguments": {
|
| 33 |
-
"query": "python packaging tutorial",
|
| 34 |
-
"max_results": 5
|
| 35 |
-
}
|
| 36 |
-
}
|
| 37 |
-
```
|
| 38 |
-
|
| 39 |
-
## ToDo
|
| 40 |
-
* write proper project description: add to README.md and pyproject.toml
|
|
|
|
| 1 |
# Aileen3 MCP Server
|
| 2 |
|
| 3 |
+
Lightweight stdio MCP server exposing Aileen 3’s media tools for use by the Gradio demo, Claude Desktop, and other MCP clients.
|
| 4 |
|
| 5 |
## Quick start
|
| 6 |
|
|
|
|
| 9 |
aileen3-mcp # starts the stdio MCP server
|
| 10 |
```
|
| 11 |
|
| 12 |
+
The server entrypoint is `aileen3_mcp.server.make_app`, which registers all tools on a `FastMCP` instance. For a complete description of available tools (health probes, YouTube search, media retrieval, slide extraction and translation, analysis, transcription), see the project root `README.md` under **“MCP tools and interface”**.
|
| 13 |
|
| 14 |
+
In short, the public tools are:
|
|
|
|
| 15 |
|
| 16 |
+
- `health`
|
| 17 |
+
- `search_youtube`
|
| 18 |
+
- `start_media_retrieval` / `get_media_retrieval_status`
|
| 19 |
+
- `start_slide_extraction` / `get_extracted_slides`
|
| 20 |
+
- `translate_slide`
|
| 21 |
+
- `start_media_analysis` / `get_media_analysis_result`
|
| 22 |
+
- `start_media_transcription` / `get_media_transcription_result`
|
| 23 |
|
| 24 |
+
These tools are designed to be called from an agentic chat interface that:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
+
- first chooses a media `source` (optionally using `search_youtube`)
|
| 27 |
+
- then calls `start_media_retrieval`
|
| 28 |
+
- and finally uses the `reference` token to drive analysis, transcription, or slide translation.
|
| 29 |
|
| 30 |
+
For detailed contracts (arguments, return payloads, and example usage), consult `README.md` in the repository root.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
mcp/src/aileen3_mcp/media_tools.py
CHANGED
|
@@ -1302,7 +1302,7 @@ def register_media_tools(app: FastMCP) -> None:
|
|
| 1302 |
async def start_slide_extraction(ctx: Context, reference: str, wait_seconds: int = 55) -> dict:
|
| 1303 |
"""Extract representative slide stills from a downloaded video.
|
| 1304 |
|
| 1305 |
-
Note: media analysis (start_media_analysis) includes slides extraction, so no need to call this function
|
| 1306 |
"""
|
| 1307 |
metadata = _load_json(_metadata_path(reference))
|
| 1308 |
if not metadata or not Path(metadata.get("download_path", "")).exists():
|
|
|
|
| 1302 |
async def start_slide_extraction(ctx: Context, reference: str, wait_seconds: int = 55) -> dict:
|
| 1303 |
"""Extract representative slide stills from a downloaded video.
|
| 1304 |
|
| 1305 |
+
Note: media analysis (start_media_analysis) includes slides extraction, so no need to call this function explicitly when aiming for full media analysis
|
| 1306 |
"""
|
| 1307 |
metadata = _load_json(_metadata_path(reference))
|
| 1308 |
if not metadata or not Path(metadata.get("download_path", "")).exists():
|