Deep Chavda commited on
Commit
4ccde7a
·
0 Parent(s):

feat: initial release — PDF to Markdown MCP server

Browse files

- FastMCP SSE server with two tools: pdf_to_markdown and pdf_to_structured_markdown
- Mistral OCR integration with lifespan-managed client
- Pydantic Settings: secrets in .env, non-secret config in development.yml
- Loguru structured logging across all layers
- CORS middleware + /health liveness probe
- Railway deployment config (railway.json, proxy_headers, PORT injection)
- .gitignore, sample.env, and uv lockfile included

.gitignore ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python-generated files
2
+ __pycache__/
3
+ *.py[oc]
4
+ build/
5
+ dist/
6
+ wheels/
7
+ *.egg-info
8
+
9
+ # Virtual environments
10
+ .venv
11
+
12
+ # Secrets — never commit real credentials
13
+ .env
14
+
15
+ # Local dev output
16
+ output/
17
+ docs/
18
+ temp/
19
+ logs/
20
+
21
+ # Test files
22
+ test.py
23
+ tests/
24
+
25
+ # OS / editor noise
26
+ .DS_Store
27
+ .idea/
28
+ .vscode/
29
+
.python-version ADDED
@@ -0,0 +1 @@
 
 
1
+ 3.12
README.md ADDED
@@ -0,0 +1,281 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p>
2
+ <div align="center">
3
+ <h1>
4
+ PDF to Markdown MCP
5
+ <br /> <br />
6
+ <a href="">
7
+ <img
8
+ src="https://img.shields.io/badge/python%20%7C%203.12-blue"
9
+ alt="Python 3.12"
10
+ />
11
+ </a>
12
+ <a href="https://github.com/astral-sh/uv">
13
+ <img
14
+ src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json"
15
+ alt="uv"
16
+ />
17
+ </a>
18
+ <a href="https://modelcontextprotocol.io/">
19
+ <img
20
+ src="https://img.shields.io/badge/MCP-FastMCP-6C47FF"
21
+ alt="FastMCP"
22
+ />
23
+ </a>
24
+ <a href="https://mistral.ai/">
25
+ <img
26
+ src="https://img.shields.io/badge/Mistral%20AI-FF7000?logoColor=white"
27
+ alt="Mistral AI"
28
+ />
29
+ </a>
30
+ <a href="https://www.starlette.io/">
31
+ <img
32
+ src="https://img.shields.io/badge/Starlette-ASGI-009688"
33
+ alt="Starlette"
34
+ />
35
+ </a>
36
+ <a href="https://www.uvicorn.org/">
37
+ <img
38
+ src="https://img.shields.io/badge/Uvicorn-server-4051B5"
39
+ alt="Uvicorn"
40
+ />
41
+ </a>
42
+ <a href="https://loguru.readthedocs.io/">
43
+ <img
44
+ src="https://img.shields.io/badge/Loguru-logging-FF6B6B"
45
+ alt="Loguru"
46
+ />
47
+ </a>
48
+ </h1>
49
+ </div>
50
+ </p>
51
+
52
+ An MCP (Model Context Protocol) server that converts PDFs and documents into Markdown using **Mistral OCR**.
53
+
54
+ ## Features
55
+
56
+ - **`pdf_to_markdown`** — Convert any publicly accessible PDF/document URL to merged Markdown.
57
+ - **`pdf_to_structured_markdown`** — Convert and get per-page structured output (page index, individual markdown, merged result).
58
+ - CORS-enabled SSE transport — connect from any MCP client or inspector.
59
+ - `/health` endpoint for liveness probing.
60
+ - Structured, colorized logging via Loguru.
61
+
62
+ ## Project Structure
63
+
64
+ ```
65
+ pdf_to_md_mcp/
66
+ ├── main.py # Entry point — uvicorn runner
67
+ ├── pyproject.toml
68
+ ├── sample.env # Secrets reference (copy to .env)
69
+ ├── development.yml # Non-secret config (server, CORS, OCR model)
70
+ └── app/
71
+ ├── server.py # ASGI app factory (MCP + CORS + health)
72
+ ├── core/
73
+ │ ├── config.py # Pydantic settings (loads .env + development.yml)
74
+ │ ├── logger.py # Loguru logger
75
+ │ ├── lifespan.py # AppContext + Mistral client lifecycle
76
+ │ └── exceptions.py # Domain exceptions
77
+ ├── services/
78
+ │ └── ocr_service.py # Mistral OCR business logic
79
+ ├── tools/
80
+ │ └── markdown_tools.py # @mcp.tool() definitions
81
+ └── utils/
82
+ ├── response.py # create_response() helper
83
+ └── validators.py # URL validation
84
+ ```
85
+
86
+ ## Setup
87
+
88
+ ```bash
89
+ # Install uv if not already installed
90
+ curl -LsSf https://astral.sh/uv/install.sh | sh
91
+
92
+ # Install dependencies
93
+ uv sync
94
+
95
+ # Configure secrets
96
+ cp sample.env .env
97
+ # Edit .env — set MISTRAL_API_KEY
98
+ # Non-secret config (server, CORS, OCR model) lives in development.yml
99
+ ```
100
+
101
+ ## Run
102
+
103
+ ```bash
104
+ uv run main.py
105
+ ```
106
+
107
+ Server starts at `http://127.0.0.1:8000` by default.
108
+
109
+ | Endpoint | Description |
110
+ | --- | --- |
111
+ | `GET /health` | Liveness probe |
112
+ | `GET /sse` | MCP SSE transport |
113
+ | `POST /messages/` | MCP message handler |
114
+
115
+ ## MCP Tools
116
+
117
+ ### `pdf_to_markdown`
118
+
119
+ Convert a document URL to merged Markdown (all pages concatenated).
120
+
121
+ **Input**
122
+
123
+ | Parameter | Type | Description |
124
+ | --- | --- | --- |
125
+ | `document_url` | `string` | Publicly accessible URL of a PDF or image document |
126
+
127
+ **Returns** — `string`
128
+
129
+ ```
130
+ # Introduction
131
+
132
+ This paper presents...
133
+
134
+ ## Section 2
135
+
136
+ ...
137
+ ```
138
+
139
+ ---
140
+
141
+ ### `pdf_to_structured_markdown`
142
+
143
+ Convert a document URL and get per-page structured output alongside the merged result.
144
+
145
+ **Input**
146
+
147
+ | Parameter | Type | Description |
148
+ | --- | --- | --- |
149
+ | `document_url` | `string` | Publicly accessible URL of a PDF or image document |
150
+
151
+ **Returns** — `object`
152
+
153
+ ```json
154
+ {
155
+ "page_count": 3,
156
+ "pages": [
157
+ { "index": 0, "markdown": "# Page 1\n..." },
158
+ { "index": 1, "markdown": "## Page 2\n..." },
159
+ { "index": 2, "markdown": "### Page 3\n..." }
160
+ ],
161
+ "markdown": "# Page 1\n...\n\n## Page 2\n...\n\n### Page 3\n..."
162
+ }
163
+ ```
164
+
165
+ ## Debugging with MCP Inspector
166
+
167
+ ```bash
168
+ npx -y @modelcontextprotocol/inspector
169
+ ```
170
+
171
+ Connect to `http://127.0.0.1:8000/sse` locally or your Railway URL in production.
172
+
173
+ ## Deploy to Railway
174
+
175
+ ### 1. Push to GitHub
176
+
177
+ ```bash
178
+ git init
179
+ git add .
180
+ git commit -m "initial commit"
181
+ gh repo create pdf-to-md-mcp --public --source=. --push
182
+ ```
183
+
184
+ ### 2. Create a Railway project
185
+
186
+ Go to [railway.app](https://railway.app) → **New Project** → **Deploy from GitHub repo** → select your repo.
187
+
188
+ Railway detects the `railway.json` and uses `uv run main.py` as the start command automatically.
189
+
190
+ ### 3. Set environment variables
191
+
192
+ In Railway → your service → **Variables**, add:
193
+
194
+ | Variable | Value |
195
+ |---|---|
196
+ | `MISTRAL_API_KEY` | your Mistral API key |
197
+ | `HOST` | `0.0.0.0` |
198
+
199
+ > `PORT` is injected automatically by Railway — do **not** set it manually.
200
+ > All other config (`MISTRAL_OCR_MODEL`, `LOG_LEVEL`, etc.) is read from `development.yml`.
201
+
202
+ ### 4. Deploy
203
+
204
+ Railway triggers a deploy on every push to your default branch. Once live, your public SSE URL will be:
205
+
206
+ ```
207
+ https://<your-service>.up.railway.app/sse
208
+ ```
209
+
210
+ Use that URL in any MCP client or pass it to the inspector:
211
+
212
+ ```bash
213
+ npx -y @modelcontextprotocol/inspector
214
+ # connect to: https://<your-service>.up.railway.app/sse
215
+ ```
216
+
217
+ ### Why it works
218
+
219
+ - Railway injects `PORT` as an env var — pydantic-settings reads env vars before `development.yml`, so it's picked up automatically.
220
+ - `HOST=0.0.0.0` (set via Railway Variables) overrides the local `127.0.0.1` default so the container is reachable.
221
+ - `proxy_headers=True` in `main.py` makes uvicorn trust Railway's `X-Forwarded-*` headers.
222
+ - `/health` is set as Railway's healthcheck path in `railway.json`.
223
+
224
+
225
+
226
+ ## Configuration
227
+
228
+ Configuration is split across two files to separate secrets from non-sensitive settings.
229
+
230
+ ### `.env` — Secrets only
231
+
232
+ ```dotenv
233
+ MISTRAL_API_KEY=your_mistral_api_key_here
234
+ ```
235
+
236
+ ### `development.yml` — Non-secret config
237
+
238
+ ```yaml
239
+ # Mistral
240
+ MISTRAL_OCR_MODEL: mistral-ocr-latest
241
+ MISTRAL_TABLE_FORMAT: markdown
242
+
243
+ # Server
244
+ APP_NAME: "Markdown & Layout Extractor"
245
+ HOST: "127.0.0.1"
246
+ PORT: 8000
247
+ LOG_LEVEL: INFO
248
+
249
+ # CORS
250
+ CORS_ALLOW_ORIGINS:
251
+ - "*"
252
+ CORS_ALLOW_METHODS:
253
+ - "*"
254
+ CORS_ALLOW_HEADERS:
255
+ - "*"
256
+ ```
257
+
258
+ **Priority (highest → lowest):** environment variables → `.env` → `development.yml`
259
+
260
+ ### All settings
261
+
262
+ | Variable | File | Default | Description |
263
+ | --- | --- | --- | --- |
264
+ | `MISTRAL_API_KEY` | `.env` | **required** | Mistral AI API key |
265
+ | `MISTRAL_OCR_MODEL` | `development.yml` | `mistral-ocr-latest` | OCR model identifier |
266
+ | `MISTRAL_TABLE_FORMAT` | `development.yml` | `markdown` | Table output format |
267
+ | `APP_NAME` | `development.yml` | `Markdown & Layout Extractor` | MCP server name |
268
+ | `HOST` | `development.yml` | `127.0.0.1` | Bind address |
269
+ | `PORT` | `development.yml` | `8000` | Bind port |
270
+ | `LOG_LEVEL` | `development.yml` | `INFO` | Log level (`DEBUG`, `INFO`, `WARNING`, `ERROR`) |
271
+ | `CORS_ALLOW_ORIGINS` | `development.yml` | `["*"]` | Allowed CORS origins |
272
+ | `CORS_ALLOW_METHODS` | `development.yml` | `["*"]` | Allowed HTTP methods |
273
+ | `CORS_ALLOW_HEADERS` | `development.yml` | `["*"]` | Allowed HTTP headers |
274
+
275
+ ## Design Notes
276
+
277
+ - **Single Starlette app** — `sse_app()` is the sole ASGI application; the health route and CORS middleware are injected directly onto it to prevent double-middleware stacking (which causes the `http.response.start` crash).
278
+ - **Separation of concerns** — Tools are thin wrappers around `OCRService`; business logic is independently testable.
279
+ - **Lifespan-managed client** — The Mistral client is initialized once at startup and shared across all tool calls.
280
+ - **Loguru logging** — Structured, colorized logs across all layers via Loguru.
281
+ - **Pydantic Settings** — Type-safe, `.env`-driven configuration with an LRU-cached singleton.
app/core/config.py ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import List, Tuple, Type
2
+ from functools import lru_cache
3
+
4
+ from pydantic import Field
5
+ from pydantic_settings import (
6
+ BaseSettings,
7
+ PydanticBaseSettingsSource,
8
+ SettingsConfigDict,
9
+ YamlConfigSettingsSource,
10
+ )
11
+
12
+
13
+ class Settings(BaseSettings):
14
+ """Centralized settings.
15
+
16
+ Priority (highest → lowest):
17
+ 1. Environment variables
18
+ 2. .env file ← secrets only (MISTRAL_API_KEY)
19
+ 3. development.yml ← non-secret config (model, server, CORS)
20
+ """
21
+
22
+ model_config = SettingsConfigDict(
23
+ env_file=".env",
24
+ env_file_encoding="utf-8",
25
+ case_sensitive=True,
26
+ extra="ignore",
27
+ )
28
+
29
+ @classmethod
30
+ def settings_customise_sources(
31
+ cls,
32
+ settings_cls: Type[PydanticBaseSettingsSource],
33
+ init_settings: PydanticBaseSettingsSource,
34
+ env_settings: PydanticBaseSettingsSource,
35
+ dotenv_settings: PydanticBaseSettingsSource,
36
+ file_secret_settings: PydanticBaseSettingsSource,
37
+ ) -> Tuple[PydanticBaseSettingsSource, ...]:
38
+ return (
39
+ init_settings,
40
+ env_settings,
41
+ dotenv_settings,
42
+ YamlConfigSettingsSource(settings_cls, yaml_file="development.yml"),
43
+ file_secret_settings,
44
+ )
45
+
46
+ # ── Mistral (secret in .env, rest in development.yml) ─────────────────────
47
+ MISTRAL_API_KEY: str
48
+ MISTRAL_OCR_MODEL: str = "mistral-ocr-latest"
49
+ MISTRAL_TABLE_FORMAT: str = "markdown"
50
+
51
+ # ── Server (development.yml) ───────────────────────────────────────────────
52
+ APP_NAME: str = "Markdown & Layout Extractor"
53
+ HOST: str = "127.0.0.1"
54
+ PORT: int = 8000
55
+ LOG_LEVEL: str = "INFO"
56
+
57
+ # ── CORS (development.yml) ─────────────────────────────────────────────────
58
+ CORS_ALLOW_ORIGINS: List[str] = Field(default_factory=lambda: ["*"])
59
+ CORS_ALLOW_METHODS: List[str] = Field(default_factory=lambda: ["*"])
60
+ CORS_ALLOW_HEADERS: List[str] = Field(default_factory=lambda: ["*"])
61
+
62
+
63
+ @lru_cache
64
+ def get_settings() -> Settings:
65
+ """Cached settings instance — call this everywhere instead of instantiating."""
66
+ return Settings()
67
+
68
+
69
+ settings = get_settings()
app/core/exceptions.py ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ class MCPExtractorError(Exception):
2
+ """Base exception for this application."""
3
+
4
+
5
+ class OCRProcessingError(MCPExtractorError):
6
+ """Raised when OCR / document conversion fails."""
7
+
8
+
9
+ class InvalidDocumentURLError(MCPExtractorError):
10
+ """Raised when the provided document URL is invalid or unreachable."""
app/core/lifespan.py ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from collections.abc import AsyncIterator
2
+ from contextlib import asynccontextmanager
3
+ from dataclasses import dataclass
4
+
5
+ from mcp.server.fastmcp import FastMCP
6
+ from mistralai.client import Mistral
7
+
8
+ from app.core.config import settings
9
+ from app.core.logger import logger
10
+
11
+
12
+ @dataclass
13
+ class AppContext:
14
+ """Shared resources available to all tools via ctx.request_context.lifespan_context."""
15
+
16
+ mistral: Mistral
17
+
18
+
19
+ @asynccontextmanager
20
+ async def app_lifespan(server: FastMCP) -> AsyncIterator[AppContext]:
21
+ """Initialize and cleanly tear down shared clients."""
22
+ logger.info("Initializing Mistral client...")
23
+ client = Mistral(api_key=settings.MISTRAL_API_KEY)
24
+ try:
25
+ yield AppContext(mistral=client)
26
+ finally:
27
+ logger.info("Shutting down lifespan resources.")
app/core/logger.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+
3
+ from loguru import logger
4
+
5
+ from app.core.config import settings
6
+
7
+
8
+ def _configure_logger() -> None:
9
+ logger.remove()
10
+ logger.add(
11
+ sys.stdout,
12
+ format="{time:YYYY-MM-DD HH:mm:ss} | {level:<8} | {name} | {message}",
13
+ level=settings.LOG_LEVEL,
14
+ colorize=True,
15
+ )
16
+
17
+
18
+ _configure_logger()
19
+
20
+ __all__ = ["logger"]
app/server.py ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from mcp.server.fastmcp import FastMCP
2
+ from starlette.middleware.cors import CORSMiddleware
3
+ from starlette.requests import Request
4
+
5
+ from app.core.config import settings
6
+ from app.core.lifespan import app_lifespan
7
+ from app.core.logger import logger
8
+ from app.tools import register_markdown_tools
9
+ from app.utils.response import create_response
10
+
11
+
12
+ def _build_mcp_server() -> FastMCP:
13
+ """Build the FastMCP instance and register every tool."""
14
+ mcp = FastMCP(settings.APP_NAME, lifespan=app_lifespan)
15
+ register_markdown_tools(mcp)
16
+ return mcp
17
+
18
+
19
+ async def _health(_: Request):
20
+ """Simple liveness probe."""
21
+ return create_response(
22
+ status_value=True,
23
+ message="Service is healthy",
24
+ data={"app": settings.APP_NAME},
25
+ )
26
+
27
+
28
+ def create_app():
29
+ """Build the ASGI application.
30
+
31
+ Mounts CORS and the health route directly on the MCP Starlette app to avoid
32
+ nesting two Starlette instances (which produces a double http.response.start
33
+ crash with uvicorn).
34
+ """
35
+ logger.info("Building application: {}", settings.APP_NAME)
36
+ mcp = _build_mcp_server()
37
+
38
+ # sse_app() returns a Starlette instance — use it as the single ASGI app.
39
+ app = mcp.sse_app()
40
+
41
+ # Inject health route before the MCP catch-all mount.
42
+ app.add_route("/health", _health, methods=["GET"])
43
+
44
+ # Add CORS once at the outermost middleware layer.
45
+ app = CORSMiddleware(
46
+ app,
47
+ allow_origins=settings.CORS_ALLOW_ORIGINS,
48
+ allow_methods=settings.CORS_ALLOW_METHODS,
49
+ allow_headers=settings.CORS_ALLOW_HEADERS,
50
+ )
51
+
52
+ return app
app/services/ocr_service.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Any, Dict, List
2
+
3
+ from mistralai.client import Mistral
4
+
5
+ from app.core.logger import logger
6
+ from app.core.config import settings
7
+ from app.core.exceptions import OCRProcessingError
8
+
9
+
10
+ class OCRService:
11
+ """Encapsulates document-to-markdown conversion via Mistral OCR."""
12
+
13
+ def __init__(self, client: Mistral) -> None:
14
+ self._client = client
15
+ self._model = settings.MISTRAL_OCR_MODEL
16
+ self._table_format = settings.MISTRAL_TABLE_FORMAT
17
+
18
+ async def document_to_markdown(self, document_url: str) -> str:
19
+ """Convert a remote document (PDF / image) to markdown.
20
+
21
+ Args:
22
+ document_url: Public URL of the document.
23
+
24
+ Returns:
25
+ Concatenated markdown content (pages joined by blank lines).
26
+
27
+ Raises:
28
+ OCRProcessingError: If the OCR call fails or returns no pages.
29
+ """
30
+ logger.info("Starting OCR for document: {}", document_url)
31
+ try:
32
+ response = await self._client.ocr.process_async(
33
+ model=self._model,
34
+ document={
35
+ "type": "document_url",
36
+ "document_url": document_url,
37
+ },
38
+ table_format=self._table_format,
39
+ )
40
+ except Exception as exc:
41
+ logger.exception("Mistral OCR call failed")
42
+ raise OCRProcessingError(f"OCR processing failed: {exc}") from exc
43
+
44
+ pages = getattr(response, "pages", None) or []
45
+ if not pages:
46
+ raise OCRProcessingError("OCR returned no pages for the given document.")
47
+
48
+ markdown = "\n\n".join(
49
+ page.markdown for page in pages if getattr(page, "markdown", None)
50
+ )
51
+ logger.info("OCR succeeded: {} pages, {} chars", len(pages), len(markdown))
52
+ return markdown
53
+
54
+ async def document_to_structured(self, document_url: str) -> Dict[str, Any]:
55
+ """Convert a document and return per-page structure alongside merged markdown.
56
+
57
+ Useful when callers need page-level metadata (page index, individual markdown).
58
+ """
59
+ logger.info("Starting structured OCR for document: {}", document_url)
60
+ try:
61
+ response = await self._client.ocr.process_async(
62
+ model=self._model,
63
+ document={
64
+ "type": "document_url",
65
+ "document_url": document_url,
66
+ },
67
+ table_format=self._table_format,
68
+ )
69
+ except Exception as exc:
70
+ logger.exception("Mistral OCR call failed")
71
+ raise OCRProcessingError(f"OCR processing failed: {exc}") from exc
72
+
73
+ pages: List[Dict[str, Any]] = []
74
+ merged_parts: List[str] = []
75
+ for idx, page in enumerate(getattr(response, "pages", []) or []):
76
+ md = getattr(page, "markdown", "") or ""
77
+ pages.append({"index": idx, "markdown": md})
78
+ if md:
79
+ merged_parts.append(md)
80
+
81
+ if not pages:
82
+ raise OCRProcessingError("OCR returned no pages for the given document.")
83
+
84
+ return {
85
+ "page_count": len(pages),
86
+ "pages": pages,
87
+ "markdown": "\n\n".join(merged_parts),
88
+ }
app/tools/__init__.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ from app.tools.markdown_tools import register_markdown_tools
2
+
3
+ __all__ = ["register_markdown_tools"]
app/tools/markdown_tools.py ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Any, Dict
2
+
3
+ from mcp.server.fastmcp import Context, FastMCP
4
+
5
+ from app.core.logger import logger
6
+ from app.services.ocr_service import OCRService
7
+ from app.utils.validators import validate_document_url
8
+ from app.core.exceptions import InvalidDocumentURLError, OCRProcessingError
9
+
10
+
11
+ def register_markdown_tools(mcp: FastMCP) -> None:
12
+ """Attach markdown-extraction tools to the given FastMCP server."""
13
+
14
+ @mcp.tool()
15
+ async def pdf_to_markdown(document_url: str, ctx: Context) -> str:
16
+ """Convert a PDF or document from a URL to Markdown using Mistral OCR.
17
+
18
+ Args:
19
+ document_url: Publicly accessible URL of the PDF / document.
20
+
21
+ Returns:
22
+ Markdown string (all pages concatenated).
23
+ """
24
+ try:
25
+ url = validate_document_url(document_url)
26
+ except InvalidDocumentURLError as exc:
27
+ logger.warning("Invalid URL rejected: {}", exc)
28
+ return f"Error: {exc}"
29
+
30
+ service = OCRService(client=ctx.request_context.lifespan_context.mistral)
31
+
32
+ try:
33
+ return await service.document_to_markdown(url)
34
+ except OCRProcessingError as exc:
35
+ logger.error("OCR failed for {}: {}", url, exc)
36
+ return f"Error: {exc}"
37
+
38
+ @mcp.tool()
39
+ async def pdf_to_structured_markdown(
40
+ document_url: str, ctx: Context
41
+ ) -> Dict[str, Any]:
42
+ """Convert a document to per-page structured markdown.
43
+
44
+ Returns:
45
+ Dict with keys: page_count (int), pages (list of {index, markdown}),
46
+ markdown (str, merged).
47
+ """
48
+ try:
49
+ url = validate_document_url(document_url)
50
+ except InvalidDocumentURLError as exc:
51
+ logger.warning("Invalid URL rejected: {}", exc)
52
+ return {"error": str(exc), "page_count": 0, "pages": [], "markdown": ""}
53
+
54
+ service = OCRService(client=ctx.request_context.lifespan_context.mistral)
55
+
56
+ try:
57
+ return await service.document_to_structured(url)
58
+ except OCRProcessingError as exc:
59
+ logger.error("Structured OCR failed for {}: {}", url, exc)
60
+ return {"error": str(exc), "page_count": 0, "pages": [], "markdown": ""}
app/utils/response.py ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Any, Dict, List, Union
2
+ from fastapi import status as http_status
3
+ from fastapi.responses import JSONResponse
4
+
5
+
6
+ def create_response(
7
+ status_value: bool,
8
+ message: str,
9
+ data: Union[Dict[str, Any], List[Any], None] = None,
10
+ code: int = http_status.HTTP_200_OK,
11
+ ) -> JSONResponse:
12
+ """Create standardized JSON response.
13
+
14
+ Args:
15
+ status_value: Success/failure boolean.
16
+ message: Human-readable message.
17
+ data: Response payload (dict, list, or None).
18
+ code: HTTP status code (default: 200).
19
+
20
+ Returns:
21
+ JSONResponse with structure {status, message, data}.
22
+ """
23
+ return JSONResponse(
24
+ status_code=code,
25
+ content={
26
+ "status": status_value,
27
+ "message": message,
28
+ "data": data if data is not None else {},
29
+ },
30
+ )
app/utils/validators.py ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from urllib.parse import urlparse
2
+ from app.core.exceptions import InvalidDocumentURLError
3
+
4
+
5
+ def validate_document_url(url: str) -> str:
6
+ """Validate that the given string is a well-formed http(s) URL.
7
+
8
+ Args:
9
+ url: The URL to validate.
10
+
11
+ Returns:
12
+ The trimmed URL.
13
+
14
+ Raises:
15
+ InvalidDocumentURLError: If the URL is malformed.
16
+ """
17
+ if not url or not isinstance(url, str):
18
+ raise InvalidDocumentURLError("Document URL must be a non-empty string.")
19
+
20
+ url = url.strip()
21
+ parsed = urlparse(url)
22
+
23
+ if parsed.scheme not in {"http", "https"}:
24
+ raise InvalidDocumentURLError(
25
+ f"Unsupported URL scheme: '{parsed.scheme}'. Use http or https."
26
+ )
27
+ if not parsed.netloc:
28
+ raise InvalidDocumentURLError("Document URL is missing a valid host.")
29
+
30
+ return url
development.yml ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ─── Mistral ───────────────────────────────────────────────────────────────────
2
+ MISTRAL_OCR_MODEL: mistral-ocr-latest
3
+ MISTRAL_TABLE_FORMAT: markdown
4
+
5
+ # ─── Server ────────────────────────────────────────────────────────────────────
6
+ # Local defaults — Railway overrides HOST and PORT via its Variables tab.
7
+ APP_NAME: "Markdown & Layout Extractor"
8
+ HOST: "127.0.0.1"
9
+ PORT: 8000
10
+ LOG_LEVEL: INFO
11
+
12
+ # ─── CORS ──────────────────────────────────────────────────────────────────────
13
+ CORS_ALLOW_ORIGINS:
14
+ - "*"
15
+ CORS_ALLOW_METHODS:
16
+ - "*"
17
+ CORS_ALLOW_HEADERS:
18
+ - "*"
main.py ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import uvicorn
2
+
3
+ from app.server import create_app
4
+ from app.core.config import settings
5
+
6
+ app = create_app()
7
+
8
+
9
+ if __name__ == "__main__":
10
+ uvicorn.run(
11
+ app,
12
+ host=settings.HOST,
13
+ port=settings.PORT,
14
+ log_level=settings.LOG_LEVEL.lower(),
15
+ # Trust X-Forwarded-* headers from Railway's edge proxy.
16
+ proxy_headers=True,
17
+ forwarded_allow_ips="*",
18
+ )
pyproject.toml ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [project]
2
+ name = "pdf-to-md-mcp"
3
+ version = "0.1.0"
4
+ description = "Add your description here"
5
+ readme = "README.md"
6
+ requires-python = ">=3.12"
7
+ dependencies = [
8
+ "fastapi==0.135.3",
9
+ "loguru==0.7.3",
10
+ "mcp[cli]==1.27.0",
11
+ "mistralai==2.3.2",
12
+ "pydantic-settings>=2.13.1",
13
+ "python-dotenv==1.2.2",
14
+ "pyyaml>=6.0.2",
15
+ "starlette==1.0.0",
16
+ ]
17
+
18
+ [dependency-groups]
19
+ dev = [
20
+ "black==26.3.1",
21
+ ]
22
+
23
+ [tool.black]
24
+ target-version = ["py312"]
25
+
26
+
railway.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "$schema": "https://railway.com/railway.schema.json",
3
+ "build": {
4
+ "builder": "NIXPACKS"
5
+ },
6
+ "deploy": {
7
+ "startCommand": "uv run main.py",
8
+ "healthcheckPath": "/health",
9
+ "healthcheckTimeout": 30,
10
+ "restartPolicyType": "ON_FAILURE",
11
+ "restartPolicyMaxRetries": 3
12
+ }
13
+ }
sample.env ADDED
@@ -0,0 +1 @@
 
 
1
+ MISTRAL_API_KEY=<YOUR-MISTRAL-API-KEY>
uv.lock ADDED
The diff for this file is too large to render. See raw diff