Spaces:
Sleeping
Sleeping
| title: Markdown Layout Extractor | |
| emoji: π | |
| colorFrom: red | |
| colorTo: yellow | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| <p> | |
| <div align="center"> | |
| <h1> | |
| PDF to Markdown MCP | |
| <br /> <br /> | |
| <a href=""> | |
| <img | |
| src="https://img.shields.io/badge/python%20%7C%203.12-blue" | |
| alt="Python 3.12" | |
| /> | |
| </a> | |
| <a href="https://github.com/astral-sh/uv"> | |
| <img | |
| src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json" | |
| alt="uv" | |
| /> | |
| </a> | |
| <a href="https://modelcontextprotocol.io/"> | |
| <img | |
| src="https://img.shields.io/badge/MCP-FastMCP-6C47FF" | |
| alt="FastMCP" | |
| /> | |
| </a> | |
| <a href="https://mistral.ai/"> | |
| <img | |
| src="https://img.shields.io/badge/Mistral%20AI-FF7000?logoColor=white" | |
| alt="Mistral AI" | |
| /> | |
| </a> | |
| <a href="https://www.starlette.io/"> | |
| <img | |
| src="https://img.shields.io/badge/Starlette-ASGI-009688" | |
| alt="Starlette" | |
| /> | |
| </a> | |
| <a href="https://www.uvicorn.org/"> | |
| <img | |
| src="https://img.shields.io/badge/Uvicorn-server-4051B5" | |
| alt="Uvicorn" | |
| /> | |
| </a> | |
| <a href="https://loguru.readthedocs.io/"> | |
| <img | |
| src="https://img.shields.io/badge/Loguru-logging-FF6B6B" | |
| alt="Loguru" | |
| /> | |
| </a> | |
| </h1> | |
| </div> | |
| </p> | |
| An MCP (Model Context Protocol) server that converts PDFs and documents into Markdown using **Mistral OCR**. | |
| ## Features | |
| - **`pdf_to_markdown`** β Convert any publicly accessible PDF/document URL to merged Markdown. | |
| - **`pdf_to_structured_markdown`** β Convert and get per-page structured output (page index, individual markdown, merged result). | |
| - CORS-enabled SSE transport β connect from any MCP client or inspector. | |
| - `/health` endpoint for liveness probing. | |
| - Structured, colorized logging via Loguru. | |
| ## Project Structure | |
| ``` | |
| pdf_to_md_mcp/ | |
| βββ main.py # Entry point β uvicorn runner | |
| βββ pyproject.toml | |
| βββ sample.env # Secrets reference (copy to .env) | |
| βββ development.yml # Non-secret config (server, CORS, OCR model) | |
| βββ app/ | |
| βββ server.py # ASGI app factory (MCP + CORS + health) | |
| βββ core/ | |
| β βββ config.py # Pydantic settings (loads .env + development.yml) | |
| β βββ logger.py # Loguru logger | |
| β βββ lifespan.py # AppContext + Mistral client lifecycle | |
| β βββ exceptions.py # Domain exceptions | |
| βββ services/ | |
| β βββ ocr_service.py # Mistral OCR business logic | |
| βββ tools/ | |
| β βββ markdown_tools.py # @mcp.tool() definitions | |
| βββ utils/ | |
| βββ response.py # create_response() helper | |
| βββ validators.py # URL validation | |
| ``` | |
| ## Setup | |
| ```bash | |
| # Install uv if not already installed | |
| curl -LsSf https://astral.sh/uv/install.sh | sh | |
| # Install dependencies | |
| uv sync | |
| # Configure secrets | |
| cp sample.env .env | |
| # Edit .env β set MISTRAL_API_KEY | |
| # Non-secret config (server, CORS, OCR model) lives in development.yml | |
| ``` | |
| ## Run | |
| ```bash | |
| uv run main.py | |
| ``` | |
| Server starts at `http://127.0.0.1:8000` by default. | |
| | Endpoint | Description | | |
| | --- | --- | | |
| | `GET /health` | Liveness probe | | |
| | `GET /sse` | MCP SSE transport | | |
| | `POST /messages/` | MCP message handler | | |
| ## MCP Tools | |
| ### `pdf_to_markdown` | |
| Convert a document URL to merged Markdown (all pages concatenated). | |
| **Input** | |
| | Parameter | Type | Description | | |
| | --- | --- | --- | | |
| | `document_url` | `string` | Publicly accessible URL of a PDF or image document | | |
| **Returns** β `string` | |
| ``` | |
| # Introduction | |
| This paper presents... | |
| ## Section 2 | |
| ... | |
| ``` | |
| --- | |
| ### `pdf_to_structured_markdown` | |
| Convert a document URL and get per-page structured output alongside the merged result. | |
| **Input** | |
| | Parameter | Type | Description | | |
| | --- | --- | --- | | |
| | `document_url` | `string` | Publicly accessible URL of a PDF or image document | | |
| **Returns** β `object` | |
| ```json | |
| { | |
| "page_count": 3, | |
| "pages": [ | |
| { "index": 0, "markdown": "# Page 1\n..." }, | |
| { "index": 1, "markdown": "## Page 2\n..." }, | |
| { "index": 2, "markdown": "### Page 3\n..." } | |
| ], | |
| "markdown": "# Page 1\n...\n\n## Page 2\n...\n\n### Page 3\n..." | |
| } | |
| ``` | |
| ## Debugging with MCP Inspector | |
| ```bash | |
| npx -y @modelcontextprotocol/inspector | |
| ``` | |
| Connect to `http://127.0.0.1:8000/sse` locally or your Railway URL in production. | |
| ## Deploy to Railway | |
| ### 1. Push to GitHub | |
| ```bash | |
| git init | |
| git add . | |
| git commit -m "initial commit" | |
| gh repo create pdf-to-md-mcp --public --source=. --push | |
| ``` | |
| ### 2. Create a Railway project | |
| Go to [railway.app](https://railway.app) β **New Project** β **Deploy from GitHub repo** β select your repo. | |
| Railway detects the `railway.json` and uses `uv run main.py` as the start command automatically. | |
| ### 3. Set environment variables | |
| In Railway β your service β **Variables**, add: | |
| | Variable | Value | | |
| |---|---| | |
| | `MISTRAL_API_KEY` | your Mistral API key | | |
| | `HOST` | `0.0.0.0` | | |
| > `PORT` is injected automatically by Railway β do **not** set it manually. | |
| > All other config (`MISTRAL_OCR_MODEL`, `LOG_LEVEL`, etc.) is read from `development.yml`. | |
| ### 4. Deploy | |
| Railway triggers a deploy on every push to your default branch. Once live, your public SSE URL will be: | |
| ``` | |
| https://<your-service>.up.railway.app/sse | |
| ``` | |
| Use that URL in any MCP client or pass it to the inspector: | |
| ```bash | |
| npx -y @modelcontextprotocol/inspector | |
| # connect to: https://<your-service>.up.railway.app/sse | |
| ``` | |
| ### Why it works | |
| - Railway injects `PORT` as an env var β pydantic-settings reads env vars before `development.yml`, so it's picked up automatically. | |
| - `HOST=0.0.0.0` (set via Railway Variables) overrides the local `127.0.0.1` default so the container is reachable. | |
| - `proxy_headers=True` in `main.py` makes uvicorn trust Railway's `X-Forwarded-*` headers. | |
| - `/health` is set as Railway's healthcheck path in `railway.json`. | |
| ## Configuration | |
| Configuration is split across two files to separate secrets from non-sensitive settings. | |
| ### `.env` β Secrets only | |
| ```dotenv | |
| MISTRAL_API_KEY=your_mistral_api_key_here | |
| ``` | |
| ### `development.yml` β Non-secret config | |
| ```yaml | |
| # Mistral | |
| MISTRAL_OCR_MODEL: mistral-ocr-latest | |
| MISTRAL_TABLE_FORMAT: markdown | |
| # Server | |
| APP_NAME: "Markdown & Layout Extractor" | |
| HOST: "127.0.0.1" | |
| PORT: 8000 | |
| LOG_LEVEL: INFO | |
| # CORS | |
| CORS_ALLOW_ORIGINS: | |
| - "*" | |
| CORS_ALLOW_METHODS: | |
| - "*" | |
| CORS_ALLOW_HEADERS: | |
| - "*" | |
| ``` | |
| **Priority (highest β lowest):** environment variables β `.env` β `development.yml` | |
| ### All settings | |
| | Variable | File | Default | Description | | |
| | --- | --- | --- | --- | | |
| | `MISTRAL_API_KEY` | `.env` | **required** | Mistral AI API key | | |
| | `MISTRAL_OCR_MODEL` | `development.yml` | `mistral-ocr-latest` | OCR model identifier | | |
| | `MISTRAL_TABLE_FORMAT` | `development.yml` | `markdown` | Table output format | | |
| | `APP_NAME` | `development.yml` | `Markdown & Layout Extractor` | MCP server name | | |
| | `HOST` | `development.yml` | `127.0.0.1` | Bind address | | |
| | `PORT` | `development.yml` | `8000` | Bind port | | |
| | `LOG_LEVEL` | `development.yml` | `INFO` | Log level (`DEBUG`, `INFO`, `WARNING`, `ERROR`) | | |
| | `CORS_ALLOW_ORIGINS` | `development.yml` | `["*"]` | Allowed CORS origins | | |
| | `CORS_ALLOW_METHODS` | `development.yml` | `["*"]` | Allowed HTTP methods | | |
| | `CORS_ALLOW_HEADERS` | `development.yml` | `["*"]` | Allowed HTTP headers | | |
| ## Design Notes | |
| - **Single Starlette app** β `sse_app()` is the sole ASGI application; the health route and CORS middleware are injected directly onto it to prevent double-middleware stacking (which causes the `http.response.start` crash). | |
| - **Separation of concerns** β Tools are thin wrappers around `OCRService`; business logic is independently testable. | |
| - **Lifespan-managed client** β The Mistral client is initialized once at startup and shared across all tool calls. | |
| - **Loguru logging** β Structured, colorized logs across all layers via Loguru. | |
| - **Pydantic Settings** β Type-safe, `.env`-driven configuration with an LRU-cached singleton. | |