Deep Chavda
fix: use valid HF Spaces colorFrom value
d6742f8
---
title: Markdown Layout Extractor
emoji: πŸ“„
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
---
<p>
<div align="center">
<h1>
PDF to Markdown MCP
<br /> <br />
<a href="">
<img
src="https://img.shields.io/badge/python%20%7C%203.12-blue"
alt="Python 3.12"
/>
</a>
<a href="https://github.com/astral-sh/uv">
<img
src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json"
alt="uv"
/>
</a>
<a href="https://modelcontextprotocol.io/">
<img
src="https://img.shields.io/badge/MCP-FastMCP-6C47FF"
alt="FastMCP"
/>
</a>
<a href="https://mistral.ai/">
<img
src="https://img.shields.io/badge/Mistral%20AI-FF7000?logoColor=white"
alt="Mistral AI"
/>
</a>
<a href="https://www.starlette.io/">
<img
src="https://img.shields.io/badge/Starlette-ASGI-009688"
alt="Starlette"
/>
</a>
<a href="https://www.uvicorn.org/">
<img
src="https://img.shields.io/badge/Uvicorn-server-4051B5"
alt="Uvicorn"
/>
</a>
<a href="https://loguru.readthedocs.io/">
<img
src="https://img.shields.io/badge/Loguru-logging-FF6B6B"
alt="Loguru"
/>
</a>
</h1>
</div>
</p>
An MCP (Model Context Protocol) server that converts PDFs and documents into Markdown using **Mistral OCR**.
## Features
- **`pdf_to_markdown`** β€” Convert any publicly accessible PDF/document URL to merged Markdown.
- **`pdf_to_structured_markdown`** β€” Convert and get per-page structured output (page index, individual markdown, merged result).
- CORS-enabled SSE transport β€” connect from any MCP client or inspector.
- `/health` endpoint for liveness probing.
- Structured, colorized logging via Loguru.
## Project Structure
```
pdf_to_md_mcp/
β”œβ”€β”€ main.py # Entry point β€” uvicorn runner
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ sample.env # Secrets reference (copy to .env)
β”œβ”€β”€ development.yml # Non-secret config (server, CORS, OCR model)
└── app/
β”œβ”€β”€ server.py # ASGI app factory (MCP + CORS + health)
β”œβ”€β”€ core/
β”‚ β”œβ”€β”€ config.py # Pydantic settings (loads .env + development.yml)
β”‚ β”œβ”€β”€ logger.py # Loguru logger
β”‚ β”œβ”€β”€ lifespan.py # AppContext + Mistral client lifecycle
β”‚ └── exceptions.py # Domain exceptions
β”œβ”€β”€ services/
β”‚ └── ocr_service.py # Mistral OCR business logic
β”œβ”€β”€ tools/
β”‚ └── markdown_tools.py # @mcp.tool() definitions
└── utils/
β”œβ”€β”€ response.py # create_response() helper
└── validators.py # URL validation
```
## Setup
```bash
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv sync
# Configure secrets
cp sample.env .env
# Edit .env β€” set MISTRAL_API_KEY
# Non-secret config (server, CORS, OCR model) lives in development.yml
```
## Run
```bash
uv run main.py
```
Server starts at `http://127.0.0.1:8000` by default.
| Endpoint | Description |
| --- | --- |
| `GET /health` | Liveness probe |
| `GET /sse` | MCP SSE transport |
| `POST /messages/` | MCP message handler |
## MCP Tools
### `pdf_to_markdown`
Convert a document URL to merged Markdown (all pages concatenated).
**Input**
| Parameter | Type | Description |
| --- | --- | --- |
| `document_url` | `string` | Publicly accessible URL of a PDF or image document |
**Returns** β€” `string`
```
# Introduction
This paper presents...
## Section 2
...
```
---
### `pdf_to_structured_markdown`
Convert a document URL and get per-page structured output alongside the merged result.
**Input**
| Parameter | Type | Description |
| --- | --- | --- |
| `document_url` | `string` | Publicly accessible URL of a PDF or image document |
**Returns** β€” `object`
```json
{
"page_count": 3,
"pages": [
{ "index": 0, "markdown": "# Page 1\n..." },
{ "index": 1, "markdown": "## Page 2\n..." },
{ "index": 2, "markdown": "### Page 3\n..." }
],
"markdown": "# Page 1\n...\n\n## Page 2\n...\n\n### Page 3\n..."
}
```
## Debugging with MCP Inspector
```bash
npx -y @modelcontextprotocol/inspector
```
Connect to `http://127.0.0.1:8000/sse` locally or your Railway URL in production.
## Deploy to Railway
### 1. Push to GitHub
```bash
git init
git add .
git commit -m "initial commit"
gh repo create pdf-to-md-mcp --public --source=. --push
```
### 2. Create a Railway project
Go to [railway.app](https://railway.app) β†’ **New Project** β†’ **Deploy from GitHub repo** β†’ select your repo.
Railway detects the `railway.json` and uses `uv run main.py` as the start command automatically.
### 3. Set environment variables
In Railway β†’ your service β†’ **Variables**, add:
| Variable | Value |
|---|---|
| `MISTRAL_API_KEY` | your Mistral API key |
| `HOST` | `0.0.0.0` |
> `PORT` is injected automatically by Railway β€” do **not** set it manually.
> All other config (`MISTRAL_OCR_MODEL`, `LOG_LEVEL`, etc.) is read from `development.yml`.
### 4. Deploy
Railway triggers a deploy on every push to your default branch. Once live, your public SSE URL will be:
```
https://<your-service>.up.railway.app/sse
```
Use that URL in any MCP client or pass it to the inspector:
```bash
npx -y @modelcontextprotocol/inspector
# connect to: https://<your-service>.up.railway.app/sse
```
### Why it works
- Railway injects `PORT` as an env var β€” pydantic-settings reads env vars before `development.yml`, so it's picked up automatically.
- `HOST=0.0.0.0` (set via Railway Variables) overrides the local `127.0.0.1` default so the container is reachable.
- `proxy_headers=True` in `main.py` makes uvicorn trust Railway's `X-Forwarded-*` headers.
- `/health` is set as Railway's healthcheck path in `railway.json`.
## Configuration
Configuration is split across two files to separate secrets from non-sensitive settings.
### `.env` β€” Secrets only
```dotenv
MISTRAL_API_KEY=your_mistral_api_key_here
```
### `development.yml` β€” Non-secret config
```yaml
# Mistral
MISTRAL_OCR_MODEL: mistral-ocr-latest
MISTRAL_TABLE_FORMAT: markdown
# Server
APP_NAME: "Markdown & Layout Extractor"
HOST: "127.0.0.1"
PORT: 8000
LOG_LEVEL: INFO
# CORS
CORS_ALLOW_ORIGINS:
- "*"
CORS_ALLOW_METHODS:
- "*"
CORS_ALLOW_HEADERS:
- "*"
```
**Priority (highest β†’ lowest):** environment variables β†’ `.env` β†’ `development.yml`
### All settings
| Variable | File | Default | Description |
| --- | --- | --- | --- |
| `MISTRAL_API_KEY` | `.env` | **required** | Mistral AI API key |
| `MISTRAL_OCR_MODEL` | `development.yml` | `mistral-ocr-latest` | OCR model identifier |
| `MISTRAL_TABLE_FORMAT` | `development.yml` | `markdown` | Table output format |
| `APP_NAME` | `development.yml` | `Markdown & Layout Extractor` | MCP server name |
| `HOST` | `development.yml` | `127.0.0.1` | Bind address |
| `PORT` | `development.yml` | `8000` | Bind port |
| `LOG_LEVEL` | `development.yml` | `INFO` | Log level (`DEBUG`, `INFO`, `WARNING`, `ERROR`) |
| `CORS_ALLOW_ORIGINS` | `development.yml` | `["*"]` | Allowed CORS origins |
| `CORS_ALLOW_METHODS` | `development.yml` | `["*"]` | Allowed HTTP methods |
| `CORS_ALLOW_HEADERS` | `development.yml` | `["*"]` | Allowed HTTP headers |
## Design Notes
- **Single Starlette app** β€” `sse_app()` is the sole ASGI application; the health route and CORS middleware are injected directly onto it to prevent double-middleware stacking (which causes the `http.response.start` crash).
- **Separation of concerns** β€” Tools are thin wrappers around `OCRService`; business logic is independently testable.
- **Lifespan-managed client** β€” The Mistral client is initialized once at startup and shared across all tool calls.
- **Loguru logging** β€” Structured, colorized logs across all layers via Loguru.
- **Pydantic Settings** β€” Type-safe, `.env`-driven configuration with an LRU-cached singleton.