Spaces:
Sleeping
title: Markdown Layout Extractor
emoji: π
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 7860
pinned: false
An MCP (Model Context Protocol) server that converts PDFs and documents into Markdown using Mistral OCR.
Features
pdf_to_markdownβ Convert any publicly accessible PDF/document URL to merged Markdown.pdf_to_structured_markdownβ Convert and get per-page structured output (page index, individual markdown, merged result).- CORS-enabled SSE transport β connect from any MCP client or inspector.
/healthendpoint for liveness probing.- Structured, colorized logging via Loguru.
Project Structure
pdf_to_md_mcp/
βββ main.py # Entry point β uvicorn runner
βββ pyproject.toml
βββ sample.env # Secrets reference (copy to .env)
βββ development.yml # Non-secret config (server, CORS, OCR model)
βββ app/
βββ server.py # ASGI app factory (MCP + CORS + health)
βββ core/
β βββ config.py # Pydantic settings (loads .env + development.yml)
β βββ logger.py # Loguru logger
β βββ lifespan.py # AppContext + Mistral client lifecycle
β βββ exceptions.py # Domain exceptions
βββ services/
β βββ ocr_service.py # Mistral OCR business logic
βββ tools/
β βββ markdown_tools.py # @mcp.tool() definitions
βββ utils/
βββ response.py # create_response() helper
βββ validators.py # URL validation
Setup
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv sync
# Configure secrets
cp sample.env .env
# Edit .env β set MISTRAL_API_KEY
# Non-secret config (server, CORS, OCR model) lives in development.yml
Run
uv run main.py
Server starts at http://127.0.0.1:8000 by default.
| Endpoint | Description |
|---|---|
GET /health |
Liveness probe |
GET /sse |
MCP SSE transport |
POST /messages/ |
MCP message handler |
MCP Tools
pdf_to_markdown
Convert a document URL to merged Markdown (all pages concatenated).
Input
| Parameter | Type | Description |
|---|---|---|
document_url |
string |
Publicly accessible URL of a PDF or image document |
Returns β string
# Introduction
This paper presents...
## Section 2
...
pdf_to_structured_markdown
Convert a document URL and get per-page structured output alongside the merged result.
Input
| Parameter | Type | Description |
|---|---|---|
document_url |
string |
Publicly accessible URL of a PDF or image document |
Returns β object
{
"page_count": 3,
"pages": [
{ "index": 0, "markdown": "# Page 1\n..." },
{ "index": 1, "markdown": "## Page 2\n..." },
{ "index": 2, "markdown": "### Page 3\n..." }
],
"markdown": "# Page 1\n...\n\n## Page 2\n...\n\n### Page 3\n..."
}
Debugging with MCP Inspector
npx -y @modelcontextprotocol/inspector
Connect to http://127.0.0.1:8000/sse locally or your Railway URL in production.
Deploy to Railway
1. Push to GitHub
git init
git add .
git commit -m "initial commit"
gh repo create pdf-to-md-mcp --public --source=. --push
2. Create a Railway project
Go to railway.app β New Project β Deploy from GitHub repo β select your repo.
Railway detects the railway.json and uses uv run main.py as the start command automatically.
3. Set environment variables
In Railway β your service β Variables, add:
| Variable | Value |
|---|---|
MISTRAL_API_KEY |
your Mistral API key |
HOST |
0.0.0.0 |
PORTis injected automatically by Railway β do not set it manually.
All other config (MISTRAL_OCR_MODEL,LOG_LEVEL, etc.) is read fromdevelopment.yml.
4. Deploy
Railway triggers a deploy on every push to your default branch. Once live, your public SSE URL will be:
https://<your-service>.up.railway.app/sse
Use that URL in any MCP client or pass it to the inspector:
npx -y @modelcontextprotocol/inspector
# connect to: https://<your-service>.up.railway.app/sse
Why it works
- Railway injects
PORTas an env var β pydantic-settings reads env vars beforedevelopment.yml, so it's picked up automatically. HOST=0.0.0.0(set via Railway Variables) overrides the local127.0.0.1default so the container is reachable.proxy_headers=Trueinmain.pymakes uvicorn trust Railway'sX-Forwarded-*headers./healthis set as Railway's healthcheck path inrailway.json.
Configuration
Configuration is split across two files to separate secrets from non-sensitive settings.
.env β Secrets only
MISTRAL_API_KEY=your_mistral_api_key_here
development.yml β Non-secret config
# Mistral
MISTRAL_OCR_MODEL: mistral-ocr-latest
MISTRAL_TABLE_FORMAT: markdown
# Server
APP_NAME: "Markdown & Layout Extractor"
HOST: "127.0.0.1"
PORT: 8000
LOG_LEVEL: INFO
# CORS
CORS_ALLOW_ORIGINS:
- "*"
CORS_ALLOW_METHODS:
- "*"
CORS_ALLOW_HEADERS:
- "*"
Priority (highest β lowest): environment variables β .env β development.yml
All settings
| Variable | File | Default | Description |
|---|---|---|---|
MISTRAL_API_KEY |
.env |
required | Mistral AI API key |
MISTRAL_OCR_MODEL |
development.yml |
mistral-ocr-latest |
OCR model identifier |
MISTRAL_TABLE_FORMAT |
development.yml |
markdown |
Table output format |
APP_NAME |
development.yml |
Markdown & Layout Extractor |
MCP server name |
HOST |
development.yml |
127.0.0.1 |
Bind address |
PORT |
development.yml |
8000 |
Bind port |
LOG_LEVEL |
development.yml |
INFO |
Log level (DEBUG, INFO, WARNING, ERROR) |
CORS_ALLOW_ORIGINS |
development.yml |
["*"] |
Allowed CORS origins |
CORS_ALLOW_METHODS |
development.yml |
["*"] |
Allowed HTTP methods |
CORS_ALLOW_HEADERS |
development.yml |
["*"] |
Allowed HTTP headers |
Design Notes
- Single Starlette app β
sse_app()is the sole ASGI application; the health route and CORS middleware are injected directly onto it to prevent double-middleware stacking (which causes thehttp.response.startcrash). - Separation of concerns β Tools are thin wrappers around
OCRService; business logic is independently testable. - Lifespan-managed client β The Mistral client is initialized once at startup and shared across all tool calls.
- Loguru logging β Structured, colorized logs across all layers via Loguru.
- Pydantic Settings β Type-safe,
.env-driven configuration with an LRU-cached singleton.