Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.13.0
title: PDF MCP Server
emoji: π
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.35.0
app_file: app.py
pinned: false
license: mit
short_description: PDF processing tools accessible via MCP protocol
π PDF MCP Server
π Comprehensive PDF processing tools accessible via MCP protocol
This Hugging Face Space provides a powerful PDF processing server that can be used as an MCP (Model Context Protocol) server for AI assistants like Cursor IDE.
π Features
- β Extract text from PDF files (single page or all pages)
- β Get comprehensive PDF metadata (title, author, pages, etc.)
- β Extract and encode images from PDFs as base64
- β Render PDF pages as high-quality images
- β Advanced text search with case sensitivity options
- β Split PDF files by page ranges
- β JSON-formatted responses for easy integration
- β MCP protocol compatibility for AI assistants
π― Usage in Cursor IDE
Add this configuration to your Cursor IDE MCP settings:
{
"mcpServers": {
"pdf-server": {
"command": "npx",
"args": [
"mcp-remote",
"https://YOUR-USERNAME-pdf-mcp-server.hf.space/gradio_api/mcp/sse"
]
}
}
}
Replace YOUR-USERNAME with your actual HF username.
π οΈ Available MCP Tools
extract_text_from_pdf(pdf_path, page_number=None)
Extract text content from PDF files. If page_number is specified, extracts only that page; otherwise extracts all pages.
Parameters:
pdf_path(str): Path to the PDF filepage_number(int, optional): Specific page number (1-indexed)
Returns: JSON with extracted text and metadata
get_pdf_metadata(pdf_path)
Get comprehensive metadata information from PDF files.
Parameters:
pdf_path(str): Path to the PDF file
Returns: JSON with title, author, creation date, page count, etc.
extract_images_from_pdf(pdf_path, page_number=None)
Extract images from PDF files and return them as base64 encoded strings.
Parameters:
pdf_path(str): Path to the PDF filepage_number(int, optional): Specific page number (1-indexed)
Returns: JSON with base64 encoded images and metadata
render_pdf_page(pdf_path, page_number=1, zoom=2.0)
Render a specific page of PDF as a high-quality image.
Parameters:
pdf_path(str): Path to the PDF filepage_number(int): Page number to render (1-indexed)zoom(float): Zoom factor for rendering quality
Returns: JSON with base64 encoded page image
search_text_in_pdf(pdf_path, search_term, case_sensitive=False)
Search for text within PDF files with optional case sensitivity.
Parameters:
pdf_path(str): Path to the PDF filesearch_term(str): Text to search forcase_sensitive(bool): Whether search should be case sensitive
Returns: JSON with search results including page numbers and coordinates
split_pdf_pages(pdf_path, start_page, end_page, output_path)
Extract specific page ranges from PDF files and save as new PDF.
Parameters:
pdf_path(str): Path to the source PDF filestart_page(int): Starting page number (1-indexed)end_page(int): Ending page number (1-indexed, inclusive)output_path(str): Path for the output PDF file
Returns: JSON with operation result and file information
π Web Interface
This Space also provides a user-friendly web interface where you can:
- Upload PDF files directly in your browser
- Test all available operations with real-time results
- View JSON responses in a formatted way
- Experiment with different parameters before using in your MCP client
π MCP Protocol
The server implements the Model Context Protocol (MCP) which allows AI assistants to call these tools directly. The MCP endpoint is available at:
https://YOUR-USERNAME-pdf-mcp-server.hf.space/gradio_api/mcp/sse
π‘οΈ Technical Details
- Framework: Gradio with MCP support
- PDF Processing: PyMuPDF (fitz) for high-performance PDF operations
- Image Processing: PIL/Pillow for image handling
- Protocol: Server-Sent Events (SSE) for MCP communication
- Format: JSON responses for all operations
π Example Usage
# Example: Extract text from first page
result = extract_text_from_pdf("/path/to/document.pdf", 1)
# Example: Search for text
result = search_text_in_pdf("/path/to/document.pdf", "important", True)
# Example: Get metadata
result = get_pdf_metadata("/path/to/document.pdf")
π€ Contributing
This project is open source. Feel free to contribute improvements or report issues.
π License
MIT License - feel free to use this in your own projects!