Spaces:
Sleeping
Sleeping
| title: Document to Markdown Converter | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 4.44.0 | |
| app_file: app.py | |
| pinned: true | |
| license: mit | |
| python_version: 3.11 | |
| hardware: cpu-basic | |
| tags: | |
| - document-processing | |
| - markdown | |
| - pdf-converter | |
| - text-extraction | |
| short_description: Convert PDF and DOCX documents to Markdown format | |
| # π Document to Markdown Converter | |
| Convert PDF and DOCX documents to Markdown format with intelligent structure analysis. | |
| ## Features | |
| ### π Supported Formats | |
| - **PDF** - Extract text with formatting preservation | |
| - **Word Documents** (.docx) - Full formatting and structure conversion | |
| ### π§ Smart Processing | |
| - **Heading Detection** - Automatically detect headings based on styles and formatting | |
| - **Table Extraction** - Convert tables to Markdown format | |
| - **List Processing** - Preserve ordered and unordered lists | |
| - **Inline Formatting** - Maintain bold, italic, and other text formatting | |
| - **Structure Analysis** - Detailed document structure statistics | |
| ### β‘ Key Capabilities | |
| - **Font-based Heading Detection** - Uses font size and styling to identify headings | |
| - **Style Recognition** - Recognizes Word document styles (Title, Heading 1-6) | |
| - **Table Conversion** - Converts complex tables to Markdown table format | |
| - **List Recognition** - Identifies and converts various list formats | |
| - **Text Formatting** - Preserves bold, italic formatting in Markdown syntax | |
| ## Usage | |
| ### Basic Processing | |
| 1. Upload a PDF or DOCX file | |
| 2. Click "Convert to Markdown" | |
| 3. View the converted Markdown in the output tab | |
| ### Options | |
| - **Structure Analysis**: Enable to see detailed document statistics | |
| - **Preview Mode**: Show only the first 500 characters for quick preview | |
| ### Output Tabs | |
| - **Markdown Output**: The complete converted Markdown text | |
| - **Structure Analysis**: Statistics about headings, lists, tables, etc. | |
| - **File Information**: Basic file details (name, type, size) | |
| ## Technical Details | |
| ### PDF Processing | |
| - Uses PyMuPDF (fitz) for text extraction | |
| - Analyzes font sizes to determine heading hierarchy | |
| - Preserves text formatting flags (bold, italic) | |
| - Processes text blocks while maintaining structure | |
| ### DOCX Processing | |
| - Uses python-docx for document parsing | |
| - Recognizes built-in Word styles | |
| - Extracts tables with proper formatting | |
| - Maintains paragraph-level formatting | |
| ### Structure Analysis | |
| The application analyzes: | |
| - **Headings**: Count by level (H1-H6) | |
| - **Lists**: Ordered vs unordered list items | |
| - **Tables**: Number of tables detected | |
| - **Paragraphs**: Regular text paragraphs | |
| - **Formatting**: Bold and italic text occurrences | |
| - **Statistics**: Word count, character count, total lines | |
| ## Installation | |
| ### Local Development | |
| ```bash | |
| # Clone the repository | |
| git clone https://huggingface.co/spaces/YOUR-USERNAME/document-to-markdown-converter | |
| cd document-to-markdown-converter | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Run the application | |
| python app.py | |
| ``` | |
| ### Dependencies | |
| - `gradio>=4.0.0` - Web interface framework | |
| - `python-docx>=1.1.0` - Word document processing | |
| - `PyMuPDF>=1.23.0` - PDF processing library | |
| ## API | |
| ### Core Function | |
| ```python | |
| def extract_document_to_markdown(file_path: str) -> Dict[str, Any]: | |
| """ | |
| Extract document content and convert to Markdown format | |
| Args: | |
| file_path: Path to PDF or DOCX file | |
| Returns: | |
| Dictionary containing: | |
| - success: Boolean indicating success | |
| - markdown: Converted Markdown content | |
| - structure: Document structure analysis | |
| - file_info: File metadata (name, type, size) | |
| - preview: Short preview of content | |
| - error: Error message if processing failed | |
| """ | |
| ``` | |
| ### Structure Analysis Output | |
| ```json | |
| { | |
| "headings": {"h1": 2, "h2": 5, "h3": 8, "h4": 0, "h5": 0, "h6": 0}, | |
| "lists": {"ordered": 3, "unordered": 7}, | |
| "tables": 2, | |
| "paragraphs": 45, | |
| "bold_text": 12, | |
| "italic_text": 8, | |
| "total_lines": 120, | |
| "word_count": 2500, | |
| "character_count": 15000 | |
| } | |
| ``` | |
| ## Examples | |
| ### Converting a PDF | |
| 1. Upload a PDF file | |
| 2. The application will: | |
| - Extract text from each page | |
| - Detect headings based on font size | |
| - Preserve bold/italic formatting | |
| - Convert to clean Markdown | |
| ### Converting a DOCX | |
| 1. Upload a Word document | |
| 2. The application will: | |
| - Parse document styles | |
| - Convert headings based on style names | |
| - Extract and format tables | |
| - Maintain list structures | |
| ## Limitations | |
| - **OCR**: Does not perform OCR on image-based PDFs | |
| - **Complex Layouts**: May not perfectly preserve complex document layouts | |
| - **Images**: Does not extract or convert embedded images | |
| - **Fonts**: Limited font analysis for PDFs | |
| ## Contributing | |
| 1. Fork the repository | |
| 2. Create a feature branch | |
| 3. Make your changes | |
| 4. Test thoroughly | |
| 5. Submit a pull request | |
| ## License | |
| MIT License - see LICENSE file for details. | |
| ## Support | |
| For issues and feature requests, please use the Community tab or create an issue on GitHub. | |
| --- | |
| *Built with β€οΈ using Gradio, python-docx, and PyMuPDF* |