Spaces:

kev216
/

extract_document_to_md

Sleeping

App Files Files Community

extract_document_to_md / README.md

wang.lingxiao

merge

4f8205f 8 months ago

preview code

raw

history blame contribute delete

5.05 kB

	---
	title: Document to Markdown Converter
	emoji: 📄
	colorFrom: blue
	colorTo: purple
	sdk: gradio
	sdk_version: 4.44.0
	app_file: app.py
	pinned: true
	license: mit
	python_version: 3.11
	hardware: cpu-basic
	tags:
	- document-processing
	- markdown
	- pdf-converter
	- text-extraction
	short_description: Convert PDF and DOCX documents to Markdown format
	---

	# 📄 Document to Markdown Converter

	Convert PDF and DOCX documents to Markdown format with intelligent structure analysis.

	## Features

	### 📄 Supported Formats
	- PDF - Extract text with formatting preservation
	- Word Documents (.docx) - Full formatting and structure conversion

	### 🧠 Smart Processing
	- Heading Detection - Automatically detect headings based on styles and formatting
	- Table Extraction - Convert tables to Markdown format
	- List Processing - Preserve ordered and unordered lists
	- Inline Formatting - Maintain bold, italic, and other text formatting
	- Structure Analysis - Detailed document structure statistics

	### ⚡ Key Capabilities
	- Font-based Heading Detection - Uses font size and styling to identify headings
	- Style Recognition - Recognizes Word document styles (Title, Heading 1-6)
	- Table Conversion - Converts complex tables to Markdown table format
	- List Recognition - Identifies and converts various list formats
	- Text Formatting - Preserves bold, italic formatting in Markdown syntax

	## Usage

	### Basic Processing
	1. Upload a PDF or DOCX file
	2. Click "Convert to Markdown"
	3. View the converted Markdown in the output tab

	### Options
	- Structure Analysis: Enable to see detailed document statistics
	- Preview Mode: Show only the first 500 characters for quick preview

	### Output Tabs
	- Markdown Output: The complete converted Markdown text
	- Structure Analysis: Statistics about headings, lists, tables, etc.
	- File Information: Basic file details (name, type, size)

	## Technical Details

	### PDF Processing
	- Uses PyMuPDF (fitz) for text extraction
	- Analyzes font sizes to determine heading hierarchy
	- Preserves text formatting flags (bold, italic)
	- Processes text blocks while maintaining structure

	### DOCX Processing
	- Uses python-docx for document parsing
	- Recognizes built-in Word styles
	- Extracts tables with proper formatting
	- Maintains paragraph-level formatting

	### Structure Analysis
	The application analyzes:
	- Headings: Count by level (H1-H6)
	- Lists: Ordered vs unordered list items
	- Tables: Number of tables detected
	- Paragraphs: Regular text paragraphs
	- Formatting: Bold and italic text occurrences
	- Statistics: Word count, character count, total lines

	## Installation

	### Local Development
	```bash
	# Clone the repository
	git clone https://huggingface.co/spaces/YOUR-USERNAME/document-to-markdown-converter
	cd document-to-markdown-converter

	# Install dependencies
	pip install -r requirements.txt

	# Run the application
	python app.py
	```

	### Dependencies
	- `gradio>=4.0.0` - Web interface framework
	- `python-docx>=1.1.0` - Word document processing
	- `PyMuPDF>=1.23.0` - PDF processing library

	## API

	### Core Function
	```python
	def extract_document_to_markdown(file_path: str) -> Dict[str, Any]:
	"""
	Extract document content and convert to Markdown format

	Args:
	file_path: Path to PDF or DOCX file

	Returns:
	Dictionary containing:
	- success: Boolean indicating success
	- markdown: Converted Markdown content
	- structure: Document structure analysis
	- file_info: File metadata (name, type, size)
	- preview: Short preview of content
	- error: Error message if processing failed
	"""
	```

	### Structure Analysis Output
	```json
	{
	"headings": {"h1": 2, "h2": 5, "h3": 8, "h4": 0, "h5": 0, "h6": 0},
	"lists": {"ordered": 3, "unordered": 7},
	"tables": 2,
	"paragraphs": 45,
	"bold_text": 12,
	"italic_text": 8,
	"total_lines": 120,
	"word_count": 2500,
	"character_count": 15000
	}
	```

	## Examples

	### Converting a PDF
	1. Upload a PDF file
	2. The application will:
	- Extract text from each page
	- Detect headings based on font size
	- Preserve bold/italic formatting
	- Convert to clean Markdown

	### Converting a DOCX
	1. Upload a Word document
	2. The application will:
	- Parse document styles
	- Convert headings based on style names
	- Extract and format tables
	- Maintain list structures

	## Limitations

	- OCR: Does not perform OCR on image-based PDFs
	- Complex Layouts: May not perfectly preserve complex document layouts
	- Images: Does not extract or convert embedded images
	- Fonts: Limited font analysis for PDFs

	## Contributing

	1. Fork the repository
	2. Create a feature branch
	3. Make your changes
	4. Test thoroughly
	5. Submit a pull request

	## License

	MIT License - see LICENSE file for details.

	## Support

	For issues and feature requests, please use the Community tab or create an issue on GitHub.

	---

	Built with ❤️ using Gradio, python-docx, and PyMuPDF