doc2page / README.md
jzhang533's picture
working demo
1f4004d
---
title: Doc2Page - Document to Webpage Converter
emoji: πŸ„
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.47.2
app_file: app.py
pinned: false
license: apache-2.0
short_description: Convert docs to webpages using PaddleOCR and ERNIE
---
# πŸ“„βž‘οΈπŸŒ Doc2Page - Document to Webpage Converter
Convert your PDF documents or images into beautiful, responsive HTML webpages!
## ✨ Features
- πŸ“– **Smart OCR**: Extract text from PDFs and images using PaddleOCR
- πŸ€– **AI Enhancement**: Transform content into well-structured HTML using ERNIE
- 🎨 **Beautiful Output**: Generate responsive, styled webpages with modern CSS
- πŸš€ **Easy Deployment**: Optional one-click deployment to GitHub Pages
- πŸ“± **Mobile Friendly**: Responsive design that works on all devices
## πŸ”§ How It Works
1. **Upload**: Drop your PDF or image file
2. **Extract**: PaddleOCR extracts text and structure
3. **Transform**: ERNIE converts to beautiful HTML
4. **Deploy**: Optionally publish to GitHub Pages
## πŸ“ Supported Formats
- **PDFs**: `.pdf`
- **Images**: `.png`, `.jpg`, `.jpeg`, `.bmp`, `.tiff`
## πŸš€ Quick Start
1. Upload a document using the file picker
2. Click "Convert to Webpage"
3. Preview your generated webpage
4. Download the HTML file
5. Optionally deploy to GitHub Pages
## βš™οΈ Configuration
**Setup using .env file:**
1. Copy the example environment file:
```bash
cp .env.example .env
```
2. Edit the `.env` file with your credentials:
```bash
# Required API Configuration for PP-StructureV3
API_URL=your_pp_structurev3_api_url
API_TOKEN=your_api_token
# Optional ERNIE API Configuration for enhanced HTML generation
ERNIE_CLIENT_ID=your_client_id_here
ERNIE_CLIENT_SECRET=your_client_secret_here
```
**Note:** The `.env` file is automatically loaded when the application starts. Without ERNIE credentials, the app will use a high-quality fallback HTML generator.
## πŸ—οΈ Technical Stack
- **Frontend**: Gradio for the web interface
- **OCR Engine**: PP-StructureV3 API (PaddlePaddle)
- **AI Processing**: ERNIE 4.5-X1.1-Preview (optional)
- **Image Processing**: Pillow
## πŸ“ Example Use Cases
- Convert research papers to web format
- Digitize scanned documents
- Create web-friendly versions of presentations
- Transform printed materials to responsive websites
- Archive documents in searchable HTML format
## πŸ“„ License
This project is licensed under the Apache 2.0 License.