Spaces:

omthakur1
/

python-doc-convert

Sleeping

App Files Files Community

omthakur1 commited on Feb 10

Commit

8e7152e

0 Parent(s):

v2.0: Add all PDF operations - PDF to Word, Image OCR, PDF Split/Merge

Browse files

Files changed (4) hide show

Dockerfile +44 -0
README.md +246 -0
app.py +433 -0
requirements.txt +8 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,44 @@

+# Hugging Face Spaces Dockerfile for Document Conversion
+# This installs LibreOffice for Word to PDF conversion
+FROM python:3.10-slim
+# Install LibreOffice, Tesseract OCR, and required system dependencies
+RUN apt-get update && apt-get install -y \
+    libreoffice \
+    libreoffice-writer \
+    libreoffice-calc \
+    libreoffice-impress \
+    tesseract-ocr \
+    tesseract-ocr-eng \
+    default-jre-headless \
+    libgl1-mesa-glx \
+    && apt-get clean \
+    && rm -rf /var/lib/apt/lists/*
+# Set working directory
+WORKDIR /app
+# Copy requirements and install Python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY app.py .
+# Create temp directory for conversions
+RUN mkdir -p /tmp/conversions
+# Expose port 7860 (Hugging Face Spaces default)
+EXPOSE 7860
+# Set environment variables
+ENV PYTHONUNBUFFERED=1
+ENV PORT=7860
+# Health check
+HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
+    CMD python -c "import requests; requests.get('http://localhost:7860/health')"
+# Run the application
+CMD ["gunicorn", "--bind", "0.0.0.0:7860", "--workers", "2", "--timeout", "60", "app:app"]

README.md ADDED Viewed

	@@ -0,0 +1,246 @@

+---
+title: Document Conversion API
+emoji: 📄
+colorFrom: blue
+colorTo: purple
+sdk: docker
+pinned: false
+license: apache-2.0
+app_port: 7860
+---
+# 📄 Document Conversion API - Word to PDF
+Free, self-hosted document conversion service using LibreOffice. Deploy on Hugging Face Spaces for unlimited FREE usage!
+## ✨ Features
+- **100% FREE** - No API keys, no limits, no credit card
+- **High Quality** - Uses LibreOffice for professional PDF conversion
+- **Fast** - Converts documents in seconds
+- **Self-Hosted** - Complete control and privacy
+- **Multiple Formats** - Supports DOCX, DOC, ODT, RTF, TXT → PDF
+## 🚀 Quick Deploy to Hugging Face Spaces
+### Step 1: Create a New Space
+1. Go to [Hugging Face Spaces](https://huggingface.co/spaces)
+2. Click **"Create new Space"**
+3. Fill in:
+   - **Space name**: `nextools-doc-converter` (or your choice)
+   - **License**: Apache 2.0
+   - **Select the SDK**: **Docker**
+   - **Space hardware**: CPU basic (FREE)
+   - **Visibility**: Public
+### Step 2: Upload Files
+Upload these 3 files to your Space:
+1. `Dockerfile`
+2. `app.py`
+3. `requirements.txt`
+### Step 3: Wait for Build
+- Hugging Face will automatically build your Docker container
+- Takes about 5-10 minutes (first time only)
+- Watch the logs for "Application startup complete"
+### Step 4: Get Your API URL
+Your API will be available at:
+```
+https://YOUR-USERNAME-nextools-doc-converter.hf.space
+```
+### Step 5: Add to Your Vercel .env.local
+```bash
+# Document Conversion API
+DOC_CONVERSION_API_URL=https://YOUR-USERNAME-nextools-doc-converter.hf.space
+```
+## 📡 API Usage
+### Convert Document to PDF
+**Endpoint:** `POST /convert`
+**cURL Example:**
+```bash
+curl -X POST \
+  https://YOUR-USERNAME-nextools-doc-converter.hf.space/convert \
+  -F "file=@document.docx" \
+  --output converted.pdf
+```
+**JavaScript Example:**
+```javascript
+const formData = new FormData();
+formData.append('file', file);
+const response = await fetch('https://YOUR-API-URL/convert', {
+  method: 'POST',
+  body: formData
+});
+const pdfBlob = await response.blob();
+```
+### Health Check
+**Endpoint:** `GET /health`
+```bash
+curl https://YOUR-API-URL/health
+```
+**Response:**
+```json
+{
+  "status": "healthy",
+  "libreoffice": true,
+  "message": "Service is running"
+}
+```
+## 🔧 Test Locally (Optional)
+### Using Docker:
+```bash
+# Build
+docker build -t doc-converter .
+# Run
+docker run -p 7860:7860 doc-converter
+# Test
+curl -X POST http://localhost:7860/convert \
+  -F "file=@test.docx" \
+  --output converted.pdf
+```
+### Using Python (requires LibreOffice installed):
+```bash
+# Install LibreOffice first:
+# Ubuntu/Debian: sudo apt install libreoffice
+# Mac: brew install libreoffice
+# Windows: Download from libreoffice.org
+# Install dependencies
+pip install -r requirements.txt
+# Run
+python app.py
+# Test
+curl -X POST http://localhost:7860/convert \
+  -F "file=@test.docx" \
+  --output converted.pdf
+```
+## 📊 Supported Formats
+### Input Formats:
+- `.docx` - Microsoft Word (2007+)
+- `.doc` - Microsoft Word (97-2003)
+- `.odt` - OpenDocument Text
+- `.rtf` - Rich Text Format
+- `.txt` - Plain Text
+### Output Format:
+- `.pdf` - PDF (Portable Document Format)
+## 🎯 Why Hugging Face Spaces?
+1. **FREE Forever** - No billing, no credit card
+2. **No Rate Limits** - Unlimited conversions
+3. **Always Online** - 99.9% uptime
+4. **Fast** - Global CDN delivery
+5. **Easy Deploy** - Just upload files
+6. **Auto-Scaling** - Handles traffic spikes
+## 🔒 Security & Privacy
+- Files are processed in memory
+- Automatic cleanup after conversion
+- No data is stored or logged
+- CORS enabled for your domains
+- SSL/HTTPS encryption
+## 🐛 Troubleshooting
+### Build Failed?
+- Check Dockerfile syntax
+- Ensure all files are uploaded
+- Wait for LibreOffice installation to complete
+### Conversion Failed?
+- Check file format is supported
+- Verify file is not corrupted
+- Check logs in Hugging Face dashboard
+### Timeout?
+- Large files (>10MB) may take longer
+- Consider increasing timeout in Dockerfile
+- Split large documents
+## 📝 Notes
+- **First conversion** may take 5-10 seconds (LibreOffice startup)
+- **Subsequent conversions** are much faster (~1-2 seconds)
+- **Maximum file size**: 50MB (configurable)
+- **Concurrent requests**: Supported with workers
+## 🔗 Integration with NexTools
+Update your `app/api/pdf-convert/route.ts`:
+```typescript
+// Use Hugging Face API for Word to PDF
+async function wordToPdf(fileBuffer: Buffer) {
+  const apiUrl = process.env.DOC_CONVERSION_API_URL;
+  if (!apiUrl) {
+    throw new Error('DOC_CONVERSION_API_URL not configured');
+  }
+  const formData = new FormData();
+  formData.append('file', new Blob([fileBuffer]), 'document.docx');
+  const response = await fetch(`${apiUrl}/convert`, {
+    method: 'POST',
+    body: formData,
+  });
+  if (!response.ok) {
+    throw new Error('Conversion failed');
+  }
+  const pdfBuffer = Buffer.from(await response.arrayBuffer());
+  return {
+    content: pdfBuffer.toString('base64'),
+    mimeType: 'application/pdf',
+    fileName: 'converted.pdf',
+    fileType: 'PDF',
+    pages: 1, // Calculate if needed
+  };
+}
+```
+## 📞 Support
+- **Issues**: Report on GitHub
+- **Questions**: Ask in Hugging Face discussions
+- **Updates**: Watch this repository
+## 📜 License
+Apache 2.0 License - Free for commercial and personal use
+---
+Made with ❤️ for NexTools - Your All-in-One SaaS Platform

app.py ADDED Viewed

	@@ -0,0 +1,433 @@

+"""
+Document Conversion API for Hugging Face Spaces
+Handles ALL PDF operations: Word↔PDF, Image↔Text, PDF Merge/Split
+Self-hosted, FREE forever with unlimited usage!
+"""
+from flask import Flask, request, send_file, jsonify
+from flask_cors import CORS
+import subprocess
+import os
+import tempfile
+import uuid
+from pathlib import Path
+import logging
+from PyPDF2 import PdfReader, PdfWriter, PdfMerger
+import pytesseract
+from PIL import Image
+from pdf2docx import Converter
+from io import BytesIO
+import zipfile
+app = Flask(__name__)
+CORS(app, origins=["*"])  # In production, replace * with your Vercel domain
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# Ensure LibreOffice is available
+def check_libreoffice():
+    """Check if LibreOffice is installed"""
+    try:
+        result = subprocess.run(
+            ['libreoffice', '--version'],
+            capture_output=True,
+            text=True,
+            timeout=5
+        )
+        logger.info(f"LibreOffice version: {result.stdout.strip()}")
+        return True
+    except Exception as e:
+        logger.error(f"LibreOffice not found: {e}")
+        return False
+# Ensure Tesseract is available
+def check_tesseract():
+    """Check if Tesseract OCR is installed"""
+    try:
+        result = subprocess.run(
+            ['tesseract', '--version'],
+            capture_output=True,
+            text=True,
+            timeout=5
+        )
+        logger.info(f"Tesseract version: {result.stdout.strip()}")
+        return True
+    except Exception as e:
+        logger.error(f"Tesseract not found: {e}")
+        return False
+@app.route('/')
+def root():
+    """Root endpoint with API info"""
+    lo_status = "Available ✅" if check_libreoffice() else "Not Found ❌"
+    tesseract_status = "Available ✅" if check_tesseract() else "Not Found ❌"
+    return {
+        'name': 'Document Conversion API',
+        'version': '2.0.0',
+        'backend': {
+            'LibreOffice': lo_status,
+            'Tesseract OCR': tesseract_status,
+            'PyPDF2': 'Available ✅'
+        },
+        'platform': 'Hugging Face Spaces',
+        'features': '100% FREE forever, Unlimited usage, No API keys needed',
+        'operations': {
+            'word-to-pdf': 'Convert Word/DOCX to PDF',
+            'pdf-to-word': 'Convert PDF to Word/DOCX (OCR)',
+            'image-to-text': 'Extract text from images (OCR)',
+            'pdf-split': 'Split PDF into separate pages',
+            'pdf-merge': 'Merge multiple PDFs'
+        },
+        'endpoints': {
+            'POST /convert': 'Word to PDF conversion',
+            'POST /pdf-to-word': 'PDF to Word conversion',
+            'POST /image-to-text': 'Image OCR to text',
+            'POST /pdf-split': 'Split PDF pages',
+            'POST /pdf-merge': 'Merge multiple PDFs',
+            'GET /health': 'Health check'
+        }
+    }, 200
+@app.route('/health')
+def health_check():
+    """Health check endpoint"""
+    lo_available = check_libreoffice()
+    tesseract_available = check_tesseract()
+    return {
+        'status': 'healthy' if (lo_available and tesseract_available) else 'degraded',
+        'services': {
+            'libreoffice': lo_available,
+            'tesseract': tesseract_available,
+            'pypdf2': True
+        },
+        'message': 'All services running' if (lo_available and tesseract_available) else 'Some services unavailable'
+    }, 200 if (lo_available and tesseract_available) else 503
+@app.route('/convert', methods=['POST'])
+def word_to_pdf():
+    """Convert Word/Document to PDF using LibreOffice"""
+    if 'file' not in request.files:
+        return jsonify({'error': 'No file provided'}), 400
+    file = request.files['file']
+    if file.filename == '':
+        return jsonify({'error': 'Empty filename'}), 400
+    # Get file extension
+    file_ext = Path(file.filename).suffix.lower()
+    supported_exts = ['.docx', '.doc', '.odt', '.rtf', '.txt']
+    if file_ext not in supported_exts:
+        return jsonify({
+            'error': f'Unsupported file format: {file_ext}',
+            'supported': supported_exts
+        }), 400
+    # Create unique temporary directory
+    temp_dir = tempfile.mkdtemp()
+    unique_id = str(uuid.uuid4())
+    try:
+        input_filename = f"input_{unique_id}{file_ext}"
+        input_path = os.path.join(temp_dir, input_filename)
+        file.save(input_path)
+        logger.info(f"Converting {input_filename} to PDF...")
+        # Convert using LibreOffice
+        cmd = [
+            'libreoffice',
+            '--headless',
+            '--convert-to', 'pdf',
+            '--outdir', temp_dir,
+            input_path
+        ]
+        result = subprocess.run(
+            cmd,
+            capture_output=True,
+            text=True,
+            timeout=30,
+            cwd=temp_dir
+        )
+        if result.returncode != 0:
+            logger.error(f"LibreOffice error: {result.stderr}")
+            return jsonify({
+                'error': 'Conversion failed',
+                'details': result.stderr
+            }), 500
+        # Find output PDF
+        output_filename = input_filename.rsplit('.', 1)[0] + '.pdf'
+        output_path = os.path.join(temp_dir, output_filename)
+        if not os.path.exists(output_path):
+            return jsonify({
+                'error': 'PDF file not created',
+                'details': 'LibreOffice did not produce output file'
+            }), 500
+        file_size = os.path.getsize(output_path)
+        logger.info(f"Conversion successful! Output: {output_filename} ({file_size} bytes)")
+        return send_file(
+            output_path,
+            mimetype='application/pdf',
+            as_attachment=True,
+            download_name='converted.pdf'
+        )
+    except subprocess.TimeoutExpired:
+        logger.error("Conversion timeout")
+        return jsonify({'error': 'Conversion timeout (>30s)'}), 504
+    except Exception as e:
+        logger.error(f"Conversion error: {str(e)}")
+        return jsonify({
+            'error': 'Conversion failed',
+            'details': str(e)
+        }), 500
+    finally:
+        try:
+            import shutil
+            shutil.rmtree(temp_dir)
+        except Exception as e:
+            logger.warning(f"Cleanup warning: {e}")
+@app.route('/pdf-to-word', methods=['POST'])
+def pdf_to_word():
+    """Convert PDF to Word/DOCX using pdf2docx"""
+    if 'file' not in request.files:
+        return jsonify({'error': 'No file provided'}), 400
+    file = request.files['file']
+    if file.filename == '':
+        return jsonify({'error': 'Empty filename'}), 400
+    temp_dir = tempfile.mkdtemp()
+    try:
+        input_path = os.path.join(temp_dir, 'input.pdf')
+        output_path = os.path.join(temp_dir, 'output.docx')
+        file.save(input_path)
+        logger.info("Converting PDF to DOCX...")
+        # Convert PDF to DOCX
+        cv = Converter(input_path)
+        cv.convert(output_path)
+        cv.close()
+        if not os.path.exists(output_path):
+            return jsonify({'error': 'DOCX file not created'}), 500
+        file_size = os.path.getsize(output_path)
+        logger.info(f"PDF to Word conversion successful! ({file_size} bytes)")
+        return send_file(
+            output_path,
+            mimetype='application/vnd.openxmlformats-officedocument.wordprocessingml.document',
+            as_attachment=True,
+            download_name='converted.docx'
+        )
+    except Exception as e:
+        logger.error(f"PDF to Word error: {str(e)}")
+        return jsonify({
+            'error': 'Conversion failed',
+            'details': str(e)
+        }), 500
+    finally:
+        try:
+            import shutil
+            shutil.rmtree(temp_dir)
+        except Exception as e:
+            logger.warning(f"Cleanup warning: {e}")
+@app.route('/image-to-text', methods=['POST'])
+def image_to_text():
+    """Extract text from image using Tesseract OCR"""
+    if 'file' not in request.files:
+        return jsonify({'error': 'No file provided'}), 400
+    file = request.files['file']
+    if file.filename == '':
+        return jsonify({'error': 'Empty filename'}), 400
+    try:
+        # Read image
+        image = Image.open(file.stream)
+        logger.info(f"Extracting text from image ({image.size})...")
+        # Perform OCR
+        text = pytesseract.image_to_string(image)
+        logger.info(f"OCR successful! Extracted {len(text)} characters")
+        # Create text file
+        text_content = f"Extracted Text from {file.filename}\n\n{text.strip()}"
+        # Return as downloadable text file
+        buffer = BytesIO()
+        buffer.write(text_content.encode('utf-8'))
+        buffer.seek(0)
+        return send_file(
+            buffer,
+            mimetype='text/plain',
+            as_attachment=True,
+            download_name='extracted-text.txt'
+        )
+    except Exception as e:
+        logger.error(f"OCR error: {str(e)}")
+        return jsonify({
+            'error': 'Text extraction failed',
+            'details': str(e)
+        }), 500
+@app.route('/pdf-split', methods=['POST'])
+def pdf_split():
+    """Split PDF into separate pages"""
+    if 'file' not in request.files:
+        return jsonify({'error': 'No file provided'}), 400
+    file = request.files['file']
+    if file.filename == '':
+        return jsonify({'error': 'Empty filename'}), 400
+    temp_dir = tempfile.mkdtemp()
+    try:
+        # Read PDF
+        pdf_reader = PdfReader(file.stream)
+        total_pages = len(pdf_reader.pages)
+        logger.info(f"Splitting PDF ({total_pages} pages)...")
+        # Create ZIP file for all pages
+        zip_path = os.path.join(temp_dir, 'split_pages.zip')
+        with zipfile.ZipFile(zip_path, 'w') as zipf:
+            for page_num in range(total_pages):
+                # Create new PDF for each page
+                pdf_writer = PdfWriter()
+                pdf_writer.add_page(pdf_reader.pages[page_num])
+                # Write to buffer
+                page_buffer = BytesIO()
+                pdf_writer.write(page_buffer)
+                page_buffer.seek(0)
+                # Add to ZIP
+                zipf.writestr(f'page_{page_num + 1}.pdf', page_buffer.read())
+        logger.info(f"Split successful! Created {total_pages} page files")
+        return send_file(
+            zip_path,
+            mimetype='application/zip',
+            as_attachment=True,
+            download_name='split_pages.zip'
+        )
+    except Exception as e:
+        logger.error(f"PDF split error: {str(e)}")
+        return jsonify({
+            'error': 'PDF split failed',
+            'details': str(e)
+        }), 500
+    finally:
+        try:
+            import shutil
+            shutil.rmtree(temp_dir)
+        except Exception as e:
+            logger.warning(f"Cleanup warning: {e}")
+@app.route('/pdf-merge', methods=['POST'])
+def pdf_merge():
+    """Merge multiple PDFs into one"""
+    if 'files' not in request.files:
+        return jsonify({'error': 'No files provided'}), 400
+    files = request.files.getlist('files')
+    if len(files) < 2:
+        return jsonify({'error': 'At least 2 PDF files required'}), 400
+    temp_dir = tempfile.mkdtemp()
+    try:
+        logger.info(f"Merging {len(files)} PDF files...")
+        # Merge PDFs
+        merger = PdfMerger()
+        for file in files:
+            if file.filename.lower().endswith('.pdf'):
+                merger.append(file.stream)
+        # Write merged PDF
+        output_path = os.path.join(temp_dir, 'merged.pdf')
+        merger.write(output_path)
+        merger.close()
+        file_size = os.path.getsize(output_path)
+        logger.info(f"Merge successful! Output: {file_size} bytes")
+        return send_file(
+            output_path,
+            mimetype='application/pdf',
+            as_attachment=True,
+            download_name='merged.pdf'
+        )
+    except Exception as e:
+        logger.error(f"PDF merge error: {str(e)}")
+        return jsonify({
+            'error': 'PDF merge failed',
+            'details': str(e)
+        }), 500
+    finally:
+        try:
+            import shutil
+            shutil.rmtree(temp_dir)
+        except Exception as e:
+            logger.warning(f"Cleanup warning: {e}")
+if __name__ == '__main__':
+    logger.info("🚀 Starting Document Conversion API...")
+    # Check dependencies on startup
+    if check_libreoffice():
+        logger.info("✅ LibreOffice is ready!")
+    else:
+        logger.warning("⚠️ LibreOffice not found")
+    if check_tesseract():
+        logger.info("✅ Tesseract OCR is ready!")
+    else:
+        logger.warning("⚠️ Tesseract not found")
+    # Run Flask app
+    port = int(os.environ.get('PORT', 7860))
+    app.run(host='0.0.0.0', port=port, debug=False)

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+Flask==3.0.0
+flask-cors==4.0.0
+Werkzeug==3.0.1
+gunicorn==21.2.0
+PyPDF2==3.0.1
+pytesseract==0.3.10
+Pillow==10.2.0
+pdf2docx==0.5.8