Spaces:

BlueSkyXN
/

OCRmyPDF-HFS

Sleeping

App Files Files Community

BlueSkyXN commited on Apr 8, 2025

Commit

91b5bcf

1 Parent(s): 486a4a6

0.2.0

Browse files

Files changed (6) hide show

Dockerfile +11 -31
README.md +90 -33
entrypoint.sh +8 -2
main.py +134 -67
requirements.txt +1 -3
test/test.py +43 -0

Dockerfile CHANGED Viewed

@@ -1,46 +1,26 @@
-# 使用官方 Python 3.9 slim 镜像作为基础
-FROM python:3.9-slim
-# 设置环境变量，防止安装过程中出现交互式提示
-ENV DEBIAN_FRONTEND=noninteractive
-# 设置默认端口
 ENV PORT=8000
-# 更新包列表并安装系统依赖
-# OCRmyPDF 需要这些系统依赖，即使我们通过 pip 安装 OCRmyPDF 本身
-RUN apt-get update && apt-get install -y --no-install-recommends \
-    ghostscript \
-    tesseract-ocr \
-    tesseract-ocr-eng \
-    tesseract-ocr-chi-sim \
-    unpaper \
-    pngquant \
-    qpdf \
-    liblept5 \
-    libffi-dev \
-    # 编译依赖
-    build-essential \
-    python3-dev \
-    # 清理 apt 缓存以减小镜像大小
-    && rm -rf /var/lib/apt/lists/*
-# 设置工作目录
 WORKDIR /app
-# 复制 Python 依赖文件
 COPY requirements.txt .
-# 安装 Python 依赖
-# --no-cache-dir 减小镜像大小
 RUN pip install --no-cache-dir -r requirements.txt
-# 复制 FastAPI 应用代码和入口脚本
 COPY main.py .
 COPY entrypoint.sh .
-# 设置入口脚本可执行权限
 RUN chmod +x /app/entrypoint.sh
 # 暴露端口
 EXPOSE 8000

+# 使用官方OCRmyPDF Alpine镜像作为基础
+FROM jbarlow83/ocrmypdf-alpine:latest
+# 设置环境变量
 ENV PORT=8000
+ENV PYTHONUNBUFFERED=1
+# 安装Python依赖
 WORKDIR /app
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
+# 复制应用代码和启动脚本
 COPY main.py .
 COPY entrypoint.sh .
+# 设置启动脚本权限
 RUN chmod +x /app/entrypoint.sh
+# 创建临时工作目录
+RUN mkdir -p /app/temp
+RUN chmod 777 /app/temp
 # 暴露端口
 EXPOSE 8000

README.md CHANGED Viewed

@@ -1,5 +1,4 @@
 ---
-# Hugging Face Spaces 所需的配置信息
 title: OCRmyPDF API 接口 # 显示在 Space 页面的标题 (可自定义)
 emoji: 📄 # Space 图标的 Emoji (可选)
 colorFrom: blue # 主题颜色起始 (可选)
@@ -9,50 +8,108 @@ app_port: 8000 # 你的 FastAPI 应用在容器内部监听的端口 (必须与
 pinned: false # 是否在你的个人资料页置顶这个 Space (可选)
 ---
-# OCRmyPDF API on Hugging Face Spaces
-这个 Space 提供了一个 REST API 接口，可以使用 OCRmyPDF 为 PDF 文件添加 OCR 文本层。此实例已配置为处理包含**英文**、**简体中文**和**数字**的文档。
-## 如何使用
-向 `/ocr/` 端点发送 POST 请求，请求体中包含 PDF 文件和所需的参数。
-**API 端点:** `/ocr/`
-**请求方法:** `POST`
-**表单数据参数 (Form Data):**
-* `pdf_file`: (必需) 需要处理的 PDF 文件。
-* `language`: (必需) 用于 OCR 的语言。可选值：
-    * `eng` (仅英文)
-    * `chi_sim` (仅简体中文)
-    * `eng+chi_sim` (英文和简体中文 - **默认值**)
-* `force_ocr`: (可选) `true` 或 `false`。即使文件看起来已有文本，是否强制进行 OCR？ (默认: `false`)
-* `deskew`: (可选) `true` 或 `false`。在 OCR 前是否进行图像歪斜校正？ (默认: `false`)
-* `optimize`: (可选) `0`, `1`, `2`, 或 `3`。PDF 优化级别 (0=无, 1=安全, 2=较强, 3=最强)。 (默认: `0` 以保证稳定性)。
-**成功响应:**
-* 状态码: `200 OK`
-* Content-Type: `application/pdf`
-* 响应体: 处理完成的、带有 OCR 文本层的 PDF 文件。
-**错误响应:**
-* 状态码: `400`, `422`, `500`, `504`
-* Content-Type: `application/json`
-* 响应体: 包含错误详情的 JSON 对象。
-**其他端点:**
-* `/`: GET - 检查 API 是否运行的基本端点。
-* `/supported-languages/`: GET - 返回支持的语言参数列表。
-## 使用示例 (curl)
-```bash
-curl -X POST \
-  -F "pdf_file=@/path/to/your/local/input.pdf" \
-  -F "language=eng+chi_sim" \
-  -F "deskew=true"

 ---
 title: OCRmyPDF API 接口 # 显示在 Space 页面的标题 (可自定义)
 emoji: 📄 # Space 图标的 Emoji (可选)
 colorFrom: blue # 主题颜色起始 (可选)
 pinned: false # 是否在你的个人资料页置顶这个 Space (可选)
 ---
+# OCRmyPDF API 服务
+本项目提供一个基于FastAPI的REST API，用于通过OCRmyPDF对PDF文件进行OCR处理，添加可搜索的文本层。API支持中文和英文OCR识别。
+## 部署到Hugging Face Spaces
+### 方法1：直接从GitHub仓库部署
+1. 登录Hugging Face账户
+2. 创建新的Space:
+   - 点击"Create New Space"
+   - 输入名称，例如"ocrmypdf-api"
+   - 选择"Docker"作为Space SDK
+   - 选择适当的硬件规格（推荐：CPU-M或更高配置，以处理大型PDF）
+   - 输入GitHub仓库URL
+   - 点击"Create Space"
+### 方法2：手动上传文件
+1. 创建新的Space，选择"Docker"作为Space SDK
+2. 上传以下文件到Space:
+   - `Dockerfile`
+   - `requirements.txt`
+   - `main.py`
+   - `entrypoint.sh`
+   - `README.md`(可选)
+3. Space会自动构建Docker镜像并启动服务
+## API使用说明
+### 端点
+- `GET /` - API根检查
+- `GET /health` - 健康检查，返回OCRmyPDF和Tesseract版本信息
+- `GET /supported-languages/` - 查询支持的语言
+- `POST /ocr/` - 处理PDF文件
+### 示例请求
+使用cURL:
+```bash
+curl -X POST "https://your-space-name.hf.space/ocr/" \
+  -H "accept: application/json" \
+  -H "Content-Type: multipart/form-data" \
+  -F "pdf_file=@your_file.pdf" \
+  -F "language=eng+chi_sim" \
+  -F "force_ocr=false" \
+  -F "deskew=true" \
+  -F "optimize=1" \
+  --output processed.pdf
+```
+使用Python:
+```python
+import requests
+url = "https://your-space-name.hf.space/ocr/"
+payload = {
+    'language': 'eng+chi_sim',
+    'force_ocr': 'false',
+    'deskew': 'true',
+    'optimize': '1'
+}
+files = {
+    'pdf_file': open('your_file.pdf', 'rb')
+}
+response = requests.post(url, data=payload, files=files)
+# 保存处理后的PDF
+with open('processed.pdf', 'wb') as f:
+    f.write(response.content)
+```
+## 参数说明
+| 参数 | 类型 | 默认值 | 描述 |
+|------|------|--------|------|
+| language | string | "eng+chi_sim" | OCR语言，可选: "eng"(英文), "chi_sim"(简体中文), "eng+chi_sim"(中英文) |
+| force_ocr | boolean | false | 是否强制对所有页面进行OCR处理，即使已包含文本 |
+| deskew | boolean | false | 是否在OCR前自动校正倾斜的页面 |
+| optimize | integer | 0 | PDF优化级别: 0=不优化, 1=安全优化, 2=强力优化, 3=最大优化 |
+## 资源限制
+- 最大文件大小: 200MB
+- 最大页数: 1000页
+- 处理超时: 1800秒(30分钟)
+## 性能注意事项
+- 大型PDF文件处理可能需要较长时间
+- 高优化级别(2-3)会显著增加处理时间和资源消耗
+- 如遇到超时问题，请尝试减小文件大小或降低优化级别
+## 技术实现
+本服务基于:
+- OCRmyPDF官方Docker镜像
+- FastAPI框架
+- Tesseract OCR引擎(支持英文和简体中文)

entrypoint.sh CHANGED Viewed

@@ -1,9 +1,15 @@
-#!/bin/bash
 # 打印环境信息用于调试
-echo "Starting OCRmyPDF API"
 echo "Environment: PORT=$PORT"
 # 确保使用正确的端口变量
 PORT="${PORT:-8000}"
 echo "Using port: $PORT"

+#!/bin/sh
 # 打印环境信息用于调试
+echo "Starting OCRmyPDF API Service"
 echo "Environment: PORT=$PORT"
+# 验证OCRmyPDF是否可用
+echo "Checking OCRmyPDF installation..."
+ocrmypdf --version
+echo "Checking Tesseract installation..."
+tesseract --version | head -n 1
 # 确保使用正确的端口变量
 PORT="${PORT:-8000}"
 echo "Using port: $PORT"

main.py CHANGED Viewed

@@ -1,14 +1,14 @@
 import fastapi
-from fastapi import FastAPI, File, UploadFile, Form, HTTPException, Response
-from fastapi.responses import FileResponse
 import tempfile
 import os
 import shutil
 import logging
 import uuid
 import PyPDF2
-from typing import Literal # 用于精确类型提示
-import ocrmypdf  # 直接导入OCRmyPDF的Python API
 # 配置日志记录
 logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
@@ -21,13 +21,14 @@ SUPPORTED_LANG_ARGS = {
     "chi_sim": "Simplified Chinese only",
     "eng+chi_sim": "English and Simplified Chinese"
 }
-DEFAULT_LANGUAGE_ARG = "eng+chi_sim" # 默认处理中英文混合
-ALLOWED_LANGUAGES = Literal["eng", "chi_sim", "eng+chi_sim"] # FastAPI 类型提示
 # 资源限制配置
 MAX_FILE_SIZE_MB = 200  # 最大文件大小，MB
 MAX_PAGES = 1000  # 最大页数
 TIMEOUT_SECONDS = 1800  # OCR 处理超时时间，秒
 # ----------------
 # 初始化 FastAPI 应用
@@ -37,29 +38,58 @@ app = FastAPI(
     version="1.0.0"
 )
 @app.get("/", summary="API Root Check")
 async def read_root():
-    """Provides a simple check that the API is running."""
-    return {"message": f"OCRmyPDF API is running. Use POST /ocr/ to process PDFs. Supported languages: {list(SUPPORTED_LANG_ARGS.keys())}"}
 @app.get("/health", summary="Health Check")
 async def health_check():
-    """Provides a detailed health check of the API and its dependencies."""
     try:
-        # 检查OCRmyPDF版本
-        ocrmypdf_version = ocrmypdf.__version__
-        # 检查Tesseract
-        import subprocess
         tesseract_result = subprocess.run(['tesseract', '--version'], capture_output=True, text=True, timeout=5)
         tesseract_version = tesseract_result.stdout.split('\n')[0] if tesseract_result.returncode == 0 else "Not available"
         # 返回健康状态
         return {
             "status": "healthy",
             "ocrmypdf": ocrmypdf_version,
             "tesseract": tesseract_version,
-            "supported_languages": list(SUPPORTED_LANG_ARGS.keys())
         }
     except Exception as e:
         logger.error(f"Health check failed: {str(e)}")
@@ -70,12 +100,12 @@ async def health_check():
 @app.get("/supported-languages/", summary="List Supported Languages")
 async def get_supported_languages():
-    """Returns a dictionary of supported language arguments and descriptions."""
     return SUPPORTED_LANG_ARGS
 @app.post("/ocr/",
           summary="Perform OCR on PDF",
-          response_class=FileResponse, # 默认成功时返回文件
           responses={
               200: {
                   "content": {"application/pdf": {}},
@@ -91,13 +121,13 @@ async def run_ocr_on_pdf(
     pdf_file: UploadFile = File(..., description="The PDF file to be processed."),
     force_ocr: bool = Form(False, description="Force OCR even if text seems present?"),
     deskew: bool = Form(False, description="Deskew the image before OCR?"),
-    optimize: int = Form(0, description="PDF optimization level (0=None, 1=Safe, 2=Strong, 3=Max) - 0 recommended for stability in Spaces")
 ):
     """
-    Accepts a PDF file, performs OCR using the specified language(s),
-    and returns the processed PDF file.
     """
-    logger.info(f"Received request for language: {language}, force_ocr: {force_ocr}, deskew: {deskew}, optimize: {optimize}")
     # 基本文件验证
     if not pdf_file.filename.lower().endswith(".pdf"):
@@ -114,10 +144,13 @@ async def run_ocr_on_pdf(
         )
     # 创建唯一的临时工作目录
-    temp_dir = tempfile.mkdtemp()
     # 在临时目录中定义输入和输出文件的路径
-    input_filename = f"input_{uuid.uuid4()}.pdf"
-    output_filename = f"output_{uuid.uuid4()}.pdf"
     input_path = os.path.join(temp_dir, input_filename)
     output_path = os.path.join(temp_dir, output_filename)
@@ -150,57 +183,84 @@ async def run_ocr_on_pdf(
             logger.error(f"Error checking PDF pages: {str(e)}")
             # 继续处理，不中断流程
-        # 使用OCRmyPDF Python API处理PDF
-        logger.info("Starting OCR processing using OCRmyPDF Python API")
-        try:
-            ocrmypdf.ocr(
-                input_file=input_path,
-                output_file=output_path,
-                language=language,
-                force_ocr=force_ocr,
-                skip_text=(not force_ocr),  # 如果不强制OCR，则跳过已有文本
-                deskew=deskew,
-                optimize=optimize if optimize >= 0 and optimize <= 3 else 0,
-                jobs=1,  # 在资源受限的环境中使用单线程
-                progress_bar=False  # 禁用进度条（在Web服务中不需要）
-            )
-            logger.info("OCR processing completed successfully")
-        except ocrmypdf.exceptions.PriorOcrFoundError as e:
-            logger.warning(f"Prior OCR found: {str(e)}")
-            # 这种情况下我们可以考虑复制原始文件作为输出
-            shutil.copy(input_path, output_path)
-            logger.info("Copied original file as it already contains OCR")
-        except ocrmypdf.exceptions.MissingDependencyError as e:
-            logger.error(f"Missing dependency: {str(e)}")
-            raise HTTPException(status_code=500, detail=f"OCR processing failed due to missing dependency: {str(e)}")
-        except ocrmypdf.exceptions.EncryptedPdfError as e:
-            logger.error(f"Encrypted PDF: {str(e)}")
-            raise HTTPException(status_code=400, detail="Cannot process encrypted PDF. Please remove the password protection first.")
-        except ocrmypdf.exceptions.BadArgsError as e:
-            logger.error(f"Bad arguments: {str(e)}")
-            raise HTTPException(status_code=400, detail=f"Invalid processing parameters: {str(e)}")
-        except Exception as e:
-            logger.error(f"OCR processing failed: {str(e)}")
-            raise HTTPException(status_code=500, detail="OCR processing failed. Please try again with different parameters.")
-        # 检查输出文件是否存在
         if not os.path.exists(output_path):
-            error_message = "OCR processing seemed successful but output file was not found."
             logger.error(error_message)
             raise HTTPException(status_code=500, detail=error_message)
-        # OCR 成功，准备返回文件
         logger.info(f"OCR successful. Output file generated at '{output_path}'")
         # 生成友好的下载文件名
         download_filename = f"ocr_{pdf_file.filename}" if pdf_file.filename else "processed_document.pdf"
         # 返回处理后的文件
         return FileResponse(
             path=output_path,
             media_type='application/pdf',
-            filename=download_filename
         )
     except HTTPException as http_exc:
         # 重新抛出已知的 HTTP 异常
         raise http_exc
@@ -209,16 +269,23 @@ async def run_ocr_on_pdf(
         logger.error(f"An unexpected error occurred during OCR processing for file '{pdf_file.filename}': {e}", exc_info=True)
         raise HTTPException(status_code=500, detail=f"An unexpected server error occurred. Please try again later.")
     finally:
-        # 无论成功与否，都清理临时目录及其内容
-        if os.path.exists(temp_dir):
-            logger.info(f"Cleaning up temporary directory: {temp_dir}")
-            try:
-                shutil.rmtree(temp_dir)
-                logger.info("Temporary directory cleaned up successfully.")
-            except Exception as cleanup_error:
-                 logger.error(f"Error cleaning up temporary directory {temp_dir}: {cleanup_error}", exc_info=True)
         # 确保关闭上传的文件句柄
         await pdf_file.close()
 async def get_upload_file_size(upload_file: UploadFile) -> int:
     """获取上传文件的大小（以字节为单位）"""

 import fastapi
+from fastapi import FastAPI, File, UploadFile, Form, HTTPException, Response, BackgroundTasks
+from fastapi.responses import FileResponse, JSONResponse
+import subprocess
 import tempfile
 import os
 import shutil
 import logging
 import uuid
 import PyPDF2
+from typing import Literal, Optional, List
 # 配置日志记录
 logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
     "chi_sim": "Simplified Chinese only",
     "eng+chi_sim": "English and Simplified Chinese"
 }
+DEFAULT_LANGUAGE_ARG = "eng+chi_sim"  # 默认处理中英文混合
+ALLOWED_LANGUAGES = Literal["eng", "chi_sim", "eng+chi_sim"]  # FastAPI 类型提示
 # 资源限制配置
 MAX_FILE_SIZE_MB = 200  # 最大文件大小，MB
 MAX_PAGES = 1000  # 最大页数
 TIMEOUT_SECONDS = 1800  # OCR 处理超时时间，秒
+TEMP_DIR = "/app/temp"  # 临时文件目录
 # ----------------
 # 初始化 FastAPI 应用
     version="1.0.0"
 )
+# 确保临时目录存在
+os.makedirs(TEMP_DIR, exist_ok=True)
 @app.get("/", summary="API Root Check")
 async def read_root():
+    """提供简单的API可用性检查"""
+    return {
+        "status": "running",
+        "service": "OCRmyPDF API",
+        "endpoints": {
+            "POST /ocr/": "OCR处理PDF文件",
+            "GET /health": "健康检查",
+            "GET /supported-languages/": "查询支持的语言"
+        },
+        "supported_languages": list(SUPPORTED_LANG_ARGS.keys())
+    }
 @app.get("/health", summary="Health Check")
 async def health_check():
+    """提供详细的API和依赖健康状态检查"""
     try:
+        # 检查 OCRmyPDF 是否可用
+        result = subprocess.run(['ocrmypdf', '--version'], capture_output=True, text=True, timeout=5)
+        ocrmypdf_version = result.stdout.strip() if result.returncode == 0 else "Not available"
+        # 检查 Tesseract 是否可用
         tesseract_result = subprocess.run(['tesseract', '--version'], capture_output=True, text=True, timeout=5)
         tesseract_version = tesseract_result.stdout.split('\n')[0] if tesseract_result.returncode == 0 else "Not available"
+        # 检查支持的语言
+        langs_result = subprocess.run(['tesseract', '--list-langs'], capture_output=True, text=True, timeout=5)
+        available_langs = langs_result.stdout.strip().split('\n')[1:] if langs_result.returncode == 0 else []
+        # 检查磁盘空间
+        disk_info = os.statvfs(TEMP_DIR)
+        free_space_mb = (disk_info.f_bavail * disk_info.f_frsize) / (1024 * 1024)
         # 返回健康状态
         return {
             "status": "healthy",
             "ocrmypdf": ocrmypdf_version,
             "tesseract": tesseract_version,
+            "available_languages": available_langs,
+            "disk_space": {
+                "free_mb": round(free_space_mb, 2),
+                "temp_dir": TEMP_DIR
+            },
+            "resource_limits": {
+                "max_file_size_mb": MAX_FILE_SIZE_MB,
+                "max_pages": MAX_PAGES,
+                "timeout_seconds": TIMEOUT_SECONDS
+            }
         }
     except Exception as e:
         logger.error(f"Health check failed: {str(e)}")
 @app.get("/supported-languages/", summary="List Supported Languages")
 async def get_supported_languages():
+    """返回支持的语言参数及其描述的字典"""
     return SUPPORTED_LANG_ARGS
 @app.post("/ocr/",
           summary="Perform OCR on PDF",
+          response_class=FileResponse,
           responses={
               200: {
                   "content": {"application/pdf": {}},
     pdf_file: UploadFile = File(..., description="The PDF file to be processed."),
     force_ocr: bool = Form(False, description="Force OCR even if text seems present?"),
     deskew: bool = Form(False, description="Deskew the image before OCR?"),
+    optimize: int = Form(0, description="PDF optimization level (0=None, 1=Safe, 2=Strong, 3=Max)"),
+    background_tasks: BackgroundTasks = None
 ):
     """
+    接收PDF文件，使用指定的语言进行OCR处理，并返回处理后的PDF文件。
     """
+    logger.info(f"Received request: filename={pdf_file.filename}, language={language}, force_ocr={force_ocr}, deskew={deskew}, optimize={optimize}")
     # 基本文件验证
     if not pdf_file.filename.lower().endswith(".pdf"):
         )
     # 创建唯一的临时工作目录
+    session_id = str(uuid.uuid4())
+    temp_dir = os.path.join(TEMP_DIR, session_id)
+    os.makedirs(temp_dir, exist_ok=True)
     # 在临时目录中定义输入和输出文件的路径
+    input_filename = f"input_{session_id}.pdf"
+    output_filename = f"output_{session_id}.pdf"
     input_path = os.path.join(temp_dir, input_filename)
     output_path = os.path.join(temp_dir, output_filename)
             logger.error(f"Error checking PDF pages: {str(e)}")
             # 继续处理，不中断流程
+        # 构建 ocrmypdf 命令列表 - 利用镜像中预装的ocrmypdf
+        cmd = [
+            'ocrmypdf',
+            '-l', language,  # 语言参数
+            '--jobs', '2',   # 并行处理线程，根据资源调整
+        ]
+        # 根据用户选项添加参数
+        if force_ocr:
+            cmd.append('--force-ocr')
+        else:
+            # 默认跳过已有文本的页面
+            cmd.append('--skip-text')
+        if deskew:
+            cmd.append('--deskew')
+        if optimize >= 0 and optimize <= 3:
+            cmd.extend(['--optimize', str(optimize)])
+        # 添加输入和输出文件路径
+        cmd.extend([input_path, output_path])
+        command_str = ' '.join(cmd)
+        logger.info(f"Executing command: {command_str}")
+        # 执行命令，设置超时
+        result = subprocess.run(
+            cmd,
+            capture_output=True,
+            text=True,
+            check=False,
+            timeout=TIMEOUT_SECONDS
+        )
+        # 检查命令执行结果
+        if result.returncode != 0:
+            # 处理已有OCR文本的情况
+            if "PriorOcrFoundError" in result.stderr:
+                logger.info("Document already contains OCR text. Returning original document.")
+                shutil.copy(input_path, output_path)
+            # 处理加密PDF的情况
+            elif "EncryptedPdfError" in result.stderr:
+                logger.error("PDF is encrypted and cannot be processed")
+                raise HTTPException(status_code=400, detail="Cannot process encrypted PDF. Please remove password protection first.")
+            # 其他错误
+            else:
+                error_message = f"OCRmyPDF failed with exit code {result.returncode}."
+                logger.error(f"{error_message}\nStderr: {result.stderr[:1000]}\nStdout: {result.stdout[:1000]}")
+                raise HTTPException(status_code=500, detail="OCR processing failed. Please check your PDF file or try different parameters.")
+        # 验证输出文件存在
         if not os.path.exists(output_path):
+            error_message = "OCR command seemed successful but output file was not found."
             logger.error(error_message)
             raise HTTPException(status_code=500, detail=error_message)
+        # OCR 成功
         logger.info(f"OCR successful. Output file generated at '{output_path}'")
         # 生成友好的下载文件名
         download_filename = f"ocr_{pdf_file.filename}" if pdf_file.filename else "processed_document.pdf"
+        # 注册清理临时目录的后台任务
+        if background_tasks:
+            background_tasks.add_task(cleanup_temp_dir, temp_dir)
         # 返回处理后的文件
         return FileResponse(
             path=output_path,
             media_type='application/pdf',
+            filename=download_filename,
+            background=background_tasks
         )
+    except subprocess.TimeoutExpired:
+        logger.error(f"OCR processing timed out after {TIMEOUT_SECONDS} seconds for file '{pdf_file.filename}'.")
+        raise HTTPException(status_code=504, detail=f"OCR processing took too long and timed out after {TIMEOUT_SECONDS} seconds. Try with a smaller file or disable heavy options.")
     except HTTPException as http_exc:
         # 重新抛出已知的 HTTP 异常
         raise http_exc
         logger.error(f"An unexpected error occurred during OCR processing for file '{pdf_file.filename}': {e}", exc_info=True)
         raise HTTPException(status_code=500, detail=f"An unexpected server error occurred. Please try again later.")
     finally:
         # 确保关闭上传的文件句柄
         await pdf_file.close()
+        # 清理临时目录会由后台任务处理，这里不需要额外操作
+        # 如果没有注册后台任务，则在这里清理
+        if not background_tasks and os.path.exists(temp_dir):
+            cleanup_temp_dir(temp_dir)
+def cleanup_temp_dir(temp_dir: str):
+    """清理临时目录及其内容的辅助函数"""
+    try:
+        if os.path.exists(temp_dir):
+            logger.info(f"Cleaning up temporary directory: {temp_dir}")
+            shutil.rmtree(temp_dir)
+            logger.info("Temporary directory cleaned up successfully.")
+    except Exception as cleanup_error:
+        logger.error(f"Error cleaning up temporary directory {temp_dir}: {cleanup_error}", exc_info=True)
 async def get_upload_file_size(upload_file: UploadFile) -> int:
     """获取上传文件的大小（以字节为单位）"""

requirements.txt CHANGED Viewed

@@ -1,6 +1,4 @@
 fastapi==0.95.0
 uvicorn[standard]==0.22.0
 python-multipart==0.0.6
-PyPDF2==3.0.1
-# 直接通过pip安装OCRmyPDF
-ocrmypdf==15.4.3

 fastapi==0.95.0
 uvicorn[standard]==0.22.0
 python-multipart==0.0.6
+PyPDF2==3.0.1

test/test.py ADDED Viewed

	@@ -0,0 +1,43 @@

+import requests
+import os
+import time
+# API端点
+api_url = "https://blueskyxn-ocrmypdf-hfs.hf.space/ocr/"
+pdf_path = r"F:\Download\20250401-113339.pdf"
+output_path = r"F:\Download\ocr_result_python.pdf"
+# 准备文件和参数
+files = {"pdf_file": open(pdf_path, "rb")}
+data = {
+    "language": "eng+chi_sim",
+    "deskew": "true",
+    "optimize": "1"
+}
+print(f"开始处理文件: {pdf_path}")
+print(f"文件大小: {os.path.getsize(pdf_path)/1024/1024:.2f} MB")
+start_time = time.time()
+try:
+    # 发送请求
+    print("正在发送请求到OCR API...")
+    response = requests.post(api_url, files=files, data=data)
+    # 处理响应
+    if response.status_code == 200:
+        # 保存处理后的PDF
+        with open(output_path, "wb") as f:
+            f.write(response.content)
+        print(f"PDF处理成功！耗时: {time.time() - start_time:.2f}秒")
+        print(f"结果已保存到: {output_path}")
+    else:
+        print(f"处理失败! 状态码: {response.status_code}")
+        try:
+            error_details = response.json()
+            print(f"错误详情: {error_details}")
+        except:
+            print(f"响应内容: {response.text[:500]}...")
+finally:
+    # 确保关闭文件
+    files["pdf_file"].close()