Spaces:

ryanlinjui
/

menu-text-detection

Running

App Files Files Community

github-actions[bot] commited on 3 days ago

Commit

5778306

0 Parent(s):

Sync from https://github.com/ryanlinjui/menu-text-detection

Browse files

Files changed (20) hide show

.checkpoints/.gitkeep +0 -0
.env.example +3 -0
.github/workflows/sync.yml +25 -0
.gitignore +24 -0
.python-version +1 -0
LICENSE +21 -0
README.md +65 -0
app.py +327 -0
menu/donut.py +472 -0
menu/llm/__init__.py +2 -0
menu/llm/base.py +12 -0
menu/llm/gemini.py +36 -0
menu/llm/openai.py +43 -0
menu/utils.py +48 -0
pyproject.toml +27 -0
requirements.txt +184 -0
tools/schema_gemini.json +44 -0
tools/schema_openai.json +47 -0
train.ipynb +235 -0
uv.lock +0 -0

.checkpoints/.gitkeep ADDED Viewed

File without changes

.env.example ADDED Viewed

	@@ -0,0 +1,3 @@

+HUGGINGFACE_TOKEN="HUGGINGFACE_TOKEN"
+GEMINI_API_TOKEN="GEMINI_API_TOKEN"
+OPENAI_API_TOKEN="OPENAI_API_TOKEN"

.github/workflows/sync.yml ADDED Viewed

	@@ -0,0 +1,25 @@

+name: Sync to Hugging Face Spaces
+on:
+    push:
+        branches:
+            - main
+jobs:
+    sync:
+        name: Sync
+        runs-on: ubuntu-latest
+        steps:
+            - name: Checkout Repository
+              uses: actions/checkout@v4
+            - name: Remove bad files
+              run: rm -rf examples assets
+            - name: Sync to Hugging Face Spaces
+              uses: JacobLinCool/huggingface-sync@v1
+              with:
+                  github: ${{ secrets.GITHUB_TOKEN }}
+                  user: ryanlinjui # Hugging Face username or organization name
+                  space: menu-text-detection # Hugging Face space name
+                  token: ${{ secrets.HF_TOKEN }} # Hugging Face token
+                  python_version: 3.11 # Python version

.gitignore ADDED Viewed

	@@ -0,0 +1,24 @@

+# mac
+.DS_Store
+# cache
+__pycache__
+# datasets
+datasets
+# papers
+docs/papers
+# uv
+.venv
+# gradio
+.gradio
+# env
+.env
+# checkpoint
+.checkpoints/*
+!.checkpoints/.gitkeep

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.11

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 RyanLin
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,65 @@

+---
+title: menu text detection
+emoji: 🦄
+colorFrom: indigo
+colorTo: pink
+sdk: gradio
+python_version: 3.11
+short_description: Extract structured menu information from images into JSON...
+tags: [ "document-understanding","donut","fine-tuning","image-text-to-text","transformer" ]
+---
+# Menu Text Detection System
+Extract structured menu information from images into JSON using a fine-tuned E2E model or LLM.
+[![Gradio Space Demo](https://img.shields.io/badge/GradioSpace-Demo-important?logo=huggingface)](https://huggingface.co/spaces/ryanlinjui/menu-text-detection)
+[![Hugging Face Models & Datasets](https://img.shields.io/badge/HuggingFace-Models_&_Datasets-important?logo=huggingface)](https://huggingface.co/collections/ryanlinjui/menu-text-detection-670ccf527626bb004bbfb39b)
+https://github.com/user-attachments/assets/80e5d54c-f2c8-4593-ad9b-499e5b71d8f6
+## 🚀 Features
+### Overview
+Currently supports the following information from menu images:
+- **Restaurant Name**
+- **Business Hours**
+- **Address**
+- **Phone Number**
+- **Dish Information**
+  - Name
+  - Price
+> For the JSON schema, see [tools directory](./tools).
+### Supported Methods to Extract Menu Information
+#### Fine-tuned E2E model and Training metrics
+- [**Donut (Document Parsing Task)**](https://huggingface.co/ryanlinjui/donut-base-finetuned-menu) - Base model by [Clova AI (ECCV ’22)](https://github.com/clovaai/donut)
+#### LLM Function Calling
+- Google Gemini API
+- OpenAI GPT API
+## 💻 Training / Fine-Tuning
+### Setup
+Use [uv](https://github.com/astral-sh/uv) to set up the development environment:
+```bash
+uv sync
+```
+> or use `pip install -r requirements.txt` if it has any problems
+### Training Script (Datasets collecting, Fine-Tuning)
+Please refer [`train.ipynb`](./train.ipynb). Use Jupyter Notebook for training:
+```bash
+uv run jupyter-notebook
+```
+> For VSCode users, please install Jupyter extension, then select `.venv/bin/python` as your kernel.
+### Run Demo Locally
+```bash
+uv run python app.py
+```

app.py ADDED Viewed

	@@ -0,0 +1,327 @@

+import os
+import json
+import base64
+import requests
+from io import BytesIO
+from typing import List
+import gradio as gr
+from PIL import Image
+from dotenv import load_dotenv
+from pillow_heif import register_heif_opener
+from menu.llm import (
+    GeminiAPI,
+    OpenAIAPI
+)
+from menu.donut import DonutFinetuned
+donut_finetuned = DonutFinetuned("ryanlinjui/donut-base-finetuned-menu")
+register_heif_opener()
+load_dotenv(override=True)
+GEMINI_API_TOKEN = os.getenv("GEMINI_API_TOKEN", "")
+OPENAI_API_TOKEN = os.getenv("OPENAI_API_TOKEN", "")
+SOURCE_CODE_GH_URL = "https://github.com/ryanlinjui/menu-text-detection"
+BADGE_URL = "https://img.shields.io/badge/GitHub_Code-Click_Here!!-default?logo=github"
+GITHUB_RAW_URL = "https://raw.githubusercontent.com/ryanlinjui/menu-text-detection/main"
+EXAMPLE_IMAGE_LIST = [
+    [f"{GITHUB_RAW_URL}/examples/menu-hd.jpg"],
+    [f"{GITHUB_RAW_URL}/examples/menu-vs.jpg"],
+    [f"{GITHUB_RAW_URL}/examples/menu-si.jpg"]
+]
+FINETUNED_MODEL_LIST = [
+    "Donut (Document Parsing Task) Fine-tuned Model"
+]
+LLM_MODEL_LIST = [
+    "gemini-2.5-pro",
+    "gemini-2.5-flash",
+    "gemini-2.0-flash",
+    "gpt-4.1",
+    "gpt-4o",
+    "o4-mini"
+]
+CSS_STYLE = """
+    .image-panel img {
+        max-height: 500px;
+        margin-top: -100px;
+    }
+    .large-text textarea {
+        font-size: 20px !important;
+        height: 600px !important;
+        width: 100% !important;
+    }
+    .control-row {
+        margin-top: -10px !important;
+        margin-bottom: -10px !important;
+        align-items: center !important;
+        justify-content: center !important;
+    }
+    .page-info {
+        text-align: center !important;
+        font-size: 20px !important;
+        display: flex !important;
+        align-items: center !important;
+        justify-content: center !important;
+        height: 100% !important;
+        font-weight: 900 !important;
+        color: #374151; /* Darker gray for clarity */
+    }
+    .page-info p {
+        margin: 0 !important;
+        width: 100% !important;
+        text-align: center !important;
+    }
+    .upload-btn {
+        margin-top: 2px !important;
+        background-color: #e0f2fe !important; /* Light blue background */
+        color: #0369a1 !important; /* Dark blue text */
+        border: 1px solid #0ea5e9 !important;
+    }
+    .upload-btn:hover {
+        background-color: #bae6fd !important;
+    }
+    .clear-btn {
+        margin-top: 2px !important;
+    }
+    .image-container {
+        height: 650px !important;
+        display: flex;
+        flex-direction: column;
+        border: 1px solid #e5e7eb;
+        border-radius: 8px;
+        padding: 4px;
+    }
+"""
+def handle(images: List[str], model: str, api_token: str) -> str:
+    if not images:
+        raise gr.Error("Please upload an image first.")
+    # Convert to PIL Images
+    pil_images = []
+    for img in images:
+        if img.startswith("http://") or img.startswith("https://"):
+            try:
+                response = requests.get(img)
+                response.raise_for_status()
+                pil_images.append(Image.open(BytesIO(response.content)))
+            except Exception as e:
+                raise gr.Error(f"Failed to load image from URL: {str(e)}")
+        elif img.startswith("data:image/") and ";base64," in img:
+            try:
+                _, encoded = img.split(";base64,", 1)
+                data = base64.b64decode(encoded)
+                pil_images.append(Image.open(BytesIO(data)))
+            except Exception as e:
+                raise gr.Error(f"Failed to decode Base64 image: {str(e)}")
+        else:
+            pil_images.append(Image.open(img))
+    if model == FINETUNED_MODEL_LIST[0]:
+        result = donut_finetuned.predict(pil_images[0])
+    elif model in LLM_MODEL_LIST:
+        if len(api_token) < 10:
+            raise gr.Error(f"Please provide a valid token for {model}.")
+        try:
+            if model in LLM_MODEL_LIST[:3]:
+                result = GeminiAPI.call(pil_images, model, api_token)
+            else:
+                result = OpenAIAPI.call(pil_images, model, api_token)
+        except Exception as e:
+            raise gr.Error(f"Failed to process with API model {model}: {str(e)}")
+    else:
+        raise gr.Error("Invalid model selection. Please choose a valid model.")
+    return json.dumps(result, indent=4, ensure_ascii=False, sort_keys=True)
+def UserInterface() -> gr.Interface:
+    with gr.Blocks(delete_cache=(86400, 86400)) as gradio_interface:
+        gr.HTML(f'<a href="{SOURCE_CODE_GH_URL}"><img src="{BADGE_URL}" alt="GitHub Code"/></a>')
+        gr.Markdown("# Menu Text Detection")
+        images_state = gr.State([])
+        current_index_state = gr.State(0)
+        with gr.Row():
+            with gr.Column(scale=1, min_width=500):
+                gr.Markdown("## 📷 Menu Image")
+                with gr.Column(elem_classes="image-container"):
+                    menu_image_display = gr.Image(
+                        label="Input menu image",
+                        type="filepath",
+                        elem_classes="image-panel",
+                        interactive=False,
+                        show_label=True,
+                        height="100%",
+                        width="100%"
+                    )
+                    with gr.Row(elem_classes="control-row"):
+                        prev_btn = gr.Button("◀️ Previous", variant="secondary", scale=1)
+                        with gr.Column(scale=2, min_width=50):
+                            page_info = gr.Markdown("Page 1 / 1", elem_classes="page-info")
+                        next_btn = gr.Button("Next ▶️", variant="secondary", scale=1)
+                    with gr.Row():
+                        upload_btn = gr.UploadButton(
+                            "📷 Upload Menu Images",
+                            file_types=["image"],
+                            file_count="multiple",
+                            scale=3,
+                            elem_classes="upload-btn",
+                            variant="primary"
+                        )
+                        clear_btn = gr.Button("🗑️ Remove", variant="stop", scale=1, elem_classes="clear-btn")
+                gr.Markdown("## 🤖 Model Selection")
+                model_choice_dropdown = gr.Dropdown(
+                    choices=FINETUNED_MODEL_LIST + LLM_MODEL_LIST,
+                    value=FINETUNED_MODEL_LIST[0],
+                    label="Select Text Detection Model"
+                )
+                api_token_textbox = gr.Textbox(
+                    label="API Token",
+                    placeholder="Enter your API token here...",
+                    type="password",
+                    visible=False
+                )
+                generate_button = gr.Button("Generate Menu Information", variant="primary")
+                example_receiver = gr.Image(visible=False, label="Example Preview", type="filepath")
+                examples_component = gr.Examples(
+                    examples=[[img_list[0]] for img_list in EXAMPLE_IMAGE_LIST],
+                    inputs=example_receiver,
+                    label="Example Menu Images"
+                )
+            with gr.Column(scale=1):
+                gr.Markdown("## 🍽️ Menu Info")
+                menu_json_textbox = gr.Textbox(
+                    label="Output JSON",
+                    interactive=True,
+                    text_align="left",
+                    elem_classes="large-text"
+                )
+        def update_display(images, index):
+            if not images:
+                return None, "Page 1 / 1"
+            idx = max(0, min(index, len(images) - 1))
+            return images[idx], f"Page {idx + 1} / {len(images)}"
+        def on_upload(new_files, current_images):
+            if current_images is None:
+                current_images = []
+            if new_files:
+                new_paths = [f.name for f in new_files]
+                current_images.extend(new_paths)
+            new_index = len(current_images) - 1
+            img, info = update_display(current_images, new_index)
+            return current_images, new_index, img, info
+        upload_btn.upload(
+            fn=on_upload,
+            inputs=[upload_btn, images_state],
+            outputs=[images_state, current_index_state, menu_image_display, page_info]
+        )
+        def on_clear(images, index):
+            if not images:
+                return [], 0, None, "Page 1 / 1"
+            new_images = list(images)
+            if 0 <= index < len(new_images):
+                new_images.pop(index)
+            if not new_images:
+                 return [], 0, None, "Page 1 / 1"
+            new_index = index
+            if new_index >= len(new_images):
+                new_index = len(new_images) - 1
+            img, info = update_display(new_images, new_index)
+            return new_images, new_index, img, info
+        clear_btn.click(
+            fn=on_clear,
+            inputs=[images_state, current_index_state],
+            outputs=[images_state, current_index_state, menu_image_display, page_info]
+        )
+        def on_prev(images, index):
+            if not images:
+                return 0, None, "Page 1 / 1"
+            new_index = max(0, index - 1)
+            img, info = update_display(images, new_index)
+            return new_index, img, info
+        def on_next(images, index):
+            if not images:
+                return 0, None, "Page 1 / 1"
+            new_index = min(len(images) - 1, index + 1)
+            img, info = update_display(images, new_index)
+            return new_index, img, info
+        prev_btn.click(on_prev, [images_state, current_index_state], [current_index_state, menu_image_display, page_info])
+        next_btn.click(on_next, [images_state, current_index_state], [current_index_state, menu_image_display, page_info])
+        def on_example_click(evt: gr.SelectData):
+            if evt.index is None:
+                return [], 0, None, "Page 1 / 1"
+            # Retrieve the full batch based on the clicked index
+            if 0 <= evt.index < len(EXAMPLE_IMAGE_LIST):
+                current_images = EXAMPLE_IMAGE_LIST[evt.index]
+            else:
+                current_images = []
+            new_index = 0
+            img, info = update_display(current_images, new_index)
+            return current_images, new_index, img, info
+        examples_component.dataset.select(
+            fn=on_example_click,
+            inputs=None,
+            outputs=[images_state, current_index_state, menu_image_display, page_info]
+        )
+        def update_token_visibility(choice):
+            if choice in LLM_MODEL_LIST:
+                current_token = ""
+                if choice in LLM_MODEL_LIST[:3]:
+                    current_token = GEMINI_API_TOKEN
+                else:
+                    current_token = OPENAI_API_TOKEN
+                return gr.update(visible=True, value=current_token)
+            else:
+                return gr.update(visible=False)
+        model_choice_dropdown.change(
+            fn=update_token_visibility,
+            inputs=model_choice_dropdown,
+            outputs=api_token_textbox
+        )
+        generate_button.click(
+            fn=handle,
+            inputs=[images_state, model_choice_dropdown, api_token_textbox],
+            outputs=menu_json_textbox
+        )
+        gr.api(
+            fn=handle,
+            api_name="run"
+        )
+    return gradio_interface
+if __name__ == "__main__":
+    demo = UserInterface()
+    demo.launch(css=CSS_STYLE)

menu/donut.py ADDED Viewed

	@@ -0,0 +1,472 @@

+"""
+This file is modified from the HuggingFace transformers tutorial script for fine-tuning Donut on a custom dataset.
+It's defined from `.ipynb` to the module implementation for better reusability and maintainability.
+Reference: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Donut/CORD/Fine_tune_Donut_on_a_custom_dataset_(CORD)_with_PyTorch_Lightning.ipynb
+"""
+import re
+import random
+from typing import Any, List, Tuple, Dict
+import torch
+import numpy as np
+from PIL import Image
+from tqdm.auto import tqdm
+from nltk import edit_distance
+import pytorch_lightning as pl
+from datasets import DatasetDict
+from donut import JSONParseEvaluator
+from huggingface_hub import upload_folder
+from pillow_heif import register_heif_opener
+from pytorch_lightning.callbacks import Callback
+from pytorch_lightning.loggers import TensorBoardLogger
+from torch.utils.data import (
+    Dataset,
+    DataLoader
+)
+from transformers import (
+    DonutProcessor,
+    VisionEncoderDecoderModel,
+    VisionEncoderDecoderConfig
+)
+TASK_PROMPT_NAME = "<s_menu-text-detection>"
+register_heif_opener()
+class DonutFinetuned:
+    def __init__(self, pretrained_model_repo_id: str = "ryanlinjui/donut-test"):
+        self.device = (
+            "cuda"
+            if torch.cuda.is_available()
+            else "mps" if torch.backends.mps.is_available() else "cpu"
+        )
+        self.processor = DonutProcessor.from_pretrained(pretrained_model_repo_id)
+        self.model = VisionEncoderDecoderModel.from_pretrained(pretrained_model_repo_id)
+        self.model.eval()
+        self.model.to(self.device)
+        print(f"Using {self.device} device")
+    def predict(self, image: Image.Image) -> Dict[str, Any]:
+        # prepare encoder inputs
+        pixel_values = self.processor(image.convert("RGB"), return_tensors="pt").pixel_values
+        pixel_values = pixel_values.to(self.device)
+        # prepare decoder inputs
+        decoder_input_ids = self.processor.tokenizer(TASK_PROMPT_NAME, add_special_tokens=False, return_tensors="pt").input_ids
+        decoder_input_ids = decoder_input_ids.to(self.device)
+        # autoregressively generate sequence
+        outputs = self.model.generate(
+                pixel_values,
+                decoder_input_ids=decoder_input_ids,
+                max_length=self.model.decoder.config.max_position_embeddings,
+                early_stopping=True,
+                pad_token_id=self.processor.tokenizer.pad_token_id,
+                eos_token_id=self.processor.tokenizer.eos_token_id,
+                use_cache=True,
+                num_beams=1,
+                bad_words_ids=[[self.processor.tokenizer.unk_token_id]],
+                return_dict_in_generate=True
+            )
+        # turn into JSON
+        seq = self.processor.batch_decode(outputs.sequences)[0]
+        seq = seq.replace(self.processor.tokenizer.eos_token, "").replace(self.processor.tokenizer.pad_token, "")
+        seq = re.sub(r"<.*?>", "", seq, count=1).strip()  # remove first task start token
+        seq = self.processor.token2json(seq)
+        return seq
+    def evaluate(self, dataset: Dataset, ground_truth_key: str = "ground_truth") -> Tuple[Dict[str, Any], List[Any]]:
+        output_list = []
+        accs = []
+        ted_accs = []
+        f1_accs = []
+        for idx, sample in tqdm(enumerate(dataset), total=len(dataset)):
+            seq = self.predict(sample["image"])
+            ground_truth = sample[ground_truth_key]
+            # Original JSON accuracy
+            evaluator = JSONParseEvaluator()
+            score = evaluator.cal_acc(seq, ground_truth)
+            accs.append(score)
+            output_list.append(seq)
+            # TED (Tree Edit Distance) Accuracy
+            # Convert predictions and ground truth to string format for comparison
+            pred_str = str(seq) if seq else ""
+            gt_str = str(ground_truth) if ground_truth else ""
+            # Calculate normalized edit distance (1 - normalized_edit_distance = accuracy)
+            if len(pred_str) == 0 and len(gt_str) == 0:
+                ted_acc = 1.0
+            elif len(pred_str) == 0 or len(gt_str) == 0:
+                ted_acc = 0.0
+            else:
+                edit_dist = edit_distance(pred_str, gt_str)
+                max_len = max(len(pred_str), len(gt_str))
+                ted_acc = 1 - (edit_dist / max_len)
+            ted_accs.append(ted_acc)
+            # F1 Score Accuracy (character-level)
+            if len(pred_str) == 0 and len(gt_str) == 0:
+                f1_acc = 1.0
+            elif len(pred_str) == 0 or len(gt_str) == 0:
+                f1_acc = 0.0
+            else:
+                # Character-level precision and recall
+                pred_chars = set(pred_str)
+                gt_chars = set(gt_str)
+                if len(pred_chars) == 0:
+                    precision = 0.0
+                else:
+                    precision = len(pred_chars.intersection(gt_chars)) / len(pred_chars)
+                if len(gt_chars) == 0:
+                    recall = 0.0
+                else:
+                    recall = len(pred_chars.intersection(gt_chars)) / len(gt_chars)
+                if precision + recall == 0:
+                    f1_acc = 0.0
+                else:
+                    f1_acc = 2 * (precision * recall) / (precision + recall)
+            f1_accs.append(f1_acc)
+        scores = {
+            "accuracies": accs,
+            "mean_accuracy": np.mean(accs),
+            "ted_accuracies": ted_accs,
+            "mean_ted_accuracy": np.mean(ted_accs),
+            "f1_accuracies": f1_accs,
+            "mean_f1_accuracy": np.mean(f1_accs),
+            "length": len(accs)
+        }
+        return scores, output_list
+class DonutTrainer:
+    processor = None
+    max_length = 768
+    image_size = [1280, 960]
+    added_tokens = []
+    train_dataloader = None
+    val_dataloader = None
+    huggingface_model_id = None
+    class DonutDataset(Dataset):
+        """
+        PyTorch Dataset for Donut. This class takes a HuggingFace Dataset as input.
+        Each row, consists of image path(png/jpg/jpeg) and gt data (json/jsonl/txt),
+        and it will be converted into pixel_values (vectorized image) and labels (input_ids of the tokenized string).
+        Args:
+            dataset: HuggingFace DatasetDict containing the dataset to be used
+            max_length: the max number of tokens for the target sequences
+            split: whether to load "train", "validation" or "test" split
+            ignore_id: ignore_index for torch.nn.CrossEntropyLoss
+            task_start_token: the special token to be fed to the decoder to conduct the target task
+            prompt_end_token: the special token at the end of the sequences
+            sort_json_key: whether or not to sort the JSON keys
+        """
+        def __init__(
+            self,
+            dataset: DatasetDict,
+            ground_truth_key: str,
+            max_length: int,
+            split: str = "train",
+            ignore_id: int = -100,
+            task_start_token: str = "<s>",
+            prompt_end_token: str = None,
+            sort_json_key: bool = True,
+        ):
+            super().__init__()
+            self.dataset = dataset[split]
+            self.ground_truth_key = ground_truth_key
+            self.max_length = max_length
+            self.split = split
+            self.ignore_id = ignore_id
+            self.task_start_token = task_start_token
+            self.prompt_end_token = prompt_end_token if prompt_end_token else task_start_token
+            self.sort_json_key = sort_json_key
+            self.dataset_length = len(self.dataset)
+            self.gt_token_sequences = []
+            for sample in self.dataset:
+                ground_truth = sample[self.ground_truth_key]
+                self.gt_token_sequences.append(
+                    [
+                        self.json2token(
+                            gt_json,
+                            update_special_tokens_for_json_key=self.split == "train",
+                            sort_json_key=self.sort_json_key,
+                        )
+                        + DonutTrainer.processor.tokenizer.eos_token
+                        for gt_json in [ground_truth]  # load json from list of json
+                    ]
+                )
+            self.add_tokens([self.task_start_token, self.prompt_end_token])
+            self.prompt_end_token_id = DonutTrainer.processor.tokenizer.convert_tokens_to_ids(self.prompt_end_token)
+        def json2token(self, obj: Any, update_special_tokens_for_json_key: bool = True, sort_json_key: bool = True):
+            """
+            Convert an ordered JSON object into a token sequence
+            """
+            if type(obj) == dict:
+                if len(obj) == 1 and "text_sequence" in obj:
+                    return obj["text_sequence"]
+                else:
+                    output = ""
+                    if sort_json_key:
+                        keys = sorted(obj.keys(), reverse=True)
+                    else:
+                        keys = obj.keys()
+                    for k in keys:
+                        if update_special_tokens_for_json_key:
+                            self.add_tokens([fr"<s_{k}>", fr"</s_{k}>"])
+                        output += (
+                            fr"<s_{k}>"
+                            + self.json2token(obj[k], update_special_tokens_for_json_key, sort_json_key)
+                            + fr"</s_{k}>"
+                        )
+                    return output
+            elif type(obj) == list:
+                return r"<sep/>".join(
+                    [self.json2token(item, update_special_tokens_for_json_key, sort_json_key) for item in obj]
+                )
+            else:
+                obj = str(obj)
+                if f"<{obj}/>" in DonutTrainer.added_tokens:
+                    obj = f"<{obj}/>"  # for categorical special tokens
+                return obj
+        def add_tokens(self, list_of_tokens: List[str]):
+            """
+            Add special tokens to tokenizer and resize the token embeddings of the decoder
+            """
+            newly_added_num = DonutTrainer.processor.tokenizer.add_tokens(list_of_tokens)
+            if newly_added_num > 0:
+                DonutTrainer.model.decoder.resize_token_embeddings(len(DonutTrainer.processor.tokenizer))
+                DonutTrainer.added_tokens.extend(list_of_tokens)
+        def __len__(self) -> int:
+            return self.dataset_length
+        def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+            """
+            Load image from image_path of given dataset_path and convert into input_tensor and labels
+            Convert gt data into input_ids (tokenized string)
+            Returns:
+                input_tensor : preprocessed image
+                input_ids : tokenized gt_data
+                labels : masked labels (model doesn't need to predict prompt and pad token)
+            """
+            sample = self.dataset[idx]
+            # inputs
+            pixel_values = DonutTrainer.processor(sample["image"], random_padding=self.split == "train", return_tensors="pt").pixel_values
+            pixel_values = pixel_values.squeeze()
+            # targets
+            target_sequence = random.choice(self.gt_token_sequences[idx])  # can be more than one, e.g., DocVQA Task 1
+            input_ids = DonutTrainer.processor.tokenizer(
+                target_sequence,
+                add_special_tokens=False,
+                max_length=self.max_length,
+                padding="max_length",
+                truncation=True,
+                return_tensors="pt",
+            )["input_ids"].squeeze(0)
+            labels = input_ids.clone()
+            labels[labels == DonutTrainer.processor.tokenizer.pad_token_id] = self.ignore_id  # model doesn't need to predict pad token
+            # labels[: torch.nonzero(labels == self.prompt_end_token_id).sum() + 1] = self.ignore_id  # model doesn't need to predict prompt (for VQA)
+            return pixel_values, labels, target_sequence
+    class DonutModelPLModule(pl.LightningModule):
+        def __init__(self, config, processor, model):
+            super().__init__()
+            self.config = config
+            self.processor = processor
+            self.model = model
+        def training_step(self, batch, batch_idx):
+            pixel_values, labels, _ = batch
+            outputs = self.model(pixel_values, labels=labels)
+            loss = outputs.loss
+            self.log("train_loss", loss)
+            return loss
+        def validation_step(self, batch, batch_idx, dataset_idx=0):
+            pixel_values, labels, answers = batch
+            batch_size = pixel_values.shape[0]
+            # we feed the prompt to the model
+            decoder_input_ids = torch.full((batch_size, 1), self.model.config.decoder_start_token_id, device=self.device)
+            outputs = self.model.generate(pixel_values,
+                                    decoder_input_ids=decoder_input_ids,
+                                    max_length=DonutTrainer.max_length,
+                                    early_stopping=True,
+                                    pad_token_id=self.processor.tokenizer.pad_token_id,
+                                    eos_token_id=self.processor.tokenizer.eos_token_id,
+                                    use_cache=True,
+                                    num_beams=1,
+                                    bad_words_ids=[[self.processor.tokenizer.unk_token_id]],
+                                    return_dict_in_generate=True,)
+            predictions = []
+            for seq in self.processor.tokenizer.batch_decode(outputs.sequences):
+                seq = seq.replace(self.processor.tokenizer.eos_token, "").replace(self.processor.tokenizer.pad_token, "")
+                seq = re.sub(r"<.*?>", "", seq, count=1).strip()  # remove first task start token
+                predictions.append(seq)
+            scores = []
+            for pred, answer in zip(predictions, answers):
+                pred = re.sub(r"(?:(?<=>) | (?=</s_))", "", pred)
+                # NOT NEEDED ANYMORE
+                # answer = re.sub(r"<.*?>", "", answer, count=1)
+                answer = answer.replace(self.processor.tokenizer.eos_token, "")
+                scores.append(edit_distance(pred, answer) / max(len(pred), len(answer)))
+                if self.config.get("verbose", False) and len(scores) == 1:
+                    print(f"Prediction: {pred}")
+                    print(f"    Answer: {answer}")
+                    print(f" Normed ED: {scores[0]}")
+            val_edit_distance = np.mean(scores)
+            self.log("val_edit_distance", val_edit_distance)
+            print(f"Validation Edit Distance: {val_edit_distance}")
+            return scores
+        def configure_optimizers(self):
+            # you could also add a learning rate scheduler if you want
+            optimizer = torch.optim.Adam(self.parameters(), lr=self.config.get("lr"))
+            return optimizer
+        def train_dataloader(self):
+            return DonutTrainer.train_dataloader
+        def val_dataloader(self):
+            return DonutTrainer.val_dataloader
+    class PushToHubCallback(Callback):
+        def on_train_epoch_end(self, trainer, pl_module):
+            print(f"Pushing model to the hub, epoch {trainer.current_epoch}")
+            pl_module.model.push_to_hub(DonutTrainer.huggingface_model_id, commit_message=f"Training in progress, epoch {trainer.current_epoch}")
+            self._upload_logs(trainer.logger.log_dir, trainer.current_epoch)
+        def on_train_end(self, trainer, pl_module):
+            print(f"Pushing model to the hub after training")
+            pl_module.processor.push_to_hub(DonutTrainer.huggingface_model_id,commit_message=f"Training done")
+            pl_module.model.push_to_hub(DonutTrainer.huggingface_model_id, commit_message=f"Training done")
+            self._upload_logs(trainer.logger.log_dir, "final")
+        def _upload_logs(self, log_dir: str, epoch_info):
+            try:
+                print(f"Attempting to upload logs from: {log_dir}")
+                upload_folder(folder_path=log_dir, repo_id=DonutTrainer.huggingface_model_id,
+                            path_in_repo="tensorboard_logs",
+                            commit_message=f"Upload logs - epoch {epoch_info}", ignore_patterns=["*.tmp", "*.lock"])
+                print(f"Successfully uploaded logs for epoch {epoch_info}")
+            except Exception as e:
+                print(f"Failed to upload logs: {e}")
+                pass
+    @classmethod
+    def train(
+        cls,
+        dataset: DatasetDict,
+        pretrained_model_repo_id: str,
+        huggingface_model_id: str,
+        epochs: int,
+        train_batch_size: int,
+        val_batch_size: int,
+        learning_rate: float,
+        val_check_interval: float,
+        check_val_every_n_epoch: int,
+        gradient_clip_val: float,
+        num_training_samples_per_epoch: int,
+        num_nodes: int,
+        warmup_steps: int,
+        ground_truth_key: str = "ground_truth"
+    ):
+        cls.huggingface_model_id = huggingface_model_id
+        config = VisionEncoderDecoderConfig.from_pretrained(pretrained_model_repo_id)
+        config.encoder.image_size = cls.image_size
+        config.decoder.max_length = cls.max_length
+        cls.processor = DonutProcessor.from_pretrained(pretrained_model_repo_id)
+        cls.model = VisionEncoderDecoderModel.from_pretrained(pretrained_model_repo_id, config=config)
+        cls.processor.image_processor.size = cls.image_size[::-1]
+        cls.processor.image_processor.do_align_long_axis = False
+        train_dataset = cls.DonutDataset(
+            dataset=dataset,
+            ground_truth_key=ground_truth_key,
+            max_length=cls.max_length,
+            split="train",
+            task_start_token=TASK_PROMPT_NAME,
+            prompt_end_token=TASK_PROMPT_NAME,
+            sort_json_key=True
+        )
+        val_dataset = cls.DonutDataset(
+            dataset=dataset,
+            ground_truth_key=ground_truth_key,
+            max_length=cls.max_length,
+            split="validation",
+            task_start_token=TASK_PROMPT_NAME,
+            prompt_end_token=TASK_PROMPT_NAME,
+            sort_json_key=True
+        )
+        cls.model.config.pad_token_id = cls.processor.tokenizer.pad_token_id
+        cls.model.config.decoder_start_token_id = cls.processor.tokenizer.convert_tokens_to_ids([TASK_PROMPT_NAME])[0]
+        cls.train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True, num_workers=4)
+        cls.val_dataloader = DataLoader(val_dataset, batch_size=1, shuffle=False, num_workers=4)
+        config = {
+            "max_epochs": epochs,
+            "val_check_interval": val_check_interval, # how many times we want to validate during an epoch
+            "check_val_every_n_epoch": check_val_every_n_epoch,
+            "gradient_clip_val": gradient_clip_val,
+            "num_training_samples_per_epoch": num_training_samples_per_epoch,
+            "lr": learning_rate,
+            "train_batch_sizes": [train_batch_size],
+            "val_batch_sizes": [val_batch_size],
+            # "seed":2022,
+            "num_nodes": num_nodes,
+            "warmup_steps": warmup_steps, # 10%
+            "result_path": "./.checkpoints",
+            "verbose": True,
+        }
+        model_module = cls.DonutModelPLModule(config, cls.processor, cls.model)
+        device = (
+            "cuda"
+            if torch.cuda.is_available()
+            else "mps" if torch.backends.mps.is_available() else "cpu"
+        )
+        print(f"Using {device} device")
+        trainer = pl.Trainer(
+                accelerator="gpu" if device == "cuda" else "mps" if device == "mps" else "cpu",
+                devices=1 if device == "cuda" else 0,
+                max_epochs=config.get("max_epochs"),
+                val_check_interval=config.get("val_check_interval"),
+                check_val_every_n_epoch=config.get("check_val_every_n_epoch"),
+                gradient_clip_val=config.get("gradient_clip_val"),
+                precision=16 if device == "cuda" else 32, # we'll use mixed precision if device == "cuda"
+                num_sanity_val_steps=0,
+                logger=TensorBoardLogger(save_dir="./.checkpoints", name="donut_training", version=None),
+                callbacks=[cls.PushToHubCallback()]
+        )
+        trainer.fit(model_module)

menu/llm/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .gemini import GeminiAPI
2	+ from .openai import OpenAIAPI

menu/llm/base.py ADDED Viewed

	@@ -0,0 +1,12 @@

+from typing import List
+from abc import ABC, abstractmethod
+from PIL import Image
+PROMPT = "The provided images display a menu. IMPORTANT: There may be MULTIPLE images representing different pages. You MUST examine EVERY image provided and combine all extracted information into the final result. Do not miss any dishes from any page."
+class LLMBase(ABC):
+    @classmethod
+    @abstractmethod
+    def call(cls, images: List[Image.Image], model: str, token: str) -> dict:
+        raise NotImplementedError

menu/llm/gemini.py ADDED Viewed

	@@ -0,0 +1,36 @@

+import json
+from typing import List
+from PIL import Image
+from google import genai
+from google.genai import types
+from .base import LLMBase, PROMPT
+FUNCTION_CALL = json.load(open("tools/schema_gemini.json", "r"))
+class GeminiAPI(LLMBase):
+    @classmethod
+    def call(cls, images: List[Image.Image], model: str, token: str) -> dict:
+        client = genai.Client(api_key=token) # Initialize the client with the API key
+        config = types.GenerateContentConfig(
+            tools=[types.Tool(function_declarations=[FUNCTION_CALL])],
+            tool_config={
+                "function_calling_config": {
+                    "mode": "ANY",
+                    "allowed_function_names": [FUNCTION_CALL["name"]]
+                }
+            }
+        )
+        response = client.models.generate_content(
+            model=model,
+            contents=images + [PROMPT],
+            config=config
+        )
+        if response.candidates[0].content.parts[0].function_call:
+            function_call = response.candidates[0].content.parts[0].function_call
+            return function_call.args
+        return {}

menu/llm/openai.py ADDED Viewed

	@@ -0,0 +1,43 @@

+import json
+import base64
+from io import BytesIO
+from typing import List
+from PIL import Image
+from openai import OpenAI
+from .base import LLMBase, PROMPT
+FUNCTION_CALL = json.load(open("tools/schema_openai.json", "r"))
+class OpenAIAPI(LLMBase):
+    @classmethod
+    def call(cls, images: List[Image.Image], model: str, token: str) -> dict:
+        client = OpenAI(api_key=token)  # Initialize the client with the API key
+        content = []
+        for image in images:
+            buffer = BytesIO()
+            image.save(buffer, format="JPEG")
+            encode_img = base64.b64encode(buffer.getvalue()).decode("utf-8")
+            content.append({
+                "type": "input_image",
+                "image_url": {"url": f"data:image/jpeg;base64,{encode_img}"},
+            })
+        content.append({"type": "text", "text": PROMPT})
+        response = client.responses.create(
+            model=model,
+            input=[
+                {
+                    "role": "user",
+                    "content": content,
+                }
+            ],
+            tools=[FUNCTION_CALL],
+        )
+        if response and response.output:
+            if hasattr(response.output[0], "arguments"):
+                return json.loads(response.output[0].arguments)
+        return {}

menu/utils.py ADDED Viewed

	@@ -0,0 +1,48 @@

+from typing import Optional
+from datasets import Dataset, DatasetDict
+def split_dataset(
+    dataset: Dataset,
+    train: float,
+    validation: float,
+    test: float,
+    seed: Optional[int] = None
+) -> DatasetDict:
+    """
+    Split a single-split Hugging Face Dataset into train/validation/test subsets.
+    Args:
+        dataset (Dataset): The input dataset (e.g. load_dataset(...)['train']).
+        train (float): Proportion of data for the train split (0 < train < 1).
+        val (float): Proportion of data for the validation split (0 < val < 1).
+        test (float): Proportion of data for the test split (0 < test < 1).
+                            Must satisfy train + val + test == 1.0.
+        seed (int): Random seed for reproducibility (default: None).
+    Returns:
+        DatasetDict: A dictionary with keys "train", "validation", and "test".
+    """
+    # Verify ratios sum to 1.0
+    total = train + validation + test
+    if abs(total - 1.0) > 1e-8:
+        raise ValueError(f"train + validation + test must equal 1.0 (got {total})")
+    # First split: extract train vs. temp (validation + test)
+    temp_size = validation + test
+    split_1 = dataset.train_test_split(test_size=temp_size, seed=seed)
+    train_ds = split_1["train"]
+    temp_ds  = split_1["test"]
+    # Second split: divide temp into validation vs. test
+    relative_test_size = test / temp_size
+    split_2 = temp_ds.train_test_split(test_size=relative_test_size, seed=seed)
+    validation_ds  = split_2["train"]
+    test_ds = split_2["test"]
+    # Return a DatasetDict with all three splits
+    return DatasetDict({
+        "train":      train_ds,
+        "validation": validation_ds,
+        "test":       test_ds,
+    })

pyproject.toml ADDED Viewed

	@@ -0,0 +1,27 @@

+[project]
+authors = [{name = "ryanlinjui", email = "ryanlinjui@gmail.com"}]
+name = "menu-text-detection"
+version = "0.1.0"
+description = "Extract structured menu information from images into JSON using a fine-tuned Donut E2E model."
+readme = "README.md"
+requires-python = "==3.11.*"
+dependencies = [
+    "datasets>=3.6.0",
+    "dotenv>=0.9.9",
+    "google-genai>=1.14.0",
+    "gradio>=5.29.0",
+    "huggingface-hub>=0.31.1",
+    "matplotlib>=3.10.1",
+    "nltk>=3.9.1",
+    "notebook>=7.4.2",
+    "openai>=1.77.0",
+    "pillow>=11.2.1",
+    "pillow-heif>=0.22.0",
+    "protobuf>=6.30.2",
+    "pytorch-lightning>=2.5.2",
+    "sentencepiece>=0.2.0",
+    "tensorboardx>=2.6.2.2",
+    "transformers==4.49",
+    "torch==2.4.1",
+    "donut-python>=1.0.9",
+]

requirements.txt ADDED Viewed

	@@ -0,0 +1,184 @@

+aiofiles==24.1.0
+aiohappyeyeballs==2.6.1
+aiohttp==3.13.3
+aiosignal==1.4.0
+annotated-doc==0.0.4
+annotated-types==0.7.0
+anyio==4.12.1
+appnope==0.1.4
+argon2-cffi==25.1.0
+argon2-cffi-bindings==25.1.0
+arrow==1.4.0
+asttokens==3.0.1
+async-lru==2.0.5
+attrs==25.4.0
+babel==2.17.0
+beautifulsoup4==4.14.3
+bleach==6.3.0
+brotli==1.2.0
+certifi==2026.1.4
+cffi==2.0.0
+charset-normalizer==3.4.4
+click==8.3.1
+comm==0.2.3
+contourpy==1.3.3
+cycler==0.12.1
+datasets==4.5.0
+debugpy==1.8.19
+decorator==5.2.1
+defusedxml==0.7.1
+dill==0.4.0
+distro==1.9.0
+donut-python==1.0.9
+dotenv==0.9.9
+executing==2.2.1
+fastapi==0.128.0
+fastjsonschema==2.21.2
+ffmpy==1.0.0
+filelock==3.20.3
+fonttools==4.61.1
+fqdn==1.5.1
+frozenlist==1.8.0
+fsspec==2025.10.0
+google-auth==2.47.0
+google-genai==1.58.0
+gradio==6.3.0
+gradio-client==2.0.3
+groovy==0.1.2
+h11==0.16.0
+hf-xet==1.2.0
+httpcore==1.0.9
+httpx==0.28.1
+huggingface-hub==0.36.0
+idna==3.11
+ipykernel==7.1.0
+ipython==9.9.0
+ipython-pygments-lexers==1.1.1
+isoduration==20.11.0
+jedi==0.19.2
+jinja2==3.1.6
+jiter==0.12.0
+joblib==1.5.3
+json5==0.13.0
+jsonpointer==3.0.0
+jsonschema==4.26.0
+jsonschema-specifications==2025.9.1
+jupyter-client==8.8.0
+jupyter-core==5.9.1
+jupyter-events==0.12.0
+jupyter-lsp==2.3.0
+jupyter-server==2.17.0
+jupyter-server-terminals==0.5.4
+jupyterlab==4.5.2
+jupyterlab-pygments==0.3.0
+jupyterlab-server==2.28.0
+kiwisolver==1.4.9
+lark==1.3.1
+lightning-utilities==0.15.2
+markdown-it-py==4.0.0
+markupsafe==3.0.3
+matplotlib==3.10.8
+matplotlib-inline==0.2.1
+mdurl==0.1.2
+mistune==3.2.0
+mpmath==1.3.0
+multidict==6.7.0
+multiprocess==0.70.18
+munch==4.0.0
+nbclient==0.10.4
+nbconvert==7.16.6
+nbformat==5.10.4
+nest-asyncio==1.6.0
+networkx==3.6.1
+nltk==3.9.2
+notebook==7.5.2
+notebook-shim==0.2.4
+numpy==2.4.1
+openai==2.15.0
+orjson==3.11.5
+overrides==7.7.0
+packaging==25.0
+pandas==2.3.3
+pandocfilters==1.5.1
+parso==0.8.5
+pexpect==4.9.0
+pillow==12.1.0
+pillow-heif==1.1.1
+platformdirs==4.5.1
+prometheus-client==0.24.1
+prompt-toolkit==3.0.52
+propcache==0.4.1
+protobuf==6.33.4
+psutil==7.2.1
+ptyprocess==0.7.0
+pure-eval==0.2.3
+pyarrow==22.0.0
+pyasn1==0.6.1
+pyasn1-modules==0.4.2
+pycparser==2.23
+pydantic==2.12.5
+pydantic-core==2.41.5
+pydub==0.25.1
+pygments==2.19.2
+pyparsing==3.3.1
+python-dateutil==2.9.0.post0
+python-dotenv==1.2.1
+python-json-logger==4.0.0
+python-multipart==0.0.21
+pytorch-lightning==2.6.0
+pytz==2025.2
+pyyaml==6.0.3
+pyzmq==27.1.0
+referencing==0.37.0
+regex==2026.1.15
+requests==2.32.5
+rfc3339-validator==0.1.4
+rfc3986-validator==0.1.1
+rfc3987-syntax==1.1.0
+rich==14.2.0
+rpds-py==0.30.0
+rsa==4.9.1
+ruamel-yaml==0.19.1
+safehttpx==0.1.7
+safetensors==0.7.0
+sconf==0.2.5
+semantic-version==2.10.0
+send2trash==2.1.0
+sentencepiece==0.2.1
+setuptools==80.9.0
+shellingham==1.5.4
+six==1.17.0
+sniffio==1.3.1
+soupsieve==2.8.1
+stack-data==0.6.3
+starlette==0.50.0
+sympy==1.14.0
+tenacity==9.1.2
+tensorboardx==2.6.4
+terminado==0.18.1
+timm==1.0.24
+tinycss2==1.4.0
+tokenizers==0.21.4
+tomlkit==0.13.3
+torch==2.4.1
+torchmetrics==1.8.2
+torchvision==0.19.1
+tornado==6.5.4
+tqdm==4.67.1
+traitlets==5.14.3
+transformers==4.49.0
+typer==0.21.1
+typing-extensions==4.15.0
+typing-inspection==0.4.2
+tzdata==2025.3
+uri-template==1.3.0
+urllib3==2.6.3
+uvicorn==0.40.0
+wcwidth==0.2.14
+webcolors==25.10.0
+webencodings==0.5.1
+websocket-client==1.9.0
+websockets==15.0.1
+xxhash==3.6.0
+yarl==1.22.0
+zss==1.2.0

tools/schema_gemini.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+    "name": "extract_menu_data",
+    "description": "Extract structured menu information from images.",
+    "parameters": {
+        "type": "object",
+        "properties": {
+            "restaurant": {
+                "type": "string",
+                "description": "Name of the restaurant. If the name is not available, it should be ''."
+            },
+            "address": {
+                "type": "string",
+                "description": "Address of the restaurant. If the address is not available, it should be ''."
+            },
+            "phone": {
+                "type": "string",
+                "description": "Phone number of the restaurant. If the phone number is not available, it should be ''."
+            },
+            "business_hours": {
+                "type": "string",
+                "description": "Business hours of the restaurant. If the business hours are not available, it should be ''."
+            },
+            "dishes": {
+                "type": "array",
+                "items": {
+                    "type": "object",
+                    "properties": {
+                        "name": {
+                            "type": "string",
+                            "description": "Name of the menu item."
+                        },
+                        "price": {
+                            "type": "string",
+                            "description": "Price of the menu item. If the price is not available, it should be -1."
+                        }
+                    },
+                    "required": ["name", "price"]
+                },
+                "description": "List of menu dishes item."
+            }
+        },
+        "required": ["restaurant", "address", "phone", "business_hours", "dishes"]
+    }
+}

tools/schema_openai.json ADDED Viewed

	@@ -0,0 +1,47 @@

+{
+    "type": "function",
+    "name": "extract_menu_data",
+    "description": "Extract structured menu information from images.",
+    "parameters": {
+        "type": "object",
+        "properties": {
+            "restaurant": {
+                "type": "string",
+                "description": "Name of the restaurant. If the name is not available, it should be ''."
+            },
+            "address": {
+                "type": "string",
+                "description": "Address of the restaurant. If the address is not available, it should be ''."
+            },
+            "phone": {
+                "type": "string",
+                "description": "Phone number of the restaurant. If the phone number is not available, it should be ''."
+            },
+            "business_hours": {
+                "type": "string",
+                "description": "Business hours of the restaurant. If the business hours are not available, it should be ''."
+            },
+            "dishes": {
+                "type": "array",
+                "items": {
+                    "type": "object",
+                    "properties": {
+                        "name": {
+                            "type": "string",
+                            "description": "Name of the menu item."
+                        },
+                        "price": {
+                            "type": "string",
+                            "description": "Price of the menu item. If the price is not available, it should be -1."
+                        }
+                    },
+                    "required": ["name", "price"],
+                    "additionalProperties": false
+                },
+                "description": "List of menu dishes item."
+            }
+        },
+        "required": ["restaurant", "address", "phone", "business_hours", "dishes"],
+        "additionalProperties": false
+    }
+}

train.ipynb ADDED Viewed

	@@ -0,0 +1,235 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Login to HuggingFace (just login once)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import interpreter_login\n",
+    "interpreter_login()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Collect Menu Image Datasets\n",
+    "- Use `metadata.jsonl` to label the images's ground truth. You can visit [here](https://github.com/ryanlinjui/menu-text-detection/tree/main/examples) to see the examples.\n",
+    "- After finishing, push to HuggingFace Datasets.\n",
+    "- For labeling:\n",
+    "    - [Google AI Studio](https://aistudio.google.com) or [OpenAI ChatGPT](https://chatgpt.com).\n",
+    "    - Use function calling by API. Start the gradio app locally or visit [here](https://huggingface.co/spaces/ryanlinjui/menu-text-detection).\n",
+    "\n",
+    "### Menu Type\n",
+    "- **h**: horizontal menu\n",
+    "- **v**: vertical menu\n",
+    "- **d**: document-style menu\n",
+    "- **s**: in-scene menu (non-document style)\n",
+    "- **i**: irregular menu (menu with irregular text layout)\n",
+    "\n",
+    "> Please see the [examples](https://github.com/ryanlinjui/menu-text-detection/tree/main/examples) for more details."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import json\n",
+    "\n",
+    "import numpy as np\n",
+    "from PIL import Image\n",
+    "from pillow_heif import register_heif_opener\n",
+    "\n",
+    "from menu.llm import (\n",
+    "    GeminiAPI,\n",
+    "    OpenAIAPI\n",
+    ")\n",
+    "\n",
+    "IMAGE_DIR = \"datasets/images\"       # set your image directory here\n",
+    "SELECTED_MODEL = \"gemini-2.5-flash\" # set model name here, refer MODEL_LIST from app.py for more\n",
+    "API_TOKEN = \"\"                      # set your API token here\n",
+    "SELECTED_FUNCTION = GeminiAPI       # set \"GeminiAPI\" or \"OpenAIAPI\"\n",
+    "\n",
+    "register_heif_opener()\n",
+    "\n",
+    "for file in os.listdir(IMAGE_DIR):\n",
+    "    print(f\"Processing image: {file}\")\n",
+    "    try:\n",
+    "        image = np.array(Image.open(os.path.join(IMAGE_DIR, file)))\n",
+    "        data = {\n",
+    "            \"file_name\": file,\n",
+    "            \"menu\": SELECTED_FUNCTION.call(image, SELECTED_MODEL, API_TOKEN)\n",
+    "        }\n",
+    "        with open(os.path.join(IMAGE_DIR, \"metadata.jsonl\"), \"a\", encoding=\"utf-8\") as metaf:\n",
+    "            metaf.write(json.dumps(data, ensure_ascii=False, sort_keys=True) + \"\\n\")\n",
+    "    except Exception as e:\n",
+    "        print(f\"Skipping invalid image '{file}': {e}\")\n",
+    "        continue"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Push Datasets to HuggingFace"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "dataset = load_dataset(path=\"datasets/menu-zh-TW\")      # load dataset from the local directory including the metadata.jsonl, images files.\n",
+    "dataset.push_to_hub(repo_id=\"ryanlinjui/menu-zh-TW\")    # push to the huggingface dataset hub"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Prepare the dataset for training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from menu.utils import split_dataset\n",
+    "from datasets import load_dataset\n",
+    "\n",
+    "dataset = load_dataset(path=\"ryanlinjui/menu-zh-TW\") # set your dataset repo id for training\n",
+    "dataset = split_dataset(dataset[\"train\"], train=0.8, validation=0.1, test=0.1, seed=42) # (optional) use it if your dataset is not split into train/validation/test\n",
+    "print(f\"Dataset split: {len(dataset['train'])} train, {len(dataset['validation'])} validation, {len(dataset['test'])} test\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Fine-tune Donut Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import logging\n",
+    "from menu.donut import DonutTrainer\n",
+    "\n",
+    "logging.getLogger(\"transformers\").setLevel(logging.ERROR) # filter output message from transformers\n",
+    "\n",
+    "DonutTrainer.train(\n",
+    "    dataset=dataset,\n",
+    "    pretrained_model_repo_id=\"naver-clova-ix/donut-base\",        # set your pretrained model repo id for fine-tuning\n",
+    "    ground_truth_key=\"menu\",                                     # set your ground truth key for training\n",
+    "    huggingface_model_id=\"ryanlinjui/donut-base-finetuned-menu\", # set your huggingface model repo id for saving / pushing to the hub\n",
+    "    epochs=15,                                                   # set your training epochs\n",
+    "    train_batch_size=8,                                          # set your training batch size\n",
+    "    val_batch_size=1,                                            # set your validation batch size\n",
+    "    learning_rate=3e-5,                                          # set your learning rate\n",
+    "    val_check_interval=0.5,                                      # how many times we want to validate during an epoch\n",
+    "    check_val_every_n_epoch=1,                                   # how many epochs we want to validate\n",
+    "    gradient_clip_val=1.0,                                       # gradient clipping value for training stability\n",
+    "    num_training_samples_per_epoch=198,                          # set num_training_samples_per_epoch = training set size\n",
+    "    num_nodes=1,                                                 # number of nodes for distributed training\n",
+    "    warmup_steps=75                                              # number of warmup steps for learning rate scheduler, 198/8*30/10, 10%\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Evaluate Donut Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from datasets import load_dataset\n",
+    "\n",
+    "from menu.utils import split_dataset\n",
+    "from menu.donut import DonutFinetuned\n",
+    "\n",
+    "dataset = load_dataset(\"ryanlinjui/menu-zh-TW\")\n",
+    "dataset = split_dataset(dataset[\"train\"], train=0.8, validation=0.1, test=0.1, seed=42)  # (optional) use it if your dataset is not split into train/validation/test\n",
+    "donut_finetuned = DonutFinetuned(pretrained_model_repo_id=\"ryanlinjui/donut-base-finetuned-menu\")\n",
+    "scores, output_list = donut_finetuned.evaluate(dataset=dataset[\"test\"], ground_truth_key=\"menu\")\n",
+    "\n",
+    "print(\"Evaluation scores:\")\n",
+    "for key, value in scores.items():\n",
+    "    print(f\"{key}: {value}\")\n",
+    "\n",
+    "print(\"\\nSample outputs:\")\n",
+    "for output in output_list[:5]:\n",
+    "    print(json.dumps(output, ensure_ascii=False, indent=4))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Test Donut Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from PIL import Image\n",
+    "from menu.donut import DonutFinetuned\n",
+    "\n",
+    "image = Image.open(\"./examples/menu-hd.jpg\")\n",
+    "\n",
+    "donut_finetuned = DonutFinetuned(pretrained_model_repo_id=\"ryanlinjui/donut-base-finetuned-menu\")\n",
+    "outputs = donut_finetuned.predict(image=image)\n",
+    "print(outputs)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "menu-text-detection",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff