Spaces:

ryanlinjui
/

menu-text-detection

Running

App Files Files Community

github-actions[bot] commited on Jul 3, 2025

Commit

6bd37dd

0 Parent(s):

Sync from https://github.com/ryanlinjui/menu-text-detection

Browse files

Files changed (20) hide show

.checkpoints/.gitkeep +0 -0
.env.example +3 -0
.github/workflows/sync.yml +25 -0
.gitignore +24 -0
.python-version +1 -0
LICENSE +21 -0
README.md +65 -0
app.py +157 -0
menu/donut.py +472 -0
menu/llm/__init__.py +2 -0
menu/llm/base.py +9 -0
menu/llm/gemini.py +36 -0
menu/llm/openai.py +39 -0
menu/utils.py +48 -0
pyproject.toml +27 -0
requirements.txt +183 -0
tools/schema_gemini.json +44 -0
tools/schema_openai.json +47 -0
train.ipynb +235 -0
uv.lock +0 -0

.checkpoints/.gitkeep ADDED Viewed

File without changes

.env.example ADDED Viewed

	@@ -0,0 +1,3 @@

+HUGGINGFACE_TOKEN="HUGGINGFACE_TOKEN"
+GEMINI_API_TOKEN="GEMINI_API_TOKEN"
+OPENAI_API_TOKEN="OPENAI_API_TOKEN"

.github/workflows/sync.yml ADDED Viewed

	@@ -0,0 +1,25 @@

+name: Sync to Hugging Face Spaces
+on:
+    push:
+        branches:
+            - main
+jobs:
+    sync:
+        name: Sync
+        runs-on: ubuntu-latest
+        steps:
+            - name: Checkout Repository
+              uses: actions/checkout@v4
+            - name: Remove bad files
+              run: rm -rf examples assets
+            - name: Sync to Hugging Face Spaces
+              uses: JacobLinCool/huggingface-sync@v1
+              with:
+                  github: ${{ secrets.GITHUB_TOKEN }}
+                  user: ryanlinjui # Hugging Face username or organization name
+                  space: menu-text-detection # Hugging Face space name
+                  token: ${{ secrets.HF_TOKEN }} # Hugging Face token
+                  python_version: 3.11 # Python version

.gitignore ADDED Viewed

	@@ -0,0 +1,24 @@

+# mac
+.DS_Store
+# cache
+__pycache__
+# datasets
+datasets
+# papers
+docs/papers
+# uv
+.venv
+# gradio
+.gradio
+# env
+.env
+# checkpoint
+.checkpoints/*
+!.checkpoints/.gitkeep

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.11

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 RyanLin
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md ADDED Viewed

	@@ -0,0 +1,65 @@

+---
+title: menu text detection
+emoji: 🦄
+colorFrom: indigo
+colorTo: pink
+sdk: gradio
+python_version: 3.11
+short_description: Extract structured menu information from images into JSON...
+tags: [ "donut","fine-tuning","image-to-text","transformer" ]
+---
+# Menu Text Detection System
+Extract structured menu information from images into JSON using a fine-tuned E2E model or LLM.
+[![Gradio Space Demo](https://img.shields.io/badge/GradioSpace-Demo-important?logo=huggingface)](https://huggingface.co/spaces/ryanlinjui/menu-text-detection)
+[![Hugging Face Models & Datasets](https://img.shields.io/badge/HuggingFace-Models_&_Datasets-important?logo=huggingface)](https://huggingface.co/collections/ryanlinjui/menu-text-detection-670ccf527626bb004bbfb39b)
+https://github.com/user-attachments/assets/80e5d54c-f2c8-4593-ad9b-499e5b71d8f6
+## 🚀 Features
+### Overview
+Currently supports the following information from menu images:
+- **Restaurant Name**
+- **Business Hours**
+- **Address**
+- **Phone Number**
+- **Dish Information**
+  - Name
+  - Price
+> For the JSON schema, see [tools directory](./tools).
+### Supported Methods to Extract Menu Information
+#### Fine-tuned E2E model and Training metrics
+- [**Donut (Document Parsing Task)**](https://huggingface.co/ryanlinjui/donut-base-finetuned-menu) - Base model by [Clova AI (ECCV ’22)](https://github.com/clovaai/donut)
+#### LLM Function Calling
+- Google Gemini API
+- OpenAI GPT API
+## 💻 Training / Fine-Tuning
+### Setup
+Use [uv](https://github.com/astral-sh/uv) to set up the development environment:
+```bash
+uv sync
+```
+> or use `pip install -r requirements.txt` if it has any problems
+### Training Script (Datasets collecting, Fine-Tuning)
+Please refer [`train.ipynb`](./train.ipynb). Use Jupyter Notebook for training:
+```bash
+uv run jupyter-notebook
+```
+> For VSCode users, please install Jupyter extension, then select `.venv/bin/python` as your kernel.
+### Run Demo Locally
+```bash
+uv run python app.py
+```

app.py ADDED Viewed

	@@ -0,0 +1,157 @@

+import os
+import json
+import gradio as gr
+from PIL import Image
+from dotenv import load_dotenv
+from pillow_heif import register_heif_opener
+from menu.llm import (
+    GeminiAPI,
+    OpenAIAPI
+)
+from menu.donut import DonutFinetuned
+register_heif_opener()
+load_dotenv(override=True)
+GEMINI_API_TOKEN = os.getenv("GEMINI_API_TOKEN", "")
+OPENAI_API_TOKEN = os.getenv("OPENAI_API_TOKEN", "")
+SOURCE_CODE_GH_URL = "https://github.com/ryanlinjui/menu-text-detection"
+BADGE_URL = "https://img.shields.io/badge/GitHub_Code-Click_Here!!-default?logo=github"
+GITHUB_RAW_URL = "https://raw.githubusercontent.com/ryanlinjui/menu-text-detection/main"
+EXAMPLE_IMAGE_LIST = [
+    f"{GITHUB_RAW_URL}/examples/menu-hd.jpg",
+    f"{GITHUB_RAW_URL}/examples/menu-vs.jpg",
+    f"{GITHUB_RAW_URL}/examples/menu-si.jpg"
+]
+FINETUNED_MODEL_LIST = [
+    "Donut (Document Parsing Task) Fine-tuned Model"
+]
+LLM_MODEL_LIST = [
+    "gemini-2.5-pro",
+    "gemini-2.5-flash",
+    "gemini-2.0-flash",
+    "gpt-4.1",
+    "gpt-4o",
+    "o4-mini"
+]
+donut_finetuned = DonutFinetuned("ryanlinjui/donut-base-finetuned-menu")
+def handle(image: Image.Image, model: str, api_token: str) -> str:
+    if image is None:
+        raise gr.Error("Please upload an image first.")
+    if model == FINETUNED_MODEL_LIST[0]:
+        result = donut_finetuned.predict(image)
+    elif model in LLM_MODEL_LIST:
+        if len(api_token) < 10:
+            raise gr.Error(f"Please provide a valid token for {model}.")
+        try:
+            if model in LLM_MODEL_LIST[:3]:
+                result = GeminiAPI.call(image, model, api_token)
+            else:
+                result = OpenAIAPI.call(image, model, api_token)
+        except Exception as e:
+            raise gr.Error(f"Failed to process with API model {model}: {str(e)}")
+    else:
+        raise gr.Error("Invalid model selection. Please choose a valid model.")
+    return json.dumps(result, indent=4, ensure_ascii=False, sort_keys=True)
+def UserInterface() -> gr.Interface:
+    with gr.Blocks(
+        delete_cache=(86400, 86400),
+        css="""
+        .image-panel {
+            display: flex;
+            flex-direction: column;
+            height: 600px;
+        }
+        .image-panel img {
+            object-fit: contain;
+            max-height: 600px;
+            max-width: 600px;
+            width: 100%;
+        }
+        .large-text textarea {
+            font-size: 20px !important;
+            height: 600px !important;
+            width: 100% !important;
+        }
+        """
+    ) as gradio_interface:
+        gr.HTML(f'<a href="{SOURCE_CODE_GH_URL}"><img src="{BADGE_URL}" alt="GitHub Code"/></a>')
+        gr.Markdown("# Menu Text Detection")
+        with gr.Row():
+            with gr.Column(scale=1, min_width=500):
+                gr.Markdown("## 📷 Menu Image")
+                menu_image = gr.Image(
+                    type="pil",
+                    label="Input menu image",
+                    elem_classes="image-panel"
+                )
+                gr.Markdown("## 🤖 Model Selection")
+                model_choice_dropdown = gr.Dropdown(
+                    choices=FINETUNED_MODEL_LIST + LLM_MODEL_LIST,
+                    value=FINETUNED_MODEL_LIST[0],
+                    label="Select Text Detection Model"
+                )
+                api_token_textbox = gr.Textbox(
+                    label="API Token",
+                    placeholder="Enter your API token here...",
+                    type="password",
+                    visible=False
+                )
+                generate_button = gr.Button("Generate Menu Information", variant="primary")
+                gr.Examples(
+                    examples=EXAMPLE_IMAGE_LIST,
+                    inputs=menu_image,
+                    label="Example Menu Images"
+                )
+            with gr.Column(scale=1):
+                gr.Markdown("## 🍽️ Menu Info")
+                menu_json_textbox = gr.Textbox(
+                    label="Ouput JSON",
+                    interactive=True,
+                    text_align="left",
+                    elem_classes="large-text"
+                )
+        def update_token_visibility(choice):
+            if choice in LLM_MODEL_LIST:
+                current_token = ""
+                if choice in LLM_MODEL_LIST[:3]:
+                    current_token = GEMINI_API_TOKEN
+                else:
+                    current_token = OPENAI_API_TOKEN
+                return gr.update(visible=True, value=current_token)
+            else:
+                return gr.update(visible=False)
+        model_choice_dropdown.change(
+            fn=update_token_visibility,
+            inputs=model_choice_dropdown,
+            outputs=api_token_textbox
+        )
+        generate_button.click(
+            fn=handle,
+            inputs=[menu_image, model_choice_dropdown, api_token_textbox],
+            outputs=menu_json_textbox
+        )
+    return gradio_interface
+if __name__ == "__main__":
+    demo = UserInterface()
+    demo.launch()

menu/donut.py ADDED Viewed

	@@ -0,0 +1,472 @@

+"""
+This file is modified from the HuggingFace transformers tutorial script for fine-tuning Donut on a custom dataset.
+It's defined from `.ipynb` to the module implementation for better reusability and maintainability.
+Reference: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Donut/CORD/Fine_tune_Donut_on_a_custom_dataset_(CORD)_with_PyTorch_Lightning.ipynb
+"""
+import re
+import random
+from typing import Any, List, Tuple, Dict
+import torch
+import numpy as np
+from PIL import Image
+from tqdm.auto import tqdm
+from nltk import edit_distance
+import pytorch_lightning as pl
+from datasets import DatasetDict
+from donut import JSONParseEvaluator
+from huggingface_hub import upload_folder
+from pillow_heif import register_heif_opener
+from pytorch_lightning.callbacks import Callback
+from pytorch_lightning.loggers import TensorBoardLogger
+from torch.utils.data import (
+    Dataset,
+    DataLoader
+)
+from transformers import (
+    DonutProcessor,
+    VisionEncoderDecoderModel,
+    VisionEncoderDecoderConfig
+)
+TASK_PROMPT_NAME = "<s_menu-text-detection>"
+register_heif_opener()
+class DonutFinetuned:
+    def __init__(self, pretrained_model_repo_id: str = "ryanlinjui/donut-test"):
+        self.device = (
+            "cuda"
+            if torch.cuda.is_available()
+            else "mps" if torch.backends.mps.is_available() else "cpu"
+        )
+        self.processor = DonutProcessor.from_pretrained(pretrained_model_repo_id)
+        self.model = VisionEncoderDecoderModel.from_pretrained(pretrained_model_repo_id)
+        self.model.eval()
+        self.model.to(self.device)
+        print(f"Using {self.device} device")
+    def predict(self, image: Image.Image) -> Dict[str, Any]:
+        # prepare encoder inputs
+        pixel_values = self.processor(image.convert("RGB"), return_tensors="pt").pixel_values
+        pixel_values = pixel_values.to(self.device)
+        # prepare decoder inputs
+        decoder_input_ids = self.processor.tokenizer(TASK_PROMPT_NAME, add_special_tokens=False, return_tensors="pt").input_ids
+        decoder_input_ids = decoder_input_ids.to(self.device)
+        # autoregressively generate sequence
+        outputs = self.model.generate(
+                pixel_values,
+                decoder_input_ids=decoder_input_ids,
+                max_length=self.model.decoder.config.max_position_embeddings,
+                early_stopping=True,
+                pad_token_id=self.processor.tokenizer.pad_token_id,
+                eos_token_id=self.processor.tokenizer.eos_token_id,
+                use_cache=True,
+                num_beams=1,
+                bad_words_ids=[[self.processor.tokenizer.unk_token_id]],
+                return_dict_in_generate=True
+            )
+        # turn into JSON
+        seq = self.processor.batch_decode(outputs.sequences)[0]
+        seq = seq.replace(self.processor.tokenizer.eos_token, "").replace(self.processor.tokenizer.pad_token, "")
+        seq = re.sub(r"<.*?>", "", seq, count=1).strip()  # remove first task start token
+        seq = self.processor.token2json(seq)
+        return seq
+    def evaluate(self, dataset: Dataset, ground_truth_key: str = "ground_truth") -> Tuple[Dict[str, Any], List[Any]]:
+        output_list = []
+        accs = []
+        ted_accs = []
+        f1_accs = []
+        for idx, sample in tqdm(enumerate(dataset), total=len(dataset)):
+            seq = self.predict(sample["image"])
+            ground_truth = sample[ground_truth_key]
+            # Original JSON accuracy
+            evaluator = JSONParseEvaluator()
+            score = evaluator.cal_acc(seq, ground_truth)
+            accs.append(score)
+            output_list.append(seq)
+            # TED (Tree Edit Distance) Accuracy
+            # Convert predictions and ground truth to string format for comparison
+            pred_str = str(seq) if seq else ""
+            gt_str = str(ground_truth) if ground_truth else ""
+            # Calculate normalized edit distance (1 - normalized_edit_distance = accuracy)
+            if len(pred_str) == 0 and len(gt_str) == 0:
+                ted_acc = 1.0
+            elif len(pred_str) == 0 or len(gt_str) == 0:
+                ted_acc = 0.0
+            else:
+                edit_dist = edit_distance(pred_str, gt_str)
+                max_len = max(len(pred_str), len(gt_str))
+                ted_acc = 1 - (edit_dist / max_len)
+            ted_accs.append(ted_acc)
+            # F1 Score Accuracy (character-level)
+            if len(pred_str) == 0 and len(gt_str) == 0:
+                f1_acc = 1.0
+            elif len(pred_str) == 0 or len(gt_str) == 0:
+                f1_acc = 0.0
+            else:
+                # Character-level precision and recall
+                pred_chars = set(pred_str)
+                gt_chars = set(gt_str)
+                if len(pred_chars) == 0:
+                    precision = 0.0
+                else:
+                    precision = len(pred_chars.intersection(gt_chars)) / len(pred_chars)
+                if len(gt_chars) == 0:
+                    recall = 0.0
+                else:
+                    recall = len(pred_chars.intersection(gt_chars)) / len(gt_chars)
+                if precision + recall == 0:
+                    f1_acc = 0.0
+                else:
+                    f1_acc = 2 * (precision * recall) / (precision + recall)
+            f1_accs.append(f1_acc)
+        scores = {
+            "accuracies": accs,
+            "mean_accuracy": np.mean(accs),
+            "ted_accuracies": ted_accs,
+            "mean_ted_accuracy": np.mean(ted_accs),
+            "f1_accuracies": f1_accs,
+            "mean_f1_accuracy": np.mean(f1_accs),
+            "length": len(accs)
+        }
+        return scores, output_list
+class DonutTrainer:
+    processor = None
+    max_length = 768
+    image_size = [1280, 960]
+    added_tokens = []
+    train_dataloader = None
+    val_dataloader = None
+    huggingface_model_id = None
+    class DonutDataset(Dataset):
+        """
+        PyTorch Dataset for Donut. This class takes a HuggingFace Dataset as input.
+        Each row, consists of image path(png/jpg/jpeg) and gt data (json/jsonl/txt),
+        and it will be converted into pixel_values (vectorized image) and labels (input_ids of the tokenized string).
+        Args:
+            dataset: HuggingFace DatasetDict containing the dataset to be used
+            max_length: the max number of tokens for the target sequences
+            split: whether to load "train", "validation" or "test" split
+            ignore_id: ignore_index for torch.nn.CrossEntropyLoss
+            task_start_token: the special token to be fed to the decoder to conduct the target task
+            prompt_end_token: the special token at the end of the sequences
+            sort_json_key: whether or not to sort the JSON keys
+        """
+        def __init__(
+            self,
+            dataset: DatasetDict,
+            ground_truth_key: str,
+            max_length: int,
+            split: str = "train",
+            ignore_id: int = -100,
+            task_start_token: str = "<s>",
+            prompt_end_token: str = None,
+            sort_json_key: bool = True,
+        ):
+            super().__init__()
+            self.dataset = dataset[split]
+            self.ground_truth_key = ground_truth_key
+            self.max_length = max_length
+            self.split = split
+            self.ignore_id = ignore_id
+            self.task_start_token = task_start_token
+            self.prompt_end_token = prompt_end_token if prompt_end_token else task_start_token
+            self.sort_json_key = sort_json_key
+            self.dataset_length = len(self.dataset)
+            self.gt_token_sequences = []
+            for sample in self.dataset:
+                ground_truth = sample[self.ground_truth_key]
+                self.gt_token_sequences.append(
+                    [
+                        self.json2token(
+                            gt_json,
+                            update_special_tokens_for_json_key=self.split == "train",
+                            sort_json_key=self.sort_json_key,
+                        )
+                        + DonutTrainer.processor.tokenizer.eos_token
+                        for gt_json in [ground_truth]  # load json from list of json
+                    ]
+                )
+            self.add_tokens([self.task_start_token, self.prompt_end_token])
+            self.prompt_end_token_id = DonutTrainer.processor.tokenizer.convert_tokens_to_ids(self.prompt_end_token)
+        def json2token(self, obj: Any, update_special_tokens_for_json_key: bool = True, sort_json_key: bool = True):
+            """
+            Convert an ordered JSON object into a token sequence
+            """
+            if type(obj) == dict:
+                if len(obj) == 1 and "text_sequence" in obj:
+                    return obj["text_sequence"]
+                else:
+                    output = ""
+                    if sort_json_key:
+                        keys = sorted(obj.keys(), reverse=True)
+                    else:
+                        keys = obj.keys()
+                    for k in keys:
+                        if update_special_tokens_for_json_key:
+                            self.add_tokens([fr"<s_{k}>", fr"</s_{k}>"])
+                        output += (
+                            fr"<s_{k}>"
+                            + self.json2token(obj[k], update_special_tokens_for_json_key, sort_json_key)
+                            + fr"</s_{k}>"
+                        )
+                    return output
+            elif type(obj) == list:
+                return r"<sep/>".join(
+                    [self.json2token(item, update_special_tokens_for_json_key, sort_json_key) for item in obj]
+                )
+            else:
+                obj = str(obj)
+                if f"<{obj}/>" in DonutTrainer.added_tokens:
+                    obj = f"<{obj}/>"  # for categorical special tokens
+                return obj
+        def add_tokens(self, list_of_tokens: List[str]):
+            """
+            Add special tokens to tokenizer and resize the token embeddings of the decoder
+            """
+            newly_added_num = DonutTrainer.processor.tokenizer.add_tokens(list_of_tokens)
+            if newly_added_num > 0:
+                DonutTrainer.model.decoder.resize_token_embeddings(len(DonutTrainer.processor.tokenizer))
+                DonutTrainer.added_tokens.extend(list_of_tokens)
+        def __len__(self) -> int:
+            return self.dataset_length
+        def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+            """
+            Load image from image_path of given dataset_path and convert into input_tensor and labels
+            Convert gt data into input_ids (tokenized string)
+            Returns:
+                input_tensor : preprocessed image
+                input_ids : tokenized gt_data
+                labels : masked labels (model doesn't need to predict prompt and pad token)
+            """
+            sample = self.dataset[idx]
+            # inputs
+            pixel_values = DonutTrainer.processor(sample["image"], random_padding=self.split == "train", return_tensors="pt").pixel_values
+            pixel_values = pixel_values.squeeze()
+            # targets
+            target_sequence = random.choice(self.gt_token_sequences[idx])  # can be more than one, e.g., DocVQA Task 1
+            input_ids = DonutTrainer.processor.tokenizer(
+                target_sequence,
+                add_special_tokens=False,
+                max_length=self.max_length,
+                padding="max_length",
+                truncation=True,
+                return_tensors="pt",
+            )["input_ids"].squeeze(0)
+            labels = input_ids.clone()
+            labels[labels == DonutTrainer.processor.tokenizer.pad_token_id] = self.ignore_id  # model doesn't need to predict pad token
+            # labels[: torch.nonzero(labels == self.prompt_end_token_id).sum() + 1] = self.ignore_id  # model doesn't need to predict prompt (for VQA)
+            return pixel_values, labels, target_sequence
+    class DonutModelPLModule(pl.LightningModule):
+        def __init__(self, config, processor, model):
+            super().__init__()
+            self.config = config
+            self.processor = processor
+            self.model = model
+        def training_step(self, batch, batch_idx):
+            pixel_values, labels, _ = batch
+            outputs = self.model(pixel_values, labels=labels)
+            loss = outputs.loss
+            self.log("train_loss", loss)
+            return loss
+        def validation_step(self, batch, batch_idx, dataset_idx=0):
+            pixel_values, labels, answers = batch
+            batch_size = pixel_values.shape[0]
+            # we feed the prompt to the model
+            decoder_input_ids = torch.full((batch_size, 1), self.model.config.decoder_start_token_id, device=self.device)
+            outputs = self.model.generate(pixel_values,
+                                    decoder_input_ids=decoder_input_ids,
+                                    max_length=DonutTrainer.max_length,
+                                    early_stopping=True,
+                                    pad_token_id=self.processor.tokenizer.pad_token_id,
+                                    eos_token_id=self.processor.tokenizer.eos_token_id,
+                                    use_cache=True,
+                                    num_beams=1,
+                                    bad_words_ids=[[self.processor.tokenizer.unk_token_id]],
+                                    return_dict_in_generate=True,)
+            predictions = []
+            for seq in self.processor.tokenizer.batch_decode(outputs.sequences):
+                seq = seq.replace(self.processor.tokenizer.eos_token, "").replace(self.processor.tokenizer.pad_token, "")
+                seq = re.sub(r"<.*?>", "", seq, count=1).strip()  # remove first task start token
+                predictions.append(seq)
+            scores = []
+            for pred, answer in zip(predictions, answers):
+                pred = re.sub(r"(?:(?<=>) | (?=</s_))", "", pred)
+                # NOT NEEDED ANYMORE
+                # answer = re.sub(r"<.*?>", "", answer, count=1)
+                answer = answer.replace(self.processor.tokenizer.eos_token, "")
+                scores.append(edit_distance(pred, answer) / max(len(pred), len(answer)))
+                if self.config.get("verbose", False) and len(scores) == 1:
+                    print(f"Prediction: {pred}")
+                    print(f"    Answer: {answer}")
+                    print(f" Normed ED: {scores[0]}")
+            val_edit_distance = np.mean(scores)
+            self.log("val_edit_distance", val_edit_distance)
+            print(f"Validation Edit Distance: {val_edit_distance}")
+            return scores
+        def configure_optimizers(self):
+            # you could also add a learning rate scheduler if you want
+            optimizer = torch.optim.Adam(self.parameters(), lr=self.config.get("lr"))
+            return optimizer
+        def train_dataloader(self):
+            return DonutTrainer.train_dataloader
+        def val_dataloader(self):
+            return DonutTrainer.val_dataloader
+    class PushToHubCallback(Callback):
+        def on_train_epoch_end(self, trainer, pl_module):
+            print(f"Pushing model to the hub, epoch {trainer.current_epoch}")
+            pl_module.model.push_to_hub(DonutTrainer.huggingface_model_id, commit_message=f"Training in progress, epoch {trainer.current_epoch}")
+            self._upload_logs(trainer.logger.log_dir, trainer.current_epoch)
+        def on_train_end(self, trainer, pl_module):
+            print(f"Pushing model to the hub after training")
+            pl_module.processor.push_to_hub(DonutTrainer.huggingface_model_id,commit_message=f"Training done")
+            pl_module.model.push_to_hub(DonutTrainer.huggingface_model_id, commit_message=f"Training done")
+            self._upload_logs(trainer.logger.log_dir, "final")
+        def _upload_logs(self, log_dir: str, epoch_info):
+            try:
+                print(f"Attempting to upload logs from: {log_dir}")
+                upload_folder(folder_path=log_dir, repo_id=DonutTrainer.huggingface_model_id,
+                            path_in_repo="tensorboard_logs",
+                            commit_message=f"Upload logs - epoch {epoch_info}", ignore_patterns=["*.tmp", "*.lock"])
+                print(f"Successfully uploaded logs for epoch {epoch_info}")
+            except Exception as e:
+                print(f"Failed to upload logs: {e}")
+                pass
+    @classmethod
+    def train(
+        cls,
+        dataset: DatasetDict,
+        pretrained_model_repo_id: str,
+        huggingface_model_id: str,
+        epochs: int,
+        train_batch_size: int,
+        val_batch_size: int,
+        learning_rate: float,
+        val_check_interval: float,
+        check_val_every_n_epoch: int,
+        gradient_clip_val: float,
+        num_training_samples_per_epoch: int,
+        num_nodes: int,
+        warmup_steps: int,
+        ground_truth_key: str = "ground_truth"
+    ):
+        cls.huggingface_model_id = huggingface_model_id
+        config = VisionEncoderDecoderConfig.from_pretrained(pretrained_model_repo_id)
+        config.encoder.image_size = cls.image_size
+        config.decoder.max_length = cls.max_length
+        cls.processor = DonutProcessor.from_pretrained(pretrained_model_repo_id)
+        cls.model = VisionEncoderDecoderModel.from_pretrained(pretrained_model_repo_id, config=config)
+        cls.processor.image_processor.size = cls.image_size[::-1]
+        cls.processor.image_processor.do_align_long_axis = False
+        train_dataset = cls.DonutDataset(
+            dataset=dataset,
+            ground_truth_key=ground_truth_key,
+            max_length=cls.max_length,
+            split="train",
+            task_start_token=TASK_PROMPT_NAME,
+            prompt_end_token=TASK_PROMPT_NAME,
+            sort_json_key=True
+        )
+        val_dataset = cls.DonutDataset(
+            dataset=dataset,
+            ground_truth_key=ground_truth_key,
+            max_length=cls.max_length,
+            split="validation",
+            task_start_token=TASK_PROMPT_NAME,
+            prompt_end_token=TASK_PROMPT_NAME,
+            sort_json_key=True
+        )
+        cls.model.config.pad_token_id = cls.processor.tokenizer.pad_token_id
+        cls.model.config.decoder_start_token_id = cls.processor.tokenizer.convert_tokens_to_ids([TASK_PROMPT_NAME])[0]
+        cls.train_dataloader = DataLoader(train_dataset, batch_size=1, shuffle=True, num_workers=4)
+        cls.val_dataloader = DataLoader(val_dataset, batch_size=1, shuffle=False, num_workers=4)
+        config = {
+            "max_epochs": epochs,
+            "val_check_interval": val_check_interval, # how many times we want to validate during an epoch
+            "check_val_every_n_epoch": check_val_every_n_epoch,
+            "gradient_clip_val": gradient_clip_val,
+            "num_training_samples_per_epoch": num_training_samples_per_epoch,
+            "lr": learning_rate,
+            "train_batch_sizes": [train_batch_size],
+            "val_batch_sizes": [val_batch_size],
+            # "seed":2022,
+            "num_nodes": num_nodes,
+            "warmup_steps": warmup_steps, # 10%
+            "result_path": "./.checkpoints",
+            "verbose": True,
+        }
+        model_module = cls.DonutModelPLModule(config, cls.processor, cls.model)
+        device = (
+            "cuda"
+            if torch.cuda.is_available()
+            else "mps" if torch.backends.mps.is_available() else "cpu"
+        )
+        print(f"Using {device} device")
+        trainer = pl.Trainer(
+                accelerator="gpu" if device == "cuda" else "mps" if device == "mps" else "cpu",
+                devices=1 if device == "cuda" else 0,
+                max_epochs=config.get("max_epochs"),
+                val_check_interval=config.get("val_check_interval"),
+                check_val_every_n_epoch=config.get("check_val_every_n_epoch"),
+                gradient_clip_val=config.get("gradient_clip_val"),
+                precision=16 if device == "cuda" else 32, # we'll use mixed precision if device == "cuda"
+                num_sanity_val_steps=0,
+                logger=TensorBoardLogger(save_dir="./.checkpoints", name="donut_training", version=None),
+                callbacks=[cls.PushToHubCallback()]
+        )
+        trainer.fit(model_module)

menu/llm/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .gemini import GeminiAPI
2	+ from .openai import OpenAIAPI

menu/llm/base.py ADDED Viewed

	@@ -0,0 +1,9 @@

+from abc import ABC, abstractmethod
+import numpy as np
+class LLMBase(ABC):
+    @classmethod
+    @abstractmethod
+    def call(image: np.ndarray, model: str, token: str) -> dict:
+        raise NotImplementedError

menu/llm/gemini.py ADDED Viewed

	@@ -0,0 +1,36 @@

+import json
+import numpy as np
+from PIL import Image
+from google import genai
+from google.genai import types
+from .base import LLMBase
+FUNCTION_CALL = json.load(open("tools/schema_gemini.json", "r"))
+class GeminiAPI(LLMBase):
+    @classmethod
+    def call(cls, image: np.ndarray, model: str, token: str) -> dict:
+        client = genai.Client(api_key=token) # Initialize the client with the API key
+        encode_img = Image.fromarray(image) # Convert the image for the API
+        config = types.GenerateContentConfig(
+            tools=[types.Tool(function_declarations=[FUNCTION_CALL])],
+            tool_config={
+                "function_calling_config": {
+                    "mode": "ANY",
+                    "allowed_function_names": [FUNCTION_CALL["name"]]
+                }
+            }
+        )
+        response = client.models.generate_content(
+            model=model,
+            contents=[encode_img],
+            config=config
+        )
+        if response.candidates[0].content.parts[0].function_call:
+            function_call = response.candidates[0].content.parts[0].function_call
+            return function_call.args
+        return {}

menu/llm/openai.py ADDED Viewed

	@@ -0,0 +1,39 @@

+import json
+import base64
+from io import BytesIO
+import numpy as np
+from PIL import Image
+from openai import OpenAI
+from .base import LLMBase
+FUNCTION_CALL = json.load(open("tools/schema_openai.json", "r"))
+class OpenAIAPI(LLMBase):
+    @classmethod
+    def call(cls, image: np.ndarray, model: str, token: str) -> dict:
+        client = OpenAI(api_key=token)  # Initialize the client with the API key
+        buffer = BytesIO()
+        Image.fromarray(image).save(buffer, format="JPEG")
+        encode_img = base64.b64encode(buffer.getvalue()).decode("utf-8") # Convert the image for the API
+        response = client.responses.create(
+            model=model,
+            input=[
+                {
+                    "role": "user",
+                    "content": [
+                        {
+                            "type": "input_image",
+                            "image_url": f"data:image/jpeg;base64,{encode_img}",
+                        },
+                    ],
+                }
+            ],
+            tools=[FUNCTION_CALL],
+        )
+        if response and response.output:
+            if hasattr(response.output[0], "arguments"):
+                return json.loads(response.output[0].arguments)
+        return {}

menu/utils.py ADDED Viewed

	@@ -0,0 +1,48 @@

+from typing import Optional
+from datasets import Dataset, DatasetDict
+def split_dataset(
+    dataset: Dataset,
+    train: float,
+    validation: float,
+    test: float,
+    seed: Optional[int] = None
+) -> DatasetDict:
+    """
+    Split a single-split Hugging Face Dataset into train/validation/test subsets.
+    Args:
+        dataset (Dataset): The input dataset (e.g. load_dataset(...)['train']).
+        train (float): Proportion of data for the train split (0 < train < 1).
+        val (float): Proportion of data for the validation split (0 < val < 1).
+        test (float): Proportion of data for the test split (0 < test < 1).
+                            Must satisfy train + val + test == 1.0.
+        seed (int): Random seed for reproducibility (default: None).
+    Returns:
+        DatasetDict: A dictionary with keys "train", "validation", and "test".
+    """
+    # Verify ratios sum to 1.0
+    total = train + validation + test
+    if abs(total - 1.0) > 1e-8:
+        raise ValueError(f"train + validation + test must equal 1.0 (got {total})")
+    # First split: extract train vs. temp (validation + test)
+    temp_size = validation + test
+    split_1 = dataset.train_test_split(test_size=temp_size, seed=seed)
+    train_ds = split_1["train"]
+    temp_ds  = split_1["test"]
+    # Second split: divide temp into validation vs. test
+    relative_test_size = test / temp_size
+    split_2 = temp_ds.train_test_split(test_size=relative_test_size, seed=seed)
+    validation_ds  = split_2["train"]
+    test_ds = split_2["test"]
+    # Return a DatasetDict with all three splits
+    return DatasetDict({
+        "train":      train_ds,
+        "validation": validation_ds,
+        "test":       test_ds,
+    })

pyproject.toml ADDED Viewed

	@@ -0,0 +1,27 @@

+[project]
+authors = [{name = "ryanlinjui", email = "ryanlinjui@gmail.com"}]
+name = "menu-text-detection"
+version = "0.1.0"
+description = "Extract structured menu information from images into JSON using a fine-tuned Donut E2E model."
+readme = "README.md"
+requires-python = "==3.11.*"
+dependencies = [
+    "datasets>=3.6.0",
+    "dotenv>=0.9.9",
+    "google-genai>=1.14.0",
+    "gradio>=5.29.0",
+    "huggingface-hub>=0.31.1",
+    "matplotlib>=3.10.1",
+    "nltk>=3.9.1",
+    "notebook>=7.4.2",
+    "openai>=1.77.0",
+    "pillow>=11.2.1",
+    "pillow-heif>=0.22.0",
+    "protobuf>=6.30.2",
+    "pytorch-lightning>=2.5.2",
+    "sentencepiece>=0.2.0",
+    "tensorboardx>=2.6.2.2",
+    "transformers==4.49",
+    "torch==2.4.1",
+    "donut-python>=1.0.9",
+]

requirements.txt ADDED Viewed

	@@ -0,0 +1,183 @@

+aiofiles==24.1.0
+aiohappyeyeballs==2.6.1
+aiohttp==3.11.18
+aiosignal==1.3.2
+annotated-types==0.7.0
+anyio==4.9.0
+appnope==0.1.4
+argon2-cffi==23.1.0
+argon2-cffi-bindings==21.2.0
+arrow==1.3.0
+asttokens==3.0.0
+async-lru==2.0.5
+attrs==25.3.0
+babel==2.17.0
+beautifulsoup4==4.13.4
+bleach==6.2.0
+cachetools==5.5.2
+certifi==2025.4.26
+cffi==1.17.1
+charset-normalizer==3.4.2
+click==8.1.8
+comm==0.2.2
+contourpy==1.3.2
+cycler==0.12.1
+datasets==3.6.0
+debugpy==1.8.14
+decorator==5.2.1
+defusedxml==0.7.1
+dill==0.3.8
+distro==1.9.0
+donut-python==1.0.9
+dotenv==0.9.9
+executing==2.2.0
+fastapi==0.115.12
+fastjsonschema==2.21.1
+ffmpy==0.5.0
+filelock==3.18.0
+fonttools==4.57.0
+fqdn==1.5.1
+frozenlist==1.6.0
+fsspec==2025.3.0
+google-auth==2.40.1
+google-genai==1.14.0
+gradio==5.29.0
+gradio-client==1.10.0
+groovy==0.1.2
+h11==0.16.0
+hf-xet==1.1.0
+httpcore==1.0.9
+httpx==0.28.1
+huggingface-hub==0.31.1
+idna==3.10
+ipykernel==6.29.5
+ipython==9.2.0
+ipython-pygments-lexers==1.1.1
+isoduration==20.11.0
+jedi==0.19.2
+jinja2==3.1.6
+jiter==0.9.0
+joblib==1.5.0
+json5==0.12.0
+jsonpointer==3.0.0
+jsonschema==4.23.0
+jsonschema-specifications==2025.4.1
+jupyter-client==8.6.3
+jupyter-core==5.7.2
+jupyter-events==0.12.0
+jupyter-lsp==2.2.5
+jupyter-server==2.15.0
+jupyter-server-terminals==0.5.3
+jupyterlab==4.4.2
+jupyterlab-pygments==0.3.0
+jupyterlab-server==2.27.3
+kiwisolver==1.4.8
+lightning-utilities==0.14.3
+markdown-it-py==3.0.0
+markupsafe==3.0.2
+matplotlib==3.10.1
+matplotlib-inline==0.1.7
+mdurl==0.1.2
+mistune==3.1.3
+mpmath==1.3.0
+multidict==6.4.3
+multiprocess==0.70.16
+munch==4.0.0
+nbclient==0.10.2
+nbconvert==7.16.6
+nbformat==5.10.4
+nest-asyncio==1.6.0
+networkx==3.4.2
+nltk==3.9.1
+notebook==7.4.2
+notebook-shim==0.2.4
+numpy==2.2.5
+openai==1.77.0
+orjson==3.10.18
+overrides==7.7.0
+packaging==25.0
+pandas==2.2.3
+pandocfilters==1.5.1
+parso==0.8.4
+pexpect==4.9.0
+pillow==11.2.1
+pillow-heif==0.22.0
+platformdirs==4.3.8
+prometheus-client==0.21.1
+prompt-toolkit==3.0.51
+propcache==0.3.1
+protobuf==6.30.2
+psutil==7.0.0
+ptyprocess==0.7.0
+pure-eval==0.2.3
+pyarrow==20.0.0
+pyasn1==0.6.1
+pyasn1-modules==0.4.2
+pycparser==2.22
+pydantic==2.11.4
+pydantic-core==2.33.2
+pydub==0.25.1
+pygments==2.19.1
+pyparsing==3.2.3
+python-dateutil==2.9.0.post0
+python-dotenv==1.1.0
+python-json-logger==3.3.0
+python-multipart==0.0.20
+pytorch-lightning==2.5.2
+pytz==2025.2
+pyyaml==6.0.2
+pyzmq==26.4.0
+referencing==0.36.2
+regex==2024.11.6
+requests==2.32.3
+rfc3339-validator==0.1.4
+rfc3986-validator==0.1.1
+rich==14.0.0
+rpds-py==0.24.0
+rsa==4.9.1
+ruamel-yaml==0.18.14
+ruamel-yaml-clib==0.2.12
+ruff==0.11.8
+safehttpx==0.1.6
+safetensors==0.5.3
+sconf==0.2.5
+semantic-version==2.10.0
+send2trash==1.8.3
+sentencepiece==0.2.0
+setuptools==80.3.1
+shellingham==1.5.4
+six==1.17.0
+sniffio==1.3.1
+soupsieve==2.7
+stack-data==0.6.3
+starlette==0.46.2
+sympy==1.14.0
+tensorboardx==2.6.2.2
+terminado==0.18.1
+timm==1.0.16
+tinycss2==1.4.0
+tokenizers==0.21.1
+tomlkit==0.13.2
+torch==2.4.1
+torchmetrics==1.7.3
+torchvision==0.19.1
+tornado==6.4.2
+tqdm==4.67.1
+traitlets==5.14.3
+transformers==4.49.0
+typer==0.15.3
+types-python-dateutil==2.9.0.20241206
+typing-extensions==4.13.2
+typing-inspection==0.4.0
+tzdata==2025.2
+uri-template==1.3.0
+urllib3==2.4.0
+uvicorn==0.34.2
+wcwidth==0.2.13
+webcolors==24.11.1
+webencodings==0.5.1
+websocket-client==1.8.0
+websockets==15.0.1
+xxhash==3.5.0
+yarl==1.20.0
+zss==1.2.0

tools/schema_gemini.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+    "name": "extract_menu_data",
+    "description": "Extract structured menu information from images.",
+    "parameters": {
+        "type": "object",
+        "properties": {
+            "restaurant": {
+                "type": "string",
+                "description": "Name of the restaurant. If the name is not available, it should be ''."
+            },
+            "address": {
+                "type": "string",
+                "description": "Address of the restaurant. If the address is not available, it should be ''."
+            },
+            "phone": {
+                "type": "string",
+                "description": "Phone number of the restaurant. If the phone number is not available, it should be ''."
+            },
+            "business_hours": {
+                "type": "string",
+                "description": "Business hours of the restaurant. If the business hours are not available, it should be ''."
+            },
+            "dishes": {
+                "type": "array",
+                "items": {
+                    "type": "object",
+                    "properties": {
+                        "name": {
+                            "type": "string",
+                            "description": "Name of the menu item."
+                        },
+                        "price": {
+                            "type": "string",
+                            "description": "Price of the menu item. If the price is not available, it should be -1."
+                        }
+                    },
+                    "required": ["name", "price"]
+                },
+                "description": "List of menu dishes item."
+            }
+        },
+        "required": ["restaurant", "address", "phone", "business_hours", "dishes"]
+    }
+}

tools/schema_openai.json ADDED Viewed

	@@ -0,0 +1,47 @@

+{
+    "type": "function",
+    "name": "extract_menu_data",
+    "description": "Extract structured menu information from images.",
+    "parameters": {
+        "type": "object",
+        "properties": {
+            "restaurant": {
+                "type": "string",
+                "description": "Name of the restaurant. If the name is not available, it should be ''."
+            },
+            "address": {
+                "type": "string",
+                "description": "Address of the restaurant. If the address is not available, it should be ''."
+            },
+            "phone": {
+                "type": "string",
+                "description": "Phone number of the restaurant. If the phone number is not available, it should be ''."
+            },
+            "business_hours": {
+                "type": "string",
+                "description": "Business hours of the restaurant. If the business hours are not available, it should be ''."
+            },
+            "dishes": {
+                "type": "array",
+                "items": {
+                    "type": "object",
+                    "properties": {
+                        "name": {
+                            "type": "string",
+                            "description": "Name of the menu item."
+                        },
+                        "price": {
+                            "type": "string",
+                            "description": "Price of the menu item. If the price is not available, it should be -1."
+                        }
+                    },
+                    "required": ["name", "price"],
+                    "additionalProperties": false
+                },
+                "description": "List of menu dishes item."
+            }
+        },
+        "required": ["restaurant", "address", "phone", "business_hours", "dishes"],
+        "additionalProperties": false
+    }
+}

train.ipynb ADDED Viewed

	@@ -0,0 +1,235 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Login to HuggingFace (just login once)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import interpreter_login\n",
+    "interpreter_login()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Collect Menu Image Datasets\n",
+    "- Use `metadata.jsonl` to label the images's ground truth. You can visit [here](https://github.com/ryanlinjui/menu-text-detection/tree/main/examples) to see the examples.\n",
+    "- After finishing, push to HuggingFace Datasets.\n",
+    "- For labeling:\n",
+    "    - [Google AI Studio](https://aistudio.google.com) or [OpenAI ChatGPT](https://chatgpt.com).\n",
+    "    - Use function calling by API. Start the gradio app locally or visit [here](https://huggingface.co/spaces/ryanlinjui/menu-text-detection).\n",
+    "\n",
+    "### Menu Type\n",
+    "- **h**: horizontal menu\n",
+    "- **v**: vertical menu\n",
+    "- **d**: document-style menu\n",
+    "- **s**: in-scene menu (non-document style)\n",
+    "- **i**: irregular menu (menu with irregular text layout)\n",
+    "\n",
+    "> Please see the [examples](https://github.com/ryanlinjui/menu-text-detection/tree/main/examples) for more details."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import json\n",
+    "\n",
+    "import numpy as np\n",
+    "from PIL import Image\n",
+    "from pillow_heif import register_heif_opener\n",
+    "\n",
+    "from menu.llm import (\n",
+    "    GeminiAPI,\n",
+    "    OpenAIAPI\n",
+    ")\n",
+    "\n",
+    "IMAGE_DIR = \"datasets/images\"       # set your image directory here\n",
+    "SELECTED_MODEL = \"gemini-2.5-flash\" # set model name here, refer MODEL_LIST from app.py for more\n",
+    "API_TOKEN = \"\"                      # set your API token here\n",
+    "SELECTED_FUNCTION = GeminiAPI       # set \"GeminiAPI\" or \"OpenAIAPI\"\n",
+    "\n",
+    "register_heif_opener()\n",
+    "\n",
+    "for file in os.listdir(IMAGE_DIR):\n",
+    "    print(f\"Processing image: {file}\")\n",
+    "    try:\n",
+    "        image = np.array(Image.open(os.path.join(IMAGE_DIR, file)))\n",
+    "        data = {\n",
+    "            \"file_name\": file,\n",
+    "            \"menu\": SELECTED_FUNCTION.call(image, SELECTED_MODEL, API_TOKEN)\n",
+    "        }\n",
+    "        with open(os.path.join(IMAGE_DIR, \"metadata.jsonl\"), \"a\", encoding=\"utf-8\") as metaf:\n",
+    "            metaf.write(json.dumps(data, ensure_ascii=False, sort_keys=True) + \"\\n\")\n",
+    "    except Exception as e:\n",
+    "        print(f\"Skipping invalid image '{file}': {e}\")\n",
+    "        continue"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Push Datasets to HuggingFace"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "\n",
+    "dataset = load_dataset(path=\"datasets/menu-zh-TW\")      # load dataset from the local directory including the metadata.jsonl, images files.\n",
+    "dataset.push_to_hub(repo_id=\"ryanlinjui/menu-zh-TW\")    # push to the huggingface dataset hub"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Prepare the dataset for training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from menu.utils import split_dataset\n",
+    "from datasets import load_dataset\n",
+    "\n",
+    "dataset = load_dataset(path=\"ryanlinjui/menu-zh-TW\") # set your dataset repo id for training\n",
+    "dataset = split_dataset(dataset[\"train\"], train=0.8, validation=0.1, test=0.1, seed=42) # (optional) use it if your dataset is not split into train/validation/test\n",
+    "print(f\"Dataset split: {len(dataset['train'])} train, {len(dataset['validation'])} validation, {len(dataset['test'])} test\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Fine-tune Donut Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import logging\n",
+    "from menu.donut import DonutTrainer\n",
+    "\n",
+    "logging.getLogger(\"transformers\").setLevel(logging.ERROR) # filter output message from transformers\n",
+    "\n",
+    "DonutTrainer.train(\n",
+    "    dataset=dataset,\n",
+    "    pretrained_model_repo_id=\"naver-clova-ix/donut-base\",        # set your pretrained model repo id for fine-tuning\n",
+    "    ground_truth_key=\"menu\",                                     # set your ground truth key for training\n",
+    "    huggingface_model_id=\"ryanlinjui/donut-base-finetuned-menu\", # set your huggingface model repo id for saving / pushing to the hub\n",
+    "    epochs=15,                                                   # set your training epochs\n",
+    "    train_batch_size=8,                                          # set your training batch size\n",
+    "    val_batch_size=1,                                            # set your validation batch size\n",
+    "    learning_rate=3e-5,                                          # set your learning rate\n",
+    "    val_check_interval=0.5,                                      # how many times we want to validate during an epoch\n",
+    "    check_val_every_n_epoch=1,                                   # how many epochs we want to validate\n",
+    "    gradient_clip_val=1.0,                                       # gradient clipping value for training stability\n",
+    "    num_training_samples_per_epoch=198,                          # set num_training_samples_per_epoch = training set size\n",
+    "    num_nodes=1,                                                 # number of nodes for distributed training\n",
+    "    warmup_steps=75                                              # number of warmup steps for learning rate scheduler, 198/8*30/10, 10%\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Evaluate Donut Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from datasets import load_dataset\n",
+    "\n",
+    "from menu.utils import split_dataset\n",
+    "from menu.donut import DonutFinetuned\n",
+    "\n",
+    "dataset = load_dataset(\"ryanlinjui/menu-zh-TW\")\n",
+    "dataset = split_dataset(dataset[\"train\"], train=0.8, validation=0.1, test=0.1, seed=42)  # (optional) use it if your dataset is not split into train/validation/test\n",
+    "donut_finetuned = DonutFinetuned(pretrained_model_repo_id=\"ryanlinjui/donut-base-finetuned-menu\")\n",
+    "scores, output_list = donut_finetuned.evaluate(dataset=dataset[\"test\"], ground_truth_key=\"menu\")\n",
+    "\n",
+    "print(\"Evaluation scores:\")\n",
+    "for key, value in scores.items():\n",
+    "    print(f\"{key}: {value}\")\n",
+    "\n",
+    "print(\"\\nSample outputs:\")\n",
+    "for output in output_list[:5]:\n",
+    "    print(json.dumps(output, ensure_ascii=False, indent=4))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Test Donut Model"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from PIL import Image\n",
+    "from menu.donut import DonutFinetuned\n",
+    "\n",
+    "image = Image.open(\"./examples/menu-hd.jpg\")\n",
+    "\n",
+    "donut_finetuned = DonutFinetuned(pretrained_model_repo_id=\"ryanlinjui/donut-base-finetuned-menu\")\n",
+    "outputs = donut_finetuned.predict(image=image)\n",
+    "print(outputs)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "menu-text-detection",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff