Spaces:

Liori25
/

CookBookAI

Sleeping

App Files Files Community

Liori25 commited on about 1 month ago

Commit

4a6ebfb

verified ·

1 Parent(s): 06e7044

Upload IO_Pipeline.ipynb

Browse files

Files changed (1) hide show

IO_Pipeline.ipynb +639 -0

IO_Pipeline.ipynb ADDED Viewed

	@@ -0,0 +1,639 @@

+{
+  "nbformat": 4,
+  "nbformat_minor": 0,
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "name": "python3",
+      "display_name": "Python 3"
+    },
+    "language_info": {
+      "name": "python"
+    }
+  },
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "source": [
+        "# Part 4: Input-Output Pipeline"
+      ],
+      "metadata": {
+        "id": "JyoRTpDES8Tq"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "- Input: Image of a handwritten recipe\n",
+        "- Output: Text of the recipe"
+      ],
+      "metadata": {
+        "id": "-Ms7ezZJTepY"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from google.colab import files\n",
+        "\n",
+        "print(\"Please upload 'RecipeData_10K.csv' from your computer:\")\n",
+        "uploaded = files.upload()"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 88
+        },
+        "id": "CfK_Cy_fUFnK",
+        "outputId": "b73eaa28-ad59-4326-c089-28e251ef16a5"
+      },
+      "execution_count": 4,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Please upload 'RecipeData_10K.csv' from your computer:\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "text/plain": [
+              "<IPython.core.display.HTML object>"
+            ],
+            "text/html": [
+              "\n",
+              "     <input type=\"file\" id=\"files-1101386e-69b8-4d66-b4de-58e3de6dcab7\" name=\"files[]\" multiple disabled\n",
+              "        style=\"border:none\" />\n",
+              "     <output id=\"result-1101386e-69b8-4d66-b4de-58e3de6dcab7\">\n",
+              "      Upload widget is only available when the cell has been executed in the\n",
+              "      current browser session. Please rerun this cell to enable.\n",
+              "      </output>\n",
+              "      <script>// Copyright 2017 Google LLC\n",
+              "//\n",
+              "// Licensed under the Apache License, Version 2.0 (the \"License\");\n",
+              "// you may not use this file except in compliance with the License.\n",
+              "// You may obtain a copy of the License at\n",
+              "//\n",
+              "//      http://www.apache.org/licenses/LICENSE-2.0\n",
+              "//\n",
+              "// Unless required by applicable law or agreed to in writing, software\n",
+              "// distributed under the License is distributed on an \"AS IS\" BASIS,\n",
+              "// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
+              "// See the License for the specific language governing permissions and\n",
+              "// limitations under the License.\n",
+              "\n",
+              "/**\n",
+              " * @fileoverview Helpers for google.colab Python module.\n",
+              " */\n",
+              "(function(scope) {\n",
+              "function span(text, styleAttributes = {}) {\n",
+              "  const element = document.createElement('span');\n",
+              "  element.textContent = text;\n",
+              "  for (const key of Object.keys(styleAttributes)) {\n",
+              "    element.style[key] = styleAttributes[key];\n",
+              "  }\n",
+              "  return element;\n",
+              "}\n",
+              "\n",
+              "// Max number of bytes which will be uploaded at a time.\n",
+              "const MAX_PAYLOAD_SIZE = 100 * 1024;\n",
+              "\n",
+              "function _uploadFiles(inputId, outputId) {\n",
+              "  const steps = uploadFilesStep(inputId, outputId);\n",
+              "  const outputElement = document.getElementById(outputId);\n",
+              "  // Cache steps on the outputElement to make it available for the next call\n",
+              "  // to uploadFilesContinue from Python.\n",
+              "  outputElement.steps = steps;\n",
+              "\n",
+              "  return _uploadFilesContinue(outputId);\n",
+              "}\n",
+              "\n",
+              "// This is roughly an async generator (not supported in the browser yet),\n",
+              "// where there are multiple asynchronous steps and the Python side is going\n",
+              "// to poll for completion of each step.\n",
+              "// This uses a Promise to block the python side on completion of each step,\n",
+              "// then passes the result of the previous step as the input to the next step.\n",
+              "function _uploadFilesContinue(outputId) {\n",
+              "  const outputElement = document.getElementById(outputId);\n",
+              "  const steps = outputElement.steps;\n",
+              "\n",
+              "  const next = steps.next(outputElement.lastPromiseValue);\n",
+              "  return Promise.resolve(next.value.promise).then((value) => {\n",
+              "    // Cache the last promise value to make it available to the next\n",
+              "    // step of the generator.\n",
+              "    outputElement.lastPromiseValue = value;\n",
+              "    return next.value.response;\n",
+              "  });\n",
+              "}\n",
+              "\n",
+              "/**\n",
+              " * Generator function which is called between each async step of the upload\n",
+              " * process.\n",
+              " * @param {string} inputId Element ID of the input file picker element.\n",
+              " * @param {string} outputId Element ID of the output display.\n",
+              " * @return {!Iterable<!Object>} Iterable of next steps.\n",
+              " */\n",
+              "function* uploadFilesStep(inputId, outputId) {\n",
+              "  const inputElement = document.getElementById(inputId);\n",
+              "  inputElement.disabled = false;\n",
+              "\n",
+              "  const outputElement = document.getElementById(outputId);\n",
+              "  outputElement.innerHTML = '';\n",
+              "\n",
+              "  const pickedPromise = new Promise((resolve) => {\n",
+              "    inputElement.addEventListener('change', (e) => {\n",
+              "      resolve(e.target.files);\n",
+              "    });\n",
+              "  });\n",
+              "\n",
+              "  const cancel = document.createElement('button');\n",
+              "  inputElement.parentElement.appendChild(cancel);\n",
+              "  cancel.textContent = 'Cancel upload';\n",
+              "  const cancelPromise = new Promise((resolve) => {\n",
+              "    cancel.onclick = () => {\n",
+              "      resolve(null);\n",
+              "    };\n",
+              "  });\n",
+              "\n",
+              "  // Wait for the user to pick the files.\n",
+              "  const files = yield {\n",
+              "    promise: Promise.race([pickedPromise, cancelPromise]),\n",
+              "    response: {\n",
+              "      action: 'starting',\n",
+              "    }\n",
+              "  };\n",
+              "\n",
+              "  cancel.remove();\n",
+              "\n",
+              "  // Disable the input element since further picks are not allowed.\n",
+              "  inputElement.disabled = true;\n",
+              "\n",
+              "  if (!files) {\n",
+              "    return {\n",
+              "      response: {\n",
+              "        action: 'complete',\n",
+              "      }\n",
+              "    };\n",
+              "  }\n",
+              "\n",
+              "  for (const file of files) {\n",
+              "    const li = document.createElement('li');\n",
+              "    li.append(span(file.name, {fontWeight: 'bold'}));\n",
+              "    li.append(span(\n",
+              "        `(${file.type || 'n/a'}) - ${file.size} bytes, ` +\n",
+              "        `last modified: ${\n",
+              "            file.lastModifiedDate ? file.lastModifiedDate.toLocaleDateString() :\n",
+              "                                    'n/a'} - `));\n",
+              "    const percent = span('0% done');\n",
+              "    li.appendChild(percent);\n",
+              "\n",
+              "    outputElement.appendChild(li);\n",
+              "\n",
+              "    const fileDataPromise = new Promise((resolve) => {\n",
+              "      const reader = new FileReader();\n",
+              "      reader.onload = (e) => {\n",
+              "        resolve(e.target.result);\n",
+              "      };\n",
+              "      reader.readAsArrayBuffer(file);\n",
+              "    });\n",
+              "    // Wait for the data to be ready.\n",
+              "    let fileData = yield {\n",
+              "      promise: fileDataPromise,\n",
+              "      response: {\n",
+              "        action: 'continue',\n",
+              "      }\n",
+              "    };\n",
+              "\n",
+              "    // Use a chunked sending to avoid message size limits. See b/62115660.\n",
+              "    let position = 0;\n",
+              "    do {\n",
+              "      const length = Math.min(fileData.byteLength - position, MAX_PAYLOAD_SIZE);\n",
+              "      const chunk = new Uint8Array(fileData, position, length);\n",
+              "      position += length;\n",
+              "\n",
+              "      const base64 = btoa(String.fromCharCode.apply(null, chunk));\n",
+              "      yield {\n",
+              "        response: {\n",
+              "          action: 'append',\n",
+              "          file: file.name,\n",
+              "          data: base64,\n",
+              "        },\n",
+              "      };\n",
+              "\n",
+              "      let percentDone = fileData.byteLength === 0 ?\n",
+              "          100 :\n",
+              "          Math.round((position / fileData.byteLength) * 100);\n",
+              "      percent.textContent = `${percentDone}% done`;\n",
+              "\n",
+              "    } while (position < fileData.byteLength);\n",
+              "  }\n",
+              "\n",
+              "  // All done.\n",
+              "  yield {\n",
+              "    response: {\n",
+              "      action: 'complete',\n",
+              "    }\n",
+              "  };\n",
+              "}\n",
+              "\n",
+              "scope.google = scope.google || {};\n",
+              "scope.google.colab = scope.google.colab || {};\n",
+              "scope.google.colab._files = {\n",
+              "  _uploadFiles,\n",
+              "  _uploadFilesContinue,\n",
+              "};\n",
+              "})(self);\n",
+              "</script> "
+            ]
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Saving Recipe.jfif to Recipe.jfif\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "\n",
+        "---\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "UoYUP6WTUmpc"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "## OLD VERSION\n",
+        "to emphasize my process along the paper, I kept this part which I evantually won't be using beacuase the used model \"TrOCRProcessor didn't achive good results.\n",
+        "\n",
+        "you may skip this part to see the final IO pipline on the next part"
+      ],
+      "metadata": {
+        "id": "hq0kcSzjS6Tr"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "from transformers import TrOCRProcessor, VisionEncoderDecoderModel\n",
+        "from PIL import Image\n",
+        "import torch\n",
+        "import numpy as np\n",
+        "import os # Import os module to use os.path.join"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "AWlqrv7kTBrE",
+        "outputId": "fa4af507-d82a-4606-880d-bca5b8ff5bc1"
+      },
+      "execution_count": 1,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "WARNING:torchao.kernel.intmm:Warning: Detected no triton, on systems without Triton certain kernels will not work\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 6,
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "EDaLqbvsSqvq",
+        "outputId": "55a5db5f-00d3-4396-8ea1-3e9bfbbecbbd"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "Some weights of VisionEncoderDecoderModel were not initialized from the model checkpoint at microsoft/trocr-large-handwritten and are newly initialized: ['encoder.pooler.dense.bias', 'encoder.pooler.dense.weight']\n",
+            "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
+          ]
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "📄 Scanning Recipe.jfif...\n",
+            "\n",
+            "🤖 FULL DIGITIZED RECIPE:\n",
+            "==============================\n",
+            "1903\n",
+            "0 0\n",
+            "1930 1932\n",
+            "0 0\n",
+            "==============================\n"
+          ]
+        }
+      ],
+      "source": [
+        "# 1. SETUP\n",
+        "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
+        "processor = TrOCRProcessor.from_pretrained(\"microsoft/trocr-large-handwritten\")\n",
+        "model = VisionEncoderDecoderModel.from_pretrained(\"microsoft/trocr-large-handwritten\").to(device)\n",
+        "\n",
+        "def scan_recipe_line_by_line(image_path, line_height=80):\n",
+        "    \"\"\"\n",
+        "    Inputs:\n",
+        "        image_path: path to your 900x1200 image\n",
+        "        line_height: approximate height of one line of text in pixels\n",
+        "    \"\"\"\n",
+        "    full_image = Image.open(image_path).convert(\"RGB\")\n",
+        "    width, height = full_image.size\n",
+        "\n",
+        "    all_text = []\n",
+        "\n",
+        "    # 2. THE SCANNING LOOP\n",
+        "    # We move down the image in 'steps' (strips)\n",
+        "    print(f\"📄 Scanning {os.path.basename(image_path)}...\")\n",
+        "\n",
+        "    for top in range(0, height, line_height):\n",
+        "        # Define the box for the current line strip\n",
+        "        bottom = min(top + line_height, height)\n",
+        "        # (left, top, right, bottom)\n",
+        "        line_strip = full_image.crop((0, top, width, bottom))\n",
+        "\n",
+        "        # 3. PROCESS THE STRIP\n",
+        "        # We check if the strip has actual ink (isn't just white paper)\n",
+        "        if np.array(line_strip).std() < 5: # Skip blank strips\n",
+        "            continue\n",
+        "\n",
+        "        pixel_values = processor(images=line_strip, return_tensors=\"pt\").pixel_values.to(device)\n",
+        "\n",
+        "        with torch.no_grad():\n",
+        "            generated_ids = model.generate(pixel_values, max_new_tokens=50)\n",
+        "\n",
+        "        line_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]\n",
+        "\n",
+        "        # If the model found text, add it to our list\n",
+        "        if line_text.strip() and line_text.strip() != \"0\":\n",
+        "            all_text.append(line_text)\n",
+        "\n",
+        "    # 4. JOIN EVERYTHING\n",
+        "    return \"\\n\".join(all_text)\n",
+        "\n",
+        "# --- TEST THE PIPELINE ---\n",
+        "test_image = \"/content/Recipe.jfif\"\n",
+        "final_recipe = scan_recipe_line_by_line(test_image)\n",
+        "\n",
+        "print(\"\\n🤖 FULL DIGITIZED RECIPE:\")\n",
+        "print(\"=\"*30)\n",
+        "print(final_recipe)\n",
+        "print(\"=\"*30)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "\n",
+        "\n",
+        "---\n",
+        "\n"
+      ],
+      "metadata": {
+        "id": "qxTqJLBwUoPS"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "### Part 4- 2nd and final version of the IO pipeline"
+      ],
+      "metadata": {
+        "id": "PmEbXIqzTQIz"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "We implemented a Serverless Inference Pipeline leveraging the **Qwen2.5-VL Vision-Language Model** hosted on the Hugging Face Inference API. Unlike traditional Document Image Transformer (DiT) approaches that require separate stages for OCR and layout analysis, our solution utilizes an end-to-end generative approach where the model processes raw pixels and directly outputs structured JSON. This architecture offloads heavy computation to cloud-hosted GPUs, allowing the application to digitize complex handwritten recipes efficiently without requiring local hardware acceleration"
+      ],
+      "metadata": {
+        "id": "wdygXOgvTJfK"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "import os\n",
+        "import json\n",
+        "import base64\n",
+        "from PIL import Image\n",
+        "import io\n",
+        "from huggingface_hub import InferenceClient"
+      ],
+      "metadata": {
+        "id": "ykczbBR4VCNL"
+      },
+      "execution_count": 7,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "class RecipeDigitalizerPipeline:\n",
+        "    def __init__(self):\n",
+        "        print(\"Connecting to Hugging Face API (Qwen Mode)...\")\n",
+        "        self.token = os.getenv(\"HF_TOKEN\")\n",
+        "\n",
+        "        # --- WE ARE STICKING TO QWEN ---\n",
+        "        # If 2.5 gives you trouble, you can try \"Qwen/Qwen2-VL-7B-Instruct\"\n",
+        "        self.model_id = \"Qwen/Qwen2.5-VL-7B-Instruct\"\n",
+        "\n",
+        "        self.client = InferenceClient(token=self.token)\n",
+        "\n",
+        "    def compress_image(self, image_path):\n",
+        "        \"\"\"\n",
+        "        Resizes the image so it doesn't crash the Free API.\n",
+        "        \"\"\"\n",
+        "        with Image.open(image_path) as img:\n",
+        "            if img.mode != 'RGB':\n",
+        "                img = img.convert('RGB')\n",
+        "\n",
+        "            # Resize: Free API often rejects images larger than 1024x1024\n",
+        "            max_size = 1024\n",
+        "            if max(img.size) > max_size:\n",
+        "                img.thumbnail((max_size, max_size))\n",
+        "\n",
+        "            # Save to memory as JPEG\n",
+        "            buffer = io.BytesIO()\n",
+        "            img.save(buffer, format=\"JPEG\", quality=70) # Quality 70 is enough for text\n",
+        "\n",
+        "            # Convert to Base64\n",
+        "            encoded_string = base64.b64encode(buffer.getvalue()).decode('utf-8')\n",
+        "            return f\"data:image/jpeg;base64,{encoded_string}\"\n",
+        "\n",
+        "    def run_pipeline(self, image_path):\n",
+        "        prompt = \"\"\"Extract the recipe from this image.\n",
+        "        Output strictly valid JSON with keys: title, ingredients (list), instructions (list), cuisine_type, difficulty.\n",
+        "        Do not include markdown formatting like ```json, just the raw JSON.\"\"\"\n",
+        "\n",
+        "        try:\n",
+        "            # 1. Compress Image (Solves 400 Bad Request)\n",
+        "            image_url = self.compress_image(image_path)\n",
+        "\n",
+        "            # 2. Call Qwen API\n",
+        "            response = self.client.chat.completions.create(\n",
+        "                model=self.model_id,\n",
+        "                messages=[\n",
+        "                    {\n",
+        "                        \"role\": \"user\",\n",
+        "                        \"content\": [\n",
+        "                            {\n",
+        "                                \"type\": \"image_url\",\n",
+        "                                \"image_url\": {\"url\": image_url}\n",
+        "                            },\n",
+        "                            {\"type\": \"text\", \"text\": prompt}\n",
+        "                        ]\n",
+        "                    }\n",
+        "                ],\n",
+        "                max_tokens=1024\n",
+        "            )\n",
+        "\n",
+        "            # 3. Clean Output\n",
+        "            raw_text = response.choices[0].message.content\n",
+        "            clean_json = raw_text.replace(\"```json\", \"\").replace(\"```\", \"\").strip()\n",
+        "\n",
+        "            # Extra safety: Find the first { and last }\n",
+        "            start = clean_json.find('{')\n",
+        "            end = clean_json.rfind('}') + 1\n",
+        "            if start != -1 and end != -1:\n",
+        "                clean_json = clean_json[start:end]\n",
+        "\n",
+        "            return json.loads(clean_json)\n",
+        "\n",
+        "        except Exception as e:\n",
+        "            return {\"error\": f\"Qwen API Error: {str(e)}\"}"
+      ],
+      "metadata": {
+        "id": "I0XOgMjETSXw"
+      },
+      "execution_count": 8,
+      "outputs": []
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "# --- PART 4: EXECUTION EXAMPLE ---\n",
+        "\n",
+        "if __name__ == \"__main__\":\n",
+        "    import os\n",
+        "\n",
+        "    # 1. AUTHENTICATION FIX\n",
+        "    try:\n",
+        "        from google.colab import userdata\n",
+        "        # Get the secret named \"HF1\"\n",
+        "        hf1_secret = userdata.get('HF_TOKEN')\n",
+        "\n",
+        "        # Inject it into the environment as 'HF_TOKEN' so the Pipeline class can find it\n",
+        "        os.environ[\"HF_TOKEN\"] = hf1_secret\n",
+        "        print(f\"✅ Successfully loaded token from secret HF_TOKEN\")\n",
+        "\n",
+        "    except Exception as e:\n",
+        "        print(f\"⚠️ Warning: Could not load secret 'HF_TOKEN'. Make sure the name in the Key icon is exactly 'HF_TOKEN'.\")\n",
+        "        print(f\"Error details: {e}\")\n",
+        "\n",
+        "    # 2. INITIALIZE PIPELINE\n",
+        "    # Now this will work because we set os.environ[\"HF_TOKEN\"] above\n",
+        "    try:\n",
+        "        app = RecipeDigitalizerPipeline()\n",
+        "\n",
+        "        # 3. USER INPUT\n",
+        "        user_image = \"/content/Recipe.jfif\"\n",
+        "\n",
+        "        # 4. RUN PIPELINE\n",
+        "        if os.path.exists(user_image):\n",
+        "            print(f\"Processing {user_image}...\")\n",
+        "            ai_output = app.run_pipeline(user_image)\n",
+        "\n",
+        "            # 5. AI OUTPUT\n",
+        "            print(\"\\n--- FINAL DIGITAL OUTPUT ---\")\n",
+        "            print(json.dumps(ai_output, indent=4))\n",
+        "        else:\n",
+        "            print(f\"❌ Error: Image not found at {user_image}\")\n",
+        "\n",
+        "    except Exception as e:\n",
+        "        print(f\"❌ Application Error: {e}\")"
+      ],
+      "metadata": {
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "id": "EyXpPQGsTXkd",
+        "outputId": "10c5fa31-6731-45ec-b5cc-074d6d534bfc"
+      },
+      "execution_count": 15,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "✅ Successfully loaded token from secret HF_TOKEN\n",
+            "Connecting to Hugging Face API (Qwen Mode)...\n",
+            "Processing /content/Recipe.jfif...\n",
+            "\n",
+            "--- FINAL DIGITAL OUTPUT ---\n",
+            "{\n",
+            "    \"title\": \"Chocolate Chip Cookies\",\n",
+            "    \"ingredients\": [\n",
+            "        \"3 cups flour\",\n",
+            "        \"1 1/2 teaspoons baking soda\",\n",
+            "        \"1/4 teaspoon salt\",\n",
+            "        \"1/2 cup soften butter\",\n",
+            "        \"1/4 cup sugar\",\n",
+            "        \"1/2 cup brown sugar\",\n",
+            "        \"3 eggs\",\n",
+            "        \"2 teaspoons vanilla\",\n",
+            "        \"2 cups chocolate chips\"\n",
+            "    ],\n",
+            "    \"instructions\": [\n",
+            "        \"Preheat oven to 350\\u00b0 for about 15 minutes or roll out a cookie cake and bake for about 9 minutes.\"\n",
+            "    ],\n",
+            "    \"cuisine_type\": \"American\",\n",
+            "    \"difficulty\": \"Easy\"\n",
+            "}\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Our evaluation demonstrates that the Qwen-VL Serverless Pipeline significantly outperforms traditional Document Image Transformer (DiT) baselines. While the DiT model frequently suffered from hallucinations and failed to correct OCR errors due to a lack of semantic awareness, our VLM approach leverages deep linguistic understanding to resolve ambiguities. For instance, the model successfully inferred 'sugar' from the noisy input 's_gar' by analyzing the culinary context—a semantic correction capability that was absent in the standard DiT pipeline."
+      ],
+      "metadata": {
+        "id": "JIZUnKOWTZqc"
+      }
+    },
+    {
+      "cell_type": "code",
+      "source": [],
+      "metadata": {
+        "id": "6kaTyYGBTZiL"
+      },
+      "execution_count": null,
+      "outputs": []
+    }
+  ]
+}