Spaces:

Liori25
/

CookBookAI

Sleeping

File size: 25,740 Bytes

4a6ebfb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Part 4: Input-Output Pipeline"
      ],
      "metadata": {
        "id": "JyoRTpDES8Tq"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "- Input: Image of a handwritten recipe\n",
        "- Output: Text of the recipe"
      ],
      "metadata": {
        "id": "-Ms7ezZJTepY"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from google.colab import files\n",
        "\n",
        "print(\"Please upload 'RecipeData_10K.csv' from your computer:\")\n",
        "uploaded = files.upload()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 88
        },
        "id": "CfK_Cy_fUFnK",
        "outputId": "b73eaa28-ad59-4326-c089-28e251ef16a5"
      },
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Please upload 'RecipeData_10K.csv' from your computer:\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ],
            "text/html": [
              "\n",
              "     <input type=\"file\" id=\"files-1101386e-69b8-4d66-b4de-58e3de6dcab7\" name=\"files[]\" multiple disabled\n",
              "        style=\"border:none\" />\n",
              "     <output id=\"result-1101386e-69b8-4d66-b4de-58e3de6dcab7\">\n",
              "      Upload widget is only available when the cell has been executed in the\n",
              "      current browser session. Please rerun this cell to enable.\n",
              "      </output>\n",
              "      <script>// Copyright 2017 Google LLC\n",
              "//\n",
              "// Licensed under the Apache License, Version 2.0 (the \"License\");\n",
              "// you may not use this file except in compliance with the License.\n",
              "// You may obtain a copy of the License at\n",
              "//\n",
              "//      http://www.apache.org/licenses/LICENSE-2.0\n",
              "//\n",
              "// Unless required by applicable law or agreed to in writing, software\n",
              "// distributed under the License is distributed on an \"AS IS\" BASIS,\n",
              "// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
              "// See the License for the specific language governing permissions and\n",
              "// limitations under the License.\n",
              "\n",
              "/**\n",
              " * @fileoverview Helpers for google.colab Python module.\n",
              " */\n",
              "(function(scope) {\n",
              "function span(text, styleAttributes = {}) {\n",
              "  const element = document.createElement('span');\n",
              "  element.textContent = text;\n",
              "  for (const key of Object.keys(styleAttributes)) {\n",
              "    element.style[key] = styleAttributes[key];\n",
              "  }\n",
              "  return element;\n",
              "}\n",
              "\n",
              "// Max number of bytes which will be uploaded at a time.\n",
              "const MAX_PAYLOAD_SIZE = 100 * 1024;\n",
              "\n",
              "function _uploadFiles(inputId, outputId) {\n",
              "  const steps = uploadFilesStep(inputId, outputId);\n",
              "  const outputElement = document.getElementById(outputId);\n",
              "  // Cache steps on the outputElement to make it available for the next call\n",
              "  // to uploadFilesContinue from Python.\n",
              "  outputElement.steps = steps;\n",
              "\n",
              "  return _uploadFilesContinue(outputId);\n",
              "}\n",
              "\n",
              "// This is roughly an async generator (not supported in the browser yet),\n",
              "// where there are multiple asynchronous steps and the Python side is going\n",
              "// to poll for completion of each step.\n",
              "// This uses a Promise to block the python side on completion of each step,\n",
              "// then passes the result of the previous step as the input to the next step.\n",
              "function _uploadFilesContinue(outputId) {\n",
              "  const outputElement = document.getElementById(outputId);\n",
              "  const steps = outputElement.steps;\n",
              "\n",
              "  const next = steps.next(outputElement.lastPromiseValue);\n",
              "  return Promise.resolve(next.value.promise).then((value) => {\n",
              "    // Cache the last promise value to make it available to the next\n",
              "    // step of the generator.\n",
              "    outputElement.lastPromiseValue = value;\n",
              "    return next.value.response;\n",
              "  });\n",
              "}\n",
              "\n",
              "/**\n",
              " * Generator function which is called between each async step of the upload\n",
              " * process.\n",
              " * @param {string} inputId Element ID of the input file picker element.\n",
              " * @param {string} outputId Element ID of the output display.\n",
              " * @return {!Iterable<!Object>} Iterable of next steps.\n",
              " */\n",
              "function* uploadFilesStep(inputId, outputId) {\n",
              "  const inputElement = document.getElementById(inputId);\n",
              "  inputElement.disabled = false;\n",
              "\n",
              "  const outputElement = document.getElementById(outputId);\n",
              "  outputElement.innerHTML = '';\n",
              "\n",
              "  const pickedPromise = new Promise((resolve) => {\n",
              "    inputElement.addEventListener('change', (e) => {\n",
              "      resolve(e.target.files);\n",
              "    });\n",
              "  });\n",
              "\n",
              "  const cancel = document.createElement('button');\n",
              "  inputElement.parentElement.appendChild(cancel);\n",
              "  cancel.textContent = 'Cancel upload';\n",
              "  const cancelPromise = new Promise((resolve) => {\n",
              "    cancel.onclick = () => {\n",
              "      resolve(null);\n",
              "    };\n",
              "  });\n",
              "\n",
              "  // Wait for the user to pick the files.\n",
              "  const files = yield {\n",
              "    promise: Promise.race([pickedPromise, cancelPromise]),\n",
              "    response: {\n",
              "      action: 'starting',\n",
              "    }\n",
              "  };\n",
              "\n",
              "  cancel.remove();\n",
              "\n",
              "  // Disable the input element since further picks are not allowed.\n",
              "  inputElement.disabled = true;\n",
              "\n",
              "  if (!files) {\n",
              "    return {\n",
              "      response: {\n",
              "        action: 'complete',\n",
              "      }\n",
              "    };\n",
              "  }\n",
              "\n",
              "  for (const file of files) {\n",
              "    const li = document.createElement('li');\n",
              "    li.append(span(file.name, {fontWeight: 'bold'}));\n",
              "    li.append(span(\n",
              "        `(${file.type || 'n/a'}) - ${file.size} bytes, ` +\n",
              "        `last modified: ${\n",
              "            file.lastModifiedDate ? file.lastModifiedDate.toLocaleDateString() :\n",
              "                                    'n/a'} - `));\n",
              "    const percent = span('0% done');\n",
              "    li.appendChild(percent);\n",
              "\n",
              "    outputElement.appendChild(li);\n",
              "\n",
              "    const fileDataPromise = new Promise((resolve) => {\n",
              "      const reader = new FileReader();\n",
              "      reader.onload = (e) => {\n",
              "        resolve(e.target.result);\n",
              "      };\n",
              "      reader.readAsArrayBuffer(file);\n",
              "    });\n",
              "    // Wait for the data to be ready.\n",
              "    let fileData = yield {\n",
              "      promise: fileDataPromise,\n",
              "      response: {\n",
              "        action: 'continue',\n",
              "      }\n",
              "    };\n",
              "\n",
              "    // Use a chunked sending to avoid message size limits. See b/62115660.\n",
              "    let position = 0;\n",
              "    do {\n",
              "      const length = Math.min(fileData.byteLength - position, MAX_PAYLOAD_SIZE);\n",
              "      const chunk = new Uint8Array(fileData, position, length);\n",
              "      position += length;\n",
              "\n",
              "      const base64 = btoa(String.fromCharCode.apply(null, chunk));\n",
              "      yield {\n",
              "        response: {\n",
              "          action: 'append',\n",
              "          file: file.name,\n",
              "          data: base64,\n",
              "        },\n",
              "      };\n",
              "\n",
              "      let percentDone = fileData.byteLength === 0 ?\n",
              "          100 :\n",
              "          Math.round((position / fileData.byteLength) * 100);\n",
              "      percent.textContent = `${percentDone}% done`;\n",
              "\n",
              "    } while (position < fileData.byteLength);\n",
              "  }\n",
              "\n",
              "  // All done.\n",
              "  yield {\n",
              "    response: {\n",
              "      action: 'complete',\n",
              "    }\n",
              "  };\n",
              "}\n",
              "\n",
              "scope.google = scope.google || {};\n",
              "scope.google.colab = scope.google.colab || {};\n",
              "scope.google.colab._files = {\n",
              "  _uploadFiles,\n",
              "  _uploadFilesContinue,\n",
              "};\n",
              "})(self);\n",
              "</script> "
            ]
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Saving Recipe.jfif to Recipe.jfif\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "\n",
        "---\n",
        "\n"
      ],
      "metadata": {
        "id": "UoYUP6WTUmpc"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## OLD VERSION\n",
        "to emphasize my process along the paper, I kept this part which I evantually won't be using beacuase the used model \"TrOCRProcessor didn't achive good results.\n",
        "\n",
        "you may skip this part to see the final IO pipline on the next part"
      ],
      "metadata": {
        "id": "hq0kcSzjS6Tr"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from transformers import TrOCRProcessor, VisionEncoderDecoderModel\n",
        "from PIL import Image\n",
        "import torch\n",
        "import numpy as np\n",
        "import os # Import os module to use os.path.join"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "AWlqrv7kTBrE",
        "outputId": "fa4af507-d82a-4606-880d-bca5b8ff5bc1"
      },
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "WARNING:torchao.kernel.intmm:Warning: Detected no triton, on systems without Triton certain kernels will not work\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "EDaLqbvsSqvq",
        "outputId": "55a5db5f-00d3-4396-8ea1-3e9bfbbecbbd"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "Some weights of VisionEncoderDecoderModel were not initialized from the model checkpoint at microsoft/trocr-large-handwritten and are newly initialized: ['encoder.pooler.dense.bias', 'encoder.pooler.dense.weight']\n",
            "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "📄 Scanning Recipe.jfif...\n",
            "\n",
            "🤖 FULL DIGITIZED RECIPE:\n",
            "==============================\n",
            "1903\n",
            "0 0\n",
            "1930 1932\n",
            "0 0\n",
            "==============================\n"
          ]
        }
      ],
      "source": [
        "# 1. SETUP\n",
        "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n",
        "processor = TrOCRProcessor.from_pretrained(\"microsoft/trocr-large-handwritten\")\n",
        "model = VisionEncoderDecoderModel.from_pretrained(\"microsoft/trocr-large-handwritten\").to(device)\n",
        "\n",
        "def scan_recipe_line_by_line(image_path, line_height=80):\n",
        "    \"\"\"\n",
        "    Inputs:\n",
        "        image_path: path to your 900x1200 image\n",
        "        line_height: approximate height of one line of text in pixels\n",
        "    \"\"\"\n",
        "    full_image = Image.open(image_path).convert(\"RGB\")\n",
        "    width, height = full_image.size\n",
        "\n",
        "    all_text = []\n",
        "\n",
        "    # 2. THE SCANNING LOOP\n",
        "    # We move down the image in 'steps' (strips)\n",
        "    print(f\"📄 Scanning {os.path.basename(image_path)}...\")\n",
        "\n",
        "    for top in range(0, height, line_height):\n",
        "        # Define the box for the current line strip\n",
        "        bottom = min(top + line_height, height)\n",
        "        # (left, top, right, bottom)\n",
        "        line_strip = full_image.crop((0, top, width, bottom))\n",
        "\n",
        "        # 3. PROCESS THE STRIP\n",
        "        # We check if the strip has actual ink (isn't just white paper)\n",
        "        if np.array(line_strip).std() < 5: # Skip blank strips\n",
        "            continue\n",
        "\n",
        "        pixel_values = processor(images=line_strip, return_tensors=\"pt\").pixel_values.to(device)\n",
        "\n",
        "        with torch.no_grad():\n",
        "            generated_ids = model.generate(pixel_values, max_new_tokens=50)\n",
        "\n",
        "        line_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]\n",
        "\n",
        "        # If the model found text, add it to our list\n",
        "        if line_text.strip() and line_text.strip() != \"0\":\n",
        "            all_text.append(line_text)\n",
        "\n",
        "    # 4. JOIN EVERYTHING\n",
        "    return \"\\n\".join(all_text)\n",
        "\n",
        "# --- TEST THE PIPELINE ---\n",
        "test_image = \"/content/Recipe.jfif\"\n",
        "final_recipe = scan_recipe_line_by_line(test_image)\n",
        "\n",
        "print(\"\\n🤖 FULL DIGITIZED RECIPE:\")\n",
        "print(\"=\"*30)\n",
        "print(final_recipe)\n",
        "print(\"=\"*30)"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "\n",
        "---\n",
        "\n"
      ],
      "metadata": {
        "id": "qxTqJLBwUoPS"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Part 4- 2nd and final version of the IO pipeline"
      ],
      "metadata": {
        "id": "PmEbXIqzTQIz"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "We implemented a Serverless Inference Pipeline leveraging the **Qwen2.5-VL Vision-Language Model** hosted on the Hugging Face Inference API. Unlike traditional Document Image Transformer (DiT) approaches that require separate stages for OCR and layout analysis, our solution utilizes an end-to-end generative approach where the model processes raw pixels and directly outputs structured JSON. This architecture offloads heavy computation to cloud-hosted GPUs, allowing the application to digitize complex handwritten recipes efficiently without requiring local hardware acceleration"
      ],
      "metadata": {
        "id": "wdygXOgvTJfK"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import os\n",
        "import json\n",
        "import base64\n",
        "from PIL import Image\n",
        "import io\n",
        "from huggingface_hub import InferenceClient"
      ],
      "metadata": {
        "id": "ykczbBR4VCNL"
      },
      "execution_count": 7,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "class RecipeDigitalizerPipeline:\n",
        "    def __init__(self):\n",
        "        print(\"Connecting to Hugging Face API (Qwen Mode)...\")\n",
        "        self.token = os.getenv(\"HF_TOKEN\")\n",
        "\n",
        "        # --- WE ARE STICKING TO QWEN ---\n",
        "        # If 2.5 gives you trouble, you can try \"Qwen/Qwen2-VL-7B-Instruct\"\n",
        "        self.model_id = \"Qwen/Qwen2.5-VL-7B-Instruct\"\n",
        "\n",
        "        self.client = InferenceClient(token=self.token)\n",
        "\n",
        "    def compress_image(self, image_path):\n",
        "        \"\"\"\n",
        "        Resizes the image so it doesn't crash the Free API.\n",
        "        \"\"\"\n",
        "        with Image.open(image_path) as img:\n",
        "            if img.mode != 'RGB':\n",
        "                img = img.convert('RGB')\n",
        "\n",
        "            # Resize: Free API often rejects images larger than 1024x1024\n",
        "            max_size = 1024\n",
        "            if max(img.size) > max_size:\n",
        "                img.thumbnail((max_size, max_size))\n",
        "\n",
        "            # Save to memory as JPEG\n",
        "            buffer = io.BytesIO()\n",
        "            img.save(buffer, format=\"JPEG\", quality=70) # Quality 70 is enough for text\n",
        "\n",
        "            # Convert to Base64\n",
        "            encoded_string = base64.b64encode(buffer.getvalue()).decode('utf-8')\n",
        "            return f\"data:image/jpeg;base64,{encoded_string}\"\n",
        "\n",
        "    def run_pipeline(self, image_path):\n",
        "        prompt = \"\"\"Extract the recipe from this image.\n",
        "        Output strictly valid JSON with keys: title, ingredients (list), instructions (list), cuisine_type, difficulty.\n",
        "        Do not include markdown formatting like ```json, just the raw JSON.\"\"\"\n",
        "\n",
        "        try:\n",
        "            # 1. Compress Image (Solves 400 Bad Request)\n",
        "            image_url = self.compress_image(image_path)\n",
        "\n",
        "            # 2. Call Qwen API\n",
        "            response = self.client.chat.completions.create(\n",
        "                model=self.model_id,\n",
        "                messages=[\n",
        "                    {\n",
        "                        \"role\": \"user\",\n",
        "                        \"content\": [\n",
        "                            {\n",
        "                                \"type\": \"image_url\",\n",
        "                                \"image_url\": {\"url\": image_url}\n",
        "                            },\n",
        "                            {\"type\": \"text\", \"text\": prompt}\n",
        "                        ]\n",
        "                    }\n",
        "                ],\n",
        "                max_tokens=1024\n",
        "            )\n",
        "\n",
        "            # 3. Clean Output\n",
        "            raw_text = response.choices[0].message.content\n",
        "            clean_json = raw_text.replace(\"```json\", \"\").replace(\"```\", \"\").strip()\n",
        "\n",
        "            # Extra safety: Find the first { and last }\n",
        "            start = clean_json.find('{')\n",
        "            end = clean_json.rfind('}') + 1\n",
        "            if start != -1 and end != -1:\n",
        "                clean_json = clean_json[start:end]\n",
        "\n",
        "            return json.loads(clean_json)\n",
        "\n",
        "        except Exception as e:\n",
        "            return {\"error\": f\"Qwen API Error: {str(e)}\"}"
      ],
      "metadata": {
        "id": "I0XOgMjETSXw"
      },
      "execution_count": 8,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# --- PART 4: EXECUTION EXAMPLE ---\n",
        "\n",
        "if __name__ == \"__main__\":\n",
        "    import os\n",
        "\n",
        "    # 1. AUTHENTICATION FIX\n",
        "    try:\n",
        "        from google.colab import userdata\n",
        "        # Get the secret named \"HF1\"\n",
        "        hf1_secret = userdata.get('HF_TOKEN')\n",
        "\n",
        "        # Inject it into the environment as 'HF_TOKEN' so the Pipeline class can find it\n",
        "        os.environ[\"HF_TOKEN\"] = hf1_secret\n",
        "        print(f\"✅ Successfully loaded token from secret HF_TOKEN\")\n",
        "\n",
        "    except Exception as e:\n",
        "        print(f\"⚠️ Warning: Could not load secret 'HF_TOKEN'. Make sure the name in the Key icon is exactly 'HF_TOKEN'.\")\n",
        "        print(f\"Error details: {e}\")\n",
        "\n",
        "    # 2. INITIALIZE PIPELINE\n",
        "    # Now this will work because we set os.environ[\"HF_TOKEN\"] above\n",
        "    try:\n",
        "        app = RecipeDigitalizerPipeline()\n",
        "\n",
        "        # 3. USER INPUT\n",
        "        user_image = \"/content/Recipe.jfif\"\n",
        "\n",
        "        # 4. RUN PIPELINE\n",
        "        if os.path.exists(user_image):\n",
        "            print(f\"Processing {user_image}...\")\n",
        "            ai_output = app.run_pipeline(user_image)\n",
        "\n",
        "            # 5. AI OUTPUT\n",
        "            print(\"\\n--- FINAL DIGITAL OUTPUT ---\")\n",
        "            print(json.dumps(ai_output, indent=4))\n",
        "        else:\n",
        "            print(f\"❌ Error: Image not found at {user_image}\")\n",
        "\n",
        "    except Exception as e:\n",
        "        print(f\"❌ Application Error: {e}\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "EyXpPQGsTXkd",
        "outputId": "10c5fa31-6731-45ec-b5cc-074d6d534bfc"
      },
      "execution_count": 15,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "✅ Successfully loaded token from secret HF_TOKEN\n",
            "Connecting to Hugging Face API (Qwen Mode)...\n",
            "Processing /content/Recipe.jfif...\n",
            "\n",
            "--- FINAL DIGITAL OUTPUT ---\n",
            "{\n",
            "    \"title\": \"Chocolate Chip Cookies\",\n",
            "    \"ingredients\": [\n",
            "        \"3 cups flour\",\n",
            "        \"1 1/2 teaspoons baking soda\",\n",
            "        \"1/4 teaspoon salt\",\n",
            "        \"1/2 cup soften butter\",\n",
            "        \"1/4 cup sugar\",\n",
            "        \"1/2 cup brown sugar\",\n",
            "        \"3 eggs\",\n",
            "        \"2 teaspoons vanilla\",\n",
            "        \"2 cups chocolate chips\"\n",
            "    ],\n",
            "    \"instructions\": [\n",
            "        \"Preheat oven to 350\\u00b0 for about 15 minutes or roll out a cookie cake and bake for about 9 minutes.\"\n",
            "    ],\n",
            "    \"cuisine_type\": \"American\",\n",
            "    \"difficulty\": \"Easy\"\n",
            "}\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Our evaluation demonstrates that the Qwen-VL Serverless Pipeline significantly outperforms traditional Document Image Transformer (DiT) baselines. While the DiT model frequently suffered from hallucinations and failed to correct OCR errors due to a lack of semantic awareness, our VLM approach leverages deep linguistic understanding to resolve ambiguities. For instance, the model successfully inferred 'sugar' from the noisy input 's_gar' by analyzing the culinary context—a semantic correction capability that was absent in the standard DiT pipeline."
      ],
      "metadata": {
        "id": "JIZUnKOWTZqc"
      }
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "6kaTyYGBTZiL"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}