{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "# Part 4: Input-Output Pipeline" ], "metadata": { "id": "JyoRTpDES8Tq" } }, { "cell_type": "markdown", "source": [ "- Input: Image of a handwritten recipe\n", "- Output: Text of the recipe" ], "metadata": { "id": "-Ms7ezZJTepY" } }, { "cell_type": "code", "source": [ "from google.colab import files\n", "\n", "print(\"Please upload 'RecipeData_10K.csv' from your computer:\")\n", "uploaded = files.upload()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 88 }, "id": "CfK_Cy_fUFnK", "outputId": "b73eaa28-ad59-4326-c089-28e251ef16a5" }, "execution_count": 4, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Please upload 'RecipeData_10K.csv' from your computer:\n" ] }, { "output_type": "display_data", "data": { "text/plain": [ "" ], "text/html": [ "\n", " \n", " \n", " Upload widget is only available when the cell has been executed in the\n", " current browser session. Please rerun this cell to enable.\n", " \n", " " ] }, "metadata": {} }, { "output_type": "stream", "name": "stdout", "text": [ "Saving Recipe.jfif to Recipe.jfif\n" ] } ] }, { "cell_type": "markdown", "source": [ "\n", "\n", "---\n", "\n" ], "metadata": { "id": "UoYUP6WTUmpc" } }, { "cell_type": "markdown", "source": [ "## OLD VERSION\n", "to emphasize my process along the paper, I kept this part which I evantually won't be using beacuase the used model \"TrOCRProcessor didn't achive good results.\n", "\n", "you may skip this part to see the final IO pipline on the next part" ], "metadata": { "id": "hq0kcSzjS6Tr" } }, { "cell_type": "code", "source": [ "from transformers import TrOCRProcessor, VisionEncoderDecoderModel\n", "from PIL import Image\n", "import torch\n", "import numpy as np\n", "import os # Import os module to use os.path.join" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "AWlqrv7kTBrE", "outputId": "fa4af507-d82a-4606-880d-bca5b8ff5bc1" }, "execution_count": 1, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "WARNING:torchao.kernel.intmm:Warning: Detected no triton, on systems without Triton certain kernels will not work\n" ] } ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "EDaLqbvsSqvq", "outputId": "55a5db5f-00d3-4396-8ea1-3e9bfbbecbbd" }, "outputs": [ { "output_type": "stream", "name": "stderr", "text": [ "Some weights of VisionEncoderDecoderModel were not initialized from the model checkpoint at microsoft/trocr-large-handwritten and are newly initialized: ['encoder.pooler.dense.bias', 'encoder.pooler.dense.weight']\n", "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n" ] }, { "output_type": "stream", "name": "stdout", "text": [ "πŸ“„ Scanning Recipe.jfif...\n", "\n", "πŸ€– FULL DIGITIZED RECIPE:\n", "==============================\n", "1903\n", "0 0\n", "1930 1932\n", "0 0\n", "==============================\n" ] } ], "source": [ "# 1. SETUP\n", "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "processor = TrOCRProcessor.from_pretrained(\"microsoft/trocr-large-handwritten\")\n", "model = VisionEncoderDecoderModel.from_pretrained(\"microsoft/trocr-large-handwritten\").to(device)\n", "\n", "def scan_recipe_line_by_line(image_path, line_height=80):\n", " \"\"\"\n", " Inputs:\n", " image_path: path to your 900x1200 image\n", " line_height: approximate height of one line of text in pixels\n", " \"\"\"\n", " full_image = Image.open(image_path).convert(\"RGB\")\n", " width, height = full_image.size\n", "\n", " all_text = []\n", "\n", " # 2. THE SCANNING LOOP\n", " # We move down the image in 'steps' (strips)\n", " print(f\"πŸ“„ Scanning {os.path.basename(image_path)}...\")\n", "\n", " for top in range(0, height, line_height):\n", " # Define the box for the current line strip\n", " bottom = min(top + line_height, height)\n", " # (left, top, right, bottom)\n", " line_strip = full_image.crop((0, top, width, bottom))\n", "\n", " # 3. PROCESS THE STRIP\n", " # We check if the strip has actual ink (isn't just white paper)\n", " if np.array(line_strip).std() < 5: # Skip blank strips\n", " continue\n", "\n", " pixel_values = processor(images=line_strip, return_tensors=\"pt\").pixel_values.to(device)\n", "\n", " with torch.no_grad():\n", " generated_ids = model.generate(pixel_values, max_new_tokens=50)\n", "\n", " line_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]\n", "\n", " # If the model found text, add it to our list\n", " if line_text.strip() and line_text.strip() != \"0\":\n", " all_text.append(line_text)\n", "\n", " # 4. JOIN EVERYTHING\n", " return \"\\n\".join(all_text)\n", "\n", "# --- TEST THE PIPELINE ---\n", "test_image = \"/content/Recipe.jfif\"\n", "final_recipe = scan_recipe_line_by_line(test_image)\n", "\n", "print(\"\\nπŸ€– FULL DIGITIZED RECIPE:\")\n", "print(\"=\"*30)\n", "print(final_recipe)\n", "print(\"=\"*30)" ] }, { "cell_type": "markdown", "source": [ "\n", "\n", "---\n", "\n" ], "metadata": { "id": "qxTqJLBwUoPS" } }, { "cell_type": "markdown", "source": [ "### Part 4- 2nd and final version of the IO pipeline" ], "metadata": { "id": "PmEbXIqzTQIz" } }, { "cell_type": "markdown", "source": [ "We implemented a Serverless Inference Pipeline leveraging the **Qwen2.5-VL Vision-Language Model** hosted on the Hugging Face Inference API. Unlike traditional Document Image Transformer (DiT) approaches that require separate stages for OCR and layout analysis, our solution utilizes an end-to-end generative approach where the model processes raw pixels and directly outputs structured JSON. This architecture offloads heavy computation to cloud-hosted GPUs, allowing the application to digitize complex handwritten recipes efficiently without requiring local hardware acceleration" ], "metadata": { "id": "wdygXOgvTJfK" } }, { "cell_type": "code", "source": [ "import os\n", "import json\n", "import base64\n", "from PIL import Image\n", "import io\n", "from huggingface_hub import InferenceClient" ], "metadata": { "id": "ykczbBR4VCNL" }, "execution_count": 7, "outputs": [] }, { "cell_type": "code", "source": [ "class RecipeDigitalizerPipeline:\n", " def __init__(self):\n", " print(\"Connecting to Hugging Face API (Qwen Mode)...\")\n", " self.token = os.getenv(\"HF_TOKEN\")\n", "\n", " # --- WE ARE STICKING TO QWEN ---\n", " # If 2.5 gives you trouble, you can try \"Qwen/Qwen2-VL-7B-Instruct\"\n", " self.model_id = \"Qwen/Qwen2.5-VL-7B-Instruct\"\n", "\n", " self.client = InferenceClient(token=self.token)\n", "\n", " def compress_image(self, image_path):\n", " \"\"\"\n", " Resizes the image so it doesn't crash the Free API.\n", " \"\"\"\n", " with Image.open(image_path) as img:\n", " if img.mode != 'RGB':\n", " img = img.convert('RGB')\n", "\n", " # Resize: Free API often rejects images larger than 1024x1024\n", " max_size = 1024\n", " if max(img.size) > max_size:\n", " img.thumbnail((max_size, max_size))\n", "\n", " # Save to memory as JPEG\n", " buffer = io.BytesIO()\n", " img.save(buffer, format=\"JPEG\", quality=70) # Quality 70 is enough for text\n", "\n", " # Convert to Base64\n", " encoded_string = base64.b64encode(buffer.getvalue()).decode('utf-8')\n", " return f\"data:image/jpeg;base64,{encoded_string}\"\n", "\n", " def run_pipeline(self, image_path):\n", " prompt = \"\"\"Extract the recipe from this image.\n", " Output strictly valid JSON with keys: title, ingredients (list), instructions (list), cuisine_type, difficulty.\n", " Do not include markdown formatting like ```json, just the raw JSON.\"\"\"\n", "\n", " try:\n", " # 1. Compress Image (Solves 400 Bad Request)\n", " image_url = self.compress_image(image_path)\n", "\n", " # 2. Call Qwen API\n", " response = self.client.chat.completions.create(\n", " model=self.model_id,\n", " messages=[\n", " {\n", " \"role\": \"user\",\n", " \"content\": [\n", " {\n", " \"type\": \"image_url\",\n", " \"image_url\": {\"url\": image_url}\n", " },\n", " {\"type\": \"text\", \"text\": prompt}\n", " ]\n", " }\n", " ],\n", " max_tokens=1024\n", " )\n", "\n", " # 3. Clean Output\n", " raw_text = response.choices[0].message.content\n", " clean_json = raw_text.replace(\"```json\", \"\").replace(\"```\", \"\").strip()\n", "\n", " # Extra safety: Find the first { and last }\n", " start = clean_json.find('{')\n", " end = clean_json.rfind('}') + 1\n", " if start != -1 and end != -1:\n", " clean_json = clean_json[start:end]\n", "\n", " return json.loads(clean_json)\n", "\n", " except Exception as e:\n", " return {\"error\": f\"Qwen API Error: {str(e)}\"}" ], "metadata": { "id": "I0XOgMjETSXw" }, "execution_count": 8, "outputs": [] }, { "cell_type": "code", "source": [ "# --- PART 4: EXECUTION EXAMPLE ---\n", "\n", "if __name__ == \"__main__\":\n", " import os\n", "\n", " # 1. AUTHENTICATION FIX\n", " try:\n", " from google.colab import userdata\n", " # Get the secret named \"HF1\"\n", " hf1_secret = userdata.get('HF_TOKEN')\n", "\n", " # Inject it into the environment as 'HF_TOKEN' so the Pipeline class can find it\n", " os.environ[\"HF_TOKEN\"] = hf1_secret\n", " print(f\"βœ… Successfully loaded token from secret HF_TOKEN\")\n", "\n", " except Exception as e:\n", " print(f\"⚠️ Warning: Could not load secret 'HF_TOKEN'. Make sure the name in the Key icon is exactly 'HF_TOKEN'.\")\n", " print(f\"Error details: {e}\")\n", "\n", " # 2. INITIALIZE PIPELINE\n", " # Now this will work because we set os.environ[\"HF_TOKEN\"] above\n", " try:\n", " app = RecipeDigitalizerPipeline()\n", "\n", " # 3. USER INPUT\n", " user_image = \"/content/Recipe.jfif\"\n", "\n", " # 4. RUN PIPELINE\n", " if os.path.exists(user_image):\n", " print(f\"Processing {user_image}...\")\n", " ai_output = app.run_pipeline(user_image)\n", "\n", " # 5. AI OUTPUT\n", " print(\"\\n--- FINAL DIGITAL OUTPUT ---\")\n", " print(json.dumps(ai_output, indent=4))\n", " else:\n", " print(f\"❌ Error: Image not found at {user_image}\")\n", "\n", " except Exception as e:\n", " print(f\"❌ Application Error: {e}\")" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "EyXpPQGsTXkd", "outputId": "10c5fa31-6731-45ec-b5cc-074d6d534bfc" }, "execution_count": 15, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "βœ… Successfully loaded token from secret HF_TOKEN\n", "Connecting to Hugging Face API (Qwen Mode)...\n", "Processing /content/Recipe.jfif...\n", "\n", "--- FINAL DIGITAL OUTPUT ---\n", "{\n", " \"title\": \"Chocolate Chip Cookies\",\n", " \"ingredients\": [\n", " \"3 cups flour\",\n", " \"1 1/2 teaspoons baking soda\",\n", " \"1/4 teaspoon salt\",\n", " \"1/2 cup soften butter\",\n", " \"1/4 cup sugar\",\n", " \"1/2 cup brown sugar\",\n", " \"3 eggs\",\n", " \"2 teaspoons vanilla\",\n", " \"2 cups chocolate chips\"\n", " ],\n", " \"instructions\": [\n", " \"Preheat oven to 350\\u00b0 for about 15 minutes or roll out a cookie cake and bake for about 9 minutes.\"\n", " ],\n", " \"cuisine_type\": \"American\",\n", " \"difficulty\": \"Easy\"\n", "}\n" ] } ] }, { "cell_type": "markdown", "source": [ "Our evaluation demonstrates that the Qwen-VL Serverless Pipeline significantly outperforms traditional Document Image Transformer (DiT) baselines. While the DiT model frequently suffered from hallucinations and failed to correct OCR errors due to a lack of semantic awareness, our VLM approach leverages deep linguistic understanding to resolve ambiguities. For instance, the model successfully inferred 'sugar' from the noisy input 's_gar' by analyzing the culinary contextβ€”a semantic correction capability that was absent in the standard DiT pipeline." ], "metadata": { "id": "JIZUnKOWTZqc" } }, { "cell_type": "code", "source": [], "metadata": { "id": "6kaTyYGBTZiL" }, "execution_count": null, "outputs": [] } ] }