import os import google.generativeai as genai import time def extract_bounding_box(image_path,description): # Set your API key genai.configure(api_key=os.getenv("GOOGLE_API_KEY")) # Load the model (Gemini 1.5 Flash is currently accessible via the multimodal endpoint) model = genai.GenerativeModel(model_name="models/gemini-2.5-pro") # Upload video as a part of a multi-turn prompt upload_response = genai.upload_file(path=image_path, mime_type="image/png") while True: status = genai.get_file(upload_response.name) if status.state == 2: break time.sleep(1) prompt = f''' **Role:** You are an expert in precise object detection within static images. **Task:** Your goal is to identify the **winning contestant** in the provided image and output the precise coordinates of its bounding box in the precised format which are normalized between 1-1000. **Input:** 1. **Image:** The image is provided in the prompt 2. **Identification of Winner:** {description} **Instructions for Identifying the Winner (if not explicitly provided by user):** * Assume the "winner" is the horse that is: * Visibly in the most advanced position relative to other competitors (if any are visible). * Closest to or clearly crossing a discernible finish line (if one is present in the image). * Appearing most dominant or distinctly ahead if other cues are absent. * If the image is a close-up of a single horse, that horse is the winner by default. * Ensure you select the bounding box og the winner that matches bet the descripttion provided. **Bounding Box Requirements:** * The bounding box must encompass the **entire visible portion** of the identified winning horse, including its head, body, all visible legs, and tail. * The bounding box should be the **tightest possible rectangle** around the horse. * Avoid including significant background elements or other distinct entities (like other horses or jockeys) unless they are directly occluding a part of the winning horse. * The coordinates should be normalized to 0-1000. **Output Format:** Provide the bounding box coordinates in the following format: `(y_min, x_min, y_max, x_max)` Where: * `(x_min, y_min)` are the pixel coordinates of the top-left corner of the bounding box. * `(x_max, y_max)` are the pixel coordinates of the bottom-right corner of the bounding box. * Assume the coordinate system origin (0,0) is at the top-left corner of the provided image. **Example Output:** (y_min, x_min, y_max , x_max) e.g., (480, 598, 608, 720) **Important Considerations for the AI:** * **Occlusion:** If the winning horse is partially occluded by other objects or racers, provide the bounding box for the visible parts of the *winning horse only*. Briefly note if significant occlusion might affect the bounding box's accuracy. * **Ambiguity:** If, even with the provided image, identifying the *single* clear winner is highly ambiguous (e.g., a very tight photo finish with multiple horses equally positioned), state this ambiguity. If possible, provide bounding boxes for all equally plausible winners, labeling them distinctly if you can. ''' # Use the uploaded video in a prompt prompt_parts = [ upload_response, prompt ] response = model.generate_content(prompt_parts, generation_config={ "temperature": 0.5 }) return response.text