File size: 3,465 Bytes
a54527e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import os
import google.generativeai as genai
import time


def extract_bounding_box(image_path,description):
    # Set your API key
    genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))

    # Load the model (Gemini 1.5 Flash is currently accessible via the multimodal endpoint)
    model = genai.GenerativeModel(model_name="models/gemini-2.5-pro")

    # Upload video as a part of a multi-turn prompt
    upload_response = genai.upload_file(path=image_path, mime_type="image/png")
    
    while True:
        status = genai.get_file(upload_response.name)
        if status.state == 2:
            break
        time.sleep(1)

    prompt = f'''
**Role:** You are an expert in precise object detection within static images.

**Task:**
Your goal is to identify the **winning contestant** in the provided image and output the precise coordinates of its bounding box in the precised format which are normalized between 1-1000.

**Input:**
1.  **Image:** The image is provided in the prompt
2.  **Identification of Winner:** {description}

**Instructions for Identifying the Winner (if not explicitly provided by user):**
*   Assume the "winner" is the horse that is:
    *   Visibly in the most advanced position relative to other competitors (if any are visible).
    *   Closest to or clearly crossing a discernible finish line (if one is present in the image).
    *   Appearing most dominant or distinctly ahead if other cues are absent.
    *   If the image is a close-up of a single horse, that horse is the winner by default.
    *   Ensure you select the bounding box og the winner that matches bet the descripttion provided.

**Bounding Box Requirements:**
*   The bounding box must encompass the **entire visible portion** of the identified winning horse, including its head, body, all visible legs, and tail.
*   The bounding box should be the **tightest possible rectangle** around the horse.
*   Avoid including significant background elements or other distinct entities (like other horses or jockeys) unless they are directly occluding a part of the winning horse.
*   The coordinates should be normalized to 0-1000.

**Output Format:**
Provide the bounding box coordinates in the following format:
`(y_min, x_min, y_max, x_max)`
Where:
*   `(x_min, y_min)` are the pixel coordinates of the top-left corner of the bounding box.
*   `(x_max, y_max)` are the pixel coordinates of the bottom-right corner of the bounding box.
*   Assume the coordinate system origin (0,0) is at the top-left corner of the provided image.

**Example Output:**
(y_min, x_min, y_max , x_max)
e.g., (480, 598, 608, 720)

**Important Considerations for the AI:**
*   **Occlusion:** If the winning horse is partially occluded by other objects or racers, provide the bounding box for the visible parts of the *winning horse only*. Briefly note if significant occlusion might affect the bounding box's accuracy.
*   **Ambiguity:** If, even with the provided image, identifying the *single* clear winner is highly ambiguous (e.g., a very tight photo finish with multiple horses equally positioned), state this ambiguity. If possible, provide bounding boxes for all equally plausible winners, labeling them distinctly if you can.
'''
    
    # Use the uploaded video in a prompt
    prompt_parts = [
        upload_response,
        prompt
    ]

    response = model.generate_content(prompt_parts, generation_config={
        "temperature": 0.5
    })

    return response.text