RacingDemo / src /models /bounding_box_extractor.py
Vlad Bastina
merge
10877f8
import os
import google.generativeai as genai
import time
def extract_bounding_box(image_path,description):
# Set your API key
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
# Load the model (Gemini 1.5 Flash is currently accessible via the multimodal endpoint)
model = genai.GenerativeModel(model_name="models/gemini-2.5-pro")
# Upload video as a part of a multi-turn prompt
upload_response = genai.upload_file(path=image_path, mime_type="image/png")
while True:
status = genai.get_file(upload_response.name)
if status.state == 2:
break
time.sleep(1)
prompt = f'''
**Role:** You are an expert in precise object detection within static images.
**Task:**
Your goal is to identify the **winning contestant** in the provided image and output the precise coordinates of its bounding box in the precised format which are normalized between 1-1000.
**Input:**
1. **Image:** The image is provided in the prompt
2. **Identification of Winner:** {description}
**Instructions for Identifying the Winner (if not explicitly provided by user):**
* Assume the "winner" is the horse that is:
* Visibly in the most advanced position relative to other competitors (if any are visible).
* Closest to or clearly crossing a discernible finish line (if one is present in the image).
* Appearing most dominant or distinctly ahead if other cues are absent.
* If the image is a close-up of a single horse, that horse is the winner by default.
* Ensure you select the bounding box og the winner that matches bet the descripttion provided.
**Bounding Box Requirements:**
* The bounding box must encompass the **entire visible portion** of the identified winning horse, including its head, body, all visible legs, and tail.
* The bounding box should be the **tightest possible rectangle** around the horse.
* Avoid including significant background elements or other distinct entities (like other horses or jockeys) unless they are directly occluding a part of the winning horse.
* The coordinates should be normalized to 0-1000.
**Output Format:**
Provide the bounding box coordinates in the following format:
`(y_min, x_min, y_max, x_max)`
Where:
* `(x_min, y_min)` are the pixel coordinates of the top-left corner of the bounding box.
* `(x_max, y_max)` are the pixel coordinates of the bottom-right corner of the bounding box.
* Assume the coordinate system origin (0,0) is at the top-left corner of the provided image.
**Example Output:**
(y_min, x_min, y_max , x_max)
e.g., (480, 598, 608, 720)
**Important Considerations for the AI:**
* **Occlusion:** If the winning horse is partially occluded by other objects or racers, provide the bounding box for the visible parts of the *winning horse only*. Briefly note if significant occlusion might affect the bounding box's accuracy.
* **Ambiguity:** If, even with the provided image, identifying the *single* clear winner is highly ambiguous (e.g., a very tight photo finish with multiple horses equally positioned), state this ambiguity. If possible, provide bounding boxes for all equally plausible winners, labeling them distinctly if you can.
'''
# Use the uploaded video in a prompt
prompt_parts = [
upload_response,
prompt
]
response = model.generate_content(prompt_parts, generation_config={
"temperature": 0.5
})
return response.text