Upload folder using huggingface_hub

Browse files

Files changed (14) hide show

.gitignore +8 -0
README.md +178 -0
gradio_utils.py +483 -0
lrn_vector_embeddings.py +111 -0
main_demo.py +142 -0
requirements.txt +22 -0
s2_download_data.py +49 -0
s3_data_to_vector_embedding.py +61 -0
s4_calculate_distance.py +83 -0
s5-how-to-umap.py +137 -0
s6_prepare_video_input.py +90 -0
s7_store_in_rag.py +105 -0
upload_huggingface.py +8 -0
utility.py +693 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,8 @@

+myenv
+__pycache__
+mm_rag/*
+shared_data
+.gradio
+.env
+.venv
+.github

README.md ADDED Viewed

	@@ -0,0 +1,178 @@

+---
+title: multimodel-rag-chat-with-videos
+app_file: main_demo.py
+sdk: gradio
+sdk_version: 5.17.1
+---
+# ReArchitecture Multimodal RAG System Pipeline Journey
+I ported it locally and isolated each concept into a step as Python runnable
+It is simplified, refactored and bug-fixed now.
+I migrated from Prediction Guard to HuggingFace.
+[**Interactive Video Chat Demo and Multimodal RAG System Architecture**](https://learn.deeplearning.ai/courses/multimodal-rag-chat-with-videos/lesson/2/interactive-demo-and-multimodal-rag-system-architecture)
+### A multimodal AI system should be able to understand both text and video content.
+---
+## Step 1 - Learn Gradio (UI) (30 mins)
+Gradio is a powerful Python library for quickly building browser-based UIs. It supports hot reloading for fast development.
+### Key Concepts:
+- **fn**: The function wrapped by the UI.
+- **inputs**: The Gradio components used for input (should match function arguments).
+- **outputs**: The Gradio components used for output (should match return values).
+📖 [**Gradio Documentation**](https://www.gradio.app/docs/gradio/introduction)
+Gradio includes **30+ built-in components**.
+💡 **Tip**: For `inputs` and `outputs`, you can pass either:
+- The **component name** as a string (e.g., `"textbox"`)
+- An **instance of the component class** (e.g., `gr.Textbox()`)
+### Sharing Your Demo
+```python
+demo.launch(share=True)  # Share your demo with just one extra parameter.
+```
+## Gradio Advanced Features
+### **Gradio.Blocks**
+Gradio provides `gr.Blocks`, a flexible way to design web apps with **custom layouts and complex interactions**:
+- Arrange components freely on the page.
+- Handle multiple data flows.
+- Use outputs as inputs for other components.
+- Dynamically update components based on user interaction.
+### **Gradio.ChatInterface**
+- Always set `type="messages"` in `gr.ChatInterface`.
+- The default (`type="tuples"`) is **deprecated** and will be removed in future versions.
+- For more UI flexibility, use `gr.ChatBot`.
+- `gr.ChatInterface` supports **Markdown** (not tested yet).
+---
+## Step 2 - Learn Bridge Tower Embedding Model (Multimodal Learning) (15 mins)
+Developed in collaboration with Intel, this model maps image-caption pairs into **512-dimensional vectors**.
+### Measuring Similarity
+- **Cosine Similarity** → Measures how close images are in vector space (**efficient & commonly used**).
+- **Euclidean Distance** → Uses `cv2.NORM_L2` to compute similarity between two images.
+### Converting to 2D for Visualization
+- **UMAP** reduces 512D embeddings to **2D for display purposes**.
+## Preprocessing Videos for Multimodal RAG
+### **Case 1: WEBVTT → Extracting Text Segments from Video**
+    - Converts video + text into structured metadata.
+    - Splits content into multiple segments.
+### **Case 2: Whisper (Small) → Video Only**
+    - Extracts **audio** → `model.transcribe()`.
+    - Applies `getSubs()` helper function to retrieve **WEBVTT** subtitles.
+    - Uses **Case 1** processing.
+### **Case 3: LvLM → Video + Silent/Music Extraction**
+    - Uses **Llava (LvLM model)** for **frame-based captioning**.
+    - Encodes each frame as a **Base64 image**.
+    - Extracts context and captions from video frames.
+    - Uses **Case 1** processing.
+# Step 4 - What is LLaVA?
+LLaVA (Large Language-and-Vision Assistant), a large multimodal model that connects a vision encoder that doesn't just see images but understands them, reads the text embedded in them, and reasons about their context—all.
+# Step 5 - what is a vector Store?
+A vector store is a specialized database designed to:
+- Store and manage high-dimensional vector data efficiently
+- Perform similarity-based searches where K=1 returns the most similar result
+- In LanceDB specifically, store multiple data types:
+    . Text content (captions)
+    . Image file paths
+    . Metadata
+    . Vector embeddings
+```python
+_ = MultimodalLanceDB.from_text_image_pairs(
+    texts=updated_vid1_trans+vid2_trans,
+    image_paths=vid1_img_path+vid2_img_path,
+    embedding=BridgeTowerEmbeddings(),
+    metadatas=vid1_metadata+vid2_metadata,
+    connection=db,
+    table_name=TBL_NAME,
+    mode="overwrite",
+)
+```
+# Gotchas and Solutions
+    Image Processing: When working with base64 encoded images, convert them to PIL.Image format before processing with BridgeTower
+    Model Selection: Using BridgeTowerForContrastiveLearning instead of PredictionGuard due to API access limitations
+    Model Size: BridgeTower model requires ~3.5GB download
+    Image Downloads: Some Flickr images may be unavailable; implement robust error handling
+    Token Decoding: BridgeTower contrastive learning model works with embeddings, not token predictions
+    Install from git+https://github.com/openai/whisper.git
+# Install ffmepg using brew
+    ```bash
+        brew install ffmpeg
+        brew link ffmpeg
+    ```
+# Learning and Skills
+## Technical Skills:
+    Basic Machine learning and deep learning
+    Vector embeddings and similarity search
+    Multimodal data processing
+## Framework & Library Expertise:
+    Hugging Face Transformers
+    Gradio UI development
+    LangChain integration (Basic)
+    PyTorch basics
+    LanceDB vector storage
+## AI/ML Concepts:
+    Multimodal RAG system architecture
+    Vector embeddings and similarity search
+    Large Language Models (LLaVA)
+    Image-text pair processing
+    Dimensionality reduction techniques
+## Multimedia Processing:
+    Video frame extraction
+    Audio transcription (Whisper)
+    Image processing (PIL)
+    Base64 encoding/decoding
+    WebVTT handling
+## System Design:
+    Client-server architecture
+    API endpoint design
+    Data pipeline construction
+    Vector store implementation
+    Multimodal system integration
+## Hugging Face
+Remote: hf_origin
+branch:hf_main
+title: Hg Demo
+emoji: 😻
+colorFrom: gray
+colorTo: red
+sdk: gradio
+sdk_version: 5.18.0
+app_file: app.py
+pinned: false
+license: mit
+short_description: 'A space to keep AI work for demo '

gradio_utils.py ADDED Viewed

	@@ -0,0 +1,483 @@

+import gradio as gr
+import io
+import sys
+import time
+import dataclasses
+from pathlib import Path
+import os
+from enum import auto, Enum
+from typing import List, Tuple, Any
+from utility import prediction_guard_llava_conv
+import lancedb
+from utility import load_json_file
+from mm_rag.embeddings.bridgetower_embeddings import BridgeTowerEmbeddings
+from mm_rag.vectorstores.multimodal_lancedb import MultimodalLanceDB
+from mm_rag.MLM.client import PredictionGuardClient
+from mm_rag.MLM.lvlm import LVLM
+from PIL import Image
+from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableLambda
+from moviepy.video.io.VideoFileClip import VideoFileClip
+from utility import prediction_guard_llava_conv, encode_image, Conversation, lvlm_inference_with_conversation
+server_error_msg="**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**"
+# function to split video at a timestamp
+def split_video(video_path, timestamp_in_ms, output_video_path: str = "./shared_data/splitted_videos", output_video_name: str="video_tmp.mp4", play_before_sec: int=3, play_after_sec: int=3):
+    timestamp_in_sec = int(timestamp_in_ms / 1000)
+    # create output_video_name folder if not exist:
+    Path(output_video_path).mkdir(parents=True, exist_ok=True)
+    output_video = os.path.join(output_video_path, output_video_name)
+    with VideoFileClip(video_path) as video:
+        duration = video.duration
+        start_time = max(timestamp_in_sec - play_before_sec, 0)
+        end_time = min(timestamp_in_sec + play_after_sec, duration)
+        new = video.subclip(start_time, end_time)
+        new.write_videofile(output_video, audio_codec='aac')
+    return output_video
+prompt_template = """The transcript associated with the image is '{transcript}'. {user_query}"""
+# define default rag_chain
+def get_default_rag_chain():
+    # declare host file
+    LANCEDB_HOST_FILE = "./shared_data/.lancedb"
+    # declare table name
+    TBL_NAME = "demo_tbl"
+    # initialize vectorstore
+    db = lancedb.connect(LANCEDB_HOST_FILE)
+    # initialize an BridgeTower embedder
+    embedder = BridgeTowerEmbeddings()
+    ## Creating a LanceDB vector store
+    vectorstore = MultimodalLanceDB(uri=LANCEDB_HOST_FILE, embedding=embedder, table_name=TBL_NAME)
+    ### creating a retriever for the vector store
+    retriever_module = vectorstore.as_retriever(search_type='similarity', search_kwargs={"k": 1})
+    # initialize a client as PredictionGuardClien
+    client = PredictionGuardClient()
+    # initialize LVLM with the given client
+    lvlm_inference_module = LVLM(client=client)
+    def prompt_processing(input):
+        # get the retrieved results and user's query
+        retrieved_results, user_query = input['retrieved_results'], input['user_query']
+        # get the first retrieved result by default
+        retrieved_result = retrieved_results[0]
+        # prompt_template = """The transcript associated with the image is '{transcript}'. {user_query}"""
+        # get all metadata of the retrieved video segment
+        metadata_retrieved_video_segment = retrieved_result.metadata['metadata']
+        # get the frame and the corresponding transcript, path to extracted frame, path to whole video, and time stamp of the retrieved video segment.
+        transcript = metadata_retrieved_video_segment['transcript']
+        frame_path = metadata_retrieved_video_segment['extracted_frame_path']
+        return {
+            'prompt': prompt_template.format(transcript=transcript, user_query=user_query),
+            'image' : frame_path,
+            'metadata' : metadata_retrieved_video_segment,
+        }
+    # initialize prompt processing module as a Langchain RunnableLambda of function prompt_processing
+    prompt_processing_module = RunnableLambda(prompt_processing)
+    # the output of this new chain will be a dictionary
+    mm_rag_chain_with_retrieved_image = (
+                    RunnableParallel({"retrieved_results": retriever_module ,
+                                      "user_query": RunnablePassthrough()})
+                    | prompt_processing_module
+                    | RunnableParallel({'final_text_output': lvlm_inference_module,
+                                        'input_to_lvlm' : RunnablePassthrough()})
+                   )
+    return mm_rag_chain_with_retrieved_image
+class SeparatorStyle(Enum):
+    """Different separator style."""
+    SINGLE = auto()
+@dataclasses.dataclass
+class GradioInstance:
+    """A class that keeps all conversation history."""
+    system: str
+    roles: List[str]
+    messages: List[List[str]]
+    offset: int
+    sep_style: SeparatorStyle = SeparatorStyle.SINGLE
+    sep: str = "\n"
+    sep2: str = None
+    version: str = "Unknown"
+    path_to_img: str = None
+    video_title: str = None
+    path_to_video: str = None
+    caption: str = None
+    mm_rag_chain: Any = None
+    skip_next: bool = False
+    def _template_caption(self):
+        out = ""
+        if self.caption is not None:
+            out = f"The caption associated with the image is '{self.caption}'. "
+        return out
+    def get_prompt_for_rag(self):
+        messages = self.messages
+        assert len(messages) == 2, "length of current conversation should be 2"
+        assert messages[1][1] is None, "the first response message of current conversation should be None"
+        ret = messages[0][1]
+        return ret
+    def get_conversation_for_lvlm(self):
+        pg_conv = prediction_guard_llava_conv.copy()
+        image_path = self.path_to_img
+        b64_img = encode_image(image_path)
+        for i, (role, msg) in enumerate(self.messages[self.offset:]):
+            if msg is None:
+                break
+            if i == 0:
+                pg_conv.append_message(prediction_guard_llava_conv.roles[0], [msg, b64_img])
+            elif i == len(self.messages[self.offset:]) - 2:
+                pg_conv.append_message(role, [prompt_template.format(transcript=self.caption, user_query=msg)])
+            else:
+                pg_conv.append_message(role, [msg])
+        return pg_conv
+    def append_message(self, role, message):
+        self.messages.append([role, message])
+    def get_images(self, return_pil=False):
+        images = []
+        if self.path_to_img is not None:
+            path_to_image = self.path_to_img
+            images.append(path_to_image)
+        return images
+    def to_gradio_chatbot(self):
+        ret = []
+        for i, (role, msg) in enumerate(self.messages[self.offset:]):
+            if i % 2 == 0:
+                if type(msg) is tuple:
+                    import base64
+                    from io import BytesIO
+                    msg, image, image_process_mode = msg
+                    max_hw, min_hw = max(image.size), min(image.size)
+                    aspect_ratio = max_hw / min_hw
+                    max_len, min_len = 800, 400
+                    shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
+                    longest_edge = int(shortest_edge * aspect_ratio)
+                    W, H = image.size
+                    if H > W:
+                        H, W = longest_edge, shortest_edge
+                    else:
+                        H, W = shortest_edge, longest_edge
+                    image = image.resize((W, H))
+                    buffered = BytesIO()
+                    image.save(buffered, format="JPEG")
+                    img_b64_str = base64.b64encode(buffered.getvalue()).decode()
+                    img_str = f'<img src="data:image/png;base64,{img_b64_str}" alt="user upload image" />'
+                    msg = img_str + msg.replace('<image>', '').strip()
+                    ret.append([msg, None])
+                else:
+                    ret.append([msg, None])
+            else:
+                ret[-1][-1] = msg
+        return ret
+    def copy(self):
+        return GradioInstance(
+            system=self.system,
+            roles=self.roles,
+            messages=[[x, y] for x, y in self.messages],
+            offset=self.offset,
+            sep_style=self.sep_style,
+            sep=self.sep,
+            sep2=self.sep2,
+            version=self.version,
+            mm_rag_chain=self.mm_rag_chain,
+        )
+    def dict(self):
+        return {
+            "system": self.system,
+            "roles": self.roles,
+            "messages": self.messages,
+            "offset": self.offset,
+            "sep": self.sep,
+            "sep2": self.sep2,
+            "path_to_img": self.path_to_img,
+            "video_title" : self.video_title,
+            "path_to_video": self.path_to_video,
+            "caption" : self.caption,
+        }
+    def get_path_to_subvideos(self):
+        if self.video_title is not None and self.path_to_img is not None:
+            info = video_helper_map[self.video_title]
+            path = info['path']
+            prefix = info['prefix']
+            vid_index = self.path_to_img.split('/')[-1]
+            vid_index = vid_index.split('_')[-1]
+            vid_index = vid_index.replace('.jpg', '')
+            ret = f"{prefix}{vid_index}.mp4"
+            ret = os.path.join(path, ret)
+            return ret
+        elif self.path_to_video is not None:
+            return self.path_to_video
+        return None
+def get_gradio_instance(mm_rag_chain=None):
+    if mm_rag_chain is None:
+        mm_rag_chain = get_default_rag_chain()
+    instance = GradioInstance(
+        system="",
+        roles=prediction_guard_llava_conv.roles,
+        messages=[],
+        offset=0,
+        sep_style=SeparatorStyle.SINGLE,
+        sep="\n",
+        path_to_img=None,
+        video_title=None,
+        caption=None,
+        mm_rag_chain=mm_rag_chain,
+    )
+    return instance
+gr.set_static_paths(paths=["./assets/"])
+theme = gr.themes.Base(
+    primary_hue=gr.themes.Color(
+        c100="#dbeafe", c200="#bfdbfe", c300="#93c5fd", c400="#60a5fa", c50="#eff6ff", c500="#0054ae", c600="#00377c", c700="#00377c", c800="#1e40af", c900="#1e3a8a", c950="#0a0c2b"),
+    secondary_hue=gr.themes.Color(
+        c100="#dbeafe", c200="#bfdbfe", c300="#93c5fd", c400="#60a5fa", c50="#eff6ff", c500="#0054ae", c600="#0054ae", c700="#0054ae", c800="#1e40af", c900="#1e3a8a", c950="#1d3660"),
+).set(
+    body_background_fill_dark='*primary_950',
+    body_text_color_dark='*neutral_300',
+    border_color_accent='*primary_700',
+    border_color_accent_dark='*neutral_800',
+    block_background_fill_dark='*primary_950',
+    block_border_width='2px',
+    block_border_width_dark='2px',
+    button_primary_background_fill_dark='*primary_500',
+    button_primary_border_color_dark='*primary_500'
+)
+css='''
+    @font-face {
+        font-family: IntelOne;
+        src: url("/file=./assets/intelone-bodytext-font-family-regular.ttf");
+    }
+    .gradio-container {background-color: #0a0c2b}
+    table {
+      border-collapse: collapse;
+      border: none;
+    }
+'''
+##     <td style="border-bottom:0"><img src="file/assets/DCAI_logo.png" height="300" width="300"></td>
+# html_title = '''
+# <table style="bordercolor=#0a0c2b; border=0">
+# <tr style="height:150px; border:0">
+#     <td style="border:0"><img src="/file=../assets/intel-labs.png" height="100" width="100"></td>
+#     <td style="vertical-align:bottom; border:0">
+#     <p style="font-size:xx-large;font-family:IntelOne, Georgia, sans-serif;color: white;">
+#      Multimodal RAG:
+#      <br>
+#      Chat with Videos
+#     </p>
+#     </td>
+#     <td style="border:0"><img src="/file=../assets/gaudi.png" width="100" height="100"></td>
+#     <td style="border:0"><img src="/file=../assets/IDC7.png" width="300" height="350"></td>
+#     <td style="border:0"><img src="/file=../assets/prediction_guard3.png" width="120" height="120"></td>
+# </tr>
+# </table>
+# '''
+html_title = '''
+<table style="bordercolor=#0a0c2b; border=0">
+<tr style="height:150px; border:0">
+    <td style="border:0"><img src="/file=./assets/header.png"></td>
+</tr>
+</table>
+'''
+#<td style="border:0"><img src="/file=../assets/xeon.png" width="100" height="100"></td>
+dropdown_list = [
+    "What is the name of one of the astronauts?",
+    "An astronaut's spacewalk",
+    "What does the astronaut say?",
+]
+no_change_btn = gr.Button()
+enable_btn = gr.Button(interactive=True)
+disable_btn = gr.Button(interactive=False)
+def clear_history(state, request: gr.Request):
+    state = get_gradio_instance(state.mm_rag_chain)
+    return (state, state.to_gradio_chatbot(), "", None) + (disable_btn,) * 1
+def add_text(state, text, request: gr.Request):
+    if len(text) <= 0 :
+        state.skip_next = True
+        return (state, state.to_gradio_chatbot(), "", None) + (no_change_btn,) * 1
+    text = text[:1536]  # Hard cut-off
+    state.append_message(state.roles[0], text)
+    state.append_message(state.roles[1], None)
+    state.skip_next = False
+    return (state, state.to_gradio_chatbot(), "") + (disable_btn,) * 1
+def http_bot(
+    state, request: gr.Request
+):
+    start_tstamp = time.time()
+    if state.skip_next:
+        # This generate call is skipped due to invalid inputs
+        path_to_sub_videos = state.get_path_to_subvideos()
+        yield (state, state.to_gradio_chatbot(), path_to_sub_videos) + (no_change_btn,) * 1
+        return
+    if len(state.messages) == state.offset + 2:
+        # First round of conversation
+        new_state = get_gradio_instance(state.mm_rag_chain)
+        new_state.append_message(new_state.roles[0], state.messages[-2][1])
+        new_state.append_message(new_state.roles[1], None)
+        state = new_state
+    all_images = state.get_images(return_pil=False)
+    # Make requests
+    is_very_first_query = True
+    if len(all_images) == 0:
+        # first query need to do RAG
+        # Construct prompt
+        prompt_or_conversation = state.get_prompt_for_rag()
+    else:
+        # subsequence queries, no need to do Retrieval
+        is_very_first_query = False
+        prompt_or_conversation = state.get_conversation_for_lvlm()
+    if is_very_first_query:
+        executor = state.mm_rag_chain
+    else:
+        executor = lvlm_inference_with_conversation
+    state.messages[-1][-1] = "▌"
+    path_to_sub_videos = state.get_path_to_subvideos()
+    yield (state, state.to_gradio_chatbot(), path_to_sub_videos) + (disable_btn,) * 1
+    try:
+        if is_very_first_query:
+            # get response by invoke executor chain
+            response = executor.invoke(prompt_or_conversation)
+            message = response['final_text_output']
+            if 'metadata' in response['input_to_lvlm']:
+                metadata = response['input_to_lvlm']['metadata']
+                if (state.path_to_img is None
+                    and 'input_to_lvlm' in response
+                    and 'image' in response['input_to_lvlm']
+                   ):
+                        state.path_to_img = response['input_to_lvlm']['image']
+                if state.path_to_video is None and 'video_path' in metadata:
+                        video_path = metadata['video_path']
+                        mid_time_ms = metadata['mid_time_ms']
+                        splited_video_path = split_video(video_path, mid_time_ms)
+                        state.path_to_video = splited_video_path
+                if state.caption is None and 'transcript' in metadata:
+                    state.caption = metadata['transcript']
+            else:
+                raise ValueError("Response's format is changed")
+        else:
+            # get the response message by directly call PredictionGuardAPI
+            message = executor(prompt_or_conversation)
+    except Exception as e:
+        print(e)
+        state.messages[-1][-1] = server_error_msg
+        yield (state, state.to_gradio_chatbot(), None) + (
+            enable_btn,
+        )
+        return
+    state.messages[-1][-1] = message
+    path_to_sub_videos = state.get_path_to_subvideos()
+    # path_to_image = state.path_to_img
+    # caption = state.caption
+    # # print(path_to_sub_videos)
+    # # print(path_to_image)
+    # # print('caption: ', caption)
+    yield (state, state.to_gradio_chatbot(), path_to_sub_videos) + (enable_btn,) * 1
+    finish_tstamp = time.time()
+    return
+def get_demo(rag_chain=None):
+    if rag_chain is None:
+        rag_chain = get_default_rag_chain()
+    with gr.Blocks(theme=theme, css=css) as demo:
+        # gr.Markdown(description)
+        instance = get_gradio_instance(rag_chain)
+        state = gr.State(instance)
+        demo.load(
+            None,
+            None,
+            js="""
+      () => {
+      const params = new URLSearchParams(window.location.search);
+      if (!params.has('__theme')) {
+        params.set('__theme', 'dark');
+        window.location.search = params.toString();
+      }
+      }""",
+        )
+        gr.HTML(value=html_title)
+        with gr.Row():
+            with gr.Column(scale=4):
+                video = gr.Video(height=512, width=512, elem_id="video", interactive=False )
+            with gr.Column(scale=7):
+                chatbot = gr.Chatbot(
+                            elem_id="chatbot", label="Multimodal RAG Chatbot", height=512,
+                    )
+                with gr.Row():
+                    with gr.Column(scale=8):
+                        # textbox.render()
+                        textbox = gr.Dropdown(
+                            dropdown_list,
+                            allow_custom_value=True,
+                            # show_label=False,
+                            # container=False,
+                            label="Query",
+                            info="Enter your query here or choose a sample from the dropdown list!"
+                        )
+                    with gr.Column(scale=1, min_width=50):
+                        submit_btn = gr.Button(
+                            value="Send", variant="primary", interactive=True
+                        )
+                with gr.Row(elem_id="buttons") as button_row:
+                    clear_btn = gr.Button(value="🗑️  Clear history", interactive=False)
+        btn_list = [clear_btn]
+        clear_btn.click(
+            clear_history, [state], [state, chatbot, textbox, video] + btn_list
+        )
+        submit_btn.click(
+            add_text,
+            [state, textbox],
+            [state, chatbot, textbox,] + btn_list,
+        ).then(
+            http_bot,
+            [state],
+            [state, chatbot, video] + btn_list,
+        )
+    return demo

lrn_vector_embeddings.py ADDED Viewed

	@@ -0,0 +1,111 @@

+import json
+import os
+import numpy as np
+from numpy.linalg import norm
+import cv2
+from io import StringIO, BytesIO
+from umap import UMAP
+from sklearn.preprocessing import MinMaxScaler
+import pandas as pd
+from tqdm import tqdm
+import base64
+from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning, BridgeTowerForImageAndTextRetrieval, BridgeTowerForMaskedLM
+import requests
+from PIL import Image
+import torch
+url1='http://farm3.staticflickr.com/2519/4126738647_cc436c111b_z.jpg'
+cap1='A motorcycle sits parked across from a herd of livestock'
+url2='http://farm3.staticflickr.com/2046/2003879022_1b4b466d1d_z.jpg'
+cap2='Motorcycle on platform to be worked on in garage'
+url3='https://i.natgeofe.com/n/548467d8-c5f1-4551-9f58-6817a8d2c45e/NationalGeographic_2572187_3x2.jpg'
+cap3='a cat laying down stretched out near a laptop'
+img1 = {
+  'flickr_url': url1,
+  'caption': cap1,
+  'image_path' : './shared_data/motorcycle_1.jpg'
+}
+img2 = {
+    'flickr_url': url2,
+    'caption': cap2,
+    'image_path' : './shared_data/motorcycle_2.jpg'
+}
+img3 = {
+    'flickr_url' : url3,
+    'caption': cap3,
+    'image_path' : './shared_data/cat_1.jpg'
+}
+def bt_embeddings_from_local(text, image):
+    model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
+    processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
+    processed_inputs  = processor(image, text, padding=True, return_tensors="pt")
+    #inputs  = processor(prompt, base64_image, padding=True, return_tensors="pt")
+    outputs = model(**processed_inputs)
+    cross_modal_embeddings = outputs.cross_embeds
+    text_embeddings = outputs.text_embeds
+    image_embeddings = outputs.image_embeds
+    return {
+        'cross_modal_embeddings': cross_modal_embeddings,
+        'text_embeddings': text_embeddings,
+        'image_embeddings': image_embeddings
+    }
+def bt_scores_with_image_and_text_retrieval():
+    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
+    image = Image.open(requests.get(url, stream=True).raw)
+    texts = ["An image of two cats chilling on a couch", "A football player scoring a goal"]
+    processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-gaudi")
+    model = BridgeTowerForImageAndTextRetrieval.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-gaudi")
+# forward pass
+    scores = dict()
+    for text in texts:
+        # prepare inputs
+        encoding = processor(image, text, return_tensors="pt")
+        outputs = model(**encoding)
+        scores[text] = outputs.logits[0,1].item()
+    return scores
+def bt_with_masked_input():
+    url = "http://images.cocodataset.org/val2017/000000360943.jpg"
+    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+    text = "a <mask> looking out of the window"
+    processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-gaudi")
+    model = BridgeTowerForMaskedLM.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-gaudi")
+    # prepare inputs
+    encoding = processor(image, text, return_tensors="pt")
+    # forward pass
+    outputs = model(**encoding)
+    token_ids = outputs.logits.argmax(dim=-1).squeeze(0).tolist()
+    if isinstance(token_ids, list):
+        results = processor.tokenizer.decode(token_ids)
+    else:
+        results = processor.tokenizer.decode([token_ids])
+    print(results)
+    return results
+if __name__ == "__main__":
+    #res = bt_embeddingsl()
+    #print((res['text_embeddings']))
+    for img in [img1, img2, img3]:
+        embeddings = bt_embeddings_from_local(img['caption'], Image.open(img['image_path']))
+        print(embeddings['cross_modal_embeddings'][0].shape)

main_demo.py ADDED Viewed

	@@ -0,0 +1,142 @@

+from pathlib import Path
+import gradio as gr
+import os
+from PIL import Image
+import ollama
+from utility import download_video, get_transcript_vtt, extract_meta_data
+from mm_rag.embeddings.bridgetower_embeddings import (
+    BridgeTowerEmbeddings
+)
+from mm_rag.vectorstores.multimodal_lancedb import MultimodalLanceDB
+import lancedb
+import json
+import os
+from PIL import Image
+from utility import load_json_file, display_retrieved_results
+import pyarrow as pa
+# declare host file
+LANCEDB_HOST_FILE = "./shared_data/.lancedb"
+# declare table name
+TBL_NAME = "demo_tbl"
+# initialize vectorstore
+db = lancedb.connect(LANCEDB_HOST_FILE)
+# initialize an BridgeTower embedder
+embedder = BridgeTowerEmbeddings()
+vid_dir = "./shared_data/videos/yt_video"
+Path(vid_dir).mkdir(parents=True, exist_ok=True)
+def open_table():
+    # open a connection to table TBL_NAME
+    tbl = db.open_table(TBL_NAME)
+    print(f"There are {tbl.to_pandas().shape[0]} rows in the table")
+    # display the first 3 rows of the table
+    tbl.to_pandas()[['text', 'image_path']].head(3)
+def store_in_rag():
+    # load metadata files
+    vid_metadata_path = './shared_data/videos/yt_video/metadatas.json'
+    vid_metadata = load_json_file(vid_metadata_path)
+    vid_subs = [vid['transcript'] for vid in vid_metadata]
+    vid_img_path = [vid['extracted_frame_path'] for vid in vid_metadata]
+    # for video1, we pick n = 7
+    n = 7
+    updated_vid_subs = [
+    ' '.join(vid_subs[i-int(n/2) : i+int(n/2)]) if i-int(n/2) >= 0 else
+    ' '.join(vid_subs[0 : i + int(n/2)]) for i in range(len(vid_subs))
+    ]
+    # also need to update the updated transcripts in metadata
+    for i in range(len(updated_vid_subs)):
+        vid_metadata[i]['transcript'] = updated_vid_subs[i]
+    # you can pass in mode="append"
+    # to add more entries to the vector store
+    # in case you want to start with a fresh vector store,
+    # you can pass in mode="overwrite" instead
+    _ = MultimodalLanceDB.from_text_image_pairs(
+        texts=updated_vid_subs,
+        image_paths=vid_img_path,
+        embedding=embedder,
+        metadatas=vid_metadata,
+        connection=db,
+        table_name=TBL_NAME,
+        mode="overwrite",
+    )
+def get_metadata_of_yt_video_with_captions(vid_url):
+    vid_filepath = download_video(vid_url, vid_dir)
+    vid_transcript_filepath = get_transcript_vtt(vid_url, vid_dir)
+    extract_meta_data(vid_dir, vid_filepath, vid_transcript_filepath) #should return lowercase file name without spaces
+    store_in_rag()
+    open_table()
+    return vid_filepath
+"""
+def chat_response_llvm(instruction):
+    #file_path = the_metadatas[0]
+    file_path = 'shared_data/videos/yt_video/extracted_frame/'
+    result = ollama.generate(
+        model='llava',
+        prompt=instruction,
+        images=[file_path],
+        stream=True
+    )['response']
+    return result
+     """
+def return_top_k_most_similar_docs(query="show me a group of astronauts", max_docs=1):
+    # ask to return top 3 most similar documents
+        # Creating a LanceDB vector store
+    vectorstore = MultimodalLanceDB(
+        uri=LANCEDB_HOST_FILE,
+        embedding=embedder,
+        table_name=TBL_NAME)
+    # creating a retriever for the vector store
+    # search_type="similarity"
+    #  declares that the type of search that the Retriever should perform
+    #  is similarity search
+    # search_kwargs={"k": 1} means returning top-1 most similar document
+    retriever = vectorstore.as_retriever(
+    search_type='similarity',
+    search_kwargs={"k": max_docs})
+    results = retriever.invoke(query)
+    return results[0].page_content, Image.open(results[0].metadata['extracted_frame_path'])
+def process_url_and_init(youtube_url):
+    vid_filepath = get_metadata_of_yt_video_with_captions(youtube_url)
+    return vid_filepath
+def init_ui():
+    with gr.Blocks() as demo:
+        url_input = gr.Textbox(label="Enter YouTube URL", value="https://www.youtube.com/watch?v=7Hcg-rLYwdM", interactive=False)
+        submit_btn = gr.Button("Process Video")
+        #vid_filepath = 'shared_data/videos/yt_video/Welcome_back_to_Planet_Earth.mp4'
+        chatbox = gr.Textbox(label="What question do you want to ask?", value="show me a group of astronauts")
+        response = gr.Textbox(label="Response", interactive=False)
+        video = gr.Video()
+        frame = gr.Image()
+        submit_btn2 = gr.Button("ASK")
+        submit_btn.click(fn=process_url_and_init, inputs=url_input, outputs=[video])
+        submit_btn2.click(fn=return_top_k_most_similar_docs, inputs=[chatbox], outputs=[response, frame])
+    return demo
+if __name__ == '__main__':
+    demo = init_ui()
+    demo.launch(True)

requirements.txt ADDED Viewed

	@@ -0,0 +1,22 @@

+gradio
+langchain-predictionguard
+IPython
+umap-learn
+pytubefix
+youtube_transcript_api
+torch
+transformers
+matplotlib
+seaborn
+datasets
+moviepy
+whisper
+webvtt-py
+tqdm
+lancedb
+langchain-core
+langchain-community
+ollama
+opencv-python
+openai-whisper
+huggingface_hub[cli]

s2_download_data.py ADDED Viewed

	@@ -0,0 +1,49 @@

+import requests
+from PIL import Image
+from IPython.display import display
+# You can use your own uploaded images and captions.
+# You will be responsible for the legal use of images that
+#  you are going to use.
+url1='http://farm3.staticflickr.com/2519/4126738647_cc436c111b_z.jpg'
+cap1='A motorcycle sits parked across from a herd of livestock'
+url2='http://farm3.staticflickr.com/2046/2003879022_1b4b466d1d_z.jpg'
+cap2='Motorcycle on platform to be worked on in garage'
+url3='https://i.natgeofe.com/n/548467d8-c5f1-4551-9f58-6817a8d2c45e/NationalGeographic_2572187_3x2.jpg'
+cap3='a cat laying down stretched out near a laptop'
+img1 = {
+  'flickr_url': url1,
+  'caption': cap1,
+  'image_path' : './shared_data/motorcycle_1.jpg'
+}
+img2 = {
+    'flickr_url': url2,
+    'caption': cap2,
+    'image_path' : './shared_data/motorcycle_2.jpg'
+}
+img3 = {
+    'flickr_url' : url3,
+    'caption': cap3,
+    'image_path' : './shared_data/cat_1.jpg'
+}
+def download_images():
+    # download images
+    imgs = [img1, img2, img3]
+    for img in imgs:
+        data = requests.get(img['flickr_url']).content
+        with open(img['image_path'], 'wb') as f:
+            f.write(data)
+    for img in [img1, img2, img3]:
+        image = Image.open(img['image_path'])
+        caption = img['caption']
+        display(image)
+        print(caption)

s3_data_to_vector_embedding.py ADDED Viewed

	@@ -0,0 +1,61 @@

+from numpy.linalg import norm
+from transformers import BridgeTowerProcessor, BridgeTowerForContrastiveLearning
+import torch
+from PIL import Image
+url1='http://farm3.staticflickr.com/2519/4126738647_cc436c111b_z.jpg'
+cap1='A motorcycle sits parked across from a herd of livestock'
+url2='http://farm3.staticflickr.com/2046/2003879022_1b4b466d1d_z.jpg'
+cap2='Motorcycle on platform to be worked on in garage'
+url3='https://i.natgeofe.com/n/548467d8-c5f1-4551-9f58-6817a8d2c45e/NationalGeographic_2572187_3x2.jpg'
+cap3='a cat laying down stretched out near a laptop'
+img1 = {
+  'flickr_url': url1,
+  'caption': cap1,
+  'image_path' : './shared_data/motorcycle_1.jpg',
+  'tensor_path' : './shared_data/motorcycle_1'
+}
+img2 = {
+    'flickr_url': url2,
+    'caption': cap2,
+    'image_path' : './shared_data/motorcycle_2.jpg',
+    'tensor_path' : './shared_data/motorcycle_2'
+}
+img3 = {
+    'flickr_url' : url3,
+    'caption': cap3,
+    'image_path' : './shared_data/cat_1.jpg',
+    'tensor_path' : './shared_data/cat_1'
+}
+def bt_embeddings_from_local(text, image):
+    model = BridgeTowerForContrastiveLearning.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
+    processor = BridgeTowerProcessor.from_pretrained("BridgeTower/bridgetower-large-itm-mlm-itc")
+    processed_inputs  = processor(image, text, padding=True, return_tensors="pt")
+    outputs = model(**processed_inputs)
+    cross_modal_embeddings = outputs.cross_embeds
+    text_embeddings = outputs.text_embeds
+    image_embeddings = outputs.image_embeds
+    return {
+        'cross_modal_embeddings': cross_modal_embeddings,
+        'text_embeddings': text_embeddings,
+        'image_embeddings': image_embeddings
+    }
+def save_embeddings():
+    for img in [img1, img2, img3]:
+        embedding = bt_embeddings_from_local(img['caption'], Image.open(img['image_path']))
+        print(embedding['cross_modal_embeddings'][0].shape) #<class 'torch.Tensor'>
+        torch.save(embedding['cross_modal_embeddings'][0], img['tensor_path'] + '.pt')

s4_calculate_distance.py ADDED Viewed

	@@ -0,0 +1,83 @@

+import numpy as np
+from numpy.linalg import norm
+import torch
+from IPython.display import display
+import cv2
+url1='http://farm3.staticflickr.com/2519/4126738647_cc436c111b_z.jpg'
+cap1='A motorcycle sits parked across from a herd of livestock'
+url2='http://farm3.staticflickr.com/2046/2003879022_1b4b466d1d_z.jpg'
+cap2='Motorcycle on platform to be worked on in garage'
+url3='https://i.natgeofe.com/n/548467d8-c5f1-4551-9f58-6817a8d2c45e/NationalGeographic_2572187_3x2.jpg'
+cap3='a cat laying down stretched out near a laptop'
+img1 = {
+  'flickr_url': url1,
+  'caption': cap1,
+  'image_path' : './shared_data/motorcycle_1.jpg',
+  'tensor_path' : './shared_data/motorcycle_1'
+}
+img2 = {
+    'flickr_url': url2,
+    'caption': cap2,
+    'image_path' : './shared_data/motorcycle_2.jpg',
+    'tensor_path' : './shared_data/motorcycle_2'
+}
+img3 = {
+    'flickr_url' : url3,
+    'caption': cap3,
+    'image_path' : './shared_data/cat_1.jpg',
+    'tensor_path' : './shared_data/cat_1'
+}
+def load_tensor(path):
+    return torch.load(path)
+def load_embeddings():
+    ex1_embed = load_tensor(img1['tensor_path'] + '.pt')
+    ex2_embed = load_tensor(img2['tensor_path'] + '.pt')
+    ex3_embed = load_tensor(img3['tensor_path'] + '.pt')
+    return ex1_embed.data.numpy(), ex2_embed.data.numpy(), ex3_embed.data.numpy()
+def cosine_similarity(vec1, vec2):
+    similarity = np.dot(vec1,vec2)/(norm(vec1)*norm(vec2))
+    return similarity
+def calculate_cosine_distance():
+    ex1_embed, ex2_embed, ex3_embed = load_embeddings()
+    similarity1 = cosine_similarity(ex1_embed, ex2_embed)
+    similarity2 = cosine_similarity(ex1_embed, ex3_embed)
+    similarity3 = cosine_similarity(ex2_embed, ex3_embed)
+    return [similarity1, similarity2, similarity3]
+def calcuate_euclidean_distance():
+    ex1_embed, ex2_embed, ex3_embed = load_embeddings()
+    distance1 = cv2.norm(ex1_embed,ex2_embed, cv2.NORM_L2)
+    distance2 = cv2.norm(ex1_embed,ex3_embed, cv2.NORM_L2)
+    distance3 = cv2.norm(ex2_embed,ex3_embed, cv2.NORM_L2)
+    return [distance1, distance2, distance3]
+def show_cosine_distance():
+    distances = calculate_cosine_distance()
+    print("Cosine similarity between ex1_embeded and ex2_embeded is:")
+    display(distances[0])
+    print("Cosine similarity between ex1_embeded and ex3_embeded is:")
+    display(distances[1])
+    print("Cosine similarity between ex2_embeded and ex2_embeded is:")
+    display(distances[2])
+def show_euclidean_distance():
+    distances = calcuate_euclidean_distance()
+    print("Euclidean distance between ex1_embeded and ex2_embeded is:")
+    display(distances[0])
+    print("Euclidean distance between ex1_embeded and ex3_embeded is:")
+    display(distances[1])
+    print("Euclidean distance between ex2_embeded and ex2_embeded is:")
+    display(distances[2])
+show_cosine_distance()
+show_euclidean_distance()

s5-how-to-umap.py ADDED Viewed

	@@ -0,0 +1,137 @@

+from os import path
+from IPython.display import display
+from umap import UMAP
+from sklearn.preprocessing import MinMaxScaler
+import pandas as pd
+from tqdm import tqdm
+import matplotlib.pyplot as plt
+import seaborn as sns
+from s3_data_to_vector_embedding import bt_embeddings_from_local
+import random
+import numpy as np
+import torch
+from sklearn.model_selection import train_test_split
+from datasets import load_dataset
+# prompt templates
+templates = [
+    'a picture of {}',
+    'an image of {}',
+    'a nice {}',
+    'a beautiful {}',
+]
+# function helps to prepare list image-text pairs from the first [test_size] data
+def data_prep(hf_dataset_name, templates=templates, test_size=1000):
+    # load Huggingface dataset by streaming the dataset which doesn’t download anything, and lets you use it instantly
+    #dataset = load_dataset(hf_dataset_name, trust_remote_code=True, split='train', streaming=True)
+    dataset = load_dataset(hf_dataset_name)
+    # split dataset with specific test_size
+    train_test_dataset = dataset['train'].train_test_split(test_size=test_size)
+    test_dataset = train_test_dataset['test']
+    print(test_dataset)
+    # get the test dataset
+    img_txt_pairs = []
+    for i in range(len(test_dataset)):
+        img_txt_pairs.append({
+            'caption' : templates[random.randint(0, len(templates)-1)],
+            'pil_img' : test_dataset[i]['image']
+        })
+    return img_txt_pairs
+def load_all_dataset():
+    car_img_txt_pairs = data_prep("tanganke/stanford_cars", test_size=50)
+    cat_img_txt_pairs = data_prep("yashikota/cat-image-dataset", test_size=50)
+    return cat_img_txt_pairs, car_img_txt_pairs
+# compute BridgeTower embeddings for cat image-text pairs
+def load_cat_and_car_embeddings():
+    # prepare image_text pairs
+    cat_img_txt_pairs, car_img_txt_pairs = load_all_dataset()
+    def save_embeddings(embedding, path):
+        torch.save(embedding, path)
+    def load_embeddings(img_txt_pair):
+        pil_img = img_txt_pair['pil_img']
+        caption = img_txt_pair['caption']
+        return bt_embeddings_from_local(caption, pil_img)
+    def load_all_embeddings_from_image_text_pairs(img_txt_pairs, file_name):
+        embeddings = []
+        for img_txt_pair in tqdm(
+                            img_txt_pairs,
+                            total=len(img_txt_pairs)
+                        ):
+            embedding = load_embeddings(img_txt_pair)
+            print(embedding)
+            cross_modal_embeddings = embedding['cross_modal_embeddings'][0].detach().numpy() #this is not the right way to convert tensor to numpy
+            #print(cross_modal_embeddings.shape) #<class 'torch.Tensor'>
+            #save_embeddings(cross_modal_embeddings, file_name)
+            embeddings.append(cross_modal_embeddings)
+            return cross_modal_embeddings
+    cat_embeddings = load_all_embeddings_from_image_text_pairs(cat_img_txt_pairs, './shared_data/cat_embeddings.pt')
+    car_embeddings = load_all_embeddings_from_image_text_pairs(car_img_txt_pairs, './shared_data/car_embeddings.pt')
+    return cat_embeddings, car_embeddings
+# function transforms high-dimension vectors to 2D vectors using UMAP
+def dimensionality_reduction(embeddings, labels):
+    print(embeddings)
+    X_scaled = MinMaxScaler().fit_transform(embeddings.reshape(-1, 1)) # This is not the right way to scale the data
+    mapper = UMAP(n_components=2, metric="cosine").fit(X_scaled)
+    df_emb = pd.DataFrame(mapper.embedding_, columns=["X", "Y"])
+    df_emb["label"] = labels
+    print(df_emb)
+    return df_emb
+def show_umap_visualization():
+    def reduce_dimensions():
+        cat_embeddings, car_embeddings = load_cat_and_car_embeddings()
+        # stacking embeddings of cat and car examples into one numpy array
+        all_embeddings = np.concatenate([cat_embeddings, car_embeddings]) # This is not the right way to scale the data
+        # prepare labels for the 3 examples
+        labels = ['cat'] * len(cat_embeddings) + ['car'] * len(car_embeddings)
+        # compute dimensionality reduction for the 3 examples
+        reduced_dim_emb = dimensionality_reduction(all_embeddings, labels)
+        return reduced_dim_emb
+    reduced_dim_emb = reduce_dimensions()
+    # Plot the centroids against the cluster
+    fig, ax = plt.subplots(figsize=(8,6)) # Set figsize
+    sns.set_style("whitegrid", {'axes.grid' : False})
+    sns.scatterplot(data=reduced_dim_emb,
+                    x=reduced_dim_emb['X'],
+                    y=reduced_dim_emb['Y'],
+                    hue='label',
+                    palette='bright')
+    sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1))
+    plt.title('Scatter plot of images of cats and cars using UMAP')
+    plt.xlabel('X')
+    plt.ylabel('Y')
+    plt.show()
+def  an_example_of_cat_and_car_pair_data():
+    cat_img_txt_pairs, car_img_txt_pairs = load_all_dataset()
+    # display an example of a cat image-text pair data
+    display(cat_img_txt_pairs[0]['caption'])
+    display(cat_img_txt_pairs[0]['pil_img'])
+    # display an example of a car image-text pair data
+    display(car_img_txt_pairs[0]['caption'])
+    display(car_img_txt_pairs[0]['pil_img'])
+if __name__ == '__main__':
+    show_umap_visualization()

s6_prepare_video_input.py ADDED Viewed

	@@ -0,0 +1,90 @@

+from pathlib import Path
+import os
+from os import path as osp
+import whisper
+from moviepy import VideoFileClip
+from PIL import Image
+from utility import download_video, extract_meta_data, get_transcript_vtt, getSubs
+from urllib.request import urlretrieve
+from IPython.display import display
+import ollama
+def demp_video_input_that_has_transcript():
+    # first video's url
+    vid_url = "https://www.youtube.com/watch?v=7Hcg-rLYwdM"
+    # download Youtube video to ./shared_data/videos/video1
+    vid_dir = "./shared_data/videos/video1"
+    vid_filepath = download_video(vid_url, vid_dir)
+    # download Youtube video's subtitle to ./shared_data/videos/video1
+    vid_transcript_filepath = get_transcript_vtt(vid_url, vid_dir)
+    return extract_meta_data(vid_dir, vid_filepath, vid_transcript_filepath)
+def demp_video_input_that_has_no_transcript():
+        # second video's url
+    vid_url=(
+        "https://multimedia-commons.s3-us-west-2.amazonaws.com/"
+        "data/videos/mp4/010/a07/010a074acb1975c4d6d6e43c1faeb8.mp4"
+    )
+    vid_dir = "./shared_data/videos/video2"
+    vid_name = "toddler_in_playground.mp4"
+    # create folder to which video2 will be downloaded
+    Path(vid_dir).mkdir(parents=True, exist_ok=True)
+    vid_filepath = urlretrieve(
+                            vid_url,
+                            osp.join(vid_dir, vid_name)
+                        )[0]
+    path_to_video_no_transcript = vid_filepath
+    # declare where to save .mp3 audio
+    path_to_extracted_audio_file = os.path.join(vid_dir, 'audio.mp3')
+    # extract mp3 audio file from mp4 video video file
+    clip = VideoFileClip(path_to_video_no_transcript)
+    clip.audio.write_audiofile(path_to_extracted_audio_file)
+    model = whisper.load_model("small")
+    options = dict(task="translate", best_of=1, language='en')
+    results = model.transcribe(path_to_extracted_audio_file, **options)
+    vtt = getSubs(results["segments"], "vtt")
+    # path to save generated transcript of video1
+    path_to_generated_trans = osp.join(vid_dir, 'generated_video1.vtt')
+    # write transcription to file
+    with open(path_to_generated_trans, 'w') as f:
+        f.write(vtt)
+    return extract_meta_data(vid_dir, vid_filepath, path_to_generated_trans)
+def ask_llvm(instruction, file_path):
+    result = ollama.generate(
+        model='llava',
+        prompt=instruction,
+        images=[file_path],
+        stream=False
+    )['response']
+    img=Image.open(file_path, mode='r')
+    img = img.resize([int(i/1.2) for i in img.size])
+    display(img)
+    for i in result.split('.'):
+        print(i, end='', flush=True)
+if __name__ == "__main__":
+    meta_data = demp_video_input_that_has_transcript()
+    meta_data1 = demp_video_input_that_has_no_transcript()
+    data = meta_data1[1]
+    caption = data['transcript']
+    print(f'Generated caption is: "{caption}"')
+    frame = Image.open(data['extracted_frame_path'])
+    display(frame)
+    instruction = "Can you describe the image?"
+    ask_llvm(instruction, data['extracted_frame_path'])
+    #print(meta_data)

s7_store_in_rag.py ADDED Viewed

	@@ -0,0 +1,105 @@

+from mm_rag.embeddings.bridgetower_embeddings import (
+    BridgeTowerEmbeddings
+)
+from mm_rag.vectorstores.multimodal_lancedb import MultimodalLanceDB
+import lancedb
+import json
+import os
+from PIL import Image
+from utility import load_json_file, display_retrieved_results
+import pyarrow as pa
+# declare host file
+LANCEDB_HOST_FILE = "./shared_data/.lancedb"
+# declare table name
+TBL_NAME = "test_tbl"
+# initialize vectorstore
+db = lancedb.connect(LANCEDB_HOST_FILE)
+# initialize an BridgeTower embedder
+embedder = BridgeTowerEmbeddings()
+def return_top_k_most_similar_docs(max_docs=3):
+    # ask to return top 3 most similar documents
+        # Creating a LanceDB vector store
+    vectorstore = MultimodalLanceDB(
+        uri=LANCEDB_HOST_FILE,
+        embedding=embedder,
+        table_name=TBL_NAME)
+    # creating a retriever for the vector store
+    # search_type="similarity"
+    #  declares that the type of search that the Retriever should perform
+    #  is similarity search
+    # search_kwargs={"k": 1} means returning top-1 most similar document
+    retriever = vectorstore.as_retriever(
+    search_type='similarity',
+    search_kwargs={"k": max_docs})
+    query2 = (
+            "an astronaut's spacewalk "
+            "with an amazing view of the earth from space behind"
+    )
+    results2 = retriever.invoke(query2)
+    display_retrieved_results(results2)
+    query3 = "a group of astronauts"
+    results3 = retriever.invoke(query3)
+    display_retrieved_results(results3)
+def open_table(TBL_NAME):
+    # open a connection to table TBL_NAME
+    tbl = db.open_table()
+    print(f"There are {tbl.to_pandas().shape[0]} rows in the table")
+    # display the first 3 rows of the table
+    tbl.to_pandas()[['text', 'image_path']].head(3)
+def store_in_rag():
+    # load metadata files
+    vid1_metadata_path = './shared_data/videos/video1/metadatas.json'
+    vid2_metadata_path = './shared_data/videos/video2/metadatas.json'
+    vid1_metadata = load_json_file(vid1_metadata_path)
+    vid2_metadata = load_json_file(vid2_metadata_path)
+    # collect transcripts and image paths
+    vid1_trans = [vid['transcript'] for vid in vid1_metadata]
+    vid1_img_path = [vid['extracted_frame_path'] for vid in vid1_metadata]
+    vid2_trans = [vid['transcript'] for vid in vid2_metadata]
+    vid2_img_path = [vid['extracted_frame_path'] for vid in vid2_metadata]
+    # for video1, we pick n = 7
+    n = 7
+    updated_vid1_trans = [
+    ' '.join(vid1_trans[i-int(n/2) : i+int(n/2)]) if i-int(n/2) >= 0 else
+    ' '.join(vid1_trans[0 : i + int(n/2)]) for i in range(len(vid1_trans))
+    ]
+    # also need to update the updated transcripts in metadata
+    for i in range(len(updated_vid1_trans)):
+        vid1_metadata[i]['transcript'] = updated_vid1_trans[i]
+    # you can pass in mode="append"
+    # to add more entries to the vector store
+    # in case you want to start with a fresh vector store,
+    # you can pass in mode="overwrite" instead
+    _ = MultimodalLanceDB.from_text_image_pairs(
+        texts=updated_vid1_trans+vid2_trans,
+        image_paths=vid1_img_path+vid2_img_path,
+        embedding=embedder,
+        metadatas=vid1_metadata+vid2_metadata,
+        connection=db,
+        table_name=TBL_NAME,
+        mode="overwrite",
+    )
+if __name__ == "__main__":
+    tbl  = db.open_table(TBL_NAME)
+    print(f"There are {tbl.to_pandas().shape[0]} rows in the table")
+    #display the first 3 rows of the table
+    return_top_k_most_similar_docs()

upload_huggingface.py ADDED Viewed

	@@ -0,0 +1,8 @@

+from huggingface_hub import HfApi
+api = HfApi()
+api.upload_large_folder(
+    repo_id="88hours/hg_demo",
+    repo_type="space",
+    folder_path="./",
+)

utility.py ADDED Viewed

	@@ -0,0 +1,693 @@

+# Add your utilities or helper functions to this file.
+import os
+from pathlib import Path
+from dotenv import load_dotenv, find_dotenv
+from io import StringIO, BytesIO
+import textwrap
+from typing import Iterator, TextIO, List, Dict, Any, Optional, Sequence, Union
+from enum import auto, Enum
+import base64
+import glob
+import requests
+from tqdm import tqdm
+from pytubefix import YouTube, Stream
+import webvtt
+from youtube_transcript_api import YouTubeTranscriptApi
+from youtube_transcript_api.formatters import WebVTTFormatter
+from predictionguard import PredictionGuard
+import cv2
+import json
+import PIL
+from ollama import chat
+from ollama import ChatResponse
+from PIL import Image
+import dataclasses
+import random
+from datasets import load_dataset
+from os import path as osp
+from IPython.display import display
+from langchain_core.prompt_values import PromptValue
+from langchain_core.messages import (
+    MessageLikeRepresentation,
+)
+MultimodalModelInput = Union[PromptValue, str, Sequence[MessageLikeRepresentation], Dict[str, Any]]
+def get_from_dict_or_env(
+    data: Dict[str, Any], key: str, env_key: str, default: Optional[str] = None
+) -> str:
+    """Get a value from a dictionary or an environment variable."""
+    if key in data and data[key]:
+        return data[key]
+    else:
+        return get_from_env(key, env_key, default=default)
+def get_from_env(key: str, env_key: str, default: Optional[str] = None) -> str:
+    """Get a value from a dictionary or an environment variable."""
+    if env_key in os.environ and os.environ[env_key]:
+        return os.environ[env_key]
+    else:
+        return default
+def load_env():
+    _ = load_dotenv(find_dotenv())
+def get_openai_api_key():
+    load_env()
+    openai_api_key = os.getenv("OPENAI_API_KEY")
+    return openai_api_key
+def get_prediction_guard_api_key():
+    load_env()
+    PREDICTION_GUARD_API_KEY = os.getenv("PREDICTION_GUARD_API_KEY", None)
+    if PREDICTION_GUARD_API_KEY is None:
+        PREDICTION_GUARD_API_KEY = input("Please enter your Prediction Guard API Key: ")
+    return PREDICTION_GUARD_API_KEY
+PREDICTION_GUARD_URL_ENDPOINT = os.getenv("DLAI_PREDICTION_GUARD_URL_ENDPOINT", "https://dl-itdc.predictionguard.com") ###"https://proxy-dl-itdc.predictionguard.com"
+# prompt templates
+templates = [
+    'a picture of {}',
+    'an image of {}',
+    'a nice {}',
+    'a beautiful {}',
+]
+# function helps to prepare list image-text pairs from the first [test_size] data of a Huggingface dataset
+def prepare_dataset_for_umap_visualization(hf_dataset, class_name, templates=templates, test_size=1000):
+    # load Huggingface dataset (download if needed)
+    dataset = load_dataset(hf_dataset, trust_remote_code=True)
+    # split dataset with specific test_size
+    train_test_dataset = dataset['train'].train_test_split(test_size=test_size)
+    # get the test dataset
+    test_dataset = train_test_dataset['test']
+    img_txt_pairs = []
+    for i in range(len(test_dataset)):
+        img_txt_pairs.append({
+            'caption' : templates[random.randint(0, len(templates)-1)].format(class_name),
+            'pil_img' : test_dataset[i]['image']
+        })
+    return img_txt_pairs
+def download_video(video_url, path='/tmp/'):
+    print(f'Getting video information for {video_url}')
+    if not video_url.startswith('http'):
+        return os.path.join(path, video_url)
+    filepath = glob.glob(os.path.join(path, '*.mp4'))
+    if len(filepath) > 0:
+        print('Video already downloaded')
+        return filepath[0]
+    def progress_callback(stream: Stream, data_chunk: bytes, bytes_remaining: int) -> None:
+        pbar.update(len(data_chunk))
+    yt = YouTube(video_url, on_progress_callback=progress_callback)
+    stream = yt.streams.filter(progressive=True, file_extension='mp4', res='480p').desc().first()
+    if stream is None:
+        stream = yt.streams.filter(progressive=True, file_extension='mp4').order_by('resolution').desc().first()
+    if not os.path.exists(path):
+        os.makedirs(path)
+    filename = stream.default_filename.replace(' ', '_')
+    filepath = os.path.join(path, filename)
+    if not os.path.exists(filepath):
+        print('Downloading video from YouTube...')
+        pbar = tqdm(desc='Downloading video from YouTube', total=stream.filesize, unit="bytes")
+        stream.download(path, filename=filename)
+        pbar.close()
+    return filepath
+def get_video_id_from_url(video_url):
+    """
+    Examples:
+    - http://youtu.be/SA2iWivDJiE
+    - http://www.youtube.com/watch?v=_oPAwA_Udwc&feature=feedu
+    - http://www.youtube.com/embed/SA2iWivDJiE
+    - http://www.youtube.com/v/SA2iWivDJiE?version=3&amp;hl=en_US
+    """
+    import urllib.parse
+    url = urllib.parse.urlparse(video_url)
+    if url.hostname == 'youtu.be':
+        return url.path[1:]
+    if url.hostname in ('www.youtube.com', 'youtube.com'):
+        if url.path == '/watch':
+            p = urllib.parse.parse_qs(url.query)
+            return p['v'][0]
+        if url.path[:7] == '/embed/':
+            return url.path.split('/')[2]
+        if url.path[:3] == '/v/':
+            return url.path.split('/')[2]
+    return video_url
+# if this has transcript then download
+def get_transcript_vtt(video_url, path='/tmp'):
+    video_id = get_video_id_from_url(video_url)
+    filepath = os.path.join(path,'captions.vtt')
+    if os.path.exists(filepath):
+        return filepath
+    transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=['en-GB', 'en'])
+    formatter = WebVTTFormatter()
+    webvtt_formatted = formatter.format_transcript(transcript)
+    with open(filepath, 'w', encoding='utf-8') as webvtt_file:
+        webvtt_file.write(webvtt_formatted)
+    webvtt_file.close()
+    return filepath
+# helper function for convert time in second to time format for .vtt or .srt file
+def format_timestamp(seconds: float, always_include_hours: bool = False, fractionalSeperator: str = '.'):
+    assert seconds >= 0, "non-negative timestamp expected"
+    milliseconds = round(seconds * 1000.0)
+    hours = milliseconds // 3_600_000
+    milliseconds -= hours * 3_600_000
+    minutes = milliseconds // 60_000
+    milliseconds -= minutes * 60_000
+    seconds = milliseconds // 1_000
+    milliseconds -= seconds * 1_000
+    hours_marker = f"{hours:02d}:" if always_include_hours or hours > 0 else ""
+    return f"{hours_marker}{minutes:02d}:{seconds:02d}{fractionalSeperator}{milliseconds:03d}"
+# a help function that helps to convert a specific time written as a string in format `webvtt` into a time in miliseconds
+def str2time(strtime):
+    # strip character " if exists
+    strtime = strtime.strip('"')
+    # get hour, minute, second from time string
+    hrs, mins, seconds = [float(c) for c in strtime.split(':')]
+    # get the corresponding time as total seconds
+    total_seconds = hrs * 60**2 + mins * 60 + seconds
+    total_miliseconds = total_seconds * 1000
+    return total_miliseconds
+def _processText(text: str, maxLineWidth=None):
+    if (maxLineWidth is None or maxLineWidth < 0):
+        return text
+    lines = textwrap.wrap(text, width=maxLineWidth, tabsize=4)
+    return '\n'.join(lines)
+# Resizes a image and maintains aspect ratio
+def maintain_aspect_ratio_resize(image, width=None, height=None, inter=cv2.INTER_AREA):
+    # Grab the image size and initialize dimensions
+    dim = None
+    (h, w) = image.shape[:2]
+    # Return original image if no need to resize
+    if width is None and height is None:
+        return image
+    # We are resizing height if width is none
+    if width is None:
+        # Calculate the ratio of the height and construct the dimensions
+        r = height / float(h)
+        dim = (int(w * r), height)
+    # We are resizing width if height is none
+    else:
+        # Calculate the ratio of the width and construct the dimensions
+        r = width / float(w)
+        dim = (width, int(h * r))
+    # Return the resized image
+    return cv2.resize(image, dim, interpolation=inter)
+# helper function to convert transcripts generated by whisper to .vtt file
+def write_vtt(transcript: Iterator[dict], file: TextIO, maxLineWidth=None):
+    print("WEBVTT\n", file=file)
+    for segment in transcript:
+        text = _processText(segment['text'], maxLineWidth).replace('-->', '->')
+        print(
+            f"{format_timestamp(segment['start'])} --> {format_timestamp(segment['end'])}\n"
+            f"{text}\n",
+            file=file,
+            flush=True,
+        )
+# helper function to convert transcripts generated by whisper to .srt file
+def write_srt(transcript: Iterator[dict], file: TextIO, maxLineWidth=None):
+    """
+    Write a transcript to a file in SRT format.
+    Example usage:
+        from pathlib import Path
+        from whisper.utils import write_srt
+import requests
+        result = transcribe(model, audio_path, temperature=temperature, **args)
+        # save SRT
+        audio_basename = Path(audio_path).stem
+        with open(Path(output_dir) / (audio_basename + ".srt"), "w", encoding="utf-8") as srt:
+            write_srt(result["segments"], file=srt)
+    """
+    for i, segment in enumerate(transcript, start=1):
+        text = _processText(segment['text'].strip(), maxLineWidth).replace('-->', '->')
+        # write srt lines
+        print(
+            f"{i}\n"
+            f"{format_timestamp(segment['start'], always_include_hours=True, fractionalSeperator=',')} --> "
+            f"{format_timestamp(segment['end'], always_include_hours=True, fractionalSeperator=',')}\n"
+            f"{text}\n",
+            file=file,
+            flush=True,
+        )
+def getSubs(segments: Iterator[dict], format: str, maxLineWidth: int=-1) -> str:
+    segmentStream = StringIO()
+    if format == 'vtt':
+        write_vtt(segments, file=segmentStream, maxLineWidth=maxLineWidth)
+    elif format == 'srt':
+        write_srt(segments, file=segmentStream, maxLineWidth=maxLineWidth)
+    else:
+        raise Exception("Unknown format " + format)
+    segmentStream.seek(0)
+    return segmentStream.read()
+# encoding image at given path or PIL Image using base64
+def encode_image(image_path_or_PIL_img):
+    if isinstance(image_path_or_PIL_img, PIL.Image.Image):
+        # this is a PIL image
+        buffered = BytesIO()
+        image_path_or_PIL_img.save(buffered, format="JPEG")
+        return base64.b64encode(buffered.getvalue()).decode('utf-8')
+    else:
+        # this is a image_path
+        with open(image_path_or_PIL_img, "rb") as image_file:
+            return base64.b64encode(image_file.read()).decode('utf-8')
+# checking whether the given string is base64 or not
+def isBase64(sb):
+    try:
+        if isinstance(sb, str):
+                # If there's any unicode here, an exception will be thrown and the function will return false
+                sb_bytes = bytes(sb, 'ascii')
+        elif isinstance(sb, bytes):
+                sb_bytes = sb
+        else:
+                raise ValueError("Argument must be string or bytes")
+        return base64.b64encode(base64.b64decode(sb_bytes)) == sb_bytes
+    except Exception:
+            return False
+def encode_image_from_path_or_url(image_path_or_url):
+    try:
+        # try to open the url to check valid url
+        f = urlopen(image_path_or_url)
+        # if this is an url
+        return base64.b64encode(requests.get(image_path_or_url).content).decode('utf-8')
+    except:
+        # this is a path to image
+        with open(image_path_or_url, "rb") as image_file:
+            return base64.b64encode(image_file.read()).decode('utf-8')
+# helper function to compute the joint embedding of a prompt and a base64-encoded image through PredictionGuard
+def bt_embedding_from_prediction_guard(prompt, base64_image):
+    # get PredictionGuard client
+    client = _getPredictionGuardClient()
+    message = {"text": prompt,}
+    if base64_image is not None and base64_image != "":
+        if not isBase64(base64_image):
+            raise TypeError("image input must be in base64 encoding!")
+        message['image'] = base64_image
+    response = client.embeddings.create(
+        model="bridgetower-large-itm-mlm-itc",
+        input=[message]
+    )
+    return response['data'][0]['embedding']
+def load_json_file(file_path):
+    # Open the JSON file in read mode
+    with open(file_path, 'r') as file:
+        data = json.load(file)
+    return data
+def display_retrieved_results(results):
+    print(f'There is/are {len(results)} retrieved result(s)')
+    print()
+    for i, res in enumerate(results):
+        print(f'The caption of the {str(i+1)}-th retrieved result is:\n"{results[i].page_content}"')
+        print()
+        print(results[i])
+        #display(Image.open(results[i].metadata['metadata']['extracted_frame_path']))
+        print("------------------------------------------------------------")
+class SeparatorStyle(Enum):
+    """Different separator style."""
+    SINGLE = auto()
+@dataclasses.dataclass
+class Conversation:
+    """A class that keeps all conversation history"""
+    system: str
+    roles: List[str]
+    messages: List[List[str]]
+    map_roles: Dict[str, str]
+    version: str = "Unknown"
+    sep_style: SeparatorStyle = SeparatorStyle.SINGLE
+    sep: str = "\n"
+    def _get_prompt_role(self, role):
+        if self.map_roles is not None and role in self.map_roles.keys():
+            return self.map_roles[role]
+        else:
+            return role
+    def _build_content_for_first_message_in_conversation(self, first_message: List[str]):
+        content = []
+        if len(first_message) != 2:
+            raise TypeError("First message in Conversation needs to include a prompt and a base64-enconded image!")
+        prompt, b64_image = first_message[0], first_message[1]
+        # handling prompt
+        if prompt is None:
+            raise TypeError("API does not support None prompt yet")
+        content.append({
+            "type": "text",
+            "text": prompt
+        })
+        if b64_image is None:
+            raise TypeError("API does not support text only conversation yet")
+        # handling image
+        if not isBase64(b64_image):
+            raise TypeError("Image in Conversation's first message must be stored under base64 encoding!")
+        content.append({
+            "type": "image_url",
+            "image_url": {
+                "url": b64_image,
+            }
+        })
+        return content
+    def _build_content_for_follow_up_messages_in_conversation(self, follow_up_message: List[str]):
+        if follow_up_message is not None and len(follow_up_message) > 1:
+            raise TypeError("Follow-up message in Conversation must not include an image!")
+        # handling text prompt
+        if follow_up_message is None or follow_up_message[0] is None:
+            raise TypeError("Follow-up message in Conversation must include exactly one text message")
+        text = follow_up_message[0]
+        return text
+    def get_message(self):
+        messages = self.messages
+        api_messages = []
+        for i, msg in enumerate(messages):
+            role, message_content = msg
+            if i == 0:
+                # get content for very first message in conversation
+                content = self._build_content_for_first_message_in_conversation(message_content)
+            else:
+                # get content for follow-up message in conversation
+                content = self._build_content_for_follow_up_messages_in_conversation(message_content)
+            api_messages.append({
+                "role": role,
+                "content": content,
+            })
+        return api_messages
+    # this method helps represent a multi-turn chat into as a single turn chat format
+    def serialize_messages(self):
+        messages = self.messages
+        ret = ""
+        if self.sep_style == SeparatorStyle.SINGLE:
+            if self.system is not None and self.system != "":
+                ret = self.system + self.sep
+            for i, (role, message) in enumerate(messages):
+                role = self._get_prompt_role(role)
+                if message:
+                    if isinstance(message, List):
+                        # get prompt only
+                        message = message[0]
+                    if i == 0:
+                        # do not include role at the beginning
+                        ret += message
+                    else:
+                        ret += role + ": " + message
+                    if i < len(messages) - 1:
+                        # avoid including sep at the end of serialized message
+                        ret += self.sep
+                else:
+                    ret += role + ":"
+        else:
+            raise ValueError(f"Invalid style: {self.sep_style}")
+        return ret
+    def append_message(self, role, message):
+        if len(self.messages) == 0:
+            # data verification for the very first message
+            assert role == self.roles[0], f"the very first message in conversation must be from role {self.roles[0]}"
+            assert len(message) == 2, f"the very first message in conversation must include both prompt and an image"
+            prompt, image = message[0], message[1]
+            assert prompt is not None, f"prompt must be not None"
+            assert isBase64(image), f"image must be under base64 encoding"
+        else:
+            # data verification for follow-up message
+            assert role in self.roles, f"the follow-up message must be from one of the roles {self.roles}"
+            assert len(message) == 1, f"the follow-up message must consist of one text message only, no image"
+        self.messages.append([role, message])
+    def copy(self):
+        return Conversation(
+            system=self.system,
+            roles=self.roles,
+            messages=[[x,y] for x, y in self.messages],
+            version=self.version,
+            map_roles=self.map_roles,
+        )
+    def dict(self):
+        return {
+            "system": self.system,
+            "roles": self.roles,
+            "messages": [[x, y[0] if len(y) == 1 else y] for x, y in self.messages],
+            "version": self.version,
+        }
+prediction_guard_llava_conv = Conversation(
+    system="",
+    roles=("user", "assistant"),
+    messages=[],
+    version="Prediction Guard LLaVA enpoint Conversation v0",
+    sep_style=SeparatorStyle.SINGLE,
+    map_roles={
+        "user": "USER",
+        "assistant": "ASSISTANT"
+    }
+)
+# get PredictionGuard Client
+def _getPredictionGuardClient():
+    PREDICTION_GUARD_API_KEY = get_prediction_guard_api_key()
+    client = PredictionGuard(
+        api_key=PREDICTION_GUARD_API_KEY,
+        url=PREDICTION_GUARD_URL_ENDPOINT,
+    )
+    return client
+# helper function to call chat completion endpoint of PredictionGuard given a prompt and an image
+def lvlm_inference(prompt, image, max_tokens: int = 200, temperature: float = 0.95, top_p: float = 0.1, top_k: int = 10):
+    # prepare conversation
+    conversation = prediction_guard_llava_conv.copy()
+    conversation.append_message(conversation.roles[0], [prompt, image])
+    return lvlm_inference_with_conversation(conversation, max_tokens=max_tokens, temperature=temperature, top_p=top_p, top_k=top_k)
+def lvlm_inference_with_conversation(conversation, max_tokens: int = 200, temperature: float = 0.95, top_p: float = 0.1, top_k: int = 10):
+    # get PredictionGuard client
+    client = _getPredictionGuardClient()
+    # get message from conversation
+    messages = conversation.get_message()
+    # call chat completion endpoint at Grediction Guard
+    response = client.chat.completions.create(
+        model="llava-1.5-7b-hf",
+        messages=messages,
+        max_tokens=max_tokens,
+        temperature=temperature,
+        top_p=top_p,
+        top_k=top_k,
+    )
+    return response['choices'][-1]['message']['content']
+def lvlm_inference_with_ollama(conversation, max_tokens: int = 200, temperature: float = 0.95, top_p: float = 0.1, top_k: int = 10):
+    # Send the request to the local Ollama server
+    #response = requests.post("http://localhost:8000/api/v1/completions", json=payload)
+    stream = chat(
+        model="llava-1.5-7b-hf",
+        messages= conversation,
+        stream=True,
+        temperature=temperature,
+        max_tokens=max_tokens,
+        top_p=top_p,
+        top_k=top_k
+    )
+    response_data = ''
+    for chunk in stream:
+        response_data += chunk['message']['content']
+    return response_data
+# function `extract_and_save_frames_and_metadata``:
+#   receives as input a video and its transcript
+#   does extracting and saving frames and their metadatas
+#   returns the extracted metadatas
+def extract_and_save_frames_and_metadata(
+        path_to_video,
+        path_to_transcript,
+        path_to_save_extracted_frames,
+        path_to_save_metadatas):
+    # metadatas will store the metadata of all extracted frames
+    metadatas = []
+    # load video using cv2
+    video = cv2.VideoCapture(path_to_video)
+    # load transcript using webvtt
+    trans = webvtt.read(path_to_transcript)
+    # iterate transcript file
+    # for each video segment specified in the transcript file
+    for idx, transcript in enumerate(trans):
+        # get the start time and end time in seconds
+        start_time_ms = str2time(transcript.start)
+        end_time_ms = str2time(transcript.end)
+        # get the time in ms exactly
+        # in the middle of start time and end time
+        mid_time_ms = (end_time_ms + start_time_ms) / 2
+        # get the transcript, remove the next-line symbol
+        text = transcript.text.replace("\n", ' ')
+        # get frame at the middle time
+        video.set(cv2.CAP_PROP_POS_MSEC, mid_time_ms)
+        success, frame = video.read()
+        if success:
+            # if the frame is extracted successfully, resize it
+            image = maintain_aspect_ratio_resize(frame, height=350)
+            # save frame as JPEG file
+            img_fname = f'frame_{idx}.jpg'
+            img_fpath = osp.join(
+                path_to_save_extracted_frames, img_fname
+            )
+            cv2.imwrite(img_fpath, image)
+            # prepare the metadata
+            metadata = {
+                'extracted_frame_path': img_fpath,
+                'transcript': text,
+                'video_segment_id': idx,
+                'video_path': path_to_video,
+                'mid_time_ms': mid_time_ms,
+            }
+            metadatas.append(metadata)
+        else:
+            print(f"ERROR! Cannot extract frame: idx = {idx}")
+    # save metadata of all extracted frames
+    fn = osp.join(path_to_save_metadatas, 'metadatas.json')
+    with open(fn, 'w') as outfile:
+        json.dump(metadatas, outfile)
+    return metadatas
+def extract_meta_data(vid_dir, vid_filepath, vid_transcript_filepath):
+    # output paths to save extracted frames and their metadata
+    extracted_frames_path = osp.join(vid_dir, 'extracted_frame')
+    metadatas_path = vid_dir
+    # create these output folders if not existing
+    Path(extracted_frames_path).mkdir(parents=True, exist_ok=True)
+    Path(metadatas_path).mkdir(parents=True, exist_ok=True)
+    # call the function to extract frames and metadatas
+    metadatas = extract_and_save_frames_and_metadata(
+                    vid_filepath,
+                    vid_transcript_filepath,
+                    extracted_frames_path,
+                    metadatas_path,
+                )
+    return metadatas
+# function extract_and_save_frames_and_metadata_with_fps
+#   receives as input a video
+#   does extracting and saving frames and their metadatas
+#   returns the extracted metadatas
+def extract_and_save_frames_and_metadata_with_fps(
+        lvlm_prompt,
+        path_to_video,
+        path_to_save_extracted_frames,
+        path_to_save_metadatas,
+        num_of_extracted_frames_per_second=1):
+    # metadatas will store the metadata of all extracted frames
+    metadatas = []
+    # load video using cv2
+    video = cv2.VideoCapture(path_to_video)
+    # Get the frames per second
+    fps = video.get(cv2.CAP_PROP_FPS)
+    # Get hop = the number of frames pass before a frame is extracted
+    hop = round(fps / num_of_extracted_frames_per_second)
+    curr_frame = 0
+    idx = -1
+    while(True):
+        # iterate all frames
+        ret, frame = video.read()
+        if not ret:
+            break
+        if curr_frame % hop == 0:
+            idx = idx + 1
+            # if the frame is extracted successfully, resize it
+            image = maintain_aspect_ratio_resize(frame, height=350)
+            # save frame as JPEG file
+            img_fname = f'frame_{idx}.jpg'
+            img_fpath = osp.join(
+                            path_to_save_extracted_frames,
+                            img_fname
+                        )
+            cv2.imwrite(img_fpath, image)
+            # generate caption using lvlm_inference
+            b64_image = encode_image(img_fpath)
+            caption = lvlm_inference(lvlm_prompt, b64_image)
+            # prepare the metadata
+            metadata = {
+                'extracted_frame_path': img_fpath,
+                'transcript': caption,
+                'video_segment_id': idx,
+                'video_path': path_to_video,
+            }
+            metadatas.append(metadata)
+        curr_frame += 1
+    # save metadata of all extracted frames
+    metadatas_path = osp.join(path_to_save_metadatas,'metadatas.json')
+    with open(metadatas_path, 'w') as outfile:
+        json.dump(metadatas, outfile)
+    return metadatas