Spaces:
Runtime error
Runtime error
| import streamlit as st | |
| from streamlit_extras.switch_page_button import switch_page | |
| st.title("MiniGemini") | |
| st.success("""[Original tweet](https://x.com/mervenoyann/status/1783864388249694520) (April 26, 2024)""", icon="ℹ️") | |
| st.markdown(""" """) | |
| st.markdown(""" | |
| MiniGemini is the coolest VLM, let's explain 🧶 | |
| """) | |
| st.markdown(""" """) | |
| st.image("pages/MiniGemini/image_1.jpeg", use_column_width=True) | |
| st.markdown(""" """) | |
| st.markdown("""MiniGemini is a vision language model that understands both image and text and also generates text and an image that goes best with the context! 🤯 | |
| """) | |
| st.markdown(""" """) | |
| st.image("pages/MiniGemini/image_2.jpeg", use_column_width=True) | |
| st.markdown(""" """) | |
| st.markdown(""" | |
| This model has two image encoders (one CNN and one ViT) in parallel to capture the details in the images | |
| I saw the same design in DocOwl 1.5 | |
| then it has a decoder to output text and also a prompt to be sent to SDXL for image generation (which works very well!)""") | |
| st.markdown(""" """) | |
| st.image("pages/MiniGemini/image_3.jpeg", use_column_width=True) | |
| st.markdown(""" """) | |
| st.markdown("""They adopt CLIP's ViT for low resolution visual embedding encoder and a CNN-based one for high resolution image encoding (precisely a pre-trained ConvNeXt) | |
| """) | |
| st.markdown(""" """) | |
| st.image("pages/MiniGemini/image_4.jpeg", use_column_width=True) | |
| st.markdown(""" """) | |
| st.markdown("""Thanks to the second encoder it can grasp details in images, which also comes in handy for e.g. document tasks (but see below the examples are mindblowing IMO) | |
| """) | |
| st.markdown(""" """) | |
| st.image("pages/MiniGemini/image_5.jpeg", use_column_width=True) | |
| st.markdown(""" """) | |
| st.markdown("""According to their reporting the model performs very well across many benchmarks compared to LLaVA 1.5 and Gemini Pro | |
| """) | |
| st.markdown(""" """) | |
| st.image("pages/MiniGemini/image_6.png", use_column_width=True) | |
| st.markdown(""" """) | |
| st.info(""" | |
| Resources: | |
| - [Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models](https://huggingface.co/papers/2403.18814) | |
| by Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, Jiaya Jia | |
| (2024) | |
| - [GitHub](https://github.com/dvlab-research/MGM) | |
| - [Link to Model Repository](https://huggingface.co/YanweiLi/MGM-13B-HD)""", icon="📚") | |
| st.markdown(""" """) | |
| st.markdown(""" """) | |
| st.markdown(""" """) | |
| col1, col2, col3 = st.columns(3) | |
| with col1: | |
| if st.button('Previous paper', use_container_width=True): | |
| switch_page("UDOP") | |
| with col2: | |
| if st.button('Home', use_container_width=True): | |
| switch_page("Home") | |
| with col3: | |
| if st.button('Next paper', use_container_width=True): | |
| switch_page("ColPali") |