Spaces:

manhteky123
/

EmoVIT

Configuration error

App Files Files Community

manhteky123 commited on Sep 23, 2025

Commit

1762ba0

verified ·

1 Parent(s): c802cc8

Upload 1462 files

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +48 -0
Dockerfile +81 -103
LAVIS/.github/workflows/docs.yaml +38 -0
LAVIS/.gitignore +159 -0
LAVIS/.pre-commit-config.yaml +26 -0
LAVIS/CODEOWNERS +2 -0
LAVIS/CODE_OF_CONDUCT.md +105 -0
LAVIS/LICENSE.txt +14 -0
LAVIS/MANIFEST.in +8 -0
LAVIS/README.md +328 -0
LAVIS/SECURITY.md +7 -0
LAVIS/app/__init__.py +26 -0
LAVIS/app/calculate_coco_features.py +87 -0
LAVIS/app/caption.py +98 -0
LAVIS/app/classification.py +216 -0
LAVIS/app/dataset_browser.py +240 -0
LAVIS/app/image_text_match.py +87 -0
LAVIS/app/main.py +25 -0
LAVIS/app/multimodal_search.py +230 -0
LAVIS/app/multipage.py +41 -0
LAVIS/app/text_localization.py +105 -0
LAVIS/app/utils.py +81 -0
LAVIS/app/vqa.py +63 -0
LAVIS/assets/demo-6.png +3 -0
LAVIS/dataset_card/avsd_dialogue.md +32 -0
LAVIS/dataset_card/coco_caption.md +41 -0
LAVIS/dataset_card/coco_retrieval.md +36 -0
LAVIS/dataset_card/conceptual_captions.md +37 -0
LAVIS/dataset_card/didemo_retrieval.md +36 -0
LAVIS/dataset_card/flickr_retrieval.md +34 -0
LAVIS/dataset_card/gqa.md +32 -0
LAVIS/dataset_card/imgs/NLVR2.png +3 -0
LAVIS/dataset_card/imgs/avsd_dialogue.png +3 -0
LAVIS/dataset_card/imgs/coco_caption.png +3 -0
LAVIS/dataset_card/imgs/conceptual_captions.png +3 -0
LAVIS/dataset_card/imgs/didemo.png +3 -0
LAVIS/dataset_card/imgs/flickr30k.png +3 -0
LAVIS/dataset_card/imgs/gqa.png +3 -0
LAVIS/dataset_card/imgs/msrvtt.png +3 -0
LAVIS/dataset_card/imgs/msrvtt_qa.png +3 -0
LAVIS/dataset_card/imgs/msvd_qa.png +3 -0
LAVIS/dataset_card/imgs/nocaps.png +3 -0
LAVIS/dataset_card/imgs/sbu_caption.png +3 -0
LAVIS/dataset_card/imgs/snli_ve.png +3 -0
LAVIS/dataset_card/imgs/vqav2.png +3 -0
LAVIS/dataset_card/msrvtt_qa.md +44 -0
LAVIS/dataset_card/msrvtt_retrieval.md +24 -0
LAVIS/dataset_card/msvd_qa.md +45 -0
LAVIS/dataset_card/nlvr2.md +42 -0
LAVIS/dataset_card/nocaps.md +38 -0

.gitattributes CHANGED Viewed

@@ -34,3 +34,51 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 emo/train.json filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 emo/train.json filter=lfs diff=lfs merge=lfs -text
+LAVIS/assets/demo-6.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/dataset_card/imgs/avsd_dialogue.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/dataset_card/imgs/coco_caption.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/dataset_card/imgs/conceptual_captions.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/dataset_card/imgs/didemo.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/dataset_card/imgs/flickr30k.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/dataset_card/imgs/gqa.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/dataset_card/imgs/msrvtt_qa.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/dataset_card/imgs/msrvtt.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/dataset_card/imgs/msvd_qa.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/dataset_card/imgs/NLVR2.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/dataset_card/imgs/nocaps.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/dataset_card/imgs/sbu_caption.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/dataset_card/imgs/snli_ve.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/dataset_card/imgs/vqav2.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/docs/_static/architecture.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/docs/_static/merlion.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/lavis/configs/datasets/discriminatory_reasoning/discriminatory_dataset/objaverse_discrn.json filter=lfs diff=lfs merge=lfs -text
+LAVIS/lavis/models/clip_models/pics/CLIP.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip-diffusion/images/black-cat.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip-diffusion/images/cat-sofa.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip-diffusion/images/dog.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip-diffusion/images/dog2.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip-diffusion/images/dreambooth/dog8/00.jpg filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip-diffusion/images/dreambooth/dog8/01.jpg filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip-diffusion/images/dreambooth/dog8/02.jpg filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip-diffusion/images/dreambooth/dog8/03.jpg filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip-diffusion/images/dreambooth/dog8/04.jpg filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip-diffusion/images/dress-model.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip-diffusion/images/green-skirt.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip-diffusion/images/jacket-letter-s/jacket-letter-s.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip-diffusion/images/pink-dress.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip-diffusion/images/pink-dress/pink-dress.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip-diffusion/images/shein-jacket/shein-jacket.jpg filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip-diffusion/teaser-website.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/blip2/blip2_illustration.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/img2llm-vqa/Caption.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/img2llm-vqa/demo.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/img2llm-vqa/Illustration.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/img2llm-vqa/QuestionGeneration.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/instructblip/comparison.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/instructblip/showcase.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/pnp-vqa/pnp_vqa.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/xinstructblip/assets/architecture.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/xinstructblip/assets/data.png filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/xinstructblip/demo/examples/audio/110714_wren.wav filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/xinstructblip/demo/examples/audio/Group_of_Dogs_Barking.wav filter=lfs diff=lfs merge=lfs -text
+LAVIS/projects/xinstructblip/demo/examples/point_cloud/banana.glb filter=lfs diff=lfs merge=lfs -text

Dockerfile CHANGED Viewed

@@ -1,14 +1,26 @@
-# Use conda base image with Python
-FROM continuumio/miniconda3:latest
 # Set working directory
 WORKDIR /app
 # Install system dependencies
 RUN apt-get update && apt-get install -y \
     git \
     build-essential \
-    libgl1-mesa-dev \
     libglib2.0-0 \
     libsm6 \
     libxext6 \
@@ -16,114 +28,80 @@ RUN apt-get update && apt-get install -y \
     libgomp1 \
     libgcc-s1 \
     ffmpeg \
-    libgtk-3-dev \
-    pkg-config \
-    curl \
-    wget \
-    sudo \
     && rm -rf /var/lib/apt/lists/*
-# Create a non-root user for better security and compatibility with HF Spaces
-RUN useradd -m -u 1000 user && \
-    echo "user ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
-# Create conda environment with Python 3.8
-RUN conda create --name emovit python=3.8 -y
-# Make RUN commands use the new environment
-SHELL ["conda", "run", "-n", "emovit", "/bin/bash", "-c"]
-# Clone EmoVIT repository
-RUN git clone https://github.com/aimmemotion/EmoVIT.git
-# Set working directory to EmoVIT
-WORKDIR /app/EmoVIT
-# Install requirements_lavis.txt
 RUN pip install --no-cache-dir --upgrade pip setuptools wheel
-RUN pip install --no-cache-dir -r requirements_lavis.txt
-# Install PyTorch with CUDA 11.8 support
-RUN pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118
-# Try to install salesforce-lavis first
-RUN pip install salesforce-lavis || echo "salesforce-lavis installation failed, proceeding with manual installation"
-# If salesforce-lavis installation failed, proceed with manual installation
-# Clone LAVIS repository and install manually
-WORKDIR /app
-RUN git clone https://github.com/salesforce/LAVIS.git
-WORKDIR /app/LAVIS
-# Install open3d to resolve the import issue
-RUN pip install open3d --no-cache-dir
-# Install LAVIS in editable mode
-RUN pip install -e .
-# Move back to EmoVIT directory
-WORKDIR /app/EmoVIT
-# Create lib directory and copy lavis folder as mentioned in instructions
-RUN mkdir -p lib
-RUN cp -r /app/LAVIS/lavis lib/
-# Copy local files to override cloned files if needed
-COPY . /app/EmoVIT/
-# Set up cache directories with proper permissions
-RUN mkdir -p /app/.cache/huggingface/hub && \
-    mkdir -p /app/.cache/torch && \
-    mkdir -p /app/.cache/datasets && \
-    mkdir -p /app/.cache/transformers && \
-    chmod -R 777 /app && \
-    chown -R user:user /app
-# Set environment variables to use /app/.cache instead of root cache
-ENV CONDA_DEFAULT_ENV=emovit
-ENV PATH="/opt/conda/envs/emovit/bin:$PATH"
-ENV PYTHONPATH=/app/EmoVIT
-ENV HF_HOME=/app/.cache/huggingface
-ENV TORCH_HOME=/app/.cache/torch
-ENV HF_DATASETS_CACHE=/app/.cache/datasets
-ENV HUGGINGFACE_HUB_CACHE=/app/.cache/huggingface/hub
-ENV TRANSFORMERS_CACHE=/app/.cache/transformers
-ENV XDG_CACHE_HOME=/app/.cache
-# Switch to non-root user for running the application
-USER user
-# Ensure conda works for the user
-RUN conda init bash
-# Activate the conda environment by default
-RUN echo "conda activate emovit" >> ~/.bashrc
-# Create necessary directories
-RUN mkdir -p static/css templates
-# Make start script executable if it exists
-RUN if [ -f start.sh ]; then chmod +x start.sh; fi
-# Expose ports for potential web applications
-EXPOSE 7860 8000 8080 8501
 # Health check
-HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \
     CMD curl -f http://localhost:7860/health || exit 1
-# Debug check - verify installations
-RUN python -c "import torch, transformers, numpy; \
-print('Torch:', torch.__version__); \
-print('Transformers:', transformers.__version__); \
-print('NumPy:', numpy.__version__);"
-# Test LAVIS import
-RUN python -c "import lavis; print('LAVIS import OK')" || echo "LAVIS import failed"
-# Test open3d import
-RUN python -c "import open3d; print('Open3D import OK')" || echo "Open3D import failed"
-# Default command
-CMD ["conda", "run", "-n", "emovit", "bash", "-c", "if [ -f start.sh ]; then ./start.sh; else bash; fi"]

+FROM python:3.8-slim
 # Set working directory
 WORKDIR /app
+# Set environment variables
+ENV PYTHONUNBUFFERED=1
+ENV PYTHONPATH=/app:/app/lib
+ENV TRANSFORMERS_CACHE=/app/.cache/transformers
+ENV HF_HOME=/app/.cache/huggingface
+ENV TORCH_HOME=/app/.cache/torch
+ENV HF_DATASETS_CACHE=/app/.cache/datasets
+ENV HUGGINGFACE_HUB_CACHE=/app/.cache/huggingface/hub
+ENV PORT=7860
+ENV DEBIAN_FRONTEND=noninteractive
 # Install system dependencies
 RUN apt-get update && apt-get install -y \
     git \
+    wget \
+    curl \
     build-essential \
+    libgl1-mesa-glx \
     libglib2.0-0 \
     libsm6 \
     libxext6 \
     libgomp1 \
     libgcc-s1 \
     ffmpeg \
     && rm -rf /var/lib/apt/lists/*
+# Create cache directories
+RUN mkdir -p /app/.cache/transformers \
+    && mkdir -p /app/.cache/huggingface/hub \
+    && mkdir -p /app/.cache/torch \
+    && mkdir -p /app/.cache/datasets \
+    && chmod -R 777 /app/.cache
+# Copy requirements files first for better caching
+COPY requirements_lavis.txt /app/
+COPY requirements_emo.txt /app/
+# Install Python dependencies
 RUN pip install --no-cache-dir --upgrade pip setuptools wheel
+# Install PyTorch first (crucial for proper dependency resolution)
+RUN pip install --no-cache-dir torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cpu
+# Install other dependencies
+RUN pip install --no-cache-dir -r requirements_lavis.txt || true
+RUN pip install --no-cache-dir -r requirements_emo.txt || true
+# Install additional dependencies for web app
+RUN pip install --no-cache-dir \
+    flask \
+    pillow \
+    numpy \
+    opencv-python-headless \
+    transformers==4.28.0 \
+    accelerate \
+    bitsandbytes \
+    sentencepiece \
+    protobuf
+# Install LAVIS
+RUN pip install --no-cache-dir salesforce-lavis || \
+    (git clone https://github.com/salesforce/LAVIS.git /tmp/LAVIS && \
+     cd /tmp/LAVIS && \
+     pip install -e . && \
+     cp -r lavis /app/lib/ && \
+     rm -rf /tmp/LAVIS)
+# Copy application files
+COPY . /app/
+# Copy LAVIS folder if it exists
+COPY LAVIS/ /app/LAVIS/
+# Ensure the blip2_vicuna_instruct.py is in the right place
+RUN if [ -f "/app/blip2_vicuna_instruct.py" ]; then \
+        cp /app/blip2_vicuna_instruct.py /app/LAVIS/lavis/models/blip2_models/ 2>/dev/null || true; \
+    fi
+# Create lib directory and ensure proper structure
+RUN mkdir -p /app/lib && \
+    if [ -d "/app/LAVIS/lavis" ]; then cp -r /app/LAVIS/lavis /app/lib/; fi
+# Make start script executable
+RUN chmod +x /app/start.sh
+# Create necessary directories for the application
+RUN mkdir -p /app/templates /app/static /app/emo
+# Set proper permissions
+RUN chmod -R 755 /app && \
+    chmod -R 777 /app/.cache
 # Health check
+HEALTHCHECK --interval=30s --timeout=30s --start-period=60s --retries=3 \
     CMD curl -f http://localhost:7860/health || exit 1
+# Expose port
+EXPOSE 7860
+# Use start script
+CMD ["./start.sh"]

LAVIS/.github/workflows/docs.yaml ADDED Viewed

	@@ -0,0 +1,38 @@

+name: docs
+on:
+  push:
+    branches: [ main ]
+  pull_request:
+    branches: [ main ]
+  release:
+    types: [ published ]
+jobs:
+  build:
+    runs-on: ubuntu-18.04
+    steps:
+    - uses: actions/checkout@v2
+      with:
+        fetch-depth: 0
+    - name: Set up Python
+      uses: actions/setup-python@v2
+      with:
+        python-version: '3.8'
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip setuptools wheel
+        sudo apt-get update
+        sudo apt-get install openjdk-11-jdk
+        sudo apt-get install pandoc
+    - name: Build Sphinx docs
+      run: |
+        docs/build_docs.sh
+    - name: Deploy to gh-pages
+      uses: peaceiris/actions-gh-pages@v3
+      if: ${{ github.ref == 'refs/heads/main' || github.event_name == 'release' }}
+      with:
+        github_token: ${{ secrets.GITHUB_TOKEN }}
+        publish_dir: docs/_build/html

LAVIS/.gitignore ADDED Viewed

	@@ -0,0 +1,159 @@

+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# project-specific
+output/
+debug*/
+*.bak
+*.dir
+*.dat
+*.tsv
+*.gz
+*.csv
+*.p
+*.pdf
+cache/

LAVIS/.pre-commit-config.yaml ADDED Viewed

	@@ -0,0 +1,26 @@

+repos:
+-   repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v4.1.0
+    hooks:
+    -   id: trailing-whitespace
+    -   id: check-ast
+    -   id: no-commit-to-branch
+        args: ['--branch=main']
+    -   id: check-added-large-files
+        args: ['--maxkb=5000']
+    -   id: end-of-file-fixer
+-   repo: https://github.com/psf/black
+    rev: stable
+    hooks:
+    - id: black
+      language_version: python3.8
+-   repo: https://github.com/PyCQA/flake8
+    rev: 3.9.2
+    hooks:
+    -   id: flake8
+        args: [
+            # only error for syntax errors and undefined names
+            "--select=E9,F63,F7,F82",
+        ]

LAVIS/CODEOWNERS ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # Comment line immediately above ownership line is reserved for related gus information. Please be careful while editing.
2	+ #ECCN:Open Source

LAVIS/CODE_OF_CONDUCT.md ADDED Viewed

	@@ -0,0 +1,105 @@

+# Salesforce Open Source Community Code of Conduct
+## About the Code of Conduct
+Equality is a core value at Salesforce. We believe a diverse and inclusive
+community fosters innovation and creativity, and are committed to building a
+culture where everyone feels included.
+Salesforce open-source projects are committed to providing a friendly, safe, and
+welcoming environment for all, regardless of gender identity and expression,
+sexual orientation, disability, physical appearance, body size, ethnicity, nationality,
+race, age, religion, level of experience, education, socioeconomic status, or
+other similar personal characteristics.
+The goal of this code of conduct is to specify a baseline standard of behavior so
+that people with different social values and communication styles can work
+together effectively, productively, and respectfully in our open source community.
+It also establishes a mechanism for reporting issues and resolving conflicts.
+All questions and reports of abusive, harassing, or otherwise unacceptable behavior
+in a Salesforce open-source project may be reported by contacting the Salesforce
+Open Source Conduct Committee at ossconduct@salesforce.com.
+## Our Pledge
+In the interest of fostering an open and welcoming environment, we as
+contributors and maintainers pledge to making participation in our project and
+our community a harassment-free experience for everyone, regardless of gender
+identity and expression, sexual orientation, disability, physical appearance,
+body size, ethnicity, nationality, race, age, religion, level of experience, education,
+socioeconomic status, or other similar personal characteristics.
+## Our Standards
+Examples of behavior that contributes to creating a positive environment
+include:
+* Using welcoming and inclusive language
+* Being respectful of differing viewpoints and experiences
+* Gracefully accepting constructive criticism
+* Focusing on what is best for the community
+* Showing empathy toward other community members
+Examples of unacceptable behavior by participants include:
+* The use of sexualized language or imagery and unwelcome sexual attention or
+advances
+* Personal attacks, insulting/derogatory comments, or trolling
+* Public or private harassment
+* Publishing, or threatening to publish, others' private information—such as
+a physical or electronic address—without explicit permission
+* Other conduct which could reasonably be considered inappropriate in a
+professional setting
+* Advocating for or encouraging any of the above behaviors
+## Our Responsibilities
+Project maintainers are responsible for clarifying the standards of acceptable
+behavior and are expected to take appropriate and fair corrective action in
+response to any instances of unacceptable behavior.
+Project maintainers have the right and responsibility to remove, edit, or
+reject comments, commits, code, wiki edits, issues, and other contributions
+that are not aligned with this Code of Conduct, or to ban temporarily or
+permanently any contributor for other behaviors that they deem inappropriate,
+threatening, offensive, or harmful.
+## Scope
+This Code of Conduct applies both within project spaces and in public spaces
+when an individual is representing the project or its community. Examples of
+representing a project or community include using an official project email
+address, posting via an official social media account, or acting as an appointed
+representative at an online or offline event. Representation of a project may be
+further defined and clarified by project maintainers.
+## Enforcement
+Instances of abusive, harassing, or otherwise unacceptable behavior may be
+reported by contacting the Salesforce Open Source Conduct Committee
+at ossconduct@salesforce.com. All complaints will be reviewed and investigated
+and will result in a response that is deemed necessary and appropriate to the
+circumstances. The committee is obligated to maintain confidentiality with
+regard to the reporter of an incident. Further details of specific enforcement
+policies may be posted separately.
+Project maintainers who do not follow or enforce the Code of Conduct in good
+faith may face temporary or permanent repercussions as determined by other
+members of the project's leadership and the Salesforce Open Source Conduct
+Committee.
+## Attribution
+This Code of Conduct is adapted from the [Contributor Covenant][contributor-covenant-home],
+version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html.
+It includes adaptions and additions from [Go Community Code of Conduct][golang-coc],
+[CNCF Code of Conduct][cncf-coc], and [Microsoft Open Source Code of Conduct][microsoft-coc].
+This Code of Conduct is licensed under the [Creative Commons Attribution 3.0 License][cc-by-3-us].
+[contributor-covenant-home]: https://www.contributor-covenant.org (https://www.contributor-covenant.org/)
+[golang-coc]: https://golang.org/conduct
+[cncf-coc]: https://github.com/cncf/foundation/blob/master/code-of-conduct.md
+[microsoft-coc]: https://opensource.microsoft.com/codeofconduct/
+[cc-by-3-us]: https://creativecommons.org/licenses/by/3.0/us/

LAVIS/LICENSE.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+BSD 3-Clause License
+Copyright (c) 2022 Salesforce, Inc.
+All rights reserved.
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+3. Neither the name of Salesforce.com nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

LAVIS/MANIFEST.in ADDED Viewed

	@@ -0,0 +1,8 @@

+recursive-include lavis/configs *.yaml *.json
+recursive-include lavis/projects *.yaml *.json
+recursive-exclude lavis/datasets/download_scripts *
+recursive-exclude lavis/output *
+include requirements.txt
+include lavis/models/clip_models/bpe_simple_vocab_16e6.txt.gz

LAVIS/README.md ADDED Viewed

	@@ -0,0 +1,328 @@

+<p align="center">
+    <br>
+    <img src="docs/_static/logo_final.png" width="400"/>
+    <br>
+<p>
+<div align="center">
+  <a href="https://github.com/salesforce/LAVIS/releases"><img alt="Latest Release" src="https://img.shields.io/github/release/salesforce/LAVIS.svg" /></a>
+  <a href="https://opensource.salesforce.com/LAVIS/index.html">
+  <img alt="docs" src="https://github.com/salesforce/LAVIS/actions/workflows/docs.yaml/badge.svg"/>
+  <a href="https://opensource.org/licenses/BSD-3-Clause">
+  <img alt="license" src="https://img.shields.io/badge/License-BSD_3--Clause-blue.svg"/>
+  </a>
+  <a href="https://pepy.tech/project/salesforce-lavis">
+  <img alt="Downloads" src="https://pepy.tech/badge/salesforce-lavis">
+  </a>
+</div>
+<div align="center">
+<a href="https://opensource.salesforce.com/LAVIS//latest/benchmark.html">Benchmark</a>,
+<a href="https://arxiv.org/abs/2209.09019">Technical Report</a>,
+<a href="https://opensource.salesforce.com/LAVIS//latest/index.html">Documentation</a>,
+<a href="https://github.com/salesforce/LAVIS/tree/main/examples">Jupyter Notebook Examples</a>,
+<a href="https://blog.salesforceairesearch.com/lavis-language-vision-library/">Blog</a>
+</div>
+# LAVIS - A Library for Language-Vision Intelligence
+## What's New: 🎉
+  * [Model Release] November 2023, released implementation of **X-InstructBLIP** <br>
+  [Paper](https://arxiv.org/pdf/2311.18799.pdf), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/xinstructblip), [Website](https://artemisp.github.io/X-InstructBLIP-page/), [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/xinstructblip/demo/run_demo.ipynb)
+  > A simple, yet effective, cross-modality framework built atop frozen LLMs that allows the integration of various modalities (image, video, audio, 3D) without extensive modality-specific customization.
+  * [Model Release] July 2023, released implementation of **BLIP-Diffusion** <br>
+  [Paper](https://arxiv.org/abs/2305.06500), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion), [Website](https://dxli94.github.io/BLIP-Diffusion-website/)
+  > A text-to-image generation model that trains 20x than DreamBooth. Also facilitates zero-shot subject-driven generation and editing.
+  * [Model Release] May 2023, released implementation of **InstructBLIP** <br>
+  [Paper](https://arxiv.org/abs/2305.06500), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip)
+  > A new vision-language instruction-tuning framework using BLIP-2 models, achieving state-of-the-art zero-shot generalization performance on a wide range of vision-language tasks.
+  * [Model Release] Jan 2023, released implementation of **BLIP-2** <br>
+  [Paper](https://arxiv.org/abs/2301.12597), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/examples/blip2_instructed_generation.ipynb)
+  > A generic and efficient pre-training strategy that easily harvests development of pretrained vision models and large language models (LLMs) for vision-language pretraining. BLIP-2 beats Flamingo on zero-shot VQAv2 (**65.0** vs **56.3**), establishing new state-of-the-art on zero-shot captioning (on NoCaps **121.6** CIDEr score vs previous best **113.2**). In addition, equipped with powerful LLMs (e.g. OPT, FlanT5), BLIP-2 also unlocks the new **zero-shot instructed vision-to-language generation** capabilities for various interesting applications!
+  * Jan 2023, LAVIS is now available on [PyPI](https://pypi.org/project/salesforce-lavis/) for installation!
+  * [Model Release] Dec 2022, released implementation of **Img2LLM-VQA** (**CVPR 2023**, _"From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models"_, by Jiaxian Guo et al) <br>
+  [Paper](https://arxiv.org/pdf/2212.10846.pdf), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/img2llm-vqa), [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/img2llm-vqa/img2llm_vqa.ipynb)
+  > A plug-and-play module that enables off-the-shelf use of Large Language Models (LLMs) for visual question answering (VQA). Img2LLM-VQA surpasses Flamingo on zero-shot VQA on VQAv2 (61.9 vs 56.3), while in contrast requiring no end-to-end training!
+  * [Model Release] Oct 2022, released implementation of **PNP-VQA** (**EMNLP Findings 2022**, _"Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training"_, by Anthony T.M.H. et al), <br>
+  [Paper](https://arxiv.org/abs/2210.08773), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/pnp-vqa), [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/pnp-vqa/pnp_vqa.ipynb))
+  >  A modular zero-shot VQA framework that requires no PLMs training, achieving SoTA zero-shot VQA performance.
+## Technical Report and Citing LAVIS
+You can find more details in our [technical report](https://arxiv.org/abs/2209.09019).
+**If you're using LAVIS in your research or applications, please cite it using this BibTeX**:
+```bibtex
+@inproceedings{li-etal-2023-lavis,
+    title = "{LAVIS}: A One-stop Library for Language-Vision Intelligence",
+    author = "Li, Dongxu  and
+      Li, Junnan  and
+      Le, Hung  and
+      Wang, Guangsen  and
+      Savarese, Silvio  and
+      Hoi, Steven C.H.",
+    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
+    month = jul,
+    year = "2023",
+    address = "Toronto, Canada",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2023.acl-demo.3",
+    pages = "31--41",
+    abstract = "We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications. LAVIS aims to serve as a one-stop comprehensive library that brings recent advancements in the language-vision field accessible for researchers and practitioners, as well as fertilizing future research and development. It features a unified interface to easily access state-of-the-art image-language, video-language models and common datasets. LAVIS supports training, evaluation and benchmarking on a rich variety of tasks, including multimodal classification, retrieval, captioning, visual question answering, dialogue and pre-training. In the meantime, the library is also highly extensible and configurable, facilitating future development and customization. In this technical report, we describe design principles, key components and functionalities of the library, and also present benchmarking results across common language-vision tasks.",
+}
+```
+## Table of Contents
+  - [Introduction](#introduction)
+  - [Installation](#installation)
+  - [Getting Started](#getting-started)
+    - [Model Zoo](#model-zoo)
+    - [Image Captioning](#image-captioning)
+    - [Visual question answering (VQA)](#visual-question-answering-vqa)
+    - [Unified Feature Extraction Interface](#unified-feature-extraction-interface)
+    - [Load Datasets](#load-datasets)
+  - [Jupyter Notebook Examples](#jupyter-notebook-examples)
+  - [Resources and Tools](#resources-and-tools)
+  - [Documentations](#documentations)
+  - [Ethical and Responsible Use](#ethical-and-responsible-use)
+  - [Technical Report and Citing LAVIS](#technical-report-and-citing-lavis)
+  - [License](#license)
+## Introduction
+LAVIS is a Python deep learning library for LAnguage-and-VISion intelligence research and applications. This library aims to provide engineers and researchers with a one-stop solution to rapidly develop models for their specific multimodal scenarios, and benchmark them across standard and customized datasets.
+It features a unified interface design to access
+- **10+** tasks
+(retrieval, captioning, visual question answering, multimodal classification etc.);
+- **20+** datasets (COCO, Flickr, Nocaps, Conceptual
+Commons, SBU, etc.);
+- **30+** pretrained weights of state-of-the-art foundation language-vision models and their task-specific adaptations, including [ALBEF](https://arxiv.org/pdf/2107.07651.pdf),
+[BLIP](https://arxiv.org/pdf/2201.12086.pdf), [ALPRO](https://arxiv.org/pdf/2112.09583.pdf), [CLIP](https://arxiv.org/pdf/2103.00020.pdf).
+<p align="center">
+    <br>
+    <img src="assets/demo-6.png"/>
+    <br>
+<p>
+Key features of LAVIS include:
+- **Unified and Modular Interface**: facilitating to easily leverage and repurpose existing modules (datasets, models, preprocessors), also to add new modules.
+- **Easy Off-the-shelf Inference and Feature Extraction**: readily available pre-trained models let you take advantage of state-of-the-art multimodal understanding and generation capabilities on your own data.
+- **Reproducible Model Zoo and Training Recipes**: easily replicate and extend state-of-the-art models on existing and new tasks.
+- **Dataset Zoo and Automatic Downloading Tools**: it can be a hassle to prepare the many language-vision datasets. LAVIS provides automatic downloading scripts to help prepare a large variety of datasets and their annotations.
+The following table shows the supported tasks, datasets and models in our library. This is a continuing effort and we are working on further growing the list.
+|                  Tasks                   |     Supported Models     |             Supported Datasets             |
+| :--------------------------------------: | :----------------------: | :----------------------------------------: |
+|         Image-text Pre-training          |       ALBEF, BLIP        | COCO, VisualGenome, SBU ConceptualCaptions |
+|           Image-text Retrieval           |    ALBEF, BLIP, CLIP     |              COCO, Flickr30k               |
+|           Text-image Retrieval           |    ALBEF, BLIP, CLIP     |              COCO, Flickr30k               |
+|        Visual Question Answering         |       ALBEF, BLIP        |           VQAv2, OKVQA, A-OKVQA            |
+|             Image Captioning             |           BLIP           |                COCO, NoCaps                |
+|           Image Classification           |           CLIP           |                  ImageNet                  |
+| Natural Language Visual Reasoning (NLVR) |       ALBEF, BLIP        |                   NLVR2                    |
+|          Visual Entailment (VE)          |          ALBEF           |                  SNLI-VE                   |
+|             Visual Dialogue              |           BLIP           |                  VisDial                   |
+|           Video-text Retrieval           |       BLIP, ALPRO        |               MSRVTT, DiDeMo               |
+|           Text-video Retrieval           |       BLIP, ALPRO        |               MSRVTT, DiDeMo               |
+|    Video Question Answering (VideoQA)    |       BLIP, ALPRO        |                MSRVTT, MSVD                |
+|              Video Dialogue              |         VGD-GPT          |                    AVSD                    |
+|      Multimodal Feature Extraction       | ALBEF, CLIP, BLIP, ALPRO |                 customized                 |
+|         Text-to-image Generation         |      [COMING SOON]       |                                            |
+## Installation
+1. (Optional) Creating conda environment
+```bash
+conda create -n lavis python=3.8
+conda activate lavis
+```
+2. install from [PyPI](https://pypi.org/project/salesforce-lavis/)
+```bash
+pip install salesforce-lavis
+```
+3. Or, for development, you may build from source
+```bash
+git clone https://github.com/salesforce/LAVIS.git
+cd LAVIS
+pip install -e .
+```
+## Getting Started
+### Model Zoo
+Model zoo summarizes supported models in LAVIS, to view:
+```python
+from lavis.models import model_zoo
+print(model_zoo)
+# ==================================================
+# Architectures                  Types
+# ==================================================
+# albef_classification           ve
+# albef_feature_extractor        base
+# albef_nlvr                     nlvr
+# albef_pretrain                 base
+# albef_retrieval                coco, flickr
+# albef_vqa                      vqav2
+# alpro_qa                       msrvtt, msvd
+# alpro_retrieval                msrvtt, didemo
+# blip_caption                   base_coco, large_coco
+# blip_classification            base
+# blip_feature_extractor         base
+# blip_nlvr                      nlvr
+# blip_pretrain                  base
+# blip_retrieval                 coco, flickr
+# blip_vqa                       vqav2, okvqa, aokvqa
+# clip_feature_extractor         ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50
+# clip                           ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50
+# gpt_dialogue                   base
+```
+Let’s see how to use models in LAVIS to perform inference on example data. We first load a sample image from local.
+```python
+import torch
+from PIL import Image
+# setup device to use
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# load sample image
+raw_image = Image.open("docs/_static/merlion.png").convert("RGB")
+```
+This example image shows [Merlion park](https://en.wikipedia.org/wiki/Merlion) ([source](https://theculturetrip.com/asia/singapore/articles/what-exactly-is-singapores-merlion-anyway/)), a landmark in Singapore.
+### Image Captioning
+In this example, we use the BLIP model to generate a caption for the image. To make inference even easier, we also associate each
+pre-trained model with its preprocessors (transforms), accessed via ``load_model_and_preprocess()``.
+```python
+import torch
+from lavis.models import load_model_and_preprocess
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+# loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
+# this also loads the associated image processors
+model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)
+# preprocess the image
+# vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
+image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
+# generate caption
+model.generate({"image": image})
+# ['a large fountain spewing water into the air']
+```
+### Visual question answering (VQA)
+BLIP model is able to answer free-form questions about images in natural language.
+To access the VQA model, simply replace the ``name`` and ``model_type`` arguments
+passed to ``load_model_and_preprocess()``.
+```python
+from lavis.models import load_model_and_preprocess
+model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_vqa", model_type="vqav2", is_eval=True, device=device)
+# ask a random question.
+question = "Which city is this photo taken?"
+image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
+question = txt_processors["eval"](question)
+model.predict_answers(samples={"image": image, "text_input": question}, inference_method="generate")
+# ['singapore']
+```
+### Unified Feature Extraction Interface
+LAVIS provides a unified interface to extract features from each architecture.
+To extract features, we load the feature extractor variants of each model.
+The multimodal feature can be used for multimodal classification.
+The low-dimensional unimodal features can be used to compute cross-modal similarity.
+```python
+from lavis.models import load_model_and_preprocess
+model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_feature_extractor", model_type="base", is_eval=True, device=device)
+caption = "a large fountain spewing water into the air"
+image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
+text_input = txt_processors["eval"](caption)
+sample = {"image": image, "text_input": [text_input]}
+features_multimodal = model.extract_features(sample)
+print(features_multimodal.multimodal_embeds.shape)
+# torch.Size([1, 12, 768]), use features_multimodal[:,0,:] for multimodal classification tasks
+features_image = model.extract_features(sample, mode="image")
+features_text = model.extract_features(sample, mode="text")
+print(features_image.image_embeds.shape)
+# torch.Size([1, 197, 768])
+print(features_text.text_embeds.shape)
+# torch.Size([1, 12, 768])
+# low-dimensional projected features
+print(features_image.image_embeds_proj.shape)
+# torch.Size([1, 197, 256])
+print(features_text.text_embeds_proj.shape)
+# torch.Size([1, 12, 256])
+similarity = features_image.image_embeds_proj[:,0,:] @ features_text.text_embeds_proj[:,0,:].t()
+print(similarity)
+# tensor([[0.2622]])
+```
+### Load Datasets
+LAVIS inherently supports a wide variety of common language-vision datasets by providing [automatic download tools](https://opensource.salesforce.com/LAVIS//latest/benchmark) to help download and organize these datasets. After downloading, to load the datasets, use the following code:
+```python
+from lavis.datasets.builders import dataset_zoo
+dataset_names = dataset_zoo.get_names()
+print(dataset_names)
+# ['aok_vqa', 'coco_caption', 'coco_retrieval', 'coco_vqa', 'conceptual_caption_12m',
+#  'conceptual_caption_3m', 'didemo_retrieval', 'flickr30k', 'imagenet', 'laion2B_multi',
+#  'msrvtt_caption', 'msrvtt_qa', 'msrvtt_retrieval', 'msvd_caption', 'msvd_qa', 'nlvr',
+#  'nocaps', 'ok_vqa', 'sbu_caption', 'snli_ve', 'vatex_caption', 'vg_caption', 'vg_vqa']
+```
+After downloading the images, we can use ``load_dataset()`` to obtain the dataset.
+```python
+from lavis.datasets.builders import load_dataset
+coco_dataset = load_dataset("coco_caption")
+print(coco_dataset.keys())
+# dict_keys(['train', 'val', 'test'])
+print(len(coco_dataset["train"]))
+# 566747
+print(coco_dataset["train"][0])
+# {'image': <PIL.Image.Image image mode=RGB size=640x480>,
+#  'text_input': 'A woman wearing a net on her head cutting a cake. ',
+#  'image_id': 0}
+```
+If you already host a local copy of the dataset, you can pass in the ``vis_path`` argument to change the default location to load images.
+```python
+coco_dataset = load_dataset("coco_caption", vis_path=YOUR_LOCAL_PATH)
+```
+## Jupyter Notebook Examples
+See [examples](https://github.com/salesforce/LAVIS/tree/main/examples) for more inference examples, e.g. captioning, feature extraction, VQA, GradCam, zeros-shot classification.
+## Resources and Tools
+- **Benchmarks**: see [Benchmark](https://opensource.salesforce.com/LAVIS//latest/benchmark) for instructions to evaluate and train supported models.
+- **Dataset Download and Browsing**: see [Dataset Download](https://opensource.salesforce.com/LAVIS//latest/benchmark) for instructions and automatic tools on download common language-vision datasets.
+- **GUI Demo**: to run the demo locally, run ```bash run_scripts/run_demo.sh``` and then follow the instruction on the prompts to view in browser. A web demo is coming soon.
+## Documentations
+For more details and advanced usages, please refer to
+[documentation](https://opensource.salesforce.com/LAVIS//latest/index.html#).
+## Ethical and Responsible Use
+We note that models in LAVIS provide no guarantees on their multimodal abilities; incorrect or biased predictions may be observed. In particular, the datasets and pretrained models utilized in LAVIS may contain socioeconomic biases which could result in misclassification and other unwanted behaviors such as offensive or inappropriate speech. We strongly recommend that users review the pre-trained models and overall system in LAVIS before practical adoption. We plan to improve the library by investigating and mitigating these potential biases and
+inappropriate behaviors in the future.
+## Contact us
+If you have any questions, comments or suggestions, please do not hesitate to contact us at lavis@salesforce.com.
+## License
+[BSD 3-Clause License](LICENSE.txt)

LAVIS/SECURITY.md ADDED Viewed

	@@ -0,0 +1,7 @@

+## Security
+Please report any security issue to [security@salesforce.com](mailto:security@salesforce.com)
+as soon as it is discovered. This library limits its runtime dependencies in
+order to reduce the total cost of ownership as much as can be, but all consumers
+should remain vigilant and have their security stakeholders review all third-party
+products (3PP) like this one and their dependencies.

LAVIS/app/__init__.py ADDED Viewed

	@@ -0,0 +1,26 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+from PIL import Image
+import requests
+import streamlit as st
+import torch
+@st.cache()
+def load_demo_image():
+    img_url = (
+        "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
+    )
+    raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
+    return raw_image
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+cache_root = "/export/home/.cache/lavis/"

LAVIS/app/calculate_coco_features.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+from PIL import Image
+import requests
+import torch
+import os
+from lavis.common.registry import registry
+from lavis.processors import *
+from lavis.models import *
+from lavis.common.utils import build_default_model
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+def load_demo_image():
+    img_url = (
+        "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
+    )
+    raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
+    return raw_image
+def read_img(filepath):
+    raw_image = Image.open(filepath).convert("RGB")
+    return raw_image
+# model
+model_url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth"
+feature_extractor = BlipFeatureExtractor(pretrained=model_url)
+feature_extractor.eval()
+feature_extractor = feature_extractor.to(device)
+# preprocessors
+vis_processor = BlipImageEvalProcessor(image_size=224)
+text_processor = BlipCaptionProcessor()
+# files to process
+# file_root = "/export/home/.cache/lavis/coco/images/val2014"
+file_root = "/export/home/.cache/lavis/coco/images/train2014"
+filepaths = os.listdir(file_root)
+print(len(filepaths))
+caption = "dummy"
+path2feat = dict()
+bsz = 256
+images_in_batch = []
+filepaths_in_batch = []
+for i, filename in enumerate(filepaths):
+    if i % bsz == 0 and i > 0:
+        images_in_batch = torch.cat(images_in_batch, dim=0).to(device)
+        with torch.no_grad():
+            image_features = feature_extractor(
+                images_in_batch, caption, mode="image", normalized=True
+            )[:, 0]
+        for filepath, image_feat in zip(filepaths_in_batch, image_features):
+            path2feat[os.path.basename(filepath)] = image_feat.detach().cpu()
+        images_in_batch = []
+        filepaths_in_batch = []
+        print(len(path2feat), image_features.shape)
+    else:
+        filepath = os.path.join(file_root, filename)
+        image = read_img(filepath)
+        image = vis_processor(image).unsqueeze(0)
+        images_in_batch.append(image)
+        filepaths_in_batch.append(filepath)
+torch.save(path2feat, "path2feat_coco_train2014.pth")

LAVIS/app/caption.py ADDED Viewed

	@@ -0,0 +1,98 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import streamlit as st
+from app import device, load_demo_image
+from app.utils import load_model_cache
+from lavis.processors import load_processor
+from PIL import Image
+def app():
+    # ===== layout =====
+    model_type = st.sidebar.selectbox("Model:", ["BLIP_base", "BLIP_large"])
+    sampling_method = st.sidebar.selectbox(
+        "Sampling method:", ["Beam search", "Nucleus sampling"]
+    )
+    st.markdown(
+        "<h1 style='text-align: center;'>Image Description Generation</h1>",
+        unsafe_allow_html=True,
+    )
+    instructions = """Try the provided image or upload your own:"""
+    file = st.file_uploader(instructions)
+    use_beam = sampling_method == "Beam search"
+    col1, col2 = st.columns(2)
+    if file:
+        raw_img = Image.open(file).convert("RGB")
+    else:
+        raw_img = load_demo_image()
+    col1.header("Image")
+    w, h = raw_img.size
+    scaling_factor = 720 / w
+    resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
+    col1.image(resized_image, use_column_width=True)
+    col2.header("Description")
+    cap_button = st.button("Generate")
+    # ==== event ====
+    vis_processor = load_processor("blip_image_eval").build(image_size=384)
+    if cap_button:
+        if model_type.startswith("BLIP"):
+            blip_type = model_type.split("_")[1].lower()
+            model = load_model_cache(
+                "blip_caption",
+                model_type=f"{blip_type}_coco",
+                is_eval=True,
+                device=device,
+            )
+        img = vis_processor(raw_img).unsqueeze(0).to(device)
+        captions = generate_caption(
+            model=model, image=img, use_nucleus_sampling=not use_beam
+        )
+        col2.write("\n\n".join(captions), use_column_width=True)
+def generate_caption(
+    model, image, use_nucleus_sampling=False, num_beams=3, max_length=40, min_length=5
+):
+    samples = {"image": image}
+    captions = []
+    if use_nucleus_sampling:
+        for _ in range(5):
+            caption = model.generate(
+                samples,
+                use_nucleus_sampling=True,
+                max_length=max_length,
+                min_length=min_length,
+                top_p=0.9,
+            )
+            captions.append(caption[0])
+    else:
+        caption = model.generate(
+            samples,
+            use_nucleus_sampling=False,
+            num_beams=num_beams,
+            max_length=max_length,
+            min_length=min_length,
+        )
+        captions.append(caption[0])
+    return captions

LAVIS/app/classification.py ADDED Viewed

	@@ -0,0 +1,216 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import plotly.graph_objects as go
+import requests
+import streamlit as st
+import torch
+from lavis.models import load_model
+from lavis.processors import load_processor
+from lavis.processors.blip_processors import BlipCaptionProcessor
+from PIL import Image
+from app import device, load_demo_image
+from app.utils import load_blip_itm_model
+from lavis.processors.clip_processors import ClipImageEvalProcessor
+@st.cache()
+def load_demo_image(img_url=None):
+    if not img_url:
+        img_url = "https://img.atlasobscura.com/yDJ86L8Ou6aIjBsxnlAy5f164w1rjTgcHZcx2yUs4mo/rt:fit/w:1200/q:81/sm:1/scp:1/ar:1/aHR0cHM6Ly9hdGxh/cy1kZXYuczMuYW1h/em9uYXdzLmNvbS91/cGxvYWRzL3BsYWNl/X2ltYWdlcy85MDll/MDRjOS00NTJjLTQx/NzQtYTY4MS02NmQw/MzI2YWIzNjk1ZGVk/MGZhMTJiMTM5MmZi/NGFfUmVhcl92aWV3/X29mX3RoZV9NZXJs/aW9uX3N0YXR1ZV9h/dF9NZXJsaW9uX1Bh/cmssX1NpbmdhcG9y/ZSxfd2l0aF9NYXJp/bmFfQmF5X1NhbmRz/X2luX3RoZV9kaXN0/YW5jZV8tXzIwMTQw/MzA3LmpwZw.jpg"
+    raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
+    return raw_image
+@st.cache(
+    hash_funcs={
+        torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
+        .cpu()
+        .numpy()
+    },
+    allow_output_mutation=True,
+)
+def load_model_cache(model_type, device):
+    if model_type == "blip":
+        model = load_model(
+            "blip_feature_extractor", model_type="base", is_eval=True, device=device
+        )
+    elif model_type == "albef":
+        model = load_model(
+            "albef_feature_extractor", model_type="base", is_eval=True, device=device
+        )
+    elif model_type == "CLIP_ViT-B-32":
+        model = load_model(
+            "clip_feature_extractor", "ViT-B-32", is_eval=True, device=device
+        )
+    elif model_type == "CLIP_ViT-B-16":
+        model = load_model(
+            "clip_feature_extractor", "ViT-B-16", is_eval=True, device=device
+        )
+    elif model_type == "CLIP_ViT-L-14":
+        model = load_model(
+            "clip_feature_extractor", "ViT-L-14", is_eval=True, device=device
+        )
+    return model
+def app():
+    model_type = st.sidebar.selectbox(
+        "Model:",
+        ["ALBEF", "BLIP_Base", "CLIP_ViT-B-32", "CLIP_ViT-B-16", "CLIP_ViT-L-14"],
+    )
+    score_type = st.sidebar.selectbox("Score type:", ["Cosine", "Multimodal"])
+    # ===== layout =====
+    st.markdown(
+        "<h1 style='text-align: center;'>Zero-shot Classification</h1>",
+        unsafe_allow_html=True,
+    )
+    instructions = """Try the provided image or upload your own:"""
+    file = st.file_uploader(instructions)
+    st.header("Image")
+    if file:
+        raw_img = Image.open(file).convert("RGB")
+    else:
+        raw_img = load_demo_image()
+    st.image(raw_img)  # , use_column_width=True)
+    col1, col2 = st.columns(2)
+    col1.header("Categories")
+    cls_0 = col1.text_input("category 1", value="merlion")
+    cls_1 = col1.text_input("category 2", value="sky")
+    cls_2 = col1.text_input("category 3", value="giraffe")
+    cls_3 = col1.text_input("category 4", value="fountain")
+    cls_4 = col1.text_input("category 5", value="marina bay")
+    cls_names = [cls_0, cls_1, cls_2, cls_3, cls_4]
+    cls_names = [cls_nm for cls_nm in cls_names if len(cls_nm) > 0]
+    if len(cls_names) != len(set(cls_names)):
+        st.error("Please provide unique class names")
+        return
+    button = st.button("Submit")
+    col2.header("Prediction")
+    # ===== event =====
+    if button:
+        if model_type.startswith("BLIP"):
+            text_processor = BlipCaptionProcessor(prompt="A picture of ")
+            cls_prompt = [text_processor(cls_nm) for cls_nm in cls_names]
+            if score_type == "Cosine":
+                vis_processor = load_processor("blip_image_eval").build(image_size=224)
+                img = vis_processor(raw_img).unsqueeze(0).to(device)
+                feature_extractor = load_model_cache(model_type="blip", device=device)
+                sample = {"image": img, "text_input": cls_prompt}
+                with torch.no_grad():
+                    image_features = feature_extractor.extract_features(
+                        sample, mode="image"
+                    ).image_embeds_proj[:, 0]
+                    text_features = feature_extractor.extract_features(
+                        sample, mode="text"
+                    ).text_embeds_proj[:, 0]
+                    sims = (image_features @ text_features.t())[
+                        0
+                    ] / feature_extractor.temp
+            else:
+                vis_processor = load_processor("blip_image_eval").build(image_size=384)
+                img = vis_processor(raw_img).unsqueeze(0).to(device)
+                model = load_blip_itm_model(device)
+                output = model(img, cls_prompt, match_head="itm")
+                sims = output[:, 1]
+            sims = torch.nn.Softmax(dim=0)(sims)
+            inv_sims = [sim * 100 for sim in sims.tolist()[::-1]]
+        elif model_type.startswith("ALBEF"):
+            vis_processor = load_processor("blip_image_eval").build(image_size=224)
+            img = vis_processor(raw_img).unsqueeze(0).to(device)
+            text_processor = BlipCaptionProcessor(prompt="A picture of ")
+            cls_prompt = [text_processor(cls_nm) for cls_nm in cls_names]
+            feature_extractor = load_model_cache(model_type="albef", device=device)
+            sample = {"image": img, "text_input": cls_prompt}
+            with torch.no_grad():
+                image_features = feature_extractor.extract_features(
+                    sample, mode="image"
+                ).image_embeds_proj[:, 0]
+                text_features = feature_extractor.extract_features(
+                    sample, mode="text"
+                ).text_embeds_proj[:, 0]
+                st.write(image_features.shape)
+                st.write(text_features.shape)
+                sims = (image_features @ text_features.t())[0] / feature_extractor.temp
+            sims = torch.nn.Softmax(dim=0)(sims)
+            inv_sims = [sim * 100 for sim in sims.tolist()[::-1]]
+        elif model_type.startswith("CLIP"):
+            if model_type == "CLIP_ViT-B-32":
+                model = load_model_cache(model_type="CLIP_ViT-B-32", device=device)
+            elif model_type == "CLIP_ViT-B-16":
+                model = load_model_cache(model_type="CLIP_ViT-B-16", device=device)
+            elif model_type == "CLIP_ViT-L-14":
+                model = load_model_cache(model_type="CLIP_ViT-L-14", device=device)
+            else:
+                raise ValueError(f"Unknown model type {model_type}")
+            if score_type == "Cosine":
+                # image_preprocess = ClipImageEvalProcessor(image_size=336)
+                image_preprocess = ClipImageEvalProcessor(image_size=224)
+                img = image_preprocess(raw_img).unsqueeze(0).to(device)
+                sample = {"image": img, "text_input": cls_names}
+                with torch.no_grad():
+                    clip_features = model.extract_features(sample)
+                    image_features = clip_features.image_embeds_proj
+                    text_features = clip_features.text_embeds_proj
+                    sims = (100.0 * image_features @ text_features.T)[0].softmax(dim=-1)
+                    inv_sims = sims.tolist()[::-1]
+            else:
+                st.warning("CLIP does not support multimodal scoring.")
+                return
+        fig = go.Figure(
+            go.Bar(
+                x=inv_sims,
+                y=cls_names[::-1],
+                text=["{:.2f}".format(s) for s in inv_sims],
+                orientation="h",
+            )
+        )
+        fig.update_traces(
+            textfont_size=12,
+            textangle=0,
+            textposition="outside",
+            cliponaxis=False,
+        )
+        col2.plotly_chart(fig, use_container_width=True)

LAVIS/app/dataset_browser.py ADDED Viewed

	@@ -0,0 +1,240 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import random
+from collections import OrderedDict
+from functools import reduce
+from tkinter import N
+import streamlit as st
+from lavis.common.registry import registry
+from lavis.datasets.builders import dataset_zoo, load_dataset
+from lavis.datasets.builders.base_dataset_builder import load_dataset_config
+from PIL import Image
+IMAGE_LAYOUT = 3, 4
+VIDEO_LAYOUT = 1, 2
+PREV_STR = "Prev"
+NEXT_STR = "Next"
+def sample_dataset(dataset, indices):
+    samples = [dataset.displ_item(idx) for idx in indices]
+    return samples
+def get_concat_v(im1, im2):
+    margin = 5
+    canvas_size = (im1.width + im2.width + margin, max(im1.height, im2.height))
+    canvas = Image.new("RGB", canvas_size, "White")
+    canvas.paste(im1, (0, 0))
+    canvas.paste(im2, (im1.width + margin, 0))
+    return canvas
+def resize_img_w(raw_img, new_w=224):
+    if isinstance(raw_img, list):
+        resized_imgs = [resize_img_w(img, 196) for img in raw_img]
+        # concatenate images
+        resized_image = reduce(get_concat_v, resized_imgs)
+    else:
+        w, h = raw_img.size
+        scaling_factor = new_w / w
+        resized_image = raw_img.resize(
+            (int(w * scaling_factor), int(h * scaling_factor))
+        )
+    return resized_image
+def get_visual_key(dataset):
+    if "image" in dataset[0]:
+        return "image"
+    elif "image0" in dataset[0]:  # NLVR2 dataset
+        return "image"
+    elif "video" in dataset[0]:
+        return "video"
+    else:
+        raise ValueError("Visual key not found.")
+def gather_items(samples, exclude=[]):
+    gathered = []
+    for s in samples:
+        ns = OrderedDict()
+        for k in s.keys():
+            if k not in exclude:
+                ns[k] = s[k]
+        gathered.append(ns)
+    return gathered
+@st.cache(allow_output_mutation=True)
+def load_dataset_cache(name):
+    return load_dataset(name)
+def format_text(text):
+    md = "\n\n".join([f"**{k}**: {v}" for k, v in text.items()])
+    return md
+def show_samples(dataset, offset=0, is_next=False):
+    visual_key = get_visual_key(dataset)
+    num_rows, num_cols = IMAGE_LAYOUT if visual_key == "image" else VIDEO_LAYOUT
+    n_samples = num_rows * num_cols
+    if not shuffle:
+        if is_next:
+            start = min(int(start_idx) + offset + n_samples, len(dataset) - n_samples)
+        else:
+            start = max(0, int(start_idx) + offset - n_samples)
+        st.session_state.last_start = start
+        end = min(start + n_samples, len(dataset))
+        indices = list(range(start, end))
+    else:
+        indices = random.sample(range(len(dataset)), n_samples)
+    samples = sample_dataset(dataset, indices)
+    visual_info = (
+        iter([resize_img_w(s[visual_key]) for s in samples])
+        if visual_key == "image"
+        # else iter([s[visual_key] for s in samples])
+        else iter([s["file"] for s in samples])
+    )
+    text_info = gather_items(samples, exclude=["image", "video"])
+    text_info = iter([format_text(s) for s in text_info])
+    st.markdown(
+        """<hr style="height:1px;border:none;color:#c7ccd4;background-color:#c7ccd4;"/> """,
+        unsafe_allow_html=True,
+    )
+    for _ in range(num_rows):
+        with st.container():
+            for col in st.columns(num_cols):
+                # col.text(next(text_info))
+                # col.caption(next(text_info))
+                try:
+                    col.markdown(next(text_info))
+                    if visual_key == "image":
+                        col.image(next(visual_info), use_column_width=True, clamp=True)
+                    elif visual_key == "video":
+                        col.markdown(
+                            "![Alt Text](https://media.giphy.com/media/vFKqnCdLPNOKc/giphy.gif)"
+                        )
+                except StopIteration:
+                    break
+            st.markdown(
+                """<hr style="height:1px;border:none;color:#c7ccd4;background-color:#c7ccd4;"/> """,
+                unsafe_allow_html=True,
+            )
+    st.session_state.n_display = n_samples
+if __name__ == "__main__":
+    st.set_page_config(
+        page_title="LAVIS Dataset Explorer",
+        # layout="wide",
+        initial_sidebar_state="expanded",
+    )
+    dataset_name = st.sidebar.selectbox("Dataset:", dataset_zoo.get_names())
+    function = st.sidebar.selectbox("Function:", ["Browser"], index=0)
+    if function == "Browser":
+        shuffle = st.sidebar.selectbox("Shuffled:", [True, False], index=0)
+        dataset = load_dataset_cache(dataset_name)
+        split = st.sidebar.selectbox("Split:", dataset.keys())
+        dataset_len = len(dataset[split])
+        st.success(
+            f"Loaded {dataset_name}/{split} with **{dataset_len}** records.  **Image/video directory**: {dataset[split].vis_root}"
+        )
+        if "last_dataset" not in st.session_state:
+            st.session_state.last_dataset = dataset_name
+            st.session_state.last_split = split
+        if "last_start" not in st.session_state:
+            st.session_state.last_start = 0
+        if "start_idx" not in st.session_state:
+            st.session_state.start_idx = 0
+        if "shuffle" not in st.session_state:
+            st.session_state.shuffle = shuffle
+        if "first_run" not in st.session_state:
+            st.session_state.first_run = True
+        elif (
+            st.session_state.last_dataset != dataset_name
+            or st.session_state.last_split != split
+        ):
+            st.session_state.first_run = True
+            st.session_state.last_dataset = dataset_name
+            st.session_state.last_split = split
+        elif st.session_state.shuffle != shuffle:
+            st.session_state.shuffle = shuffle
+            st.session_state.first_run = True
+        if not shuffle:
+            n_col, p_col = st.columns([0.05, 1])
+            prev_button = n_col.button(PREV_STR)
+            next_button = p_col.button(NEXT_STR)
+        else:
+            next_button = st.button(NEXT_STR)
+        if not shuffle:
+            start_idx = st.sidebar.text_input(f"Begin from (total {dataset_len})", 0)
+            if not start_idx.isdigit():
+                st.error(f"Input to 'Begin from' must be digits, found {start_idx}.")
+            else:
+                if int(start_idx) != st.session_state.start_idx:
+                    st.session_state.start_idx = int(start_idx)
+                    st.session_state.last_start = int(start_idx)
+            if prev_button:
+                show_samples(
+                    dataset[split],
+                    offset=st.session_state.last_start - st.session_state.start_idx,
+                    is_next=False,
+                )
+        if next_button:
+            show_samples(
+                dataset[split],
+                offset=st.session_state.last_start - st.session_state.start_idx,
+                is_next=True,
+            )
+        if st.session_state.first_run:
+            st.session_state.first_run = False
+            show_samples(
+                dataset[split],
+                offset=st.session_state.last_start - st.session_state.start_idx,
+                is_next=True,
+            )

LAVIS/app/image_text_match.py ADDED Viewed

	@@ -0,0 +1,87 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import numpy as np
+import streamlit as st
+import torch
+from lavis.models.blip_models.blip_image_text_matching import compute_gradcam
+from lavis.processors import load_processor
+from PIL import Image
+from app import device, load_demo_image
+from app.utils import getAttMap, init_bert_tokenizer, load_blip_itm_model
+def app():
+    model_type = st.sidebar.selectbox("Model:", ["BLIP_base", "BLIP_large"])
+    if model_type.startswith("BLIP"):
+        blip_type = model_type.split("_")[1]
+        model = load_blip_itm_model(device, model_type=blip_type)
+    vis_processor = load_processor("blip_image_eval").build(image_size=384)
+    st.markdown(
+        "<h1 style='text-align: center;'>Image Text Matching</h1>",
+        unsafe_allow_html=True,
+    )
+    values = list(range(1, 12))
+    default_layer_num = values.index(7)
+    layer_num = (
+        st.sidebar.selectbox("Layer number", values, index=default_layer_num) - 1
+    )
+    instructions = """Try the provided image or upload your own:"""
+    file = st.file_uploader(instructions)
+    col1, col2 = st.columns(2)
+    col1.header("Image")
+    col2.header("GradCam")
+    if file:
+        raw_img = Image.open(file).convert("RGB")
+    else:
+        raw_img = load_demo_image()
+    w, h = raw_img.size
+    scaling_factor = 720 / w
+    resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
+    col1.image(resized_image, use_column_width=True)
+    col3, col4 = st.columns(2)
+    col3.header("Text")
+    user_question = col3.text_input(
+        "Input your sentence!", "a woman sitting on the beach with a dog"
+    )
+    submit_button = col3.button("Submit")
+    col4.header("Matching score")
+    if submit_button:
+        tokenizer = init_bert_tokenizer()
+        img = vis_processor(raw_img).unsqueeze(0).to(device)
+        text_processor = load_processor("blip_caption").build()
+        qry = text_processor(user_question)
+        norm_img = np.float32(resized_image) / 255
+        qry_tok = tokenizer(qry, return_tensors="pt").to(device)
+        gradcam, output = compute_gradcam(model, img, qry, qry_tok, block_num=layer_num)
+        avg_gradcam = getAttMap(norm_img, gradcam[0][1], blur=True)
+        col2.image(avg_gradcam, use_column_width=True, clamp=True)
+        # output = model(img, question)
+        itm_score = torch.nn.functional.softmax(output, dim=1)
+        new_title = (
+            '<p style="text-align: left; font-size: 25px;">\n{:.3f}%</p>'.format(
+                itm_score[0][1].item() * 100
+            )
+        )
+        col4.markdown(new_title, unsafe_allow_html=True)

LAVIS/app/main.py ADDED Viewed

	@@ -0,0 +1,25 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+from app.multipage import MultiPage
+from app import vqa, caption
+from app import image_text_match as itm
+from app import text_localization as tl
+from app import multimodal_search as ms
+from app import classification as cl
+if __name__ == "__main__":
+    app = MultiPage()
+    app.add_page("Image Description Generation", caption.app)
+    app.add_page("Multimodal Search", ms.app)
+    app.add_page("Visual Question Answering", vqa.app)
+    app.add_page("Image Text Matching", itm.app)
+    app.add_page("Text Localization", tl.app)
+    app.add_page("Classification", cl.app)
+    app.run()

LAVIS/app/multimodal_search.py ADDED Viewed

	@@ -0,0 +1,230 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import os
+import numpy as np
+import streamlit as st
+import torch
+import torch.nn.functional as F
+from app import cache_root, device
+from app.utils import (
+    getAttMap,
+    init_bert_tokenizer,
+    load_blip_itm_model,
+    read_img,
+    resize_img,
+)
+from lavis.models import load_model
+from lavis.processors import load_processor
+@st.cache(
+    hash_funcs={
+        torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
+        .cpu()
+        .numpy()
+    },
+    allow_output_mutation=True,
+)
+def load_feat():
+    from lavis.common.utils import download_url
+    dirname = os.path.join(os.path.dirname(__file__), "assets")
+    filename = "path2feat_coco_train2014.pth"
+    filepath = os.path.join(dirname, filename)
+    url = "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/path2feat_coco_train2014.pth"
+    if not os.path.exists(filepath):
+        download_url(url=url, root=dirname, filename="path2feat_coco_train2014.pth")
+    path2feat = torch.load(filepath)
+    paths = sorted(path2feat.keys())
+    all_img_feats = torch.stack([path2feat[k] for k in paths], dim=0).to(device)
+    return path2feat, paths, all_img_feats
+@st.cache(
+    hash_funcs={
+        torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
+        .cpu()
+        .numpy()
+    },
+    allow_output_mutation=True,
+)
+def load_feature_extractor_model(device):
+    model_url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth"
+    model = load_model(
+        "blip_feature_extractor", model_type="base", is_eval=True, device=device
+    )
+    model.load_from_pretrained(model_url)
+    return model
+def app():
+    # === layout ===
+    model_type = st.sidebar.selectbox("Model:", ["BLIP_base", "BLIP_large"])
+    file_root = os.path.join(cache_root, "coco/images/train2014/")
+    values = [12, 24, 48]
+    default_layer_num = values.index(24)
+    num_display = st.sidebar.selectbox(
+        "Number of images:", values, index=default_layer_num
+    )
+    show_gradcam = st.sidebar.selectbox("Show GradCam:", [True, False], index=1)
+    itm_ranking = st.sidebar.selectbox("Multimodal re-ranking:", [True, False], index=0)
+    # st.title('Multimodal Search')
+    st.markdown(
+        "<h1 style='text-align: center;'>Multimodal Search</h1>", unsafe_allow_html=True
+    )
+    # === event ===
+    vis_processor = load_processor("blip_image_eval").build(image_size=384)
+    text_processor = load_processor("blip_caption")
+    user_question = st.text_input(
+        "Search query", "A dog running on the grass.", help="Type something to search."
+    )
+    user_question = text_processor(user_question)
+    feature_extractor = load_feature_extractor_model(device)
+    # ======= ITC =========
+    sample = {"text_input": user_question}
+    with torch.no_grad():
+        text_feature = feature_extractor.extract_features(
+            sample, mode="text"
+        ).text_embeds_proj[0, 0]
+        path2feat, paths, all_img_feats = load_feat()
+        all_img_feats.to(device)
+        all_img_feats = F.normalize(all_img_feats, dim=1)
+        num_cols = 4
+        num_rows = int(num_display / num_cols)
+        similarities = text_feature @ all_img_feats.T
+        indices = torch.argsort(similarities, descending=True)[:num_display]
+    top_paths = [paths[ind.detach().cpu().item()] for ind in indices]
+    sorted_similarities = [similarities[idx] for idx in indices]
+    filenames = [os.path.join(file_root, p) for p in top_paths]
+    # ========= ITM and GradCam ==========
+    bsz = 4  # max number of images to avoid cuda oom
+    if model_type.startswith("BLIP"):
+        blip_type = model_type.split("_")[1]
+    itm_model = load_blip_itm_model(device, model_type=blip_type)
+    tokenizer = init_bert_tokenizer()
+    queries_batch = [user_question] * bsz
+    queries_tok_batch = tokenizer(queries_batch, return_tensors="pt").to(device)
+    num_batches = int(num_display / bsz)
+    avg_gradcams = []
+    all_raw_images = []
+    itm_scores = []
+    for i in range(num_batches):
+        filenames_in_batch = filenames[i * bsz : (i + 1) * bsz]
+        raw_images, images = read_and_process_images(filenames_in_batch, vis_processor)
+        gradcam, itm_output = compute_gradcam_batch(
+            itm_model, images, queries_batch, queries_tok_batch
+        )
+        all_raw_images.extend([resize_img(r_img) for r_img in raw_images])
+        norm_imgs = [np.float32(r_img) / 255 for r_img in raw_images]
+        for norm_img, grad_cam in zip(norm_imgs, gradcam):
+            avg_gradcam = getAttMap(norm_img, grad_cam[0], blur=True)
+            avg_gradcams.append(avg_gradcam)
+        with torch.no_grad():
+            itm_score = torch.nn.functional.softmax(itm_output, dim=1)
+        itm_scores.append(itm_score)
+    # ========= ITM re-ranking =========
+    itm_scores = torch.cat(itm_scores)[:, 1]
+    if itm_ranking:
+        itm_scores_sorted, indices = torch.sort(itm_scores, descending=True)
+        avg_gradcams_sorted = []
+        all_raw_images_sorted = []
+        for idx in indices:
+            avg_gradcams_sorted.append(avg_gradcams[idx])
+            all_raw_images_sorted.append(all_raw_images[idx])
+        avg_gradcams = avg_gradcams_sorted
+        all_raw_images = all_raw_images_sorted
+    if show_gradcam:
+        images_to_show = iter(avg_gradcams)
+    else:
+        images_to_show = iter(all_raw_images)
+    for _ in range(num_rows):
+        with st.container():
+            for col in st.columns(num_cols):
+                col.image(next(images_to_show), use_column_width=True, clamp=True)
+def read_and_process_images(image_paths, vis_processor):
+    raw_images = [read_img(path) for path in image_paths]
+    images = [vis_processor(r_img) for r_img in raw_images]
+    images_tensors = torch.stack(images).to(device)
+    return raw_images, images_tensors
+def compute_gradcam_batch(model, visual_input, text_input, tokenized_text, block_num=6):
+    model.text_encoder.base_model.base_model.encoder.layer[
+        block_num
+    ].crossattention.self.save_attention = True
+    output = model({"image": visual_input, "text_input": text_input}, match_head="itm")
+    loss = output[:, 1].sum()
+    model.zero_grad()
+    loss.backward()
+    with torch.no_grad():
+        mask = tokenized_text.attention_mask.view(
+            tokenized_text.attention_mask.size(0), 1, -1, 1, 1
+        )  # (bsz,1,token_len, 1,1)
+        token_length = mask.sum() - 2
+        token_length = token_length.cpu()
+        # grads and cams [bsz, num_head, seq_len, image_patch]
+        grads = model.text_encoder.base_model.base_model.encoder.layer[
+            block_num
+        ].crossattention.self.get_attn_gradients()
+        cams = model.text_encoder.base_model.base_model.encoder.layer[
+            block_num
+        ].crossattention.self.get_attention_map()
+        # assume using vit large with 576 num image patch
+        cams = cams[:, :, :, 1:].reshape(visual_input.size(0), 12, -1, 24, 24) * mask
+        grads = (
+            grads[:, :, :, 1:].clamp(0).reshape(visual_input.size(0), 12, -1, 24, 24)
+            * mask
+        )
+        gradcam = cams * grads
+        # [enc token gradcam, average gradcam across token, gradcam for individual token]
+        # gradcam = torch.cat((gradcam[0:1,:], gradcam[1:token_length+1, :].sum(dim=0, keepdim=True)/token_length, gradcam[1:, :]))
+        gradcam = gradcam.mean(1).cpu().detach()
+        gradcam = (
+            gradcam[:, 1 : token_length + 1, :].sum(dim=1, keepdim=True) / token_length
+        )
+    return gradcam, output

LAVIS/app/multipage.py ADDED Viewed

	@@ -0,0 +1,41 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+"""
+This file is the framework for generating multiple Streamlit applications
+through an object oriented framework.
+"""
+# Import necessary libraries
+import streamlit as st
+# Define the multipage class to manage the multiple apps in our program
+class MultiPage:
+    """Framework for combining multiple streamlit applications."""
+    def __init__(self) -> None:
+        """Constructor class to generate a list which will store all our applications as an instance variable."""
+        self.pages = []
+    def add_page(self, title, func) -> None:
+        """Class Method to Add pages to the project
+        Args:
+            title ([str]): The title of page which we are adding to the list of apps
+            func: Python function to render this page in Streamlit
+        """
+        self.pages.append({"title": title, "function": func})
+    def run(self):
+        # Drodown to select the page to run
+        page = st.sidebar.selectbox(
+            "Navigation", self.pages, format_func=lambda page: page["title"]
+        )
+        # run the app function
+        page["function"]()

LAVIS/app/text_localization.py ADDED Viewed

	@@ -0,0 +1,105 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import math
+import numpy as np
+import streamlit as st
+from lavis.models.blip_models.blip_image_text_matching import compute_gradcam
+from lavis.processors import load_processor
+from PIL import Image
+from app import device, load_demo_image
+from app.utils import getAttMap, init_bert_tokenizer, load_blip_itm_model
+def app():
+    model_type = st.sidebar.selectbox("Model:", ["BLIP_base", "BLIP_large"])
+    values = list(range(1, 12))
+    default_layer_num = values.index(7)
+    layer_num = (
+        st.sidebar.selectbox("Layer number", values, index=default_layer_num) - 1
+    )
+    st.markdown(
+        "<h1 style='text-align: center;'>Text Localization</h1>", unsafe_allow_html=True
+    )
+    vis_processor = load_processor("blip_image_eval").build(image_size=384)
+    text_processor = load_processor("blip_caption")
+    tokenizer = init_bert_tokenizer()
+    instructions = "Try the provided image and text or use your own ones."
+    file = st.file_uploader(instructions)
+    query = st.text_input(
+        "Try a different input.", "A girl playing with her dog on the beach."
+    )
+    submit_button = st.button("Submit")
+    col1, col2 = st.columns(2)
+    if file:
+        raw_img = Image.open(file).convert("RGB")
+    else:
+        raw_img = load_demo_image()
+    col1.header("Image")
+    w, h = raw_img.size
+    scaling_factor = 720 / w
+    resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
+    col1.image(resized_image, use_column_width=True)
+    col2.header("GradCam")
+    if submit_button:
+        if model_type.startswith("BLIP"):
+            blip_type = model_type.split("_")[1]
+            model = load_blip_itm_model(device, model_type=blip_type)
+        img = vis_processor(raw_img).unsqueeze(0).to(device)
+        qry = text_processor(query)
+        qry_tok = tokenizer(qry, return_tensors="pt").to(device)
+        norm_img = np.float32(resized_image) / 255
+        gradcam, _ = compute_gradcam(model, img, qry, qry_tok, block_num=layer_num)
+        avg_gradcam = getAttMap(norm_img, gradcam[0][1], blur=True)
+        col2.image(avg_gradcam, use_column_width=True, clamp=True)
+        num_cols = 4.0
+        num_tokens = len(qry_tok.input_ids[0]) - 2
+        num_rows = int(math.ceil(num_tokens / num_cols))
+        gradcam_iter = iter(gradcam[0][2:-1])
+        token_id_iter = iter(qry_tok.input_ids[0][1:-1])
+        for _ in range(num_rows):
+            with st.container():
+                for col in st.columns(int(num_cols)):
+                    token_id = next(token_id_iter, None)
+                    if not token_id:
+                        break
+                    gradcam_img = next(gradcam_iter)
+                    word = tokenizer.decode([token_id])
+                    gradcam_todraw = getAttMap(norm_img, gradcam_img, blur=True)
+                    new_title = (
+                        '<p style="text-align: center; font-size: 25px;">{}</p>'.format(
+                            word
+                        )
+                    )
+                    col.markdown(new_title, unsafe_allow_html=True)
+                    # st.image(image, channels="BGR")
+                    col.image(gradcam_todraw, use_column_width=True, clamp=True)

LAVIS/app/utils.py ADDED Viewed

	@@ -0,0 +1,81 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import numpy as np
+import streamlit as st
+import torch
+from lavis.models import BlipBase, load_model
+from matplotlib import pyplot as plt
+from PIL import Image
+from scipy.ndimage import filters
+from skimage import transform as skimage_transform
+def resize_img(raw_img):
+    w, h = raw_img.size
+    scaling_factor = 240 / w
+    resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
+    return resized_image
+def read_img(filepath):
+    raw_image = Image.open(filepath).convert("RGB")
+    return raw_image
+@st.cache(
+    hash_funcs={
+        torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
+        .cpu()
+        .numpy()
+    },
+    allow_output_mutation=True,
+)
+def load_model_cache(name, model_type, is_eval, device):
+    return load_model(name, model_type, is_eval, device)
+@st.cache(allow_output_mutation=True)
+def init_bert_tokenizer():
+    tokenizer = BlipBase.init_tokenizer()
+    return tokenizer
+def getAttMap(img, attMap, blur=True, overlap=True):
+    attMap -= attMap.min()
+    if attMap.max() > 0:
+        attMap /= attMap.max()
+    attMap = skimage_transform.resize(attMap, (img.shape[:2]), order=3, mode="constant")
+    if blur:
+        attMap = filters.gaussian_filter(attMap, 0.02 * max(img.shape[:2]))
+        attMap -= attMap.min()
+        attMap /= attMap.max()
+    cmap = plt.get_cmap("jet")
+    attMapV = cmap(attMap)
+    attMapV = np.delete(attMapV, 3, 2)
+    if overlap:
+        attMap = (
+            1 * (1 - attMap**0.7).reshape(attMap.shape + (1,)) * img
+            + (attMap**0.7).reshape(attMap.shape + (1,)) * attMapV
+        )
+    return attMap
+@st.cache(
+    hash_funcs={
+        torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
+        .cpu()
+        .numpy()
+    },
+    allow_output_mutation=True,
+)
+def load_blip_itm_model(device, model_type="base"):
+    model = load_model(
+        "blip_image_text_matching", model_type, is_eval=True, device=device
+    )
+    return model

LAVIS/app/vqa.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""
+ # Copyright (c) 2022, salesforce.com, inc.
+ # All rights reserved.
+ # SPDX-License-Identifier: BSD-3-Clause
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
+"""
+import streamlit as st
+from app import load_demo_image, device
+from app.utils import load_model_cache
+from lavis.processors import load_processor
+from PIL import Image
+def app():
+    model_type = st.sidebar.selectbox("Model:", ["BLIP"])
+    # ===== layout =====
+    st.markdown(
+        "<h1 style='text-align: center;'>Visual Question Answering</h1>",
+        unsafe_allow_html=True,
+    )
+    instructions = """Try the provided image or upload your own:"""
+    file = st.file_uploader(instructions)
+    col1, col2 = st.columns(2)
+    col1.header("Image")
+    if file:
+        raw_img = Image.open(file).convert("RGB")
+    else:
+        raw_img = load_demo_image()
+    w, h = raw_img.size
+    scaling_factor = 720 / w
+    resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
+    col1.image(resized_image, use_column_width=True)
+    col2.header("Question")
+    user_question = col2.text_input("Input your question!", "What are objects there?")
+    qa_button = st.button("Submit")
+    col2.header("Answer")
+    # ===== event =====
+    vis_processor = load_processor("blip_image_eval").build(image_size=480)
+    text_processor = load_processor("blip_question").build()
+    if qa_button:
+        if model_type.startswith("BLIP"):
+            model = load_model_cache(
+                "blip_vqa", model_type="vqav2", is_eval=True, device=device
+            )
+            img = vis_processor(raw_img).unsqueeze(0).to(device)
+            question = text_processor(user_question)
+            vqa_samples = {"image": img, "text_input": [question]}
+            answers = model.predict_answers(vqa_samples, inference_method="generate")
+            col2.write("\n".join(answers), use_column_width=True)

LAVIS/assets/demo-6.png ADDED Viewed

Git LFS Details

SHA256: 34cab965b30f15b772bee414771bf1b25ee784593aaccce730cad7f0302db220
Pointer size: 132 Bytes
Size of remote file: 3.12 MB

LAVIS/dataset_card/avsd_dialogue.md ADDED Viewed

	@@ -0,0 +1,32 @@

+![Samples from the AVSD dataset (Image credit: "https://arxiv.org/pdf/1901.09107.pdf").](imgs/avsd_dialogue.png)(Samples from the AVSD dataset. Image credit: "https://arxiv.org/pdf/1901.09107.pdf")
+# Audio-Visual Scene-Aware Dialogues (AVSD)
+## Description
+[Audio-Visual Scene-Aware Dialogues (AVSD)](https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge) contains more than 10,000 dialogues, each of which is grounded on a unique video. In the test split, for each test sample, 6 reference dialogue responses are provided.
+## Task
+(https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge)
+In a **video-grounded dialogue task**, the system must generate responses to a user input in the context of a given dialog.
+This context consists of a dialog history (previous utterances by both user and system) in addition to video and audio information that comprise the scene. The quality of a system’s automatically generated sentences is evaluated using objective measures to determine whether or not the generated responses are natural and informative
+## Metrics
+Models are typically evaluated according to [BLEU](https://aclanthology.org/P02-1040/), [CIDER](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf), [METEOR](https://aclanthology.org/W05-0909/), and [ROUGE-L](https://aclanthology.org/W04-1013/) metrics.
+## Leaderboard
+TBD
+## Auto-Downloading
+Please refer to [benchmark webite](https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge) for instruction to download the dataset.
+## References
+"Audio Visual Scene-Aware Dialog", Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh

LAVIS/dataset_card/coco_caption.md ADDED Viewed

	@@ -0,0 +1,41 @@

+![Samples from the COCO Caption dataset (Image credit: "https://arxiv.org/pdf/1504.00325.pdf").](imgs/coco_caption.png)(Samples from the COCO Caption dataset. Image credit: "https://arxiv.org/pdf/1504.00325.pdf")
+# Microsoft COCO Dataset (Captioning)
+## Description
+[Microsoft COCO Captions dataset](https://github.com/tylin/coco-caption) contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
+## Task
+(from https://paperswithcode.com/task/image-captioning)
+**Image captioning** is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence.
+## Metrics
+Models are typically evaluated according to a [BLEU](https://aclanthology.org/P02-1040/) or [CIDER](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf) metric.
+## Leaderboard
+(Ranked by BLEU-4)
+| Rank |  Model  | BLEU-4 | CIDEr | METEOR | SPICE |                                                                    Resources                                                                     |
+| ---- | :-----: | :----: | :---: | :----: | :---: | :----------------------------------------------------------------------------------------------------------------------------------------------: |
+| 1    |   OFA   |  44.9  | 154.9 |  32.5  | 26.6  |                                [paper](https://arxiv.org/abs/2202.03052), [code](https://github.com/OFA-Sys/OFA)                                 |
+| 2    |  LEMON  |  42.6  | 145.5 |  31.4  | 25.5  |                                                                    [paper]()                                                                     |
+| 3    |  CoCa  |   40.9   |  143.6  | 33.9 | 24.7 | [paper](https://arxiv.org/pdf/2205.01917.pdf) |
+| 4    | SimVLM  |  40.6  | 143.3 |  33.7  | 25.4  |                                                [paper](https://openreview.net/pdf?id=GUrhfTuf_3)                                                 |
+| 5    |  VinVL  |  41.0  | 140.9 |  31.1  | 25.2  |                           [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar)                            |
+| 6    |  OSCAR  |  40.7  | 140.0 |  30.6  | 24.5  |                           [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar)                            |
+| 7    |  BLIP   |  40.4  | 136.7 |  31.4  | 24.3  | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP) |
+| 8    |   M^2   |  39.1  | 131.2 |  29.2  | 22.6  |                 [paper](https://arxiv.org/pdf/1912.08226v2.pdf), [code](https://github.com/aimagelab/meshed-memory-transformer)                  |
+| 9    |  BUTD   |  36.5  | 113.5 |  27.0  | 20.3  |               [paper](https://arxiv.org/abs/1707.07998?context=cs), [code](https://github.com/peteanderson80/bottom-up-attention)                |
+| 10    | ClipCap |  32.2  | 108.4 |  27.1  | 20.1  |                     [paper](https://arxiv.org/pdf/2111.09734v1.pdf), [code](https://github.com/rmokady/clip_prefix_caption)                      |
+## Auto-Downloading
+```
+cd lavis/datasets/download_scripts && python download_coco.py
+```
+## References
+"Microsoft COCO Captions: Data Collection and Evaluation Server", Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick

LAVIS/dataset_card/coco_retrieval.md ADDED Viewed

	@@ -0,0 +1,36 @@

+![Samples from the COCO Caption dataset (Image credit: "https://arxiv.org/pdf/1504.00325.pdf").](imgs/coco_caption.png)(Samples from the COCO Caption dataset. Image credit: "https://arxiv.org/pdf/1504.00325.pdf")
+# Microsoft COCO Dataset (Retrieval)
+## Description
+[Microsoft COCO dataset](https://github.com/tylin/coco-caption) contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
+## Task
+Cross modal retrieval: (1) **image-text**: given an image as query, retrieve texts from a gallery; (2) **text-image**: given a text as query, retrieval images from a gallery.
+## Metrics
+Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.
+We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.
+## Leaderboard
+(Ranked by TR@1.)
+| Rank | Model  | TR@1  | TR@5  | TR@10 | IR@1  | IR@5  | IR@10 |                                                                                                                   Resources                                                                                                                    |
+| ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| 1    |  BLIP  | 82.4  | 95.4  | 97.9  | 65.1  | 86.3  | 91.8  | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) |
+| 2    | X-VLM  | 81.2  | 95.6  | 98.2  | 63.4  | 85.8  | 91.5  |                                                                          [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM)                                                                          |
+| 3    | ALBEF  | 77.6  | 94.3  | 97.2  | 60.7  | 84.3  | 90.5  |                                            [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/)                                            |
+| 3    | ALIGN  | 77.0  | 93.5  | 96.9  | 59.9  | 83.3  | 89.8  |                                                                                                   [paper](https://arxiv.org/abs/2102.05918)                                                                                                    |
+| 4    | VinVL  | 75.4  | 92.9  | 96.2  | 58.8  | 83.5  | 90.3  |                                                                          [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
+| 5    | OSCAR  | 73.5  | 92.2  | 96.0  | 57.5  | 82.8  | 89.8  |                                                                          [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
+| 6    | UNITER | 65.7  | 88.6  | 93.8  | 52.9  | 79.9  | 88.0  |                                                          [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER)                                                          |
+## Auto-Downloading
+```
+cd lavis/datasets/download_scripts && python download_coco.py
+```
+## References
+"Microsoft COCO Captions: Data Collection and Evaluation Server", Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick

LAVIS/dataset_card/conceptual_captions.md ADDED Viewed

	@@ -0,0 +1,37 @@

+![From https://arxiv.org/pdf/1505.00468.pdf.](imgs/conceptual_captions.png)
+(image credit: https://ai.google.com/research/ConceptualCaptions/download)
+# Conceptual Captions Dataset
+## Description
+(from https://huggingface.co/datasets/conceptual_captions)
+Conceptual Captions 3M (CC3M) is a dataset consisting of ~3.3M images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. More precisely, the raw descriptions are harvested from the Alt-text HTML attribute associated with web images. To arrive at the current version of the captions, we have developed an automatic pipeline that extracts, filters, and transforms candidate image/caption pairs, with the goal of achieving a balance of cleanliness, informativeness, fluency, and learnability of the resulting captions.
+Conceptual Captions 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. Its data collection pipeline is a relaxed version of the one used in Conceptual Captions 3M (CC3M).
+## Task
+Image-language pre-training; image captioning.
+## Auto-Downloading
+**Warning**: images of this dataset are downloadeded by requesting URLs. Since URLs may disappear with time, it is expected the downloaded dataset is partial.
+### Conceptual Captions 3M
+- Download images
+```
+cd lavis/datasets/download_scripts/DownloadConceptualCaptions && python download_data_cc3m.py
+```
+- Create annotations by running the notebook
+```lavis/datasets/download_scripts/DownloadConceptualCaptions/create_annotation_3m.ipynb```
+### Conceptual Captions 12M
+- Download images
+```
+cd lavis/datasets/download_scripts/DownloadConceptualCaptions && python download_data_cc12m.py
+```
+- Create annotations by running the notebook
+```lavis/datasets/download_scripts/DownloadConceptualCaptions/create_annotation_12m.ipynb```
+## References
+Edwin G. Ng, Bo Pang, Piyush Sharma and Radu Soricut. 2020. Understanding Guided Image Captioning Performance Across Domains. arXiv preprint arXiv:2012.02339.

LAVIS/dataset_card/didemo_retrieval.md ADDED Viewed

	@@ -0,0 +1,36 @@

+![Samples from the DiDeMo dataset.](imgs/didemo.png)(Samples from the DiDeMo dataset. Image credit: "https://www.di.ens.fr/~miech/datasetviz/")
+# DiDeMo Dataset (Retrieval)
+## Description
+[Microsoft COCO dataset](https://github.com/tylin/coco-caption) contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
+## Task
+Cross modal retrieval: (1) **image-text**: given an image as query, retrieve texts from a gallery; (2) **text-image**: given a text as query, retrieval images from a gallery.
+## Metrics
+Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.
+We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.
+## Leaderboard
+(Ranked by TR@1.)
+<!-- | Rank | Model  | TR@1  | TR@5  | TR@10 | IR@1  | IR@5  | IR@10 |                                                                                                                   Resources                                                                                                                    |
+| ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| 1    |  BLIP  | 82.4  | 95.4  | 97.9  | 65.1  | 86.3  | 91.8  | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) |
+| 2    | X-VLM  | 81.2  | 95.6  | 98.2  | 63.4  | 85.8  | 91.5  |                                                                          [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM)                                                                          |
+| 3    | ALBEF  | 77.6  | 94.3  | 97.2  | 60.7  | 84.3  | 90.5  |                                            [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/)                                            |
+| 3    | ALIGN  | 77.0  | 93.5  | 96.9  | 59.9  | 83.3  | 89.8  |                                                                                                   [paper](https://arxiv.org/abs/2102.05918)                                                                                                    |
+| 4    | VinVL  | 75.4  | 92.9  | 96.2  | 58.8  | 83.5  | 90.3  |                                                                          [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
+| 5    | OSCAR  | 73.5  | 92.2  | 96.0  | 57.5  | 82.8  | 89.8  |                                                                          [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
+| 6    | UNITER | 65.7  | 88.6  | 93.8  | 52.9  | 79.9  | 88.0  |                                                          [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER)                                                          | -->
+## Auto-Downloading
+```
+cd lavis/datasets/download_scripts && python download_didemo.py
+```
+## References
+Anne Hendricks, Lisa, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. "Localizing moments in video with natural language." In Proceedings of the IEEE international conference on computer vision, pp. 5803-5812. 2017.

LAVIS/dataset_card/flickr_retrieval.md ADDED Viewed

	@@ -0,0 +1,34 @@

+![Samples from Flickr30k dataset (Image credit: "https://bryanplummer.com/Flickr30kEntities/").](imgs/flickr30k.png)Samples from Flickr30k dataset (Image credit: "https://bryanplummer.com/Flickr30kEntities/")
+# Flickr30K Dataset (Retrieval)
+## Description
+[Flickr30k](https://github.com/tylin/coco-caption) dataset contains 31k+ images collected from Flickr, together with 5 reference sentences provided by human annotators.
+## Task
+Cross modal retrieval: (1) **image-text**: given an image as query, retrieve texts from a gallery; (2) **text-image**: given a text as query, retrieval images from a gallery.
+## Metrics
+Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.
+We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.
+## Leaderboard
+(Ranked by TR@1.)
+| Rank | Model  | TR@1  | TR@5  | TR@10 | IR@1  | IR@5  | IR@10 |                                                                                                                   Resources                                                                                                                    |
+| ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| 1    |  BLIP  | 97.2  | 99.9  | 100.0  | 87.5  | 97.7  | 98.9  | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) |
+| 2    | X-VLM  | 97.1  | 100.0  | 100.0  | 86.9  | 97.3  | 98.7  |                                                                          [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM)                                                                          |
+| 3    | ALBEF  | 95.9  | 99.8  | 100.0  | 85.6  | 97.5  | 98.9  |                                            [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/)                                            |
+| 4    | ALIGN  | 95.3  | 99.8  | 100.0  | 84.9  | 97.4  | 98.6  |                                                                                                   [paper](https://arxiv.org/abs/2102.05918)                                                                                                    |                                                      |
+| 5    | VILLA  | 87.9  | 97.5  | 98.8  | 76.3  | 94.2  | 96.8  |                                                                          [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar)                                                                           |
+| 6    | UNITER | 87.3  | 98.0  | 99.2  | 75.6  | 94.1  | 96.8  |                                                          [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER)                                                          |
+## Auto-Downloading
+```
+cd lavis/datasets/download_scripts && python download_flickr.py
+```
+## References
+Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik, Flickr30K Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models, IJCV, 123(1):74-93, 2017. [paper]

LAVIS/dataset_card/gqa.md ADDED Viewed

	@@ -0,0 +1,32 @@

+![From https://arxiv.org/abs/1902.09506.pdf.](imgs/gqa.png)
+# GQA Dataset
+## Description
+(from https://cs.stanford.edu/people/dorarad/gqa/about.html)
+GQA is a VQA dataset for real-word images which requires visual, spatial and compositional reasoning.
+It consists of 22M questions and 110K images.
+## Task
+(from https://arxiv.org/abs/1902.09506)
+Given an image and a question, the model is required to output a correct answer.
+GQA questions require spatial understanding, multiple reasoning skills and multiple-step inference.
+## Metrics
+The metrics are accuracy, consistency, validity, plausibility. The commonly reported metric is accuracy.
+## Leaderboard
+TBD
+## Auto-Downloading
+```
+cd lavis/datasets/download_scripts && python download_gqa.py
+```
+## References
+"GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering", Drew A. Hudson, Christopher D. Manning

LAVIS/dataset_card/imgs/NLVR2.png ADDED Viewed

Git LFS Details

SHA256: e796c41791e58c8d36e404612f9bbdf16c99e07761094d362f57a746d8e002b3
Pointer size: 132 Bytes
Size of remote file: 2.49 MB

LAVIS/dataset_card/imgs/avsd_dialogue.png ADDED Viewed

Git LFS Details

SHA256: 68512b87cd234ab7d051d117c673b1770fd6a08c774ed3a16ec5e069833192aa
Pointer size: 132 Bytes
Size of remote file: 1.56 MB

LAVIS/dataset_card/imgs/coco_caption.png ADDED Viewed

Git LFS Details

SHA256: 950cfe987d6da5aad5ece7ca774ad73c1a24af7fcfda328536a8ce56eeaaf1b8
Pointer size: 132 Bytes
Size of remote file: 2.45 MB

LAVIS/dataset_card/imgs/conceptual_captions.png ADDED Viewed

Git LFS Details

SHA256: 5b6b13ab2fbc45076db85b0fbf1e0b1feb84fd0923a38c7db50ae09a5589b2ed
Pointer size: 132 Bytes
Size of remote file: 1.85 MB

LAVIS/dataset_card/imgs/didemo.png ADDED Viewed

Git LFS Details

SHA256: bca37c1c073a0ae47bebda0b4f37b3d933311e160cf4785398c6a2455465ddd0
Pointer size: 132 Bytes
Size of remote file: 2.63 MB

LAVIS/dataset_card/imgs/flickr30k.png ADDED Viewed

Git LFS Details

SHA256: 6397b8d87738dad19e1e2e5be652a6be6c9c50331023f1c5caa39ddbef981c24
Pointer size: 132 Bytes
Size of remote file: 1.44 MB

LAVIS/dataset_card/imgs/gqa.png ADDED Viewed

Git LFS Details

SHA256: 9aab093dd97c874c944f46991eb4bd018bbfc08f3a2fab3afbf27d71d2437866
Pointer size: 132 Bytes
Size of remote file: 3.12 MB

LAVIS/dataset_card/imgs/msrvtt.png ADDED Viewed

Git LFS Details

SHA256: defdac15aefc94420db8c64ca10eb19425e4454521b780bd391d032ec2f36a57
Pointer size: 132 Bytes
Size of remote file: 1.12 MB

LAVIS/dataset_card/imgs/msrvtt_qa.png ADDED Viewed

Git LFS Details

SHA256: 5518bff90f37cd9ddc2ca9eb96f9378075c341d3676c6708b0aabc9e7e78e758
Pointer size: 131 Bytes
Size of remote file: 335 kB

LAVIS/dataset_card/imgs/msvd_qa.png ADDED Viewed

Git LFS Details

SHA256: 05b7bf302d23a363996d0390bfd3c1ffc5f95b2b0f826c057af169ebd3d9e93f
Pointer size: 131 Bytes
Size of remote file: 388 kB

LAVIS/dataset_card/imgs/nocaps.png ADDED Viewed

Git LFS Details

SHA256: c087d4838a5d43612158a960bfd4200b3ef00303447ed27ba4ffc23a37588b6b
Pointer size: 132 Bytes
Size of remote file: 1.51 MB

LAVIS/dataset_card/imgs/sbu_caption.png ADDED Viewed

Git LFS Details

SHA256: 5fea4744bed8090a23f66689a8b4ca1fde008752f3830b24afe8d3eb81c2a2ea
Pointer size: 132 Bytes
Size of remote file: 1.52 MB

LAVIS/dataset_card/imgs/snli_ve.png ADDED Viewed

Git LFS Details

SHA256: 4660107e5ef8599fe587ebe6b8bf7b4b5561c796272f927a1d9e4aa5f8e15420
Pointer size: 132 Bytes
Size of remote file: 2.33 MB

LAVIS/dataset_card/imgs/vqav2.png ADDED Viewed

Git LFS Details

SHA256: 4101a7978e0870f95f5950cfe64697b2f1b7de725b056546ba950d27e2080bec
Pointer size: 132 Bytes
Size of remote file: 2.28 MB

LAVIS/dataset_card/msrvtt_qa.md ADDED Viewed

	@@ -0,0 +1,44 @@

+![Samples from MSRVTT-QA dataset.](imgs/msrvtt_qa.png)(Samples from MSRVTT-QA dataset, image credit: http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf)
+# MSRVTT Dataset (Video Question Answering)
+## Description
+[MSRVTT](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) dataset is a large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers.
+[MSRVTT-QA](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf) dataset is based on the MSR-VTT dataset, which is larger and has more complex scenes. The dataset
+contains 10K video clips and 243k question answer pairs.
+## Task
+Video question answering (VideoQA) is the task where
+a video and a natural language question are provided and the model
+needs to give the right answer (from [paper](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf)).
+## Metrics
+Accuracy.
+## Leaderboard
+(Ranked by accurarcy on test-dev.)
+| Rank | Model  | Acc. | Resources |
+| ---- | :----: | :-------: | :-------: |
+| 1    |  ALPro  |  42.1 |  [paper](https://arxiv.org/abs/2112.09583), [code](https://github.com/salesforce/ALPRO), [blog](https://blog.salesforceairesearch.com/alpro/) |
+| 2   |  VQA-T  |  41.5 | [paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Yang_Just_Ask_Learning_To_Answer_Questions_From_Millions_of_Narrated_ICCV_2021_paper.pdf), [code](https://github.com/antoyang/just-ask), [demo](http://videoqa.paris.inria.fr/) |
+| 3   |  CoMVT | 39.5 | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Seo_Look_Before_You_Speak_Visually_Contextualized_Utterances_CVPR_2021_paper.pdf) |
+| 4   |  SSML | 35.1 | [paper](https://arxiv.org/abs/2003.03186) |
+| 5   |  ClipBERT | 37.4 | [paper](https://arxiv.org/abs/2102.06183) [code](https://github.com/jayleicn/ClipBERT)|
+| 6   |  HCRN | 35.6 | [paper](https://arxiv.org/abs/2002.10698) [code](https://github.com/thaolmk54/hcrn-videoqa) |
+| 7   |  HGA | 35.5 | [paper](https://ojs.aaai.org/index.php/AAAI/article/view/6767) [code](https://github.com/Jumpin2/HGA) |
+| 8   |  DualVGR | 35.5 | [paper](https://arxiv.org/pdf/2107.04768v1.pdf) [code](https://github.com/NJUPT-MCC/DualVGR-VideoQA) |
+| 9   |  HME | 33.0 | [paper](https://arxiv.org/pdf/1904.04357.pdf), [code](https://github.com/fanchenyou/HME-VideoQA) |
+| 10   |  AMU | 32.5 | [paper](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf), [code](https://github.com/xudejing/video-question-answering) |
+## Auto-Downloading
+```
+cd lavis/datasets/download_scripts && python download_msrvtt.py
+```
+## References
+Xu, Jun, Tao Mei, Ting Yao, and Yong Rui. "Msr-vtt: A large video description dataset for bridging video and language." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288-5296. 2016.
+Xu, Dejing, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. "Video question answering via gradually refined attention over appearance and motion." In Proceedings of the 25th ACM international conference on Multimedia, pp. 1645-1653. 2017.

LAVIS/dataset_card/msrvtt_retrieval.md ADDED Viewed

	@@ -0,0 +1,24 @@

+![Samples from Flickr30k dataset (Image credit: "https://bryanplummer.com/Flickr30kEntities/").](imgs/msrvtt.png)
+# MSRVTT Dataset (Retrieval)
+## Description
+[MSRVTT](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) dataset is a large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers.
+## Task
+Cross modal retrieval: (1) **video-text**: given a video as query, retrieve texts from a gallery; (2) **text-video**: given a text as query, retrieval videos from a gallery.
+## Metrics
+Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.
+We use TR to denote the video-text retrieval recall score and VR to denote text-video retrieval score.
+## Leaderboard
+(Ranked by TR@1.)
+<!-- | Rank | Model  | TR@1  | TR@5  | TR@10 | IR@1  | IR@5  | IR@10 |                                                                                                                   Resources                                                                                                                    |
+| ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
+| 1    |  BLIP  | 82.4  | 95.4  | 97.9  | 65.1  | 86.3  | 91.8  | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) | -->
+## References
+Xu, Jun, Tao Mei, Ting Yao, and Yong Rui. "Msr-vtt: A large video description dataset for bridging video and language." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288-5296. 2016.

LAVIS/dataset_card/msvd_qa.md ADDED Viewed

	@@ -0,0 +1,45 @@

+![Samples from MSVD-QA dataset.](imgs/msvd_qa.png)(Samples from MSVD-QA dataset, image credit: http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf)
+# MSVD Dataset (Video Question Answering)
+## Description
+[MSVD-QA](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf) dataset is based on Microsoft Research Video
+Description Corpus (https://www.cs.utexas.edu/users/ml/clamp/videoDescription/) which is used in many video captioning
+experiments. The MSVD-QA dataset has a total number of 1,970
+video clips and 50,505 question answer pairs.
+## Task
+Video question answering (VideoQA) is the task where
+a video and a natural language question are provided and the model
+needs to give the right answer (from [paper](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf)).
+## Metrics
+Accuracy.
+## Leaderboard
+(Ranked by accurarcy on test-dev.)
+| Rank | Model  | Acc. | Resources |
+| ---- | :----: | :-------: | :-------: |
+| 1   |  VQA-T  |  46.3 | [paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Yang_Just_Ask_Learning_To_Answer_Questions_From_Millions_of_Narrated_ICCV_2021_paper.pdf), [code](https://github.com/antoyang/just-ask), [demo](http://videoqa.paris.inria.fr/) |
+| 2    |  ALPro  |  45.9 |  [paper](https://arxiv.org/abs/2112.09583), [code](https://github.com/salesforce/ALPRO), [blog](https://blog.salesforceairesearch.com/alpro/) |
+| 3   |  CoMVT | 42.6 | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Seo_Look_Before_You_Speak_Visually_Contextualized_Utterances_CVPR_2021_paper.pdf) |
+| 4   |  DualVGR | 39.0 | [paper](https://arxiv.org/pdf/2107.04768v1.pdf) [code](https://github.com/NJUPT-MCC/DualVGR-VideoQA) |
+| 5   |  HCRN | 36.1 | [paper](https://arxiv.org/abs/2002.10698) [code](https://github.com/thaolmk54/hcrn-videoqa) |
+| 6   |  SSML | 35.1 | [paper](https://arxiv.org/abs/2003.03186) |
+| 7   |  HGA | 34.7 | [paper](https://ojs.aaai.org/index.php/AAAI/article/view/6767) [code](https://github.com/Jumpin2/HGA) |
+| 8   |  HME | 33.7 | [paper](https://arxiv.org/pdf/1904.04357.pdf), [code](https://github.com/fanchenyou/HME-VideoQA) |
+| 9   |  AMU | 32.0 | [paper](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf), [code](https://github.com/xudejing/video-question-answering) |
+| 10   |  ST-VQA | 31.3 | [paper](https://arxiv.org/pdf/1704.04497.pdf), [code](https://github.com/YunseokJANG/tgif-qa) |
+## Auto-Downloading
+```
+cd lavis/datasets/download_scripts && python download_msvd.py
+```
+## References
+Chen, David, and William B. Dolan. "Collecting highly parallel data for paraphrase evaluation." In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp. 190-200. 2011.
+Xu, Dejing, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. "Video question answering via gradually refined attention over appearance and motion." In Proceedings of the 25th ACM international conference on Multimedia, pp. 1645-1653. 2017.

LAVIS/dataset_card/nlvr2.md ADDED Viewed

	@@ -0,0 +1,42 @@

+![From https://arxiv.org/pdf/1505.00468.pdf.](imgs/NLVR2.png)
+# Natural Language for Visual Reasoning for Real (NLVR2)
+## Description
+(from https://lil.nlp.cornell.edu/nlvr/)
+NLVR2 contains 107,292 examples of human-written English sentences grounded in pairs of photographs. NLVR2 retains the linguistic diversity of NLVR, while including much more visually complex images.
+We only publicly release the sentence annotations and original image URLs, and scripts that download the images from the URLs. If you would like direct access to the images, please fill out this Google Form. This form asks for your basic information and asks you to agree to our Terms of Service.
+## Task
+(from https://lil.nlp.cornell.edu/nlvr/)
+The Natural Language for Visual Reasoning (NLVR) task is to determine whether a sentence is true about a visual input. The data was collected through crowdsourcings, and solving the task requires reasoning about sets of objects, comparisons, and spatial relations. This includes two corpora: NLVR, with synthetically generated images, and NLVR2, which includes natural photographs.
+## Metrics
+Accuracy.
+## Leaderboard
+(Ranked by accurarcy on dev.)
+| Rank | Model  | dev | test | Resources |
+| ---- | :----: | :------: | :------: | :-------: |
+| 1    |  VLMo  |   88.6   |   89.5   |  [paper](https://arxiv.org/pdf/2111.02358.pdf) |
+| 2    |  CoCa  |   86.1   |   87.0   |  [paper](https://arxiv.org/pdf/2205.01917.pdf) |
+| 3    | SimVLM  |   84.5   |   85.2   | [paper](https://openreview.net/pdf?id=GUrhfTuf_3) |
+| 4    | X-VLM  | 84.4  | 84.8  |  [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM)
+| 5    | VinVL  | 82.7 | 84.0 |                                                                          [paper](https://arxiv.org/pdf/2101.00529.pdf), [code](https://github.com/pzzhang/VinVL)                                                                           |
+| 6    | ALBEF  |   82.6   |   83.1   |  [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/)                                                 |
+| 7    | BLIP  |   82.2   |   82.2   | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/)|
+| 8    |  OSCAR  |  78.1  | 78.4 |                           [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar)                            |
+| 9    | SOHO  |   76.4   |   77.3  | [paper](https://arxiv.org/pdf/2104.03135.pdf), [code](https://github.com/researchmm/soho) |
+| 10    | UNITER | 77.2  | 77.9 |                                                          [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER)                                                          |
+## Downloading
+Auto-downloading is not supported for this dataset. Please refer to https://lil.nlp.cornell.edu/nlvr/ and fill in the Google form to download the original images.
+## References
+Suhr, Alane, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. "A corpus for reasoning about natural language grounded in photographs." arXiv preprint arXiv:1811.00491 (2018).

LAVIS/dataset_card/nocaps.md ADDED Viewed

	@@ -0,0 +1,38 @@

+![Samples from the COCO Caption dataset (Image credit: "https://arxiv.org/pdf/1504.00325.pdf").](imgs/nocaps.png)
+# Nocaps
+## Description
+our benchmark consists of 166,100 human-generated captions describing 15,100 images from the Open Images validation and test sets. The associated training data consists of COCO image-caption pairs, plus Open Images image-level labels and object bounding boxes. Since Open Images contains many more classes than COCO, nearly 400 object classes seen in test images have no or very few associated training captions (hence, nocaps).
+## Task: Novel object captioning
+(from https://nocaps.org/)
+Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task. Dubbed nocaps, for novel object captioning at scale
+## Metrics
+Models are typically evaluated according to a [CIDEr](https://aclanthology.org/P02-1040/) or [SPICE](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf) metric.
+## Leaderboard
+(Ranked by CIDEr)
+| Rank |  Model  | val. CIDEr | val. SPICE |          test CIDEr | test SPICE |                                                          Resources                                                                     |
+| ---- | :-----: | :----: | :---: | :----------------------------------------------------------------------------------------------------------------------------------------------: | :---:|:---:|
+| 1    |  CoCa  |   122.4   |  15.5  | 120.6 | 15.5| [paper](https://arxiv.org/pdf/2205.01917.pdf) |
+| 2    |  LEMON  |  117.3 | 15.0  |114.3 | 14.9           |                                                         [paper]()                                                                     |
+| 3    |  BLIP   |  113.2 | 14.8 |  -  | -  | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP) |
+| 4    | SimVLM  |  112.2  | - | 110.3  | 14.5  |                                                [paper](https://openreview.net/pdf?id=GUrhfTuf_3)                                                 |
+| 5    |  VinVL  |  105.1  | 14.4 |  103.7  | 14.4  |                           [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar)                            |
+## Auto-Downloading
+```
+cd lavis/datasets/download_scripts && python download_nocaps.py
+```
+## References
+Agrawal, Harsh, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. "Nocaps: Novel object captioning at scale." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8948-8957. 2019.