manhteky123 commited on
Commit
1762ba0
·
verified ·
1 Parent(s): c802cc8

Upload 1462 files

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +48 -0
  2. Dockerfile +81 -103
  3. LAVIS/.github/workflows/docs.yaml +38 -0
  4. LAVIS/.gitignore +159 -0
  5. LAVIS/.pre-commit-config.yaml +26 -0
  6. LAVIS/CODEOWNERS +2 -0
  7. LAVIS/CODE_OF_CONDUCT.md +105 -0
  8. LAVIS/LICENSE.txt +14 -0
  9. LAVIS/MANIFEST.in +8 -0
  10. LAVIS/README.md +328 -0
  11. LAVIS/SECURITY.md +7 -0
  12. LAVIS/app/__init__.py +26 -0
  13. LAVIS/app/calculate_coco_features.py +87 -0
  14. LAVIS/app/caption.py +98 -0
  15. LAVIS/app/classification.py +216 -0
  16. LAVIS/app/dataset_browser.py +240 -0
  17. LAVIS/app/image_text_match.py +87 -0
  18. LAVIS/app/main.py +25 -0
  19. LAVIS/app/multimodal_search.py +230 -0
  20. LAVIS/app/multipage.py +41 -0
  21. LAVIS/app/text_localization.py +105 -0
  22. LAVIS/app/utils.py +81 -0
  23. LAVIS/app/vqa.py +63 -0
  24. LAVIS/assets/demo-6.png +3 -0
  25. LAVIS/dataset_card/avsd_dialogue.md +32 -0
  26. LAVIS/dataset_card/coco_caption.md +41 -0
  27. LAVIS/dataset_card/coco_retrieval.md +36 -0
  28. LAVIS/dataset_card/conceptual_captions.md +37 -0
  29. LAVIS/dataset_card/didemo_retrieval.md +36 -0
  30. LAVIS/dataset_card/flickr_retrieval.md +34 -0
  31. LAVIS/dataset_card/gqa.md +32 -0
  32. LAVIS/dataset_card/imgs/NLVR2.png +3 -0
  33. LAVIS/dataset_card/imgs/avsd_dialogue.png +3 -0
  34. LAVIS/dataset_card/imgs/coco_caption.png +3 -0
  35. LAVIS/dataset_card/imgs/conceptual_captions.png +3 -0
  36. LAVIS/dataset_card/imgs/didemo.png +3 -0
  37. LAVIS/dataset_card/imgs/flickr30k.png +3 -0
  38. LAVIS/dataset_card/imgs/gqa.png +3 -0
  39. LAVIS/dataset_card/imgs/msrvtt.png +3 -0
  40. LAVIS/dataset_card/imgs/msrvtt_qa.png +3 -0
  41. LAVIS/dataset_card/imgs/msvd_qa.png +3 -0
  42. LAVIS/dataset_card/imgs/nocaps.png +3 -0
  43. LAVIS/dataset_card/imgs/sbu_caption.png +3 -0
  44. LAVIS/dataset_card/imgs/snli_ve.png +3 -0
  45. LAVIS/dataset_card/imgs/vqav2.png +3 -0
  46. LAVIS/dataset_card/msrvtt_qa.md +44 -0
  47. LAVIS/dataset_card/msrvtt_retrieval.md +24 -0
  48. LAVIS/dataset_card/msvd_qa.md +45 -0
  49. LAVIS/dataset_card/nlvr2.md +42 -0
  50. LAVIS/dataset_card/nocaps.md +38 -0
.gitattributes CHANGED
@@ -34,3 +34,51 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  emo/train.json filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  emo/train.json filter=lfs diff=lfs merge=lfs -text
37
+ LAVIS/assets/demo-6.png filter=lfs diff=lfs merge=lfs -text
38
+ LAVIS/dataset_card/imgs/avsd_dialogue.png filter=lfs diff=lfs merge=lfs -text
39
+ LAVIS/dataset_card/imgs/coco_caption.png filter=lfs diff=lfs merge=lfs -text
40
+ LAVIS/dataset_card/imgs/conceptual_captions.png filter=lfs diff=lfs merge=lfs -text
41
+ LAVIS/dataset_card/imgs/didemo.png filter=lfs diff=lfs merge=lfs -text
42
+ LAVIS/dataset_card/imgs/flickr30k.png filter=lfs diff=lfs merge=lfs -text
43
+ LAVIS/dataset_card/imgs/gqa.png filter=lfs diff=lfs merge=lfs -text
44
+ LAVIS/dataset_card/imgs/msrvtt_qa.png filter=lfs diff=lfs merge=lfs -text
45
+ LAVIS/dataset_card/imgs/msrvtt.png filter=lfs diff=lfs merge=lfs -text
46
+ LAVIS/dataset_card/imgs/msvd_qa.png filter=lfs diff=lfs merge=lfs -text
47
+ LAVIS/dataset_card/imgs/NLVR2.png filter=lfs diff=lfs merge=lfs -text
48
+ LAVIS/dataset_card/imgs/nocaps.png filter=lfs diff=lfs merge=lfs -text
49
+ LAVIS/dataset_card/imgs/sbu_caption.png filter=lfs diff=lfs merge=lfs -text
50
+ LAVIS/dataset_card/imgs/snli_ve.png filter=lfs diff=lfs merge=lfs -text
51
+ LAVIS/dataset_card/imgs/vqav2.png filter=lfs diff=lfs merge=lfs -text
52
+ LAVIS/docs/_static/architecture.png filter=lfs diff=lfs merge=lfs -text
53
+ LAVIS/docs/_static/merlion.png filter=lfs diff=lfs merge=lfs -text
54
+ LAVIS/lavis/configs/datasets/discriminatory_reasoning/discriminatory_dataset/objaverse_discrn.json filter=lfs diff=lfs merge=lfs -text
55
+ LAVIS/lavis/models/clip_models/pics/CLIP.png filter=lfs diff=lfs merge=lfs -text
56
+ LAVIS/projects/blip-diffusion/images/black-cat.png filter=lfs diff=lfs merge=lfs -text
57
+ LAVIS/projects/blip-diffusion/images/cat-sofa.png filter=lfs diff=lfs merge=lfs -text
58
+ LAVIS/projects/blip-diffusion/images/dog.png filter=lfs diff=lfs merge=lfs -text
59
+ LAVIS/projects/blip-diffusion/images/dog2.png filter=lfs diff=lfs merge=lfs -text
60
+ LAVIS/projects/blip-diffusion/images/dreambooth/dog8/00.jpg filter=lfs diff=lfs merge=lfs -text
61
+ LAVIS/projects/blip-diffusion/images/dreambooth/dog8/01.jpg filter=lfs diff=lfs merge=lfs -text
62
+ LAVIS/projects/blip-diffusion/images/dreambooth/dog8/02.jpg filter=lfs diff=lfs merge=lfs -text
63
+ LAVIS/projects/blip-diffusion/images/dreambooth/dog8/03.jpg filter=lfs diff=lfs merge=lfs -text
64
+ LAVIS/projects/blip-diffusion/images/dreambooth/dog8/04.jpg filter=lfs diff=lfs merge=lfs -text
65
+ LAVIS/projects/blip-diffusion/images/dress-model.png filter=lfs diff=lfs merge=lfs -text
66
+ LAVIS/projects/blip-diffusion/images/green-skirt.png filter=lfs diff=lfs merge=lfs -text
67
+ LAVIS/projects/blip-diffusion/images/jacket-letter-s/jacket-letter-s.png filter=lfs diff=lfs merge=lfs -text
68
+ LAVIS/projects/blip-diffusion/images/pink-dress.png filter=lfs diff=lfs merge=lfs -text
69
+ LAVIS/projects/blip-diffusion/images/pink-dress/pink-dress.png filter=lfs diff=lfs merge=lfs -text
70
+ LAVIS/projects/blip-diffusion/images/shein-jacket/shein-jacket.jpg filter=lfs diff=lfs merge=lfs -text
71
+ LAVIS/projects/blip-diffusion/teaser-website.png filter=lfs diff=lfs merge=lfs -text
72
+ LAVIS/projects/blip2/blip2_illustration.png filter=lfs diff=lfs merge=lfs -text
73
+ LAVIS/projects/img2llm-vqa/Caption.png filter=lfs diff=lfs merge=lfs -text
74
+ LAVIS/projects/img2llm-vqa/demo.png filter=lfs diff=lfs merge=lfs -text
75
+ LAVIS/projects/img2llm-vqa/Illustration.png filter=lfs diff=lfs merge=lfs -text
76
+ LAVIS/projects/img2llm-vqa/QuestionGeneration.png filter=lfs diff=lfs merge=lfs -text
77
+ LAVIS/projects/instructblip/comparison.png filter=lfs diff=lfs merge=lfs -text
78
+ LAVIS/projects/instructblip/showcase.png filter=lfs diff=lfs merge=lfs -text
79
+ LAVIS/projects/pnp-vqa/pnp_vqa.png filter=lfs diff=lfs merge=lfs -text
80
+ LAVIS/projects/xinstructblip/assets/architecture.png filter=lfs diff=lfs merge=lfs -text
81
+ LAVIS/projects/xinstructblip/assets/data.png filter=lfs diff=lfs merge=lfs -text
82
+ LAVIS/projects/xinstructblip/demo/examples/audio/110714_wren.wav filter=lfs diff=lfs merge=lfs -text
83
+ LAVIS/projects/xinstructblip/demo/examples/audio/Group_of_Dogs_Barking.wav filter=lfs diff=lfs merge=lfs -text
84
+ LAVIS/projects/xinstructblip/demo/examples/point_cloud/banana.glb filter=lfs diff=lfs merge=lfs -text
Dockerfile CHANGED
@@ -1,14 +1,26 @@
1
- # Use conda base image with Python
2
- FROM continuumio/miniconda3:latest
3
 
4
  # Set working directory
5
  WORKDIR /app
6
 
 
 
 
 
 
 
 
 
 
 
 
7
  # Install system dependencies
8
  RUN apt-get update && apt-get install -y \
9
  git \
 
 
10
  build-essential \
11
- libgl1-mesa-dev \
12
  libglib2.0-0 \
13
  libsm6 \
14
  libxext6 \
@@ -16,114 +28,80 @@ RUN apt-get update && apt-get install -y \
16
  libgomp1 \
17
  libgcc-s1 \
18
  ffmpeg \
19
- libgtk-3-dev \
20
- pkg-config \
21
- curl \
22
- wget \
23
- sudo \
24
  && rm -rf /var/lib/apt/lists/*
25
 
26
- # Create a non-root user for better security and compatibility with HF Spaces
27
- RUN useradd -m -u 1000 user && \
28
- echo "user ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
 
 
 
29
 
30
- # Create conda environment with Python 3.8
31
- RUN conda create --name emovit python=3.8 -y
 
32
 
33
- # Make RUN commands use the new environment
34
- SHELL ["conda", "run", "-n", "emovit", "/bin/bash", "-c"]
35
-
36
- # Clone EmoVIT repository
37
- RUN git clone https://github.com/aimmemotion/EmoVIT.git
38
-
39
- # Set working directory to EmoVIT
40
- WORKDIR /app/EmoVIT
41
-
42
- # Install requirements_lavis.txt
43
  RUN pip install --no-cache-dir --upgrade pip setuptools wheel
44
- RUN pip install --no-cache-dir -r requirements_lavis.txt
45
-
46
- # Install PyTorch with CUDA 11.8 support
47
- RUN pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cu118
48
-
49
- # Try to install salesforce-lavis first
50
- RUN pip install salesforce-lavis || echo "salesforce-lavis installation failed, proceeding with manual installation"
51
-
52
- # If salesforce-lavis installation failed, proceed with manual installation
53
- # Clone LAVIS repository and install manually
54
- WORKDIR /app
55
- RUN git clone https://github.com/salesforce/LAVIS.git
56
-
57
- WORKDIR /app/LAVIS
58
-
59
- # Install open3d to resolve the import issue
60
- RUN pip install open3d --no-cache-dir
61
 
62
- # Install LAVIS in editable mode
63
- RUN pip install -e .
64
-
65
- # Move back to EmoVIT directory
66
- WORKDIR /app/EmoVIT
67
-
68
- # Create lib directory and copy lavis folder as mentioned in instructions
69
- RUN mkdir -p lib
70
- RUN cp -r /app/LAVIS/lavis lib/
71
-
72
- # Copy local files to override cloned files if needed
73
- COPY . /app/EmoVIT/
74
-
75
- # Set up cache directories with proper permissions
76
- RUN mkdir -p /app/.cache/huggingface/hub && \
77
- mkdir -p /app/.cache/torch && \
78
- mkdir -p /app/.cache/datasets && \
79
- mkdir -p /app/.cache/transformers && \
80
- chmod -R 777 /app && \
81
- chown -R user:user /app
82
-
83
- # Set environment variables to use /app/.cache instead of root cache
84
- ENV CONDA_DEFAULT_ENV=emovit
85
- ENV PATH="/opt/conda/envs/emovit/bin:$PATH"
86
- ENV PYTHONPATH=/app/EmoVIT
87
- ENV HF_HOME=/app/.cache/huggingface
88
- ENV TORCH_HOME=/app/.cache/torch
89
- ENV HF_DATASETS_CACHE=/app/.cache/datasets
90
- ENV HUGGINGFACE_HUB_CACHE=/app/.cache/huggingface/hub
91
- ENV TRANSFORMERS_CACHE=/app/.cache/transformers
92
- ENV XDG_CACHE_HOME=/app/.cache
93
-
94
- # Switch to non-root user for running the application
95
- USER user
96
-
97
- # Ensure conda works for the user
98
- RUN conda init bash
99
-
100
- # Activate the conda environment by default
101
- RUN echo "conda activate emovit" >> ~/.bashrc
102
-
103
- # Create necessary directories
104
- RUN mkdir -p static/css templates
105
-
106
- # Make start script executable if it exists
107
- RUN if [ -f start.sh ]; then chmod +x start.sh; fi
108
-
109
- # Expose ports for potential web applications
110
- EXPOSE 7860 8000 8080 8501
 
 
111
 
112
  # Health check
113
- HEALTHCHECK --interval=30s --timeout=30s --start-period=5s --retries=3 \
114
  CMD curl -f http://localhost:7860/health || exit 1
115
 
116
- # Debug check - verify installations
117
- RUN python -c "import torch, transformers, numpy; \
118
- print('Torch:', torch.__version__); \
119
- print('Transformers:', transformers.__version__); \
120
- print('NumPy:', numpy.__version__);"
121
-
122
- # Test LAVIS import
123
- RUN python -c "import lavis; print('LAVIS import OK')" || echo "LAVIS import failed"
124
-
125
- # Test open3d import
126
- RUN python -c "import open3d; print('Open3D import OK')" || echo "Open3D import failed"
127
 
128
- # Default command
129
- CMD ["conda", "run", "-n", "emovit", "bash", "-c", "if [ -f start.sh ]; then ./start.sh; else bash; fi"]
 
1
+ FROM python:3.8-slim
 
2
 
3
  # Set working directory
4
  WORKDIR /app
5
 
6
+ # Set environment variables
7
+ ENV PYTHONUNBUFFERED=1
8
+ ENV PYTHONPATH=/app:/app/lib
9
+ ENV TRANSFORMERS_CACHE=/app/.cache/transformers
10
+ ENV HF_HOME=/app/.cache/huggingface
11
+ ENV TORCH_HOME=/app/.cache/torch
12
+ ENV HF_DATASETS_CACHE=/app/.cache/datasets
13
+ ENV HUGGINGFACE_HUB_CACHE=/app/.cache/huggingface/hub
14
+ ENV PORT=7860
15
+ ENV DEBIAN_FRONTEND=noninteractive
16
+
17
  # Install system dependencies
18
  RUN apt-get update && apt-get install -y \
19
  git \
20
+ wget \
21
+ curl \
22
  build-essential \
23
+ libgl1-mesa-glx \
24
  libglib2.0-0 \
25
  libsm6 \
26
  libxext6 \
 
28
  libgomp1 \
29
  libgcc-s1 \
30
  ffmpeg \
 
 
 
 
 
31
  && rm -rf /var/lib/apt/lists/*
32
 
33
+ # Create cache directories
34
+ RUN mkdir -p /app/.cache/transformers \
35
+ && mkdir -p /app/.cache/huggingface/hub \
36
+ && mkdir -p /app/.cache/torch \
37
+ && mkdir -p /app/.cache/datasets \
38
+ && chmod -R 777 /app/.cache
39
 
40
+ # Copy requirements files first for better caching
41
+ COPY requirements_lavis.txt /app/
42
+ COPY requirements_emo.txt /app/
43
 
44
+ # Install Python dependencies
 
 
 
 
 
 
 
 
 
45
  RUN pip install --no-cache-dir --upgrade pip setuptools wheel
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
+ # Install PyTorch first (crucial for proper dependency resolution)
48
+ RUN pip install --no-cache-dir torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1 --index-url https://download.pytorch.org/whl/cpu
49
+
50
+ # Install other dependencies
51
+ RUN pip install --no-cache-dir -r requirements_lavis.txt || true
52
+ RUN pip install --no-cache-dir -r requirements_emo.txt || true
53
+
54
+ # Install additional dependencies for web app
55
+ RUN pip install --no-cache-dir \
56
+ flask \
57
+ pillow \
58
+ numpy \
59
+ opencv-python-headless \
60
+ transformers==4.28.0 \
61
+ accelerate \
62
+ bitsandbytes \
63
+ sentencepiece \
64
+ protobuf
65
+
66
+ # Install LAVIS
67
+ RUN pip install --no-cache-dir salesforce-lavis || \
68
+ (git clone https://github.com/salesforce/LAVIS.git /tmp/LAVIS && \
69
+ cd /tmp/LAVIS && \
70
+ pip install -e . && \
71
+ cp -r lavis /app/lib/ && \
72
+ rm -rf /tmp/LAVIS)
73
+
74
+ # Copy application files
75
+ COPY . /app/
76
+
77
+ # Copy LAVIS folder if it exists
78
+ COPY LAVIS/ /app/LAVIS/
79
+
80
+ # Ensure the blip2_vicuna_instruct.py is in the right place
81
+ RUN if [ -f "/app/blip2_vicuna_instruct.py" ]; then \
82
+ cp /app/blip2_vicuna_instruct.py /app/LAVIS/lavis/models/blip2_models/ 2>/dev/null || true; \
83
+ fi
84
+
85
+ # Create lib directory and ensure proper structure
86
+ RUN mkdir -p /app/lib && \
87
+ if [ -d "/app/LAVIS/lavis" ]; then cp -r /app/LAVIS/lavis /app/lib/; fi
88
+
89
+ # Make start script executable
90
+ RUN chmod +x /app/start.sh
91
+
92
+ # Create necessary directories for the application
93
+ RUN mkdir -p /app/templates /app/static /app/emo
94
+
95
+ # Set proper permissions
96
+ RUN chmod -R 755 /app && \
97
+ chmod -R 777 /app/.cache
98
 
99
  # Health check
100
+ HEALTHCHECK --interval=30s --timeout=30s --start-period=60s --retries=3 \
101
  CMD curl -f http://localhost:7860/health || exit 1
102
 
103
+ # Expose port
104
+ EXPOSE 7860
 
 
 
 
 
 
 
 
 
105
 
106
+ # Use start script
107
+ CMD ["./start.sh"]
LAVIS/.github/workflows/docs.yaml ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: docs
2
+
3
+ on:
4
+ push:
5
+ branches: [ main ]
6
+ pull_request:
7
+ branches: [ main ]
8
+ release:
9
+ types: [ published ]
10
+
11
+ jobs:
12
+ build:
13
+
14
+ runs-on: ubuntu-18.04
15
+
16
+ steps:
17
+ - uses: actions/checkout@v2
18
+ with:
19
+ fetch-depth: 0
20
+ - name: Set up Python
21
+ uses: actions/setup-python@v2
22
+ with:
23
+ python-version: '3.8'
24
+ - name: Install dependencies
25
+ run: |
26
+ python -m pip install --upgrade pip setuptools wheel
27
+ sudo apt-get update
28
+ sudo apt-get install openjdk-11-jdk
29
+ sudo apt-get install pandoc
30
+ - name: Build Sphinx docs
31
+ run: |
32
+ docs/build_docs.sh
33
+ - name: Deploy to gh-pages
34
+ uses: peaceiris/actions-gh-pages@v3
35
+ if: ${{ github.ref == 'refs/heads/main' || github.event_name == 'release' }}
36
+ with:
37
+ github_token: ${{ secrets.GITHUB_TOKEN }}
38
+ publish_dir: docs/_build/html
LAVIS/.gitignore ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+
6
+ # C extensions
7
+ *.so
8
+
9
+ # Distribution / packaging
10
+ .Python
11
+ build/
12
+ develop-eggs/
13
+ dist/
14
+ downloads/
15
+ eggs/
16
+ .eggs/
17
+ lib/
18
+ lib64/
19
+ parts/
20
+ sdist/
21
+ var/
22
+ wheels/
23
+ share/python-wheels/
24
+ *.egg-info/
25
+ .installed.cfg
26
+ *.egg
27
+ MANIFEST
28
+
29
+ # PyInstaller
30
+ # Usually these files are written by a python script from a template
31
+ # before PyInstaller builds the exe, so as to inject date/other infos into it.
32
+ *.manifest
33
+ *.spec
34
+
35
+ # Installer logs
36
+ pip-log.txt
37
+ pip-delete-this-directory.txt
38
+
39
+ # Unit test / coverage reports
40
+ htmlcov/
41
+ .tox/
42
+ .nox/
43
+ .coverage
44
+ .coverage.*
45
+ .cache
46
+ nosetests.xml
47
+ coverage.xml
48
+ *.cover
49
+ *.py,cover
50
+ .hypothesis/
51
+ .pytest_cache/
52
+ cover/
53
+
54
+ # Translations
55
+ *.mo
56
+ *.pot
57
+
58
+ # Django stuff:
59
+ *.log
60
+ local_settings.py
61
+ db.sqlite3
62
+ db.sqlite3-journal
63
+
64
+ # Flask stuff:
65
+ instance/
66
+ .webassets-cache
67
+
68
+ # Scrapy stuff:
69
+ .scrapy
70
+
71
+ # Sphinx documentation
72
+ docs/_build/
73
+
74
+ # PyBuilder
75
+ .pybuilder/
76
+ target/
77
+
78
+ # Jupyter Notebook
79
+ .ipynb_checkpoints
80
+
81
+ # IPython
82
+ profile_default/
83
+ ipython_config.py
84
+
85
+ # pyenv
86
+ # For a library or package, you might want to ignore these files since the code is
87
+ # intended to run in multiple environments; otherwise, check them in:
88
+ # .python-version
89
+
90
+ # pipenv
91
+ # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92
+ # However, in case of collaboration, if having platform-specific dependencies or dependencies
93
+ # having no cross-platform support, pipenv may install dependencies that don't work, or not
94
+ # install all needed dependencies.
95
+ #Pipfile.lock
96
+
97
+ # poetry
98
+ # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
99
+ # This is especially recommended for binary packages to ensure reproducibility, and is more
100
+ # commonly ignored for libraries.
101
+ # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102
+ #poetry.lock
103
+
104
+ # PEP 582; used by e.g. github.com/David-OConnor/pyflow
105
+ __pypackages__/
106
+
107
+ # Celery stuff
108
+ celerybeat-schedule
109
+ celerybeat.pid
110
+
111
+ # SageMath parsed files
112
+ *.sage.py
113
+
114
+ # Environments
115
+ .env
116
+ .venv
117
+ env/
118
+ venv/
119
+ ENV/
120
+ env.bak/
121
+ venv.bak/
122
+
123
+ # Spyder project settings
124
+ .spyderproject
125
+ .spyproject
126
+
127
+ # Rope project settings
128
+ .ropeproject
129
+
130
+ # mkdocs documentation
131
+ /site
132
+
133
+ # mypy
134
+ .mypy_cache/
135
+ .dmypy.json
136
+ dmypy.json
137
+
138
+ # Pyre type checker
139
+ .pyre/
140
+
141
+ # pytype static type analyzer
142
+ .pytype/
143
+
144
+ # Cython debug symbols
145
+ cython_debug/
146
+
147
+ # project-specific
148
+ output/
149
+ debug*/
150
+ *.bak
151
+ *.dir
152
+ *.dat
153
+ *.tsv
154
+ *.gz
155
+ *.csv
156
+ *.p
157
+ *.pdf
158
+
159
+ cache/
LAVIS/.pre-commit-config.yaml ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ repos:
2
+ - repo: https://github.com/pre-commit/pre-commit-hooks
3
+ rev: v4.1.0
4
+ hooks:
5
+ - id: trailing-whitespace
6
+ - id: check-ast
7
+ - id: no-commit-to-branch
8
+ args: ['--branch=main']
9
+ - id: check-added-large-files
10
+ args: ['--maxkb=5000']
11
+ - id: end-of-file-fixer
12
+
13
+ - repo: https://github.com/psf/black
14
+ rev: stable
15
+ hooks:
16
+ - id: black
17
+ language_version: python3.8
18
+
19
+ - repo: https://github.com/PyCQA/flake8
20
+ rev: 3.9.2
21
+ hooks:
22
+ - id: flake8
23
+ args: [
24
+ # only error for syntax errors and undefined names
25
+ "--select=E9,F63,F7,F82",
26
+ ]
LAVIS/CODEOWNERS ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ # Comment line immediately above ownership line is reserved for related gus information. Please be careful while editing.
2
+ #ECCN:Open Source
LAVIS/CODE_OF_CONDUCT.md ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Salesforce Open Source Community Code of Conduct
2
+
3
+ ## About the Code of Conduct
4
+
5
+ Equality is a core value at Salesforce. We believe a diverse and inclusive
6
+ community fosters innovation and creativity, and are committed to building a
7
+ culture where everyone feels included.
8
+
9
+ Salesforce open-source projects are committed to providing a friendly, safe, and
10
+ welcoming environment for all, regardless of gender identity and expression,
11
+ sexual orientation, disability, physical appearance, body size, ethnicity, nationality,
12
+ race, age, religion, level of experience, education, socioeconomic status, or
13
+ other similar personal characteristics.
14
+
15
+ The goal of this code of conduct is to specify a baseline standard of behavior so
16
+ that people with different social values and communication styles can work
17
+ together effectively, productively, and respectfully in our open source community.
18
+ It also establishes a mechanism for reporting issues and resolving conflicts.
19
+
20
+ All questions and reports of abusive, harassing, or otherwise unacceptable behavior
21
+ in a Salesforce open-source project may be reported by contacting the Salesforce
22
+ Open Source Conduct Committee at ossconduct@salesforce.com.
23
+
24
+ ## Our Pledge
25
+
26
+ In the interest of fostering an open and welcoming environment, we as
27
+ contributors and maintainers pledge to making participation in our project and
28
+ our community a harassment-free experience for everyone, regardless of gender
29
+ identity and expression, sexual orientation, disability, physical appearance,
30
+ body size, ethnicity, nationality, race, age, religion, level of experience, education,
31
+ socioeconomic status, or other similar personal characteristics.
32
+
33
+ ## Our Standards
34
+
35
+ Examples of behavior that contributes to creating a positive environment
36
+ include:
37
+
38
+ * Using welcoming and inclusive language
39
+ * Being respectful of differing viewpoints and experiences
40
+ * Gracefully accepting constructive criticism
41
+ * Focusing on what is best for the community
42
+ * Showing empathy toward other community members
43
+
44
+ Examples of unacceptable behavior by participants include:
45
+
46
+ * The use of sexualized language or imagery and unwelcome sexual attention or
47
+ advances
48
+ * Personal attacks, insulting/derogatory comments, or trolling
49
+ * Public or private harassment
50
+ * Publishing, or threatening to publish, others' private information—such as
51
+ a physical or electronic address—without explicit permission
52
+ * Other conduct which could reasonably be considered inappropriate in a
53
+ professional setting
54
+ * Advocating for or encouraging any of the above behaviors
55
+
56
+ ## Our Responsibilities
57
+
58
+ Project maintainers are responsible for clarifying the standards of acceptable
59
+ behavior and are expected to take appropriate and fair corrective action in
60
+ response to any instances of unacceptable behavior.
61
+
62
+ Project maintainers have the right and responsibility to remove, edit, or
63
+ reject comments, commits, code, wiki edits, issues, and other contributions
64
+ that are not aligned with this Code of Conduct, or to ban temporarily or
65
+ permanently any contributor for other behaviors that they deem inappropriate,
66
+ threatening, offensive, or harmful.
67
+
68
+ ## Scope
69
+
70
+ This Code of Conduct applies both within project spaces and in public spaces
71
+ when an individual is representing the project or its community. Examples of
72
+ representing a project or community include using an official project email
73
+ address, posting via an official social media account, or acting as an appointed
74
+ representative at an online or offline event. Representation of a project may be
75
+ further defined and clarified by project maintainers.
76
+
77
+ ## Enforcement
78
+
79
+ Instances of abusive, harassing, or otherwise unacceptable behavior may be
80
+ reported by contacting the Salesforce Open Source Conduct Committee
81
+ at ossconduct@salesforce.com. All complaints will be reviewed and investigated
82
+ and will result in a response that is deemed necessary and appropriate to the
83
+ circumstances. The committee is obligated to maintain confidentiality with
84
+ regard to the reporter of an incident. Further details of specific enforcement
85
+ policies may be posted separately.
86
+
87
+ Project maintainers who do not follow or enforce the Code of Conduct in good
88
+ faith may face temporary or permanent repercussions as determined by other
89
+ members of the project's leadership and the Salesforce Open Source Conduct
90
+ Committee.
91
+
92
+ ## Attribution
93
+
94
+ This Code of Conduct is adapted from the [Contributor Covenant][contributor-covenant-home],
95
+ version 1.4, available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html.
96
+ It includes adaptions and additions from [Go Community Code of Conduct][golang-coc],
97
+ [CNCF Code of Conduct][cncf-coc], and [Microsoft Open Source Code of Conduct][microsoft-coc].
98
+
99
+ This Code of Conduct is licensed under the [Creative Commons Attribution 3.0 License][cc-by-3-us].
100
+
101
+ [contributor-covenant-home]: https://www.contributor-covenant.org (https://www.contributor-covenant.org/)
102
+ [golang-coc]: https://golang.org/conduct
103
+ [cncf-coc]: https://github.com/cncf/foundation/blob/master/code-of-conduct.md
104
+ [microsoft-coc]: https://opensource.microsoft.com/codeofconduct/
105
+ [cc-by-3-us]: https://creativecommons.org/licenses/by/3.0/us/
LAVIS/LICENSE.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ BSD 3-Clause License
2
+
3
+ Copyright (c) 2022 Salesforce, Inc.
4
+ All rights reserved.
5
+
6
+ Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
7
+
8
+ 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
9
+
10
+ 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
11
+
12
+ 3. Neither the name of Salesforce.com nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
13
+
14
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
LAVIS/MANIFEST.in ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ recursive-include lavis/configs *.yaml *.json
2
+ recursive-include lavis/projects *.yaml *.json
3
+
4
+ recursive-exclude lavis/datasets/download_scripts *
5
+ recursive-exclude lavis/output *
6
+
7
+ include requirements.txt
8
+ include lavis/models/clip_models/bpe_simple_vocab_16e6.txt.gz
LAVIS/README.md ADDED
@@ -0,0 +1,328 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <p align="center">
2
+ <br>
3
+ <img src="docs/_static/logo_final.png" width="400"/>
4
+ <br>
5
+ <p>
6
+
7
+ <div align="center">
8
+ <a href="https://github.com/salesforce/LAVIS/releases"><img alt="Latest Release" src="https://img.shields.io/github/release/salesforce/LAVIS.svg" /></a>
9
+ <a href="https://opensource.salesforce.com/LAVIS/index.html">
10
+ <img alt="docs" src="https://github.com/salesforce/LAVIS/actions/workflows/docs.yaml/badge.svg"/>
11
+ <a href="https://opensource.org/licenses/BSD-3-Clause">
12
+ <img alt="license" src="https://img.shields.io/badge/License-BSD_3--Clause-blue.svg"/>
13
+ </a>
14
+ <a href="https://pepy.tech/project/salesforce-lavis">
15
+ <img alt="Downloads" src="https://pepy.tech/badge/salesforce-lavis">
16
+ </a>
17
+ </div>
18
+
19
+ <div align="center">
20
+ <a href="https://opensource.salesforce.com/LAVIS//latest/benchmark.html">Benchmark</a>,
21
+ <a href="https://arxiv.org/abs/2209.09019">Technical Report</a>,
22
+ <a href="https://opensource.salesforce.com/LAVIS//latest/index.html">Documentation</a>,
23
+ <a href="https://github.com/salesforce/LAVIS/tree/main/examples">Jupyter Notebook Examples</a>,
24
+ <a href="https://blog.salesforceairesearch.com/lavis-language-vision-library/">Blog</a>
25
+ </div>
26
+
27
+ # LAVIS - A Library for Language-Vision Intelligence
28
+
29
+ ## What's New: 🎉
30
+ * [Model Release] November 2023, released implementation of **X-InstructBLIP** <br>
31
+ [Paper](https://arxiv.org/pdf/2311.18799.pdf), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/xinstructblip), [Website](https://artemisp.github.io/X-InstructBLIP-page/), [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/xinstructblip/demo/run_demo.ipynb)
32
+ > A simple, yet effective, cross-modality framework built atop frozen LLMs that allows the integration of various modalities (image, video, audio, 3D) without extensive modality-specific customization.
33
+ * [Model Release] July 2023, released implementation of **BLIP-Diffusion** <br>
34
+ [Paper](https://arxiv.org/abs/2305.06500), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion), [Website](https://dxli94.github.io/BLIP-Diffusion-website/)
35
+ > A text-to-image generation model that trains 20x than DreamBooth. Also facilitates zero-shot subject-driven generation and editing.
36
+ * [Model Release] May 2023, released implementation of **InstructBLIP** <br>
37
+ [Paper](https://arxiv.org/abs/2305.06500), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip)
38
+ > A new vision-language instruction-tuning framework using BLIP-2 models, achieving state-of-the-art zero-shot generalization performance on a wide range of vision-language tasks.
39
+ * [Model Release] Jan 2023, released implementation of **BLIP-2** <br>
40
+ [Paper](https://arxiv.org/abs/2301.12597), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/blip2), [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/examples/blip2_instructed_generation.ipynb)
41
+ > A generic and efficient pre-training strategy that easily harvests development of pretrained vision models and large language models (LLMs) for vision-language pretraining. BLIP-2 beats Flamingo on zero-shot VQAv2 (**65.0** vs **56.3**), establishing new state-of-the-art on zero-shot captioning (on NoCaps **121.6** CIDEr score vs previous best **113.2**). In addition, equipped with powerful LLMs (e.g. OPT, FlanT5), BLIP-2 also unlocks the new **zero-shot instructed vision-to-language generation** capabilities for various interesting applications!
42
+ * Jan 2023, LAVIS is now available on [PyPI](https://pypi.org/project/salesforce-lavis/) for installation!
43
+ * [Model Release] Dec 2022, released implementation of **Img2LLM-VQA** (**CVPR 2023**, _"From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models"_, by Jiaxian Guo et al) <br>
44
+ [Paper](https://arxiv.org/pdf/2212.10846.pdf), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/img2llm-vqa), [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/img2llm-vqa/img2llm_vqa.ipynb)
45
+ > A plug-and-play module that enables off-the-shelf use of Large Language Models (LLMs) for visual question answering (VQA). Img2LLM-VQA surpasses Flamingo on zero-shot VQA on VQAv2 (61.9 vs 56.3), while in contrast requiring no end-to-end training!
46
+ * [Model Release] Oct 2022, released implementation of **PNP-VQA** (**EMNLP Findings 2022**, _"Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training"_, by Anthony T.M.H. et al), <br>
47
+ [Paper](https://arxiv.org/abs/2210.08773), [Project Page](https://github.com/salesforce/LAVIS/tree/main/projects/pnp-vqa), [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/salesforce/LAVIS/blob/main/projects/pnp-vqa/pnp_vqa.ipynb))
48
+ > A modular zero-shot VQA framework that requires no PLMs training, achieving SoTA zero-shot VQA performance.
49
+
50
+ ## Technical Report and Citing LAVIS
51
+ You can find more details in our [technical report](https://arxiv.org/abs/2209.09019).
52
+
53
+ **If you're using LAVIS in your research or applications, please cite it using this BibTeX**:
54
+ ```bibtex
55
+ @inproceedings{li-etal-2023-lavis,
56
+ title = "{LAVIS}: A One-stop Library for Language-Vision Intelligence",
57
+ author = "Li, Dongxu and
58
+ Li, Junnan and
59
+ Le, Hung and
60
+ Wang, Guangsen and
61
+ Savarese, Silvio and
62
+ Hoi, Steven C.H.",
63
+ booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
64
+ month = jul,
65
+ year = "2023",
66
+ address = "Toronto, Canada",
67
+ publisher = "Association for Computational Linguistics",
68
+ url = "https://aclanthology.org/2023.acl-demo.3",
69
+ pages = "31--41",
70
+ abstract = "We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications. LAVIS aims to serve as a one-stop comprehensive library that brings recent advancements in the language-vision field accessible for researchers and practitioners, as well as fertilizing future research and development. It features a unified interface to easily access state-of-the-art image-language, video-language models and common datasets. LAVIS supports training, evaluation and benchmarking on a rich variety of tasks, including multimodal classification, retrieval, captioning, visual question answering, dialogue and pre-training. In the meantime, the library is also highly extensible and configurable, facilitating future development and customization. In this technical report, we describe design principles, key components and functionalities of the library, and also present benchmarking results across common language-vision tasks.",
71
+ }
72
+ ```
73
+
74
+
75
+ ## Table of Contents
76
+ - [Introduction](#introduction)
77
+ - [Installation](#installation)
78
+ - [Getting Started](#getting-started)
79
+ - [Model Zoo](#model-zoo)
80
+ - [Image Captioning](#image-captioning)
81
+ - [Visual question answering (VQA)](#visual-question-answering-vqa)
82
+ - [Unified Feature Extraction Interface](#unified-feature-extraction-interface)
83
+ - [Load Datasets](#load-datasets)
84
+ - [Jupyter Notebook Examples](#jupyter-notebook-examples)
85
+ - [Resources and Tools](#resources-and-tools)
86
+ - [Documentations](#documentations)
87
+ - [Ethical and Responsible Use](#ethical-and-responsible-use)
88
+ - [Technical Report and Citing LAVIS](#technical-report-and-citing-lavis)
89
+ - [License](#license)
90
+
91
+ ## Introduction
92
+ LAVIS is a Python deep learning library for LAnguage-and-VISion intelligence research and applications. This library aims to provide engineers and researchers with a one-stop solution to rapidly develop models for their specific multimodal scenarios, and benchmark them across standard and customized datasets.
93
+ It features a unified interface design to access
94
+ - **10+** tasks
95
+ (retrieval, captioning, visual question answering, multimodal classification etc.);
96
+ - **20+** datasets (COCO, Flickr, Nocaps, Conceptual
97
+ Commons, SBU, etc.);
98
+ - **30+** pretrained weights of state-of-the-art foundation language-vision models and their task-specific adaptations, including [ALBEF](https://arxiv.org/pdf/2107.07651.pdf),
99
+ [BLIP](https://arxiv.org/pdf/2201.12086.pdf), [ALPRO](https://arxiv.org/pdf/2112.09583.pdf), [CLIP](https://arxiv.org/pdf/2103.00020.pdf).
100
+ <p align="center">
101
+ <br>
102
+ <img src="assets/demo-6.png"/>
103
+ <br>
104
+ <p>
105
+
106
+ Key features of LAVIS include:
107
+
108
+ - **Unified and Modular Interface**: facilitating to easily leverage and repurpose existing modules (datasets, models, preprocessors), also to add new modules.
109
+
110
+ - **Easy Off-the-shelf Inference and Feature Extraction**: readily available pre-trained models let you take advantage of state-of-the-art multimodal understanding and generation capabilities on your own data.
111
+
112
+ - **Reproducible Model Zoo and Training Recipes**: easily replicate and extend state-of-the-art models on existing and new tasks.
113
+
114
+ - **Dataset Zoo and Automatic Downloading Tools**: it can be a hassle to prepare the many language-vision datasets. LAVIS provides automatic downloading scripts to help prepare a large variety of datasets and their annotations.
115
+
116
+
117
+ The following table shows the supported tasks, datasets and models in our library. This is a continuing effort and we are working on further growing the list.
118
+
119
+ | Tasks | Supported Models | Supported Datasets |
120
+ | :--------------------------------------: | :----------------------: | :----------------------------------------: |
121
+ | Image-text Pre-training | ALBEF, BLIP | COCO, VisualGenome, SBU ConceptualCaptions |
122
+ | Image-text Retrieval | ALBEF, BLIP, CLIP | COCO, Flickr30k |
123
+ | Text-image Retrieval | ALBEF, BLIP, CLIP | COCO, Flickr30k |
124
+ | Visual Question Answering | ALBEF, BLIP | VQAv2, OKVQA, A-OKVQA |
125
+ | Image Captioning | BLIP | COCO, NoCaps |
126
+ | Image Classification | CLIP | ImageNet |
127
+ | Natural Language Visual Reasoning (NLVR) | ALBEF, BLIP | NLVR2 |
128
+ | Visual Entailment (VE) | ALBEF | SNLI-VE |
129
+ | Visual Dialogue | BLIP | VisDial |
130
+ | Video-text Retrieval | BLIP, ALPRO | MSRVTT, DiDeMo |
131
+ | Text-video Retrieval | BLIP, ALPRO | MSRVTT, DiDeMo |
132
+ | Video Question Answering (VideoQA) | BLIP, ALPRO | MSRVTT, MSVD |
133
+ | Video Dialogue | VGD-GPT | AVSD |
134
+ | Multimodal Feature Extraction | ALBEF, CLIP, BLIP, ALPRO | customized |
135
+ | Text-to-image Generation | [COMING SOON] | |
136
+
137
+ ## Installation
138
+
139
+ 1. (Optional) Creating conda environment
140
+
141
+ ```bash
142
+ conda create -n lavis python=3.8
143
+ conda activate lavis
144
+ ```
145
+
146
+ 2. install from [PyPI](https://pypi.org/project/salesforce-lavis/)
147
+ ```bash
148
+ pip install salesforce-lavis
149
+ ```
150
+
151
+ 3. Or, for development, you may build from source
152
+
153
+ ```bash
154
+ git clone https://github.com/salesforce/LAVIS.git
155
+ cd LAVIS
156
+ pip install -e .
157
+ ```
158
+
159
+ ## Getting Started
160
+ ### Model Zoo
161
+ Model zoo summarizes supported models in LAVIS, to view:
162
+ ```python
163
+ from lavis.models import model_zoo
164
+ print(model_zoo)
165
+ # ==================================================
166
+ # Architectures Types
167
+ # ==================================================
168
+ # albef_classification ve
169
+ # albef_feature_extractor base
170
+ # albef_nlvr nlvr
171
+ # albef_pretrain base
172
+ # albef_retrieval coco, flickr
173
+ # albef_vqa vqav2
174
+ # alpro_qa msrvtt, msvd
175
+ # alpro_retrieval msrvtt, didemo
176
+ # blip_caption base_coco, large_coco
177
+ # blip_classification base
178
+ # blip_feature_extractor base
179
+ # blip_nlvr nlvr
180
+ # blip_pretrain base
181
+ # blip_retrieval coco, flickr
182
+ # blip_vqa vqav2, okvqa, aokvqa
183
+ # clip_feature_extractor ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50
184
+ # clip ViT-B-32, ViT-B-16, ViT-L-14, ViT-L-14-336, RN50
185
+ # gpt_dialogue base
186
+ ```
187
+
188
+ Let’s see how to use models in LAVIS to perform inference on example data. We first load a sample image from local.
189
+
190
+ ```python
191
+ import torch
192
+ from PIL import Image
193
+ # setup device to use
194
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
195
+ # load sample image
196
+ raw_image = Image.open("docs/_static/merlion.png").convert("RGB")
197
+ ```
198
+
199
+ This example image shows [Merlion park](https://en.wikipedia.org/wiki/Merlion) ([source](https://theculturetrip.com/asia/singapore/articles/what-exactly-is-singapores-merlion-anyway/)), a landmark in Singapore.
200
+
201
+
202
+ ### Image Captioning
203
+ In this example, we use the BLIP model to generate a caption for the image. To make inference even easier, we also associate each
204
+ pre-trained model with its preprocessors (transforms), accessed via ``load_model_and_preprocess()``.
205
+
206
+ ```python
207
+ import torch
208
+ from lavis.models import load_model_and_preprocess
209
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
210
+ # loads BLIP caption base model, with finetuned checkpoints on MSCOCO captioning dataset.
211
+ # this also loads the associated image processors
212
+ model, vis_processors, _ = load_model_and_preprocess(name="blip_caption", model_type="base_coco", is_eval=True, device=device)
213
+ # preprocess the image
214
+ # vis_processors stores image transforms for "train" and "eval" (validation / testing / inference)
215
+ image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
216
+ # generate caption
217
+ model.generate({"image": image})
218
+ # ['a large fountain spewing water into the air']
219
+ ```
220
+
221
+ ### Visual question answering (VQA)
222
+ BLIP model is able to answer free-form questions about images in natural language.
223
+ To access the VQA model, simply replace the ``name`` and ``model_type`` arguments
224
+ passed to ``load_model_and_preprocess()``.
225
+
226
+ ```python
227
+ from lavis.models import load_model_and_preprocess
228
+ model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_vqa", model_type="vqav2", is_eval=True, device=device)
229
+ # ask a random question.
230
+ question = "Which city is this photo taken?"
231
+ image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
232
+ question = txt_processors["eval"](question)
233
+ model.predict_answers(samples={"image": image, "text_input": question}, inference_method="generate")
234
+ # ['singapore']
235
+ ```
236
+
237
+ ### Unified Feature Extraction Interface
238
+
239
+ LAVIS provides a unified interface to extract features from each architecture.
240
+ To extract features, we load the feature extractor variants of each model.
241
+ The multimodal feature can be used for multimodal classification.
242
+ The low-dimensional unimodal features can be used to compute cross-modal similarity.
243
+
244
+
245
+ ```python
246
+ from lavis.models import load_model_and_preprocess
247
+ model, vis_processors, txt_processors = load_model_and_preprocess(name="blip_feature_extractor", model_type="base", is_eval=True, device=device)
248
+ caption = "a large fountain spewing water into the air"
249
+ image = vis_processors["eval"](raw_image).unsqueeze(0).to(device)
250
+ text_input = txt_processors["eval"](caption)
251
+ sample = {"image": image, "text_input": [text_input]}
252
+
253
+ features_multimodal = model.extract_features(sample)
254
+ print(features_multimodal.multimodal_embeds.shape)
255
+ # torch.Size([1, 12, 768]), use features_multimodal[:,0,:] for multimodal classification tasks
256
+
257
+ features_image = model.extract_features(sample, mode="image")
258
+ features_text = model.extract_features(sample, mode="text")
259
+ print(features_image.image_embeds.shape)
260
+ # torch.Size([1, 197, 768])
261
+ print(features_text.text_embeds.shape)
262
+ # torch.Size([1, 12, 768])
263
+
264
+ # low-dimensional projected features
265
+ print(features_image.image_embeds_proj.shape)
266
+ # torch.Size([1, 197, 256])
267
+ print(features_text.text_embeds_proj.shape)
268
+ # torch.Size([1, 12, 256])
269
+ similarity = features_image.image_embeds_proj[:,0,:] @ features_text.text_embeds_proj[:,0,:].t()
270
+ print(similarity)
271
+ # tensor([[0.2622]])
272
+ ```
273
+
274
+ ### Load Datasets
275
+ LAVIS inherently supports a wide variety of common language-vision datasets by providing [automatic download tools](https://opensource.salesforce.com/LAVIS//latest/benchmark) to help download and organize these datasets. After downloading, to load the datasets, use the following code:
276
+
277
+ ```python
278
+ from lavis.datasets.builders import dataset_zoo
279
+ dataset_names = dataset_zoo.get_names()
280
+ print(dataset_names)
281
+ # ['aok_vqa', 'coco_caption', 'coco_retrieval', 'coco_vqa', 'conceptual_caption_12m',
282
+ # 'conceptual_caption_3m', 'didemo_retrieval', 'flickr30k', 'imagenet', 'laion2B_multi',
283
+ # 'msrvtt_caption', 'msrvtt_qa', 'msrvtt_retrieval', 'msvd_caption', 'msvd_qa', 'nlvr',
284
+ # 'nocaps', 'ok_vqa', 'sbu_caption', 'snli_ve', 'vatex_caption', 'vg_caption', 'vg_vqa']
285
+ ```
286
+ After downloading the images, we can use ``load_dataset()`` to obtain the dataset.
287
+ ```python
288
+ from lavis.datasets.builders import load_dataset
289
+ coco_dataset = load_dataset("coco_caption")
290
+ print(coco_dataset.keys())
291
+ # dict_keys(['train', 'val', 'test'])
292
+ print(len(coco_dataset["train"]))
293
+ # 566747
294
+ print(coco_dataset["train"][0])
295
+ # {'image': <PIL.Image.Image image mode=RGB size=640x480>,
296
+ # 'text_input': 'A woman wearing a net on her head cutting a cake. ',
297
+ # 'image_id': 0}
298
+ ```
299
+
300
+ If you already host a local copy of the dataset, you can pass in the ``vis_path`` argument to change the default location to load images.
301
+
302
+ ```python
303
+ coco_dataset = load_dataset("coco_caption", vis_path=YOUR_LOCAL_PATH)
304
+ ```
305
+
306
+ ## Jupyter Notebook Examples
307
+ See [examples](https://github.com/salesforce/LAVIS/tree/main/examples) for more inference examples, e.g. captioning, feature extraction, VQA, GradCam, zeros-shot classification.
308
+
309
+ ## Resources and Tools
310
+ - **Benchmarks**: see [Benchmark](https://opensource.salesforce.com/LAVIS//latest/benchmark) for instructions to evaluate and train supported models.
311
+ - **Dataset Download and Browsing**: see [Dataset Download](https://opensource.salesforce.com/LAVIS//latest/benchmark) for instructions and automatic tools on download common language-vision datasets.
312
+ - **GUI Demo**: to run the demo locally, run ```bash run_scripts/run_demo.sh``` and then follow the instruction on the prompts to view in browser. A web demo is coming soon.
313
+
314
+
315
+ ## Documentations
316
+ For more details and advanced usages, please refer to
317
+ [documentation](https://opensource.salesforce.com/LAVIS//latest/index.html#).
318
+
319
+ ## Ethical and Responsible Use
320
+ We note that models in LAVIS provide no guarantees on their multimodal abilities; incorrect or biased predictions may be observed. In particular, the datasets and pretrained models utilized in LAVIS may contain socioeconomic biases which could result in misclassification and other unwanted behaviors such as offensive or inappropriate speech. We strongly recommend that users review the pre-trained models and overall system in LAVIS before practical adoption. We plan to improve the library by investigating and mitigating these potential biases and
321
+ inappropriate behaviors in the future.
322
+
323
+
324
+ ## Contact us
325
+ If you have any questions, comments or suggestions, please do not hesitate to contact us at lavis@salesforce.com.
326
+
327
+ ## License
328
+ [BSD 3-Clause License](LICENSE.txt)
LAVIS/SECURITY.md ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ ## Security
2
+
3
+ Please report any security issue to [security@salesforce.com](mailto:security@salesforce.com)
4
+ as soon as it is discovered. This library limits its runtime dependencies in
5
+ order to reduce the total cost of ownership as much as can be, but all consumers
6
+ should remain vigilant and have their security stakeholders review all third-party
7
+ products (3PP) like this one and their dependencies.
LAVIS/app/__init__.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Copyright (c) 2022, salesforce.com, inc.
3
+ # All rights reserved.
4
+ # SPDX-License-Identifier: BSD-3-Clause
5
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
6
+ """
7
+
8
+ from PIL import Image
9
+ import requests
10
+
11
+ import streamlit as st
12
+ import torch
13
+
14
+
15
+ @st.cache()
16
+ def load_demo_image():
17
+ img_url = (
18
+ "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
19
+ )
20
+ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
21
+ return raw_image
22
+
23
+
24
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
25
+
26
+ cache_root = "/export/home/.cache/lavis/"
LAVIS/app/calculate_coco_features.py ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Copyright (c) 2022, salesforce.com, inc.
3
+ # All rights reserved.
4
+ # SPDX-License-Identifier: BSD-3-Clause
5
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
6
+ """
7
+
8
+ from PIL import Image
9
+ import requests
10
+ import torch
11
+
12
+ import os
13
+
14
+ from lavis.common.registry import registry
15
+ from lavis.processors import *
16
+ from lavis.models import *
17
+ from lavis.common.utils import build_default_model
18
+
19
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
20
+
21
+
22
+ def load_demo_image():
23
+ img_url = (
24
+ "https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg"
25
+ )
26
+ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
27
+
28
+ return raw_image
29
+
30
+
31
+ def read_img(filepath):
32
+ raw_image = Image.open(filepath).convert("RGB")
33
+
34
+ return raw_image
35
+
36
+
37
+ # model
38
+ model_url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth"
39
+ feature_extractor = BlipFeatureExtractor(pretrained=model_url)
40
+
41
+ feature_extractor.eval()
42
+ feature_extractor = feature_extractor.to(device)
43
+
44
+ # preprocessors
45
+ vis_processor = BlipImageEvalProcessor(image_size=224)
46
+ text_processor = BlipCaptionProcessor()
47
+
48
+ # files to process
49
+ # file_root = "/export/home/.cache/lavis/coco/images/val2014"
50
+ file_root = "/export/home/.cache/lavis/coco/images/train2014"
51
+ filepaths = os.listdir(file_root)
52
+
53
+ print(len(filepaths))
54
+
55
+ caption = "dummy"
56
+
57
+ path2feat = dict()
58
+ bsz = 256
59
+
60
+ images_in_batch = []
61
+ filepaths_in_batch = []
62
+
63
+ for i, filename in enumerate(filepaths):
64
+ if i % bsz == 0 and i > 0:
65
+ images_in_batch = torch.cat(images_in_batch, dim=0).to(device)
66
+ with torch.no_grad():
67
+ image_features = feature_extractor(
68
+ images_in_batch, caption, mode="image", normalized=True
69
+ )[:, 0]
70
+
71
+ for filepath, image_feat in zip(filepaths_in_batch, image_features):
72
+ path2feat[os.path.basename(filepath)] = image_feat.detach().cpu()
73
+
74
+ images_in_batch = []
75
+ filepaths_in_batch = []
76
+
77
+ print(len(path2feat), image_features.shape)
78
+ else:
79
+ filepath = os.path.join(file_root, filename)
80
+
81
+ image = read_img(filepath)
82
+ image = vis_processor(image).unsqueeze(0)
83
+
84
+ images_in_batch.append(image)
85
+ filepaths_in_batch.append(filepath)
86
+
87
+ torch.save(path2feat, "path2feat_coco_train2014.pth")
LAVIS/app/caption.py ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Copyright (c) 2022, salesforce.com, inc.
3
+ # All rights reserved.
4
+ # SPDX-License-Identifier: BSD-3-Clause
5
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
6
+ """
7
+
8
+ import streamlit as st
9
+ from app import device, load_demo_image
10
+ from app.utils import load_model_cache
11
+ from lavis.processors import load_processor
12
+ from PIL import Image
13
+
14
+
15
+ def app():
16
+ # ===== layout =====
17
+ model_type = st.sidebar.selectbox("Model:", ["BLIP_base", "BLIP_large"])
18
+
19
+ sampling_method = st.sidebar.selectbox(
20
+ "Sampling method:", ["Beam search", "Nucleus sampling"]
21
+ )
22
+
23
+ st.markdown(
24
+ "<h1 style='text-align: center;'>Image Description Generation</h1>",
25
+ unsafe_allow_html=True,
26
+ )
27
+
28
+ instructions = """Try the provided image or upload your own:"""
29
+ file = st.file_uploader(instructions)
30
+
31
+ use_beam = sampling_method == "Beam search"
32
+
33
+ col1, col2 = st.columns(2)
34
+
35
+ if file:
36
+ raw_img = Image.open(file).convert("RGB")
37
+ else:
38
+ raw_img = load_demo_image()
39
+
40
+ col1.header("Image")
41
+
42
+ w, h = raw_img.size
43
+ scaling_factor = 720 / w
44
+ resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
45
+
46
+ col1.image(resized_image, use_column_width=True)
47
+ col2.header("Description")
48
+
49
+ cap_button = st.button("Generate")
50
+
51
+ # ==== event ====
52
+ vis_processor = load_processor("blip_image_eval").build(image_size=384)
53
+
54
+ if cap_button:
55
+ if model_type.startswith("BLIP"):
56
+ blip_type = model_type.split("_")[1].lower()
57
+ model = load_model_cache(
58
+ "blip_caption",
59
+ model_type=f"{blip_type}_coco",
60
+ is_eval=True,
61
+ device=device,
62
+ )
63
+
64
+ img = vis_processor(raw_img).unsqueeze(0).to(device)
65
+ captions = generate_caption(
66
+ model=model, image=img, use_nucleus_sampling=not use_beam
67
+ )
68
+
69
+ col2.write("\n\n".join(captions), use_column_width=True)
70
+
71
+
72
+ def generate_caption(
73
+ model, image, use_nucleus_sampling=False, num_beams=3, max_length=40, min_length=5
74
+ ):
75
+ samples = {"image": image}
76
+
77
+ captions = []
78
+ if use_nucleus_sampling:
79
+ for _ in range(5):
80
+ caption = model.generate(
81
+ samples,
82
+ use_nucleus_sampling=True,
83
+ max_length=max_length,
84
+ min_length=min_length,
85
+ top_p=0.9,
86
+ )
87
+ captions.append(caption[0])
88
+ else:
89
+ caption = model.generate(
90
+ samples,
91
+ use_nucleus_sampling=False,
92
+ num_beams=num_beams,
93
+ max_length=max_length,
94
+ min_length=min_length,
95
+ )
96
+ captions.append(caption[0])
97
+
98
+ return captions
LAVIS/app/classification.py ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Copyright (c) 2022, salesforce.com, inc.
3
+ # All rights reserved.
4
+ # SPDX-License-Identifier: BSD-3-Clause
5
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
6
+ """
7
+
8
+ import plotly.graph_objects as go
9
+ import requests
10
+ import streamlit as st
11
+ import torch
12
+ from lavis.models import load_model
13
+ from lavis.processors import load_processor
14
+ from lavis.processors.blip_processors import BlipCaptionProcessor
15
+ from PIL import Image
16
+
17
+ from app import device, load_demo_image
18
+ from app.utils import load_blip_itm_model
19
+ from lavis.processors.clip_processors import ClipImageEvalProcessor
20
+
21
+
22
+ @st.cache()
23
+ def load_demo_image(img_url=None):
24
+ if not img_url:
25
+ img_url = "https://img.atlasobscura.com/yDJ86L8Ou6aIjBsxnlAy5f164w1rjTgcHZcx2yUs4mo/rt:fit/w:1200/q:81/sm:1/scp:1/ar:1/aHR0cHM6Ly9hdGxh/cy1kZXYuczMuYW1h/em9uYXdzLmNvbS91/cGxvYWRzL3BsYWNl/X2ltYWdlcy85MDll/MDRjOS00NTJjLTQx/NzQtYTY4MS02NmQw/MzI2YWIzNjk1ZGVk/MGZhMTJiMTM5MmZi/NGFfUmVhcl92aWV3/X29mX3RoZV9NZXJs/aW9uX3N0YXR1ZV9h/dF9NZXJsaW9uX1Bh/cmssX1NpbmdhcG9y/ZSxfd2l0aF9NYXJp/bmFfQmF5X1NhbmRz/X2luX3RoZV9kaXN0/YW5jZV8tXzIwMTQw/MzA3LmpwZw.jpg"
26
+ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
27
+ return raw_image
28
+
29
+
30
+ @st.cache(
31
+ hash_funcs={
32
+ torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
33
+ .cpu()
34
+ .numpy()
35
+ },
36
+ allow_output_mutation=True,
37
+ )
38
+ def load_model_cache(model_type, device):
39
+ if model_type == "blip":
40
+ model = load_model(
41
+ "blip_feature_extractor", model_type="base", is_eval=True, device=device
42
+ )
43
+ elif model_type == "albef":
44
+ model = load_model(
45
+ "albef_feature_extractor", model_type="base", is_eval=True, device=device
46
+ )
47
+ elif model_type == "CLIP_ViT-B-32":
48
+ model = load_model(
49
+ "clip_feature_extractor", "ViT-B-32", is_eval=True, device=device
50
+ )
51
+ elif model_type == "CLIP_ViT-B-16":
52
+ model = load_model(
53
+ "clip_feature_extractor", "ViT-B-16", is_eval=True, device=device
54
+ )
55
+ elif model_type == "CLIP_ViT-L-14":
56
+ model = load_model(
57
+ "clip_feature_extractor", "ViT-L-14", is_eval=True, device=device
58
+ )
59
+
60
+ return model
61
+
62
+
63
+ def app():
64
+ model_type = st.sidebar.selectbox(
65
+ "Model:",
66
+ ["ALBEF", "BLIP_Base", "CLIP_ViT-B-32", "CLIP_ViT-B-16", "CLIP_ViT-L-14"],
67
+ )
68
+ score_type = st.sidebar.selectbox("Score type:", ["Cosine", "Multimodal"])
69
+
70
+ # ===== layout =====
71
+ st.markdown(
72
+ "<h1 style='text-align: center;'>Zero-shot Classification</h1>",
73
+ unsafe_allow_html=True,
74
+ )
75
+
76
+ instructions = """Try the provided image or upload your own:"""
77
+ file = st.file_uploader(instructions)
78
+
79
+ st.header("Image")
80
+ if file:
81
+ raw_img = Image.open(file).convert("RGB")
82
+ else:
83
+ raw_img = load_demo_image()
84
+
85
+ st.image(raw_img) # , use_column_width=True)
86
+
87
+ col1, col2 = st.columns(2)
88
+
89
+ col1.header("Categories")
90
+
91
+ cls_0 = col1.text_input("category 1", value="merlion")
92
+ cls_1 = col1.text_input("category 2", value="sky")
93
+ cls_2 = col1.text_input("category 3", value="giraffe")
94
+ cls_3 = col1.text_input("category 4", value="fountain")
95
+ cls_4 = col1.text_input("category 5", value="marina bay")
96
+
97
+ cls_names = [cls_0, cls_1, cls_2, cls_3, cls_4]
98
+ cls_names = [cls_nm for cls_nm in cls_names if len(cls_nm) > 0]
99
+
100
+ if len(cls_names) != len(set(cls_names)):
101
+ st.error("Please provide unique class names")
102
+ return
103
+
104
+ button = st.button("Submit")
105
+
106
+ col2.header("Prediction")
107
+
108
+ # ===== event =====
109
+
110
+ if button:
111
+ if model_type.startswith("BLIP"):
112
+ text_processor = BlipCaptionProcessor(prompt="A picture of ")
113
+ cls_prompt = [text_processor(cls_nm) for cls_nm in cls_names]
114
+
115
+ if score_type == "Cosine":
116
+ vis_processor = load_processor("blip_image_eval").build(image_size=224)
117
+ img = vis_processor(raw_img).unsqueeze(0).to(device)
118
+
119
+ feature_extractor = load_model_cache(model_type="blip", device=device)
120
+
121
+ sample = {"image": img, "text_input": cls_prompt}
122
+
123
+ with torch.no_grad():
124
+ image_features = feature_extractor.extract_features(
125
+ sample, mode="image"
126
+ ).image_embeds_proj[:, 0]
127
+ text_features = feature_extractor.extract_features(
128
+ sample, mode="text"
129
+ ).text_embeds_proj[:, 0]
130
+ sims = (image_features @ text_features.t())[
131
+ 0
132
+ ] / feature_extractor.temp
133
+
134
+ else:
135
+ vis_processor = load_processor("blip_image_eval").build(image_size=384)
136
+ img = vis_processor(raw_img).unsqueeze(0).to(device)
137
+
138
+ model = load_blip_itm_model(device)
139
+
140
+ output = model(img, cls_prompt, match_head="itm")
141
+ sims = output[:, 1]
142
+
143
+ sims = torch.nn.Softmax(dim=0)(sims)
144
+ inv_sims = [sim * 100 for sim in sims.tolist()[::-1]]
145
+
146
+ elif model_type.startswith("ALBEF"):
147
+ vis_processor = load_processor("blip_image_eval").build(image_size=224)
148
+ img = vis_processor(raw_img).unsqueeze(0).to(device)
149
+
150
+ text_processor = BlipCaptionProcessor(prompt="A picture of ")
151
+ cls_prompt = [text_processor(cls_nm) for cls_nm in cls_names]
152
+
153
+ feature_extractor = load_model_cache(model_type="albef", device=device)
154
+
155
+ sample = {"image": img, "text_input": cls_prompt}
156
+
157
+ with torch.no_grad():
158
+ image_features = feature_extractor.extract_features(
159
+ sample, mode="image"
160
+ ).image_embeds_proj[:, 0]
161
+ text_features = feature_extractor.extract_features(
162
+ sample, mode="text"
163
+ ).text_embeds_proj[:, 0]
164
+
165
+ st.write(image_features.shape)
166
+ st.write(text_features.shape)
167
+
168
+ sims = (image_features @ text_features.t())[0] / feature_extractor.temp
169
+
170
+ sims = torch.nn.Softmax(dim=0)(sims)
171
+ inv_sims = [sim * 100 for sim in sims.tolist()[::-1]]
172
+
173
+ elif model_type.startswith("CLIP"):
174
+ if model_type == "CLIP_ViT-B-32":
175
+ model = load_model_cache(model_type="CLIP_ViT-B-32", device=device)
176
+ elif model_type == "CLIP_ViT-B-16":
177
+ model = load_model_cache(model_type="CLIP_ViT-B-16", device=device)
178
+ elif model_type == "CLIP_ViT-L-14":
179
+ model = load_model_cache(model_type="CLIP_ViT-L-14", device=device)
180
+ else:
181
+ raise ValueError(f"Unknown model type {model_type}")
182
+
183
+ if score_type == "Cosine":
184
+ # image_preprocess = ClipImageEvalProcessor(image_size=336)
185
+ image_preprocess = ClipImageEvalProcessor(image_size=224)
186
+ img = image_preprocess(raw_img).unsqueeze(0).to(device)
187
+
188
+ sample = {"image": img, "text_input": cls_names}
189
+
190
+ with torch.no_grad():
191
+ clip_features = model.extract_features(sample)
192
+
193
+ image_features = clip_features.image_embeds_proj
194
+ text_features = clip_features.text_embeds_proj
195
+
196
+ sims = (100.0 * image_features @ text_features.T)[0].softmax(dim=-1)
197
+ inv_sims = sims.tolist()[::-1]
198
+ else:
199
+ st.warning("CLIP does not support multimodal scoring.")
200
+ return
201
+
202
+ fig = go.Figure(
203
+ go.Bar(
204
+ x=inv_sims,
205
+ y=cls_names[::-1],
206
+ text=["{:.2f}".format(s) for s in inv_sims],
207
+ orientation="h",
208
+ )
209
+ )
210
+ fig.update_traces(
211
+ textfont_size=12,
212
+ textangle=0,
213
+ textposition="outside",
214
+ cliponaxis=False,
215
+ )
216
+ col2.plotly_chart(fig, use_container_width=True)
LAVIS/app/dataset_browser.py ADDED
@@ -0,0 +1,240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Copyright (c) 2022, salesforce.com, inc.
3
+ # All rights reserved.
4
+ # SPDX-License-Identifier: BSD-3-Clause
5
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
6
+ """
7
+
8
+ import random
9
+ from collections import OrderedDict
10
+ from functools import reduce
11
+ from tkinter import N
12
+
13
+ import streamlit as st
14
+ from lavis.common.registry import registry
15
+ from lavis.datasets.builders import dataset_zoo, load_dataset
16
+ from lavis.datasets.builders.base_dataset_builder import load_dataset_config
17
+ from PIL import Image
18
+
19
+ IMAGE_LAYOUT = 3, 4
20
+ VIDEO_LAYOUT = 1, 2
21
+
22
+ PREV_STR = "Prev"
23
+ NEXT_STR = "Next"
24
+
25
+
26
+ def sample_dataset(dataset, indices):
27
+ samples = [dataset.displ_item(idx) for idx in indices]
28
+
29
+ return samples
30
+
31
+
32
+ def get_concat_v(im1, im2):
33
+ margin = 5
34
+
35
+ canvas_size = (im1.width + im2.width + margin, max(im1.height, im2.height))
36
+ canvas = Image.new("RGB", canvas_size, "White")
37
+ canvas.paste(im1, (0, 0))
38
+ canvas.paste(im2, (im1.width + margin, 0))
39
+
40
+ return canvas
41
+
42
+
43
+ def resize_img_w(raw_img, new_w=224):
44
+ if isinstance(raw_img, list):
45
+ resized_imgs = [resize_img_w(img, 196) for img in raw_img]
46
+ # concatenate images
47
+ resized_image = reduce(get_concat_v, resized_imgs)
48
+ else:
49
+ w, h = raw_img.size
50
+ scaling_factor = new_w / w
51
+ resized_image = raw_img.resize(
52
+ (int(w * scaling_factor), int(h * scaling_factor))
53
+ )
54
+
55
+ return resized_image
56
+
57
+
58
+ def get_visual_key(dataset):
59
+ if "image" in dataset[0]:
60
+ return "image"
61
+ elif "image0" in dataset[0]: # NLVR2 dataset
62
+ return "image"
63
+ elif "video" in dataset[0]:
64
+ return "video"
65
+ else:
66
+ raise ValueError("Visual key not found.")
67
+
68
+
69
+ def gather_items(samples, exclude=[]):
70
+ gathered = []
71
+
72
+ for s in samples:
73
+ ns = OrderedDict()
74
+ for k in s.keys():
75
+ if k not in exclude:
76
+ ns[k] = s[k]
77
+
78
+ gathered.append(ns)
79
+
80
+ return gathered
81
+
82
+
83
+ @st.cache(allow_output_mutation=True)
84
+ def load_dataset_cache(name):
85
+ return load_dataset(name)
86
+
87
+
88
+ def format_text(text):
89
+ md = "\n\n".join([f"**{k}**: {v}" for k, v in text.items()])
90
+
91
+ return md
92
+
93
+
94
+ def show_samples(dataset, offset=0, is_next=False):
95
+ visual_key = get_visual_key(dataset)
96
+
97
+ num_rows, num_cols = IMAGE_LAYOUT if visual_key == "image" else VIDEO_LAYOUT
98
+ n_samples = num_rows * num_cols
99
+
100
+ if not shuffle:
101
+ if is_next:
102
+ start = min(int(start_idx) + offset + n_samples, len(dataset) - n_samples)
103
+ else:
104
+ start = max(0, int(start_idx) + offset - n_samples)
105
+
106
+ st.session_state.last_start = start
107
+ end = min(start + n_samples, len(dataset))
108
+
109
+ indices = list(range(start, end))
110
+ else:
111
+ indices = random.sample(range(len(dataset)), n_samples)
112
+ samples = sample_dataset(dataset, indices)
113
+
114
+ visual_info = (
115
+ iter([resize_img_w(s[visual_key]) for s in samples])
116
+ if visual_key == "image"
117
+ # else iter([s[visual_key] for s in samples])
118
+ else iter([s["file"] for s in samples])
119
+ )
120
+ text_info = gather_items(samples, exclude=["image", "video"])
121
+ text_info = iter([format_text(s) for s in text_info])
122
+
123
+ st.markdown(
124
+ """<hr style="height:1px;border:none;color:#c7ccd4;background-color:#c7ccd4;"/> """,
125
+ unsafe_allow_html=True,
126
+ )
127
+ for _ in range(num_rows):
128
+ with st.container():
129
+ for col in st.columns(num_cols):
130
+ # col.text(next(text_info))
131
+ # col.caption(next(text_info))
132
+ try:
133
+ col.markdown(next(text_info))
134
+ if visual_key == "image":
135
+ col.image(next(visual_info), use_column_width=True, clamp=True)
136
+ elif visual_key == "video":
137
+ col.markdown(
138
+ "![Alt Text](https://media.giphy.com/media/vFKqnCdLPNOKc/giphy.gif)"
139
+ )
140
+ except StopIteration:
141
+ break
142
+
143
+ st.markdown(
144
+ """<hr style="height:1px;border:none;color:#c7ccd4;background-color:#c7ccd4;"/> """,
145
+ unsafe_allow_html=True,
146
+ )
147
+
148
+ st.session_state.n_display = n_samples
149
+
150
+
151
+ if __name__ == "__main__":
152
+ st.set_page_config(
153
+ page_title="LAVIS Dataset Explorer",
154
+ # layout="wide",
155
+ initial_sidebar_state="expanded",
156
+ )
157
+
158
+ dataset_name = st.sidebar.selectbox("Dataset:", dataset_zoo.get_names())
159
+
160
+ function = st.sidebar.selectbox("Function:", ["Browser"], index=0)
161
+
162
+ if function == "Browser":
163
+ shuffle = st.sidebar.selectbox("Shuffled:", [True, False], index=0)
164
+
165
+ dataset = load_dataset_cache(dataset_name)
166
+ split = st.sidebar.selectbox("Split:", dataset.keys())
167
+
168
+ dataset_len = len(dataset[split])
169
+ st.success(
170
+ f"Loaded {dataset_name}/{split} with **{dataset_len}** records. **Image/video directory**: {dataset[split].vis_root}"
171
+ )
172
+
173
+ if "last_dataset" not in st.session_state:
174
+ st.session_state.last_dataset = dataset_name
175
+ st.session_state.last_split = split
176
+
177
+ if "last_start" not in st.session_state:
178
+ st.session_state.last_start = 0
179
+
180
+ if "start_idx" not in st.session_state:
181
+ st.session_state.start_idx = 0
182
+
183
+ if "shuffle" not in st.session_state:
184
+ st.session_state.shuffle = shuffle
185
+
186
+ if "first_run" not in st.session_state:
187
+ st.session_state.first_run = True
188
+ elif (
189
+ st.session_state.last_dataset != dataset_name
190
+ or st.session_state.last_split != split
191
+ ):
192
+ st.session_state.first_run = True
193
+
194
+ st.session_state.last_dataset = dataset_name
195
+ st.session_state.last_split = split
196
+ elif st.session_state.shuffle != shuffle:
197
+ st.session_state.shuffle = shuffle
198
+ st.session_state.first_run = True
199
+
200
+ if not shuffle:
201
+ n_col, p_col = st.columns([0.05, 1])
202
+
203
+ prev_button = n_col.button(PREV_STR)
204
+ next_button = p_col.button(NEXT_STR)
205
+
206
+ else:
207
+ next_button = st.button(NEXT_STR)
208
+
209
+ if not shuffle:
210
+ start_idx = st.sidebar.text_input(f"Begin from (total {dataset_len})", 0)
211
+
212
+ if not start_idx.isdigit():
213
+ st.error(f"Input to 'Begin from' must be digits, found {start_idx}.")
214
+ else:
215
+ if int(start_idx) != st.session_state.start_idx:
216
+ st.session_state.start_idx = int(start_idx)
217
+ st.session_state.last_start = int(start_idx)
218
+
219
+ if prev_button:
220
+ show_samples(
221
+ dataset[split],
222
+ offset=st.session_state.last_start - st.session_state.start_idx,
223
+ is_next=False,
224
+ )
225
+
226
+ if next_button:
227
+ show_samples(
228
+ dataset[split],
229
+ offset=st.session_state.last_start - st.session_state.start_idx,
230
+ is_next=True,
231
+ )
232
+
233
+ if st.session_state.first_run:
234
+ st.session_state.first_run = False
235
+
236
+ show_samples(
237
+ dataset[split],
238
+ offset=st.session_state.last_start - st.session_state.start_idx,
239
+ is_next=True,
240
+ )
LAVIS/app/image_text_match.py ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Copyright (c) 2022, salesforce.com, inc.
3
+ # All rights reserved.
4
+ # SPDX-License-Identifier: BSD-3-Clause
5
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
6
+ """
7
+
8
+ import numpy as np
9
+ import streamlit as st
10
+ import torch
11
+ from lavis.models.blip_models.blip_image_text_matching import compute_gradcam
12
+ from lavis.processors import load_processor
13
+ from PIL import Image
14
+
15
+ from app import device, load_demo_image
16
+ from app.utils import getAttMap, init_bert_tokenizer, load_blip_itm_model
17
+
18
+
19
+ def app():
20
+ model_type = st.sidebar.selectbox("Model:", ["BLIP_base", "BLIP_large"])
21
+
22
+ if model_type.startswith("BLIP"):
23
+ blip_type = model_type.split("_")[1]
24
+ model = load_blip_itm_model(device, model_type=blip_type)
25
+
26
+ vis_processor = load_processor("blip_image_eval").build(image_size=384)
27
+
28
+ st.markdown(
29
+ "<h1 style='text-align: center;'>Image Text Matching</h1>",
30
+ unsafe_allow_html=True,
31
+ )
32
+
33
+ values = list(range(1, 12))
34
+ default_layer_num = values.index(7)
35
+ layer_num = (
36
+ st.sidebar.selectbox("Layer number", values, index=default_layer_num) - 1
37
+ )
38
+
39
+ instructions = """Try the provided image or upload your own:"""
40
+ file = st.file_uploader(instructions)
41
+
42
+ col1, col2 = st.columns(2)
43
+ col1.header("Image")
44
+ col2.header("GradCam")
45
+ if file:
46
+ raw_img = Image.open(file).convert("RGB")
47
+ else:
48
+ raw_img = load_demo_image()
49
+
50
+ w, h = raw_img.size
51
+ scaling_factor = 720 / w
52
+ resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
53
+ col1.image(resized_image, use_column_width=True)
54
+
55
+ col3, col4 = st.columns(2)
56
+ col3.header("Text")
57
+ user_question = col3.text_input(
58
+ "Input your sentence!", "a woman sitting on the beach with a dog"
59
+ )
60
+ submit_button = col3.button("Submit")
61
+
62
+ col4.header("Matching score")
63
+
64
+ if submit_button:
65
+ tokenizer = init_bert_tokenizer()
66
+
67
+ img = vis_processor(raw_img).unsqueeze(0).to(device)
68
+ text_processor = load_processor("blip_caption").build()
69
+
70
+ qry = text_processor(user_question)
71
+
72
+ norm_img = np.float32(resized_image) / 255
73
+
74
+ qry_tok = tokenizer(qry, return_tensors="pt").to(device)
75
+ gradcam, output = compute_gradcam(model, img, qry, qry_tok, block_num=layer_num)
76
+
77
+ avg_gradcam = getAttMap(norm_img, gradcam[0][1], blur=True)
78
+
79
+ col2.image(avg_gradcam, use_column_width=True, clamp=True)
80
+ # output = model(img, question)
81
+ itm_score = torch.nn.functional.softmax(output, dim=1)
82
+ new_title = (
83
+ '<p style="text-align: left; font-size: 25px;">\n{:.3f}%</p>'.format(
84
+ itm_score[0][1].item() * 100
85
+ )
86
+ )
87
+ col4.markdown(new_title, unsafe_allow_html=True)
LAVIS/app/main.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Copyright (c) 2022, salesforce.com, inc.
3
+ # All rights reserved.
4
+ # SPDX-License-Identifier: BSD-3-Clause
5
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
6
+ """
7
+
8
+ from app.multipage import MultiPage
9
+ from app import vqa, caption
10
+ from app import image_text_match as itm
11
+ from app import text_localization as tl
12
+ from app import multimodal_search as ms
13
+ from app import classification as cl
14
+
15
+
16
+ if __name__ == "__main__":
17
+ app = MultiPage()
18
+
19
+ app.add_page("Image Description Generation", caption.app)
20
+ app.add_page("Multimodal Search", ms.app)
21
+ app.add_page("Visual Question Answering", vqa.app)
22
+ app.add_page("Image Text Matching", itm.app)
23
+ app.add_page("Text Localization", tl.app)
24
+ app.add_page("Classification", cl.app)
25
+ app.run()
LAVIS/app/multimodal_search.py ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Copyright (c) 2022, salesforce.com, inc.
3
+ # All rights reserved.
4
+ # SPDX-License-Identifier: BSD-3-Clause
5
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
6
+ """
7
+
8
+ import os
9
+
10
+ import numpy as np
11
+ import streamlit as st
12
+ import torch
13
+ import torch.nn.functional as F
14
+ from app import cache_root, device
15
+ from app.utils import (
16
+ getAttMap,
17
+ init_bert_tokenizer,
18
+ load_blip_itm_model,
19
+ read_img,
20
+ resize_img,
21
+ )
22
+ from lavis.models import load_model
23
+ from lavis.processors import load_processor
24
+
25
+
26
+ @st.cache(
27
+ hash_funcs={
28
+ torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
29
+ .cpu()
30
+ .numpy()
31
+ },
32
+ allow_output_mutation=True,
33
+ )
34
+ def load_feat():
35
+ from lavis.common.utils import download_url
36
+
37
+ dirname = os.path.join(os.path.dirname(__file__), "assets")
38
+ filename = "path2feat_coco_train2014.pth"
39
+ filepath = os.path.join(dirname, filename)
40
+ url = "https://storage.googleapis.com/sfr-vision-language-research/LAVIS/assets/path2feat_coco_train2014.pth"
41
+
42
+ if not os.path.exists(filepath):
43
+ download_url(url=url, root=dirname, filename="path2feat_coco_train2014.pth")
44
+
45
+ path2feat = torch.load(filepath)
46
+ paths = sorted(path2feat.keys())
47
+
48
+ all_img_feats = torch.stack([path2feat[k] for k in paths], dim=0).to(device)
49
+
50
+ return path2feat, paths, all_img_feats
51
+
52
+
53
+ @st.cache(
54
+ hash_funcs={
55
+ torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
56
+ .cpu()
57
+ .numpy()
58
+ },
59
+ allow_output_mutation=True,
60
+ )
61
+ def load_feature_extractor_model(device):
62
+ model_url = "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth"
63
+
64
+ model = load_model(
65
+ "blip_feature_extractor", model_type="base", is_eval=True, device=device
66
+ )
67
+ model.load_from_pretrained(model_url)
68
+
69
+ return model
70
+
71
+
72
+ def app():
73
+ # === layout ===
74
+ model_type = st.sidebar.selectbox("Model:", ["BLIP_base", "BLIP_large"])
75
+ file_root = os.path.join(cache_root, "coco/images/train2014/")
76
+
77
+ values = [12, 24, 48]
78
+ default_layer_num = values.index(24)
79
+ num_display = st.sidebar.selectbox(
80
+ "Number of images:", values, index=default_layer_num
81
+ )
82
+ show_gradcam = st.sidebar.selectbox("Show GradCam:", [True, False], index=1)
83
+ itm_ranking = st.sidebar.selectbox("Multimodal re-ranking:", [True, False], index=0)
84
+
85
+ # st.title('Multimodal Search')
86
+ st.markdown(
87
+ "<h1 style='text-align: center;'>Multimodal Search</h1>", unsafe_allow_html=True
88
+ )
89
+
90
+ # === event ===
91
+ vis_processor = load_processor("blip_image_eval").build(image_size=384)
92
+ text_processor = load_processor("blip_caption")
93
+
94
+ user_question = st.text_input(
95
+ "Search query", "A dog running on the grass.", help="Type something to search."
96
+ )
97
+ user_question = text_processor(user_question)
98
+ feature_extractor = load_feature_extractor_model(device)
99
+
100
+ # ======= ITC =========
101
+ sample = {"text_input": user_question}
102
+
103
+ with torch.no_grad():
104
+ text_feature = feature_extractor.extract_features(
105
+ sample, mode="text"
106
+ ).text_embeds_proj[0, 0]
107
+
108
+ path2feat, paths, all_img_feats = load_feat()
109
+ all_img_feats.to(device)
110
+ all_img_feats = F.normalize(all_img_feats, dim=1)
111
+
112
+ num_cols = 4
113
+ num_rows = int(num_display / num_cols)
114
+
115
+ similarities = text_feature @ all_img_feats.T
116
+ indices = torch.argsort(similarities, descending=True)[:num_display]
117
+
118
+ top_paths = [paths[ind.detach().cpu().item()] for ind in indices]
119
+ sorted_similarities = [similarities[idx] for idx in indices]
120
+ filenames = [os.path.join(file_root, p) for p in top_paths]
121
+
122
+ # ========= ITM and GradCam ==========
123
+ bsz = 4 # max number of images to avoid cuda oom
124
+ if model_type.startswith("BLIP"):
125
+ blip_type = model_type.split("_")[1]
126
+
127
+ itm_model = load_blip_itm_model(device, model_type=blip_type)
128
+
129
+ tokenizer = init_bert_tokenizer()
130
+ queries_batch = [user_question] * bsz
131
+ queries_tok_batch = tokenizer(queries_batch, return_tensors="pt").to(device)
132
+
133
+ num_batches = int(num_display / bsz)
134
+
135
+ avg_gradcams = []
136
+ all_raw_images = []
137
+ itm_scores = []
138
+
139
+ for i in range(num_batches):
140
+ filenames_in_batch = filenames[i * bsz : (i + 1) * bsz]
141
+ raw_images, images = read_and_process_images(filenames_in_batch, vis_processor)
142
+ gradcam, itm_output = compute_gradcam_batch(
143
+ itm_model, images, queries_batch, queries_tok_batch
144
+ )
145
+
146
+ all_raw_images.extend([resize_img(r_img) for r_img in raw_images])
147
+ norm_imgs = [np.float32(r_img) / 255 for r_img in raw_images]
148
+
149
+ for norm_img, grad_cam in zip(norm_imgs, gradcam):
150
+ avg_gradcam = getAttMap(norm_img, grad_cam[0], blur=True)
151
+ avg_gradcams.append(avg_gradcam)
152
+
153
+ with torch.no_grad():
154
+ itm_score = torch.nn.functional.softmax(itm_output, dim=1)
155
+
156
+ itm_scores.append(itm_score)
157
+
158
+ # ========= ITM re-ranking =========
159
+ itm_scores = torch.cat(itm_scores)[:, 1]
160
+ if itm_ranking:
161
+ itm_scores_sorted, indices = torch.sort(itm_scores, descending=True)
162
+
163
+ avg_gradcams_sorted = []
164
+ all_raw_images_sorted = []
165
+ for idx in indices:
166
+ avg_gradcams_sorted.append(avg_gradcams[idx])
167
+ all_raw_images_sorted.append(all_raw_images[idx])
168
+
169
+ avg_gradcams = avg_gradcams_sorted
170
+ all_raw_images = all_raw_images_sorted
171
+
172
+ if show_gradcam:
173
+ images_to_show = iter(avg_gradcams)
174
+ else:
175
+ images_to_show = iter(all_raw_images)
176
+
177
+ for _ in range(num_rows):
178
+ with st.container():
179
+ for col in st.columns(num_cols):
180
+ col.image(next(images_to_show), use_column_width=True, clamp=True)
181
+
182
+
183
+ def read_and_process_images(image_paths, vis_processor):
184
+ raw_images = [read_img(path) for path in image_paths]
185
+ images = [vis_processor(r_img) for r_img in raw_images]
186
+ images_tensors = torch.stack(images).to(device)
187
+
188
+ return raw_images, images_tensors
189
+
190
+
191
+ def compute_gradcam_batch(model, visual_input, text_input, tokenized_text, block_num=6):
192
+ model.text_encoder.base_model.base_model.encoder.layer[
193
+ block_num
194
+ ].crossattention.self.save_attention = True
195
+
196
+ output = model({"image": visual_input, "text_input": text_input}, match_head="itm")
197
+ loss = output[:, 1].sum()
198
+
199
+ model.zero_grad()
200
+ loss.backward()
201
+ with torch.no_grad():
202
+ mask = tokenized_text.attention_mask.view(
203
+ tokenized_text.attention_mask.size(0), 1, -1, 1, 1
204
+ ) # (bsz,1,token_len, 1,1)
205
+ token_length = mask.sum() - 2
206
+ token_length = token_length.cpu()
207
+ # grads and cams [bsz, num_head, seq_len, image_patch]
208
+ grads = model.text_encoder.base_model.base_model.encoder.layer[
209
+ block_num
210
+ ].crossattention.self.get_attn_gradients()
211
+ cams = model.text_encoder.base_model.base_model.encoder.layer[
212
+ block_num
213
+ ].crossattention.self.get_attention_map()
214
+
215
+ # assume using vit large with 576 num image patch
216
+ cams = cams[:, :, :, 1:].reshape(visual_input.size(0), 12, -1, 24, 24) * mask
217
+ grads = (
218
+ grads[:, :, :, 1:].clamp(0).reshape(visual_input.size(0), 12, -1, 24, 24)
219
+ * mask
220
+ )
221
+
222
+ gradcam = cams * grads
223
+ # [enc token gradcam, average gradcam across token, gradcam for individual token]
224
+ # gradcam = torch.cat((gradcam[0:1,:], gradcam[1:token_length+1, :].sum(dim=0, keepdim=True)/token_length, gradcam[1:, :]))
225
+ gradcam = gradcam.mean(1).cpu().detach()
226
+ gradcam = (
227
+ gradcam[:, 1 : token_length + 1, :].sum(dim=1, keepdim=True) / token_length
228
+ )
229
+
230
+ return gradcam, output
LAVIS/app/multipage.py ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Copyright (c) 2022, salesforce.com, inc.
3
+ # All rights reserved.
4
+ # SPDX-License-Identifier: BSD-3-Clause
5
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
6
+ """
7
+
8
+ """
9
+ This file is the framework for generating multiple Streamlit applications
10
+ through an object oriented framework.
11
+ """
12
+
13
+ # Import necessary libraries
14
+ import streamlit as st
15
+
16
+ # Define the multipage class to manage the multiple apps in our program
17
+ class MultiPage:
18
+ """Framework for combining multiple streamlit applications."""
19
+
20
+ def __init__(self) -> None:
21
+ """Constructor class to generate a list which will store all our applications as an instance variable."""
22
+ self.pages = []
23
+
24
+ def add_page(self, title, func) -> None:
25
+ """Class Method to Add pages to the project
26
+ Args:
27
+ title ([str]): The title of page which we are adding to the list of apps
28
+
29
+ func: Python function to render this page in Streamlit
30
+ """
31
+
32
+ self.pages.append({"title": title, "function": func})
33
+
34
+ def run(self):
35
+ # Drodown to select the page to run
36
+ page = st.sidebar.selectbox(
37
+ "Navigation", self.pages, format_func=lambda page: page["title"]
38
+ )
39
+
40
+ # run the app function
41
+ page["function"]()
LAVIS/app/text_localization.py ADDED
@@ -0,0 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Copyright (c) 2022, salesforce.com, inc.
3
+ # All rights reserved.
4
+ # SPDX-License-Identifier: BSD-3-Clause
5
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
6
+ """
7
+
8
+ import math
9
+
10
+ import numpy as np
11
+ import streamlit as st
12
+ from lavis.models.blip_models.blip_image_text_matching import compute_gradcam
13
+ from lavis.processors import load_processor
14
+ from PIL import Image
15
+
16
+ from app import device, load_demo_image
17
+ from app.utils import getAttMap, init_bert_tokenizer, load_blip_itm_model
18
+
19
+
20
+ def app():
21
+ model_type = st.sidebar.selectbox("Model:", ["BLIP_base", "BLIP_large"])
22
+
23
+ values = list(range(1, 12))
24
+ default_layer_num = values.index(7)
25
+ layer_num = (
26
+ st.sidebar.selectbox("Layer number", values, index=default_layer_num) - 1
27
+ )
28
+
29
+ st.markdown(
30
+ "<h1 style='text-align: center;'>Text Localization</h1>", unsafe_allow_html=True
31
+ )
32
+
33
+ vis_processor = load_processor("blip_image_eval").build(image_size=384)
34
+ text_processor = load_processor("blip_caption")
35
+
36
+ tokenizer = init_bert_tokenizer()
37
+
38
+ instructions = "Try the provided image and text or use your own ones."
39
+ file = st.file_uploader(instructions)
40
+
41
+ query = st.text_input(
42
+ "Try a different input.", "A girl playing with her dog on the beach."
43
+ )
44
+
45
+ submit_button = st.button("Submit")
46
+
47
+ col1, col2 = st.columns(2)
48
+
49
+ if file:
50
+ raw_img = Image.open(file).convert("RGB")
51
+ else:
52
+ raw_img = load_demo_image()
53
+
54
+ col1.header("Image")
55
+ w, h = raw_img.size
56
+ scaling_factor = 720 / w
57
+ resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
58
+ col1.image(resized_image, use_column_width=True)
59
+
60
+ col2.header("GradCam")
61
+
62
+ if submit_button:
63
+ if model_type.startswith("BLIP"):
64
+ blip_type = model_type.split("_")[1]
65
+ model = load_blip_itm_model(device, model_type=blip_type)
66
+
67
+ img = vis_processor(raw_img).unsqueeze(0).to(device)
68
+ qry = text_processor(query)
69
+
70
+ qry_tok = tokenizer(qry, return_tensors="pt").to(device)
71
+
72
+ norm_img = np.float32(resized_image) / 255
73
+
74
+ gradcam, _ = compute_gradcam(model, img, qry, qry_tok, block_num=layer_num)
75
+
76
+ avg_gradcam = getAttMap(norm_img, gradcam[0][1], blur=True)
77
+ col2.image(avg_gradcam, use_column_width=True, clamp=True)
78
+
79
+ num_cols = 4.0
80
+ num_tokens = len(qry_tok.input_ids[0]) - 2
81
+
82
+ num_rows = int(math.ceil(num_tokens / num_cols))
83
+
84
+ gradcam_iter = iter(gradcam[0][2:-1])
85
+ token_id_iter = iter(qry_tok.input_ids[0][1:-1])
86
+
87
+ for _ in range(num_rows):
88
+ with st.container():
89
+ for col in st.columns(int(num_cols)):
90
+ token_id = next(token_id_iter, None)
91
+ if not token_id:
92
+ break
93
+ gradcam_img = next(gradcam_iter)
94
+
95
+ word = tokenizer.decode([token_id])
96
+ gradcam_todraw = getAttMap(norm_img, gradcam_img, blur=True)
97
+
98
+ new_title = (
99
+ '<p style="text-align: center; font-size: 25px;">{}</p>'.format(
100
+ word
101
+ )
102
+ )
103
+ col.markdown(new_title, unsafe_allow_html=True)
104
+ # st.image(image, channels="BGR")
105
+ col.image(gradcam_todraw, use_column_width=True, clamp=True)
LAVIS/app/utils.py ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Copyright (c) 2022, salesforce.com, inc.
3
+ # All rights reserved.
4
+ # SPDX-License-Identifier: BSD-3-Clause
5
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
6
+ """
7
+
8
+ import numpy as np
9
+ import streamlit as st
10
+ import torch
11
+ from lavis.models import BlipBase, load_model
12
+ from matplotlib import pyplot as plt
13
+ from PIL import Image
14
+ from scipy.ndimage import filters
15
+ from skimage import transform as skimage_transform
16
+
17
+
18
+ def resize_img(raw_img):
19
+ w, h = raw_img.size
20
+ scaling_factor = 240 / w
21
+ resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
22
+ return resized_image
23
+
24
+
25
+ def read_img(filepath):
26
+ raw_image = Image.open(filepath).convert("RGB")
27
+
28
+ return raw_image
29
+
30
+
31
+ @st.cache(
32
+ hash_funcs={
33
+ torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
34
+ .cpu()
35
+ .numpy()
36
+ },
37
+ allow_output_mutation=True,
38
+ )
39
+ def load_model_cache(name, model_type, is_eval, device):
40
+ return load_model(name, model_type, is_eval, device)
41
+
42
+
43
+ @st.cache(allow_output_mutation=True)
44
+ def init_bert_tokenizer():
45
+ tokenizer = BlipBase.init_tokenizer()
46
+ return tokenizer
47
+
48
+
49
+ def getAttMap(img, attMap, blur=True, overlap=True):
50
+ attMap -= attMap.min()
51
+ if attMap.max() > 0:
52
+ attMap /= attMap.max()
53
+ attMap = skimage_transform.resize(attMap, (img.shape[:2]), order=3, mode="constant")
54
+ if blur:
55
+ attMap = filters.gaussian_filter(attMap, 0.02 * max(img.shape[:2]))
56
+ attMap -= attMap.min()
57
+ attMap /= attMap.max()
58
+ cmap = plt.get_cmap("jet")
59
+ attMapV = cmap(attMap)
60
+ attMapV = np.delete(attMapV, 3, 2)
61
+ if overlap:
62
+ attMap = (
63
+ 1 * (1 - attMap**0.7).reshape(attMap.shape + (1,)) * img
64
+ + (attMap**0.7).reshape(attMap.shape + (1,)) * attMapV
65
+ )
66
+ return attMap
67
+
68
+
69
+ @st.cache(
70
+ hash_funcs={
71
+ torch.nn.parameter.Parameter: lambda parameter: parameter.data.detach()
72
+ .cpu()
73
+ .numpy()
74
+ },
75
+ allow_output_mutation=True,
76
+ )
77
+ def load_blip_itm_model(device, model_type="base"):
78
+ model = load_model(
79
+ "blip_image_text_matching", model_type, is_eval=True, device=device
80
+ )
81
+ return model
LAVIS/app/vqa.py ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ # Copyright (c) 2022, salesforce.com, inc.
3
+ # All rights reserved.
4
+ # SPDX-License-Identifier: BSD-3-Clause
5
+ # For full license text, see the LICENSE file in the repo root or https://opensource.org/licenses/BSD-3-Clause
6
+ """
7
+
8
+ import streamlit as st
9
+ from app import load_demo_image, device
10
+ from app.utils import load_model_cache
11
+ from lavis.processors import load_processor
12
+ from PIL import Image
13
+
14
+
15
+ def app():
16
+ model_type = st.sidebar.selectbox("Model:", ["BLIP"])
17
+
18
+ # ===== layout =====
19
+ st.markdown(
20
+ "<h1 style='text-align: center;'>Visual Question Answering</h1>",
21
+ unsafe_allow_html=True,
22
+ )
23
+
24
+ instructions = """Try the provided image or upload your own:"""
25
+ file = st.file_uploader(instructions)
26
+
27
+ col1, col2 = st.columns(2)
28
+
29
+ col1.header("Image")
30
+ if file:
31
+ raw_img = Image.open(file).convert("RGB")
32
+ else:
33
+ raw_img = load_demo_image()
34
+
35
+ w, h = raw_img.size
36
+ scaling_factor = 720 / w
37
+ resized_image = raw_img.resize((int(w * scaling_factor), int(h * scaling_factor)))
38
+
39
+ col1.image(resized_image, use_column_width=True)
40
+ col2.header("Question")
41
+
42
+ user_question = col2.text_input("Input your question!", "What are objects there?")
43
+ qa_button = st.button("Submit")
44
+
45
+ col2.header("Answer")
46
+
47
+ # ===== event =====
48
+ vis_processor = load_processor("blip_image_eval").build(image_size=480)
49
+ text_processor = load_processor("blip_question").build()
50
+
51
+ if qa_button:
52
+ if model_type.startswith("BLIP"):
53
+ model = load_model_cache(
54
+ "blip_vqa", model_type="vqav2", is_eval=True, device=device
55
+ )
56
+
57
+ img = vis_processor(raw_img).unsqueeze(0).to(device)
58
+ question = text_processor(user_question)
59
+
60
+ vqa_samples = {"image": img, "text_input": [question]}
61
+ answers = model.predict_answers(vqa_samples, inference_method="generate")
62
+
63
+ col2.write("\n".join(answers), use_column_width=True)
LAVIS/assets/demo-6.png ADDED

Git LFS Details

  • SHA256: 34cab965b30f15b772bee414771bf1b25ee784593aaccce730cad7f0302db220
  • Pointer size: 132 Bytes
  • Size of remote file: 3.12 MB
LAVIS/dataset_card/avsd_dialogue.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ![Samples from the AVSD dataset (Image credit: "https://arxiv.org/pdf/1901.09107.pdf").](imgs/avsd_dialogue.png)(Samples from the AVSD dataset. Image credit: "https://arxiv.org/pdf/1901.09107.pdf")
2
+
3
+ # Audio-Visual Scene-Aware Dialogues (AVSD)
4
+
5
+ ## Description
6
+ [Audio-Visual Scene-Aware Dialogues (AVSD)](https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge) contains more than 10,000 dialogues, each of which is grounded on a unique video. In the test split, for each test sample, 6 reference dialogue responses are provided.
7
+
8
+
9
+ ## Task
10
+
11
+ (https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge)
12
+
13
+ In a **video-grounded dialogue task**, the system must generate responses to a user input in the context of a given dialog.
14
+ This context consists of a dialog history (previous utterances by both user and system) in addition to video and audio information that comprise the scene. The quality of a system’s automatically generated sentences is evaluated using objective measures to determine whether or not the generated responses are natural and informative
15
+
16
+ ## Metrics
17
+ Models are typically evaluated according to [BLEU](https://aclanthology.org/P02-1040/), [CIDER](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf), [METEOR](https://aclanthology.org/W05-0909/), and [ROUGE-L](https://aclanthology.org/W04-1013/) metrics.
18
+
19
+ ## Leaderboard
20
+
21
+ TBD
22
+
23
+
24
+ ## Auto-Downloading
25
+
26
+ Please refer to [benchmark webite](https://github.com/hudaAlamri/DSTC7-Audio-Visual-Scene-Aware-Dialog-AVSD-Challenge) for instruction to download the dataset.
27
+
28
+
29
+ ## References
30
+ "Audio Visual Scene-Aware Dialog", Huda Alamri, Vincent Cartillier, Abhishek Das, Jue Wang, Anoop Cherian, Irfan Essa, Dhruv Batra, Tim K. Marks, Chiori Hori, Peter Anderson, Stefan Lee, Devi Parikh
31
+
32
+
LAVIS/dataset_card/coco_caption.md ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ![Samples from the COCO Caption dataset (Image credit: "https://arxiv.org/pdf/1504.00325.pdf").](imgs/coco_caption.png)(Samples from the COCO Caption dataset. Image credit: "https://arxiv.org/pdf/1504.00325.pdf")
2
+
3
+ # Microsoft COCO Dataset (Captioning)
4
+
5
+ ## Description
6
+ [Microsoft COCO Captions dataset](https://github.com/tylin/coco-caption) contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
7
+
8
+ ## Task
9
+
10
+ (from https://paperswithcode.com/task/image-captioning)
11
+
12
+ **Image captioning** is the task of describing the content of an image in words. This task lies at the intersection of computer vision and natural language processing. Most image captioning systems use an encoder-decoder framework, where an input image is encoded into an intermediate representation of the information in the image, and then decoded into a descriptive text sequence.
13
+
14
+ ## Metrics
15
+ Models are typically evaluated according to a [BLEU](https://aclanthology.org/P02-1040/) or [CIDER](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf) metric.
16
+
17
+ ## Leaderboard
18
+
19
+ (Ranked by BLEU-4)
20
+
21
+ | Rank | Model | BLEU-4 | CIDEr | METEOR | SPICE | Resources |
22
+ | ---- | :-----: | :----: | :---: | :----: | :---: | :----------------------------------------------------------------------------------------------------------------------------------------------: |
23
+ | 1 | OFA | 44.9 | 154.9 | 32.5 | 26.6 | [paper](https://arxiv.org/abs/2202.03052), [code](https://github.com/OFA-Sys/OFA) |
24
+ | 2 | LEMON | 42.6 | 145.5 | 31.4 | 25.5 | [paper]() |
25
+ | 3 | CoCa | 40.9 | 143.6 | 33.9 | 24.7 | [paper](https://arxiv.org/pdf/2205.01917.pdf) |
26
+ | 4 | SimVLM | 40.6 | 143.3 | 33.7 | 25.4 | [paper](https://openreview.net/pdf?id=GUrhfTuf_3) |
27
+ | 5 | VinVL | 41.0 | 140.9 | 31.1 | 25.2 | [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar) |
28
+ | 6 | OSCAR | 40.7 | 140.0 | 30.6 | 24.5 | [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar) |
29
+ | 7 | BLIP | 40.4 | 136.7 | 31.4 | 24.3 | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP) |
30
+ | 8 | M^2 | 39.1 | 131.2 | 29.2 | 22.6 | [paper](https://arxiv.org/pdf/1912.08226v2.pdf), [code](https://github.com/aimagelab/meshed-memory-transformer) |
31
+ | 9 | BUTD | 36.5 | 113.5 | 27.0 | 20.3 | [paper](https://arxiv.org/abs/1707.07998?context=cs), [code](https://github.com/peteanderson80/bottom-up-attention) |
32
+ | 10 | ClipCap | 32.2 | 108.4 | 27.1 | 20.1 | [paper](https://arxiv.org/pdf/2111.09734v1.pdf), [code](https://github.com/rmokady/clip_prefix_caption) |
33
+
34
+ ## Auto-Downloading
35
+
36
+ ```
37
+ cd lavis/datasets/download_scripts && python download_coco.py
38
+ ```
39
+
40
+ ## References
41
+ "Microsoft COCO Captions: Data Collection and Evaluation Server", Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick
LAVIS/dataset_card/coco_retrieval.md ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ![Samples from the COCO Caption dataset (Image credit: "https://arxiv.org/pdf/1504.00325.pdf").](imgs/coco_caption.png)(Samples from the COCO Caption dataset. Image credit: "https://arxiv.org/pdf/1504.00325.pdf")
2
+
3
+ # Microsoft COCO Dataset (Retrieval)
4
+
5
+ ## Description
6
+ [Microsoft COCO dataset](https://github.com/tylin/coco-caption) contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
7
+
8
+ ## Task
9
+ Cross modal retrieval: (1) **image-text**: given an image as query, retrieve texts from a gallery; (2) **text-image**: given a text as query, retrieval images from a gallery.
10
+
11
+
12
+ ## Metrics
13
+ Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.
14
+
15
+ We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.
16
+
17
+ ## Leaderboard
18
+ (Ranked by TR@1.)
19
+ | Rank | Model | TR@1 | TR@5 | TR@10 | IR@1 | IR@5 | IR@10 | Resources |
20
+ | ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
21
+ | 1 | BLIP | 82.4 | 95.4 | 97.9 | 65.1 | 86.3 | 91.8 | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) |
22
+ | 2 | X-VLM | 81.2 | 95.6 | 98.2 | 63.4 | 85.8 | 91.5 | [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM) |
23
+ | 3 | ALBEF | 77.6 | 94.3 | 97.2 | 60.7 | 84.3 | 90.5 | [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/) |
24
+ | 3 | ALIGN | 77.0 | 93.5 | 96.9 | 59.9 | 83.3 | 89.8 | [paper](https://arxiv.org/abs/2102.05918) |
25
+ | 4 | VinVL | 75.4 | 92.9 | 96.2 | 58.8 | 83.5 | 90.3 | [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar) |
26
+ | 5 | OSCAR | 73.5 | 92.2 | 96.0 | 57.5 | 82.8 | 89.8 | [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar) |
27
+ | 6 | UNITER | 65.7 | 88.6 | 93.8 | 52.9 | 79.9 | 88.0 | [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER) |
28
+
29
+ ## Auto-Downloading
30
+
31
+ ```
32
+ cd lavis/datasets/download_scripts && python download_coco.py
33
+ ```
34
+
35
+ ## References
36
+ "Microsoft COCO Captions: Data Collection and Evaluation Server", Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, C. Lawrence Zitnick
LAVIS/dataset_card/conceptual_captions.md ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ![From https://arxiv.org/pdf/1505.00468.pdf.](imgs/conceptual_captions.png)
2
+ (image credit: https://ai.google.com/research/ConceptualCaptions/download)
3
+
4
+ # Conceptual Captions Dataset
5
+
6
+ ## Description
7
+ (from https://huggingface.co/datasets/conceptual_captions)
8
+
9
+ Conceptual Captions 3M (CC3M) is a dataset consisting of ~3.3M images annotated with captions. In contrast with the curated style of other image caption annotations, Conceptual Caption images and their raw descriptions are harvested from the web, and therefore represent a wider variety of styles. More precisely, the raw descriptions are harvested from the Alt-text HTML attribute associated with web images. To arrive at the current version of the captions, we have developed an automatic pipeline that extracts, filters, and transforms candidate image/caption pairs, with the goal of achieving a balance of cleanliness, informativeness, fluency, and learnability of the resulting captions.
10
+
11
+ Conceptual Captions 12M (CC12M) is a dataset with 12 million image-text pairs specifically meant to be used for visionand-language pre-training. Its data collection pipeline is a relaxed version of the one used in Conceptual Captions 3M (CC3M).
12
+
13
+ ## Task
14
+
15
+ Image-language pre-training; image captioning.
16
+
17
+ ## Auto-Downloading
18
+ **Warning**: images of this dataset are downloadeded by requesting URLs. Since URLs may disappear with time, it is expected the downloaded dataset is partial.
19
+
20
+ ### Conceptual Captions 3M
21
+ - Download images
22
+ ```
23
+ cd lavis/datasets/download_scripts/DownloadConceptualCaptions && python download_data_cc3m.py
24
+ ```
25
+ - Create annotations by running the notebook
26
+ ```lavis/datasets/download_scripts/DownloadConceptualCaptions/create_annotation_3m.ipynb```
27
+
28
+ ### Conceptual Captions 12M
29
+ - Download images
30
+ ```
31
+ cd lavis/datasets/download_scripts/DownloadConceptualCaptions && python download_data_cc12m.py
32
+ ```
33
+ - Create annotations by running the notebook
34
+ ```lavis/datasets/download_scripts/DownloadConceptualCaptions/create_annotation_12m.ipynb```
35
+
36
+ ## References
37
+ Edwin G. Ng, Bo Pang, Piyush Sharma and Radu Soricut. 2020. Understanding Guided Image Captioning Performance Across Domains. arXiv preprint arXiv:2012.02339.
LAVIS/dataset_card/didemo_retrieval.md ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ![Samples from the DiDeMo dataset.](imgs/didemo.png)(Samples from the DiDeMo dataset. Image credit: "https://www.di.ens.fr/~miech/datasetviz/")
2
+
3
+ # DiDeMo Dataset (Retrieval)
4
+
5
+ ## Description
6
+ [Microsoft COCO dataset](https://github.com/tylin/coco-caption) contains over one and a half million captions describing over 330,000 images. For the training and validation images, five independent human generated captions are be provided for each image.
7
+
8
+ ## Task
9
+ Cross modal retrieval: (1) **image-text**: given an image as query, retrieve texts from a gallery; (2) **text-image**: given a text as query, retrieval images from a gallery.
10
+
11
+
12
+ ## Metrics
13
+ Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.
14
+
15
+ We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.
16
+
17
+ ## Leaderboard
18
+ (Ranked by TR@1.)
19
+ <!-- | Rank | Model | TR@1 | TR@5 | TR@10 | IR@1 | IR@5 | IR@10 | Resources |
20
+ | ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
21
+ | 1 | BLIP | 82.4 | 95.4 | 97.9 | 65.1 | 86.3 | 91.8 | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) |
22
+ | 2 | X-VLM | 81.2 | 95.6 | 98.2 | 63.4 | 85.8 | 91.5 | [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM) |
23
+ | 3 | ALBEF | 77.6 | 94.3 | 97.2 | 60.7 | 84.3 | 90.5 | [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/) |
24
+ | 3 | ALIGN | 77.0 | 93.5 | 96.9 | 59.9 | 83.3 | 89.8 | [paper](https://arxiv.org/abs/2102.05918) |
25
+ | 4 | VinVL | 75.4 | 92.9 | 96.2 | 58.8 | 83.5 | 90.3 | [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar) |
26
+ | 5 | OSCAR | 73.5 | 92.2 | 96.0 | 57.5 | 82.8 | 89.8 | [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar) |
27
+ | 6 | UNITER | 65.7 | 88.6 | 93.8 | 52.9 | 79.9 | 88.0 | [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER) | -->
28
+
29
+ ## Auto-Downloading
30
+
31
+ ```
32
+ cd lavis/datasets/download_scripts && python download_didemo.py
33
+ ```
34
+
35
+ ## References
36
+ Anne Hendricks, Lisa, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. "Localizing moments in video with natural language." In Proceedings of the IEEE international conference on computer vision, pp. 5803-5812. 2017.
LAVIS/dataset_card/flickr_retrieval.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ![Samples from Flickr30k dataset (Image credit: "https://bryanplummer.com/Flickr30kEntities/").](imgs/flickr30k.png)Samples from Flickr30k dataset (Image credit: "https://bryanplummer.com/Flickr30kEntities/")
2
+
3
+ # Flickr30K Dataset (Retrieval)
4
+
5
+ ## Description
6
+ [Flickr30k](https://github.com/tylin/coco-caption) dataset contains 31k+ images collected from Flickr, together with 5 reference sentences provided by human annotators.
7
+
8
+ ## Task
9
+ Cross modal retrieval: (1) **image-text**: given an image as query, retrieve texts from a gallery; (2) **text-image**: given a text as query, retrieval images from a gallery.
10
+
11
+
12
+ ## Metrics
13
+ Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.
14
+
15
+ We use TR to denote the image-text retrieval recall score and IR to denote text-image retrieval score.
16
+
17
+ ## Leaderboard
18
+ (Ranked by TR@1.)
19
+ | Rank | Model | TR@1 | TR@5 | TR@10 | IR@1 | IR@5 | IR@10 | Resources |
20
+ | ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
21
+ | 1 | BLIP | 97.2 | 99.9 | 100.0 | 87.5 | 97.7 | 98.9 | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) |
22
+ | 2 | X-VLM | 97.1 | 100.0 | 100.0 | 86.9 | 97.3 | 98.7 | [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM) |
23
+ | 3 | ALBEF | 95.9 | 99.8 | 100.0 | 85.6 | 97.5 | 98.9 | [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/) |
24
+ | 4 | ALIGN | 95.3 | 99.8 | 100.0 | 84.9 | 97.4 | 98.6 | [paper](https://arxiv.org/abs/2102.05918) | |
25
+ | 5 | VILLA | 87.9 | 97.5 | 98.8 | 76.3 | 94.2 | 96.8 | [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar) |
26
+ | 6 | UNITER | 87.3 | 98.0 | 99.2 | 75.6 | 94.1 | 96.8 | [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER) |
27
+
28
+ ## Auto-Downloading
29
+ ```
30
+ cd lavis/datasets/download_scripts && python download_flickr.py
31
+ ```
32
+
33
+ ## References
34
+ Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik, Flickr30K Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models, IJCV, 123(1):74-93, 2017. [paper]
LAVIS/dataset_card/gqa.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ![From https://arxiv.org/abs/1902.09506.pdf.](imgs/gqa.png)
2
+
3
+ # GQA Dataset
4
+
5
+ ## Description
6
+ (from https://cs.stanford.edu/people/dorarad/gqa/about.html)
7
+
8
+ GQA is a VQA dataset for real-word images which requires visual, spatial and compositional reasoning.
9
+ It consists of 22M questions and 110K images.
10
+
11
+ ## Task
12
+ (from https://arxiv.org/abs/1902.09506)
13
+
14
+ Given an image and a question, the model is required to output a correct answer.
15
+ GQA questions require spatial understanding, multiple reasoning skills and multiple-step inference.
16
+
17
+ ## Metrics
18
+
19
+ The metrics are accuracy, consistency, validity, plausibility. The commonly reported metric is accuracy.
20
+
21
+ ## Leaderboard
22
+
23
+ TBD
24
+
25
+ ## Auto-Downloading
26
+
27
+ ```
28
+ cd lavis/datasets/download_scripts && python download_gqa.py
29
+ ```
30
+
31
+ ## References
32
+ "GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering", Drew A. Hudson, Christopher D. Manning
LAVIS/dataset_card/imgs/NLVR2.png ADDED

Git LFS Details

  • SHA256: e796c41791e58c8d36e404612f9bbdf16c99e07761094d362f57a746d8e002b3
  • Pointer size: 132 Bytes
  • Size of remote file: 2.49 MB
LAVIS/dataset_card/imgs/avsd_dialogue.png ADDED

Git LFS Details

  • SHA256: 68512b87cd234ab7d051d117c673b1770fd6a08c774ed3a16ec5e069833192aa
  • Pointer size: 132 Bytes
  • Size of remote file: 1.56 MB
LAVIS/dataset_card/imgs/coco_caption.png ADDED

Git LFS Details

  • SHA256: 950cfe987d6da5aad5ece7ca774ad73c1a24af7fcfda328536a8ce56eeaaf1b8
  • Pointer size: 132 Bytes
  • Size of remote file: 2.45 MB
LAVIS/dataset_card/imgs/conceptual_captions.png ADDED

Git LFS Details

  • SHA256: 5b6b13ab2fbc45076db85b0fbf1e0b1feb84fd0923a38c7db50ae09a5589b2ed
  • Pointer size: 132 Bytes
  • Size of remote file: 1.85 MB
LAVIS/dataset_card/imgs/didemo.png ADDED

Git LFS Details

  • SHA256: bca37c1c073a0ae47bebda0b4f37b3d933311e160cf4785398c6a2455465ddd0
  • Pointer size: 132 Bytes
  • Size of remote file: 2.63 MB
LAVIS/dataset_card/imgs/flickr30k.png ADDED

Git LFS Details

  • SHA256: 6397b8d87738dad19e1e2e5be652a6be6c9c50331023f1c5caa39ddbef981c24
  • Pointer size: 132 Bytes
  • Size of remote file: 1.44 MB
LAVIS/dataset_card/imgs/gqa.png ADDED

Git LFS Details

  • SHA256: 9aab093dd97c874c944f46991eb4bd018bbfc08f3a2fab3afbf27d71d2437866
  • Pointer size: 132 Bytes
  • Size of remote file: 3.12 MB
LAVIS/dataset_card/imgs/msrvtt.png ADDED

Git LFS Details

  • SHA256: defdac15aefc94420db8c64ca10eb19425e4454521b780bd391d032ec2f36a57
  • Pointer size: 132 Bytes
  • Size of remote file: 1.12 MB
LAVIS/dataset_card/imgs/msrvtt_qa.png ADDED

Git LFS Details

  • SHA256: 5518bff90f37cd9ddc2ca9eb96f9378075c341d3676c6708b0aabc9e7e78e758
  • Pointer size: 131 Bytes
  • Size of remote file: 335 kB
LAVIS/dataset_card/imgs/msvd_qa.png ADDED

Git LFS Details

  • SHA256: 05b7bf302d23a363996d0390bfd3c1ffc5f95b2b0f826c057af169ebd3d9e93f
  • Pointer size: 131 Bytes
  • Size of remote file: 388 kB
LAVIS/dataset_card/imgs/nocaps.png ADDED

Git LFS Details

  • SHA256: c087d4838a5d43612158a960bfd4200b3ef00303447ed27ba4ffc23a37588b6b
  • Pointer size: 132 Bytes
  • Size of remote file: 1.51 MB
LAVIS/dataset_card/imgs/sbu_caption.png ADDED

Git LFS Details

  • SHA256: 5fea4744bed8090a23f66689a8b4ca1fde008752f3830b24afe8d3eb81c2a2ea
  • Pointer size: 132 Bytes
  • Size of remote file: 1.52 MB
LAVIS/dataset_card/imgs/snli_ve.png ADDED

Git LFS Details

  • SHA256: 4660107e5ef8599fe587ebe6b8bf7b4b5561c796272f927a1d9e4aa5f8e15420
  • Pointer size: 132 Bytes
  • Size of remote file: 2.33 MB
LAVIS/dataset_card/imgs/vqav2.png ADDED

Git LFS Details

  • SHA256: 4101a7978e0870f95f5950cfe64697b2f1b7de725b056546ba950d27e2080bec
  • Pointer size: 132 Bytes
  • Size of remote file: 2.28 MB
LAVIS/dataset_card/msrvtt_qa.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ![Samples from MSRVTT-QA dataset.](imgs/msrvtt_qa.png)(Samples from MSRVTT-QA dataset, image credit: http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf)
2
+
3
+ # MSRVTT Dataset (Video Question Answering)
4
+
5
+ ## Description
6
+ [MSRVTT](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) dataset is a large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers.
7
+
8
+ [MSRVTT-QA](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf) dataset is based on the MSR-VTT dataset, which is larger and has more complex scenes. The dataset
9
+ contains 10K video clips and 243k question answer pairs.
10
+
11
+ ## Task
12
+ Video question answering (VideoQA) is the task where
13
+ a video and a natural language question are provided and the model
14
+ needs to give the right answer (from [paper](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf)).
15
+
16
+
17
+ ## Metrics
18
+ Accuracy.
19
+
20
+ ## Leaderboard
21
+ (Ranked by accurarcy on test-dev.)
22
+ | Rank | Model | Acc. | Resources |
23
+ | ---- | :----: | :-------: | :-------: |
24
+ | 1 | ALPro | 42.1 | [paper](https://arxiv.org/abs/2112.09583), [code](https://github.com/salesforce/ALPRO), [blog](https://blog.salesforceairesearch.com/alpro/) |
25
+ | 2 | VQA-T | 41.5 | [paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Yang_Just_Ask_Learning_To_Answer_Questions_From_Millions_of_Narrated_ICCV_2021_paper.pdf), [code](https://github.com/antoyang/just-ask), [demo](http://videoqa.paris.inria.fr/) |
26
+ | 3 | CoMVT | 39.5 | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Seo_Look_Before_You_Speak_Visually_Contextualized_Utterances_CVPR_2021_paper.pdf) |
27
+ | 4 | SSML | 35.1 | [paper](https://arxiv.org/abs/2003.03186) |
28
+ | 5 | ClipBERT | 37.4 | [paper](https://arxiv.org/abs/2102.06183) [code](https://github.com/jayleicn/ClipBERT)|
29
+ | 6 | HCRN | 35.6 | [paper](https://arxiv.org/abs/2002.10698) [code](https://github.com/thaolmk54/hcrn-videoqa) |
30
+ | 7 | HGA | 35.5 | [paper](https://ojs.aaai.org/index.php/AAAI/article/view/6767) [code](https://github.com/Jumpin2/HGA) |
31
+ | 8 | DualVGR | 35.5 | [paper](https://arxiv.org/pdf/2107.04768v1.pdf) [code](https://github.com/NJUPT-MCC/DualVGR-VideoQA) |
32
+ | 9 | HME | 33.0 | [paper](https://arxiv.org/pdf/1904.04357.pdf), [code](https://github.com/fanchenyou/HME-VideoQA) |
33
+ | 10 | AMU | 32.5 | [paper](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf), [code](https://github.com/xudejing/video-question-answering) |
34
+
35
+
36
+ ## Auto-Downloading
37
+ ```
38
+ cd lavis/datasets/download_scripts && python download_msrvtt.py
39
+ ```
40
+
41
+ ## References
42
+ Xu, Jun, Tao Mei, Ting Yao, and Yong Rui. "Msr-vtt: A large video description dataset for bridging video and language." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288-5296. 2016.
43
+
44
+ Xu, Dejing, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. "Video question answering via gradually refined attention over appearance and motion." In Proceedings of the 25th ACM international conference on Multimedia, pp. 1645-1653. 2017.
LAVIS/dataset_card/msrvtt_retrieval.md ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ![Samples from Flickr30k dataset (Image credit: "https://bryanplummer.com/Flickr30kEntities/").](imgs/msrvtt.png)
2
+
3
+ # MSRVTT Dataset (Retrieval)
4
+
5
+ ## Description
6
+ [MSRVTT](https://www.microsoft.com/en-us/research/publication/msr-vtt-a-large-video-description-dataset-for-bridging-video-and-language/) dataset is a large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine, with 118 videos for each query. In its current version, MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers.
7
+
8
+ ## Task
9
+ Cross modal retrieval: (1) **video-text**: given a video as query, retrieve texts from a gallery; (2) **text-video**: given a text as query, retrieval videos from a gallery.
10
+
11
+
12
+ ## Metrics
13
+ Common metrics are recall@k, denotes the [recall score](https://en.wikipedia.org/wiki/Precision_and_recall) after k retrieval efforts.
14
+
15
+ We use TR to denote the video-text retrieval recall score and VR to denote text-video retrieval score.
16
+
17
+ ## Leaderboard
18
+ (Ranked by TR@1.)
19
+ <!-- | Rank | Model | TR@1 | TR@5 | TR@10 | IR@1 | IR@5 | IR@10 | Resources |
20
+ | ---- | :----: | :---: | :---: | :---: | :---: | :---: | :---: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
21
+ | 1 | BLIP | 82.4 | 95.4 | 97.9 | 65.1 | 86.3 | 91.8 | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/) | -->
22
+
23
+ ## References
24
+ Xu, Jun, Tao Mei, Ting Yao, and Yong Rui. "Msr-vtt: A large video description dataset for bridging video and language." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5288-5296. 2016.
LAVIS/dataset_card/msvd_qa.md ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ![Samples from MSVD-QA dataset.](imgs/msvd_qa.png)(Samples from MSVD-QA dataset, image credit: http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf)
2
+
3
+ # MSVD Dataset (Video Question Answering)
4
+
5
+ ## Description
6
+ [MSVD-QA](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf) dataset is based on Microsoft Research Video
7
+ Description Corpus (https://www.cs.utexas.edu/users/ml/clamp/videoDescription/) which is used in many video captioning
8
+ experiments. The MSVD-QA dataset has a total number of 1,970
9
+ video clips and 50,505 question answer pairs.
10
+
11
+
12
+ ## Task
13
+ Video question answering (VideoQA) is the task where
14
+ a video and a natural language question are provided and the model
15
+ needs to give the right answer (from [paper](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf)).
16
+
17
+
18
+ ## Metrics
19
+ Accuracy.
20
+
21
+ ## Leaderboard
22
+ (Ranked by accurarcy on test-dev.)
23
+ | Rank | Model | Acc. | Resources |
24
+ | ---- | :----: | :-------: | :-------: |
25
+ | 1 | VQA-T | 46.3 | [paper](https://openaccess.thecvf.com/content/ICCV2021/papers/Yang_Just_Ask_Learning_To_Answer_Questions_From_Millions_of_Narrated_ICCV_2021_paper.pdf), [code](https://github.com/antoyang/just-ask), [demo](http://videoqa.paris.inria.fr/) |
26
+ | 2 | ALPro | 45.9 | [paper](https://arxiv.org/abs/2112.09583), [code](https://github.com/salesforce/ALPRO), [blog](https://blog.salesforceairesearch.com/alpro/) |
27
+ | 3 | CoMVT | 42.6 | [paper](https://openaccess.thecvf.com/content/CVPR2021/papers/Seo_Look_Before_You_Speak_Visually_Contextualized_Utterances_CVPR_2021_paper.pdf) |
28
+ | 4 | DualVGR | 39.0 | [paper](https://arxiv.org/pdf/2107.04768v1.pdf) [code](https://github.com/NJUPT-MCC/DualVGR-VideoQA) |
29
+ | 5 | HCRN | 36.1 | [paper](https://arxiv.org/abs/2002.10698) [code](https://github.com/thaolmk54/hcrn-videoqa) |
30
+ | 6 | SSML | 35.1 | [paper](https://arxiv.org/abs/2003.03186) |
31
+ | 7 | HGA | 34.7 | [paper](https://ojs.aaai.org/index.php/AAAI/article/view/6767) [code](https://github.com/Jumpin2/HGA) |
32
+ | 8 | HME | 33.7 | [paper](https://arxiv.org/pdf/1904.04357.pdf), [code](https://github.com/fanchenyou/HME-VideoQA) |
33
+ | 9 | AMU | 32.0 | [paper](http://staff.ustc.edu.cn/~hexn/papers/mm17-videoQA.pdf), [code](https://github.com/xudejing/video-question-answering) |
34
+ | 10 | ST-VQA | 31.3 | [paper](https://arxiv.org/pdf/1704.04497.pdf), [code](https://github.com/YunseokJANG/tgif-qa) |
35
+
36
+
37
+ ## Auto-Downloading
38
+ ```
39
+ cd lavis/datasets/download_scripts && python download_msvd.py
40
+ ```
41
+
42
+ ## References
43
+ Chen, David, and William B. Dolan. "Collecting highly parallel data for paraphrase evaluation." In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp. 190-200. 2011.
44
+
45
+ Xu, Dejing, Zhou Zhao, Jun Xiao, Fei Wu, Hanwang Zhang, Xiangnan He, and Yueting Zhuang. "Video question answering via gradually refined attention over appearance and motion." In Proceedings of the 25th ACM international conference on Multimedia, pp. 1645-1653. 2017.
LAVIS/dataset_card/nlvr2.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ![From https://arxiv.org/pdf/1505.00468.pdf.](imgs/NLVR2.png)
2
+
3
+ # Natural Language for Visual Reasoning for Real (NLVR2)
4
+
5
+ ## Description
6
+ (from https://lil.nlp.cornell.edu/nlvr/)
7
+
8
+ NLVR2 contains 107,292 examples of human-written English sentences grounded in pairs of photographs. NLVR2 retains the linguistic diversity of NLVR, while including much more visually complex images.
9
+
10
+ We only publicly release the sentence annotations and original image URLs, and scripts that download the images from the URLs. If you would like direct access to the images, please fill out this Google Form. This form asks for your basic information and asks you to agree to our Terms of Service.
11
+
12
+
13
+ ## Task
14
+ (from https://lil.nlp.cornell.edu/nlvr/)
15
+ The Natural Language for Visual Reasoning (NLVR) task is to determine whether a sentence is true about a visual input. The data was collected through crowdsourcings, and solving the task requires reasoning about sets of objects, comparisons, and spatial relations. This includes two corpora: NLVR, with synthetically generated images, and NLVR2, which includes natural photographs.
16
+
17
+
18
+ ## Metrics
19
+ Accuracy.
20
+
21
+ ## Leaderboard
22
+ (Ranked by accurarcy on dev.)
23
+ | Rank | Model | dev | test | Resources |
24
+ | ---- | :----: | :------: | :------: | :-------: |
25
+ | 1 | VLMo | 88.6 | 89.5 | [paper](https://arxiv.org/pdf/2111.02358.pdf) |
26
+ | 2 | CoCa | 86.1 | 87.0 | [paper](https://arxiv.org/pdf/2205.01917.pdf) |
27
+ | 3 | SimVLM | 84.5 | 85.2 | [paper](https://openreview.net/pdf?id=GUrhfTuf_3) |
28
+ | 4 | X-VLM | 84.4 | 84.8 | [paper](https://arxiv.org/pdf/2111.08276v3.pdf), [code](https://github.com/zengyan-97/X-VLM)
29
+ | 5 | VinVL | 82.7 | 84.0 | [paper](https://arxiv.org/pdf/2101.00529.pdf), [code](https://github.com/pzzhang/VinVL) |
30
+ | 6 | ALBEF | 82.6 | 83.1 | [paper](https://arxiv.org/abs/2107.07651), [code](https://github.com/salesforce/ALBEF), [blog](https://blog.salesforceairesearch.com/align-before-fuse/) |
31
+ | 7 | BLIP | 82.2 | 82.2 | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP), [blog](https://blog.salesforceairesearch.com/blip-bootstrapping-language-image-pretraining/)|
32
+ | 8 | OSCAR | 78.1 | 78.4 | [paper](https://arxiv.org/pdf/2004.06165v5.pdf), [code](https://github.com/microsoft/Oscar) |
33
+ | 9 | SOHO | 76.4 | 77.3 | [paper](https://arxiv.org/pdf/2104.03135.pdf), [code](https://github.com/researchmm/soho) |
34
+ | 10 | UNITER | 77.2 | 77.9 | [paper](https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123750103.pdf), [code](https://github.com/ChenRocks/UNITER) |
35
+
36
+
37
+ ## Downloading
38
+ Auto-downloading is not supported for this dataset. Please refer to https://lil.nlp.cornell.edu/nlvr/ and fill in the Google form to download the original images.
39
+
40
+
41
+ ## References
42
+ Suhr, Alane, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. "A corpus for reasoning about natural language grounded in photographs." arXiv preprint arXiv:1811.00491 (2018).
LAVIS/dataset_card/nocaps.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ![Samples from the COCO Caption dataset (Image credit: "https://arxiv.org/pdf/1504.00325.pdf").](imgs/nocaps.png)
2
+
3
+ # Nocaps
4
+
5
+ ## Description
6
+
7
+ our benchmark consists of 166,100 human-generated captions describing 15,100 images from the Open Images validation and test sets. The associated training data consists of COCO image-caption pairs, plus Open Images image-level labels and object bounding boxes. Since Open Images contains many more classes than COCO, nearly 400 object classes seen in test images have no or very few associated training captions (hence, nocaps).
8
+
9
+
10
+ ## Task: Novel object captioning
11
+
12
+ (from https://nocaps.org/)
13
+
14
+ Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task. Dubbed nocaps, for novel object captioning at scale
15
+
16
+
17
+ ## Metrics
18
+ Models are typically evaluated according to a [CIDEr](https://aclanthology.org/P02-1040/) or [SPICE](https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Vedantam_CIDEr_Consensus-Based_Image_2015_CVPR_paper.pdf) metric.
19
+
20
+ ## Leaderboard
21
+
22
+ (Ranked by CIDEr)
23
+
24
+ | Rank | Model | val. CIDEr | val. SPICE | test CIDEr | test SPICE | Resources |
25
+ | ---- | :-----: | :----: | :---: | :----------------------------------------------------------------------------------------------------------------------------------------------: | :---:|:---:|
26
+ | 1 | CoCa | 122.4 | 15.5 | 120.6 | 15.5| [paper](https://arxiv.org/pdf/2205.01917.pdf) |
27
+ | 2 | LEMON | 117.3 | 15.0 |114.3 | 14.9 | [paper]() |
28
+ | 3 | BLIP | 113.2 | 14.8 | - | - | [paper](https://arxiv.org/pdf/2201.12086.pdf), [code](https://github.com/salesforce/BLIP), [demo](https://huggingface.co/spaces/Salesforce/BLIP) |
29
+ | 4 | SimVLM | 112.2 | - | 110.3 | 14.5 | [paper](https://openreview.net/pdf?id=GUrhfTuf_3) |
30
+ | 5 | VinVL | 105.1 | 14.4 | 103.7 | 14.4 | [paper](https://arxiv.org/pdf/2101.00529v2.pdf), [code](https://github.com/microsoft/Oscar) |
31
+
32
+ ## Auto-Downloading
33
+ ```
34
+ cd lavis/datasets/download_scripts && python download_nocaps.py
35
+ ```
36
+
37
+ ## References
38
+ Agrawal, Harsh, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. "Nocaps: Novel object captioning at scale." In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8948-8957. 2019.