Contents
List of projects hosted in Scenic
-
AdaTape is an adaptive computation transformer with elastic input sequence.
-
Adversarial training is an implementation of modern forms of adversarial training that achieved state-of-the-art robustness results on image classifications. This includes AdvProp and (Pyramid Adversarial Training Improves ViT Performance)[https://arxiv.org/abs/2111.15121].
-
AVATAR is a sequence-to-sequence AudioVisual ASR TrAnsformeR which is trained end-to-end from spectrograms and full-frame RGB for the task of audiovisual speech recognition (AV-ASR).
Audiovisual Masked Autoencoders
Audiovisual Masked Autoencoders performs self-supervised learning on multiple modalities (audio and video) to improve representation learning for both unimodal and multimodal downstream tasks. Details can be found in the paper.
-
Boundary Attention is differentiable bottom-up model for detecting boundaries in high noise at any resolution. It uses a form of local attention to infer boundaries that include contours, corners and junctions, all without rasterization. Details and a link to the paper can be found on its website.
-
ViViT is a family of pure-transformer based models for video classification that achieved state-of-the-art results. Details can be found in the paper.
-
Tasseo is a project that uses transformer based models for aberration detection from chromosome karyotype images.
-
TokenLearner proposes dynamic tokenization of images and videos for faster and more accurate video/image processing tasks. More can be found in the paper.
-
Token Turing Machines are a sequential, autoregressive transformer architecture with external memory. More can be found in the paper.
-
FastViT is a project that aims at exploring ideas around making ViT faster via using efficient transformers, in particular on higher resolution inputs (more tokens and thus longer sequences).
-
Omninet is a transformer model with omni-directional representations.
-
CLAY is a Transformer-based pipeline for mobile UI layout denoising. Read more about this project in CLAY paper.
-
LOCA (paper) is a self-supervised method to train spatially-aware vision transformer features.
-
MatViT is a MatFormer (paper) based nested ViT architecture designed to offer elasticity in a variety of deployment constraints, where each Feed Forward Network (FFN) block of a MatViT model is jointly optimized with a few nested smaller FFN blocks.
-
MBT presents a transformer based architecture that uses "fusion bottlenecks" for modality fusion at multiple layers. Details can be found in the paper.
-
MTV presents a state-of-the-art transformer based architecture for video classification. MTV consists of separate encoders to represent different views of the input video with lateral connections and a global encoder to fuse information across views. More details are in the paper.
-
OWL-ViT is an open-vocabulary object detector that given an image and a free-text query, it finds objects matching that query in the image. It can also do one-shot object detection, i.e. detect objects based on a single example image. More details are in the paper.
-
NCR is a regularization method which encourages the network to make similar predictions for similar vectors in the feature space. Details can be found in the paper, where we used this method to learn with noisy labels.
-
Point Cloud Transformer (PCT) is a Transformer-based model for performing inference (classification/segmentation) for point cloud data. Details can be found in the paper.
-
PolyViT is a simple and effective model for co-training a single transformer backbone on multiple modalities and tasks, resulting in a parameter-efficient model that performs as well or better than models trained on single modalities or tasks. Details can be found in the paper.
-
Wrappers of T5 models in t5x.
-
Vid2Seq is a single-stage dense video captioning model, pre-trained on unlabelled narrated videos. Details can be found in the paper.
-
ObjectViViT uses object detection results from external object detectors to help action recognition. Details can be found in the paper.
-
Verbs in action (paper) uses LLMs to create hard negative pairs for contrastive learning, in order to improve the verb understanding of video-text models based on CLIP.
-
UniVRD is a bottom-up visual relationship detector built upon pre-trained vision and language models. Details can be found in the paper.
-
UnLoc proposes a unified architecture for video localization tasks, e.g., Temporal Action Localization, Moment Retrieval, and Action Segmentation. More details can be found in the paper.
-
REVEAL is an Retrieval-Augmented Visual Language Model that learns to retrieve world knowledge from a diverse set of multimodal knowledge sources, through end-to-end pre-training. Details can be found in the paper.
-
GER-ALD is a novel generative framework for web-scale visual entity recognition. We represent each entity by a compact, discriminative and semantic code that a generative model learns to auto-regressively decode. Details can be found in the paper.
Streaming Dense Video Captioning
Streaming DVC is a framework for dense captioning of long videos. Details can be found in the paper.
-
Dense VOC is an end-to-end model for joint object detection, tracking, and captioning in videos. Details can be found in the paper.
Scenic projects
A typical project consists of models, trainers, configs, a runner, and some utility functions developed for the project.
Models
Models are entities that define the network architecture, loss function, and
metrics. Network architectures are built using Flax nn.Modules. Common loss
functions and metrics can be included via a
Base Model, or within the project
itself for more specific use-cases.
To be accessible by the trainer, a model newly-defined by a project needs to be
registered within a specific project. As an exception, the baseline models
are registered directly in model_lib.models.
Trainers
Trainers implement the training and evaluation loops of the model. There are
already standard trainers that are provided in Scenic for classification,
segmentation, and adaptation (located in the train_lib module).
These trainers are directly registered in train_lib_deprecated/trainers and
given the careful optimization of these trainers for fast and efficient training
on accelerators (in particular TPUs), they can be forked by projects for further
customization. Projects need to register the new trainers they define within
their project, or they can simply use the standard Scenic trainers when no
modification is needed.
Configs
Config files are used to configure experiments. They define (hyper-)parameters for the selected model, trainer, and dataset (e.g. number of layers, frequency of logging, etc).
Binaries
Binaries bind models, trainers, and datasets together based on the config and
start the training. Usually, this is a main.py within the project that also
contains the registry for the project specific models and trainers. Note that
baselines make use of Scenic's default binary main.py.
Registries
There are three types of objects that can be registered in Scenic:
model, trainer, and dataset. A registry could be any simple data structure
that maps a string name to an object, for instance, a python dictionary.
Scenic defines a dataset registry that uses ad-hoc importing to lazy-load
the code for the input pipeline of a requested dataset. This registry lives in
dataset_lib/datasets.py. There are common trainers and models that are
registered in train_lib_deprecated/trainers.py and model_lib/models.py. However,
a project can define its own dataset, model, and trainer and make a small
registry for these objects within the project, e.g. in the project's main.py
so that the right model, trainer, and dataset are selectable using the
configs specified in the config file.