List of projects hosted in Scenic

AdaTape

AdaTape is an adaptive computation transformer with elastic input sequence.
AdversarialTraining

Adversarial training is an implementation of modern forms of adversarial training that achieved state-of-the-art robustness results on image classifications. This includes AdvProp and (Pyramid Adversarial Training Improves ViT Performance)[https://arxiv.org/abs/2111.15121].
AVATAR

AVATAR is a sequence-to-sequence AudioVisual ASR TrAnsformeR which is trained end-to-end from spectrograms and full-frame RGB for the task of audiovisual speech recognition (AV-ASR).
Audiovisual Masked Autoencoders

Audiovisual Masked Autoencoders performs self-supervised learning on multiple modalities (audio and video) to improve representation learning for both unimodal and multimodal downstream tasks. Details can be found in the paper.
Boundary Attention

Boundary Attention is differentiable bottom-up model for detecting boundaries in high noise at any resolution. It uses a form of local attention to infer boundaries that include contours, corners and junctions, all without rasterization. Details and a link to the paper can be found on its website.
ViViT

ViViT is a family of pure-transformer based models for video classification that achieved state-of-the-art results. Details can be found in the paper.
Tasseo

Tasseo is a project that uses transformer based models for aberration detection from chromosome karyotype images.
TokenLearner

TokenLearner proposes dynamic tokenization of images and videos for faster and more accurate video/image processing tasks. More can be found in the paper.
Token Turing Machines

Token Turing Machines are a sequential, autoregressive transformer architecture with external memory. More can be found in the paper.
FastViT

FastViT is a project that aims at exploring ideas around making ViT faster via using efficient transformers, in particular on higher resolution inputs (more tokens and thus longer sequences).
Omninet

Omninet is a transformer model with omni-directional representations.
CLAY

CLAY is a Transformer-based pipeline for mobile UI layout denoising. Read more about this project in CLAY paper.
LOCA

LOCA (paper) is a self-supervised method to train spatially-aware vision transformer features.
MatViT

MatViT is a MatFormer (paper) based nested ViT architecture designed to offer elasticity in a variety of deployment constraints, where each Feed Forward Network (FFN) block of a MatViT model is jointly optimized with a few nested smaller FFN blocks.
MBT

MBT presents a transformer based architecture that uses "fusion bottlenecks" for modality fusion at multiple layers. Details can be found in the paper.
MTV

MTV presents a state-of-the-art transformer based architecture for video classification. MTV consists of separate encoders to represent different views of the input video with lateral connections and a global encoder to fuse information across views. More details are in the paper.
OWL-ViT

OWL-ViT is an open-vocabulary object detector that given an image and a free-text query, it finds objects matching that query in the image. It can also do one-shot object detection, i.e. detect objects based on a single example image. More details are in the paper.
NCR

NCR is a regularization method which encourages the network to make similar predictions for similar vectors in the feature space. Details can be found in the paper, where we used this method to learn with noisy labels.
PCT

Point Cloud Transformer (PCT) is a Transformer-based model for performing inference (classification/segmentation) for point cloud data. Details can be found in the paper.
PolyViT

PolyViT is a simple and effective model for co-training a single transformer backbone on multiple modalities and tasks, resulting in a parameter-efficient model that performs as well or better than models trained on single modalities or tasks. Details can be found in the paper.
T5

Wrappers of T5 models in t5x.
Vid2Seq

Vid2Seq is a single-stage dense video captioning model, pre-trained on unlabelled narrated videos. Details can be found in the paper.
ObjectViViT

ObjectViViT uses object detection results from external object detectors to help action recognition. Details can be found in the paper.
Verbs in action

Verbs in action (paper) uses LLMs to create hard negative pairs for contrastive learning, in order to improve the verb understanding of video-text models based on CLIP.
UniVRD

UniVRD is a bottom-up visual relationship detector built upon pre-trained vision and language models. Details can be found in the paper.
UnLoc

UnLoc proposes a unified architecture for video localization tasks, e.g., Temporal Action Localization, Moment Retrieval, and Action Segmentation. More details can be found in the paper.
REVEAL

REVEAL is an Retrieval-Augmented Visual Language Model that learns to retrieve world knowledge from a diverse set of multimodal knowledge sources, through end-to-end pre-training. Details can be found in the paper.
- PixelLLM
  
  PixelLLM equips large language models with localization capability. Details can be found in the paper.
GER-ALD

GER-ALD is a novel generative framework for web-scale visual entity recognition. We represent each entity by a compact, discriminative and semantic code that a generative model learns to auto-regressively decode. Details can be found in the paper.
Streaming Dense Video Captioning

Streaming DVC is a framework for dense captioning of long videos. Details can be found in the paper.
Dense Video Object Captioning

Dense VOC is an end-to-end model for joint object detection, tracking, and captioning in videos. Details can be found in the paper.

Scenic projects

A typical project consists of models, trainers, configs, a runner, and some utility functions developed for the project.

Models

Models are entities that define the network architecture, loss function, and metrics. Network architectures are built using Flax nn.Modules. Common loss functions and metrics can be included via a Base Model, or within the project itself for more specific use-cases.

To be accessible by the trainer, a model newly-defined by a project needs to be registered within a specific project. As an exception, the baseline models are registered directly in model_lib.models.

Trainers

Trainers implement the training and evaluation loops of the model. There are already standard trainers that are provided in Scenic for classification, segmentation, and adaptation (located in the train_lib module). These trainers are directly registered in train_lib_deprecated/trainers and given the careful optimization of these trainers for fast and efficient training on accelerators (in particular TPUs), they can be forked by projects for further customization. Projects need to register the new trainers they define within their project, or they can simply use the standard Scenic trainers when no modification is needed.

Configs

Config files are used to configure experiments. They define (hyper-)parameters for the selected model, trainer, and dataset (e.g. number of layers, frequency of logging, etc).

Binaries

Binaries bind models, trainers, and datasets together based on the config and start the training. Usually, this is a main.py within the project that also contains the registry for the project specific models and trainers. Note that baselines make use of Scenic's default binary main.py.

Registries

There are three types of objects that can be registered in Scenic: model, trainer, and dataset. A registry could be any simple data structure that maps a string name to an object, for instance, a python dictionary.

Scenic defines a dataset registry that uses ad-hoc importing to lazy-load the code for the input pipeline of a requested dataset. This registry lives in dataset_lib/datasets.py. There are common trainers and models that are registered in train_lib_deprecated/trainers.py and model_lib/models.py. However, a project can define its own dataset, model, and trainer and make a small registry for these objects within the project, e.g. in the project's main.py so that the right model, trainer, and dataset are selectable using the configs specified in the config file.

fcxfcx
/

owlv2

Contents