| ## Contents | |
| * [List of projects hosted in Scenic](#list-of-projects-hosted-in-scenic) | |
| * [Scenic projects](#scenic-projects) | |
| ## List of projects hosted in Scenic | |
| * [AdaTape](adatape) | |
| > AdaTape is an adaptive computation transformer with elastic input sequence. | |
| * [AdversarialTraining](adversarialtraining) | |
| > Adversarial training is an implementation of modern forms of adversarial | |
| > training that achieved state-of-the-art robustness results on image | |
| > classifications. This includes [AdvProp](https://arxiv.org/abs/1911.09665) | |
| > and (Pyramid Adversarial Training Improves ViT Performance)[https://arxiv.org/abs/2111.15121]. | |
| * [AVATAR](avatar) | |
| > [AVATAR](https://gabeur.github.io/avatar-visspeech) is a | |
| > sequence-to-sequence AudioVisual ASR TrAnsformeR which is | |
| > trained end-to-end from spectrograms and full-frame RGB for the task of | |
| > audiovisual speech recognition (AV-ASR). | |
| * [Audiovisual Masked Autoencoders](av-mae) | |
| > Audiovisual Masked Autoencoders performs self-supervised learning on | |
| > multiple modalities (audio and video) to improve representation learning | |
| > for both unimodal and multimodal downstream tasks. Details can be found | |
| > in the [paper](https://arxiv.org/abs/2212.05922). | |
| * [Boundary Attention](boundary_attention) | |
| > Boundary Attention is differentiable bottom-up model for detecting | |
| > boundaries in high noise at any resolution. It uses a form of local | |
| > attention to infer boundaries that include contours, corners and | |
| > junctions, all without rasterization. Details and a link to | |
| > the paper can be found on its [website](https://boundaryattention.github.io/). | |
| * [ViViT](vivit) | |
| > ViViT is a family of pure-transformer based models for video | |
| > classification that achieved state-of-the-art results. | |
| > Details can be found in the [paper](https://arxiv.org/abs/2103.15691). | |
| * [Tasseo](tasseo) | |
| > Tasseo is a project that uses transformer based models for aberration | |
| > detection from chromosome karyotype images. | |
| * [TokenLearner](token_learner) | |
| > TokenLearner proposes dynamic tokenization of images and videos for faster | |
| > and more accurate video/image processing tasks. More can be found in | |
| > the [paper](https://arxiv.org/abs/2106.11297). | |
| * [Token Turing Machines](token_turing) | |
| > Token Turing Machines are a sequential, autoregressive transformer | |
| > architecture with external memory. More can be found in the | |
| > [paper](https://arxiv.org/abs/2106.11297). | |
| * [FastViT](fast_vit) | |
| > FastViT is a project that aims at exploring ideas around making ViT faster | |
| > via using [efficient transformers](https://arxiv.org/abs/2009.06732), in | |
| > particular on higher resolution inputs (more tokens and thus longer | |
| > sequences). | |
| * [Omninet](omninet) | |
| > Omninet is a transformer model with | |
| > [omni-directional representations](https://arxiv.org/abs/2103.01075). | |
| * [CLAY](layout_denoise) | |
| > CLAY is a Transformer-based pipeline for mobile UI layout denoising. Read | |
| > more about this project in CLAY [paper](https://arxiv.org/abs/2201.04100). | |
| * [LOCA](loca) | |
| > LOCA ([paper](https://arxiv.org/abs/2212.02400)) is a self-supervised | |
| > method to train spatially-aware vision transformer features. | |
| * [MatViT](matvit) | |
| > MatViT is a MatFormer ([paper](https://arxiv.org/abs/2310.07707)) based | |
| > nested ViT architecture designed to offer elasticity in a variety of | |
| > deployment constraints, where each Feed Forward Network (FFN) block of a | |
| > MatViT model is jointly optimized with a few nested smaller FFN blocks. | |
| * [MBT](mbt) | |
| > MBT presents a transformer based architecture that uses "fusion | |
| > bottlenecks" for modality fusion at multiple layers. | |
| > Details can be found in the [paper](https://arxiv.org/abs/2201.04100). | |
| * [MTV](mtv) | |
| > MTV presents a state-of-the-art transformer based architecture for video | |
| > classification. MTV consists of separate encoders to represent different | |
| > views of the input video with lateral connections and a global encoder to | |
| > fuse information across views. More details are in the | |
| > [paper](https://arxiv.org/abs/2201.04288). | |
| * [OWL-ViT](owl_vit) | |
| > OWL-ViT is an open-vocabulary object detector that given an image and a | |
| > free-text query, it finds objects matching that query in the image. It can | |
| > also do one-shot object detection, i.e. detect objects based on a single | |
| > example image. More details are in the | |
| > [paper](https://arxiv.org/abs/2205.06230). | |
| * [NCR](ncr) | |
| > NCR is a regularization method which encourages the network to make | |
| > similar predictions for similar vectors in the feature space. | |
| > Details can be found in the [paper](https://arxiv.org/abs/2202.02200), | |
| > where we used this method to learn with noisy labels. | |
| * [PCT](pointcloud) | |
| > Point Cloud Transformer (PCT) is a Transformer-based model for | |
| > performing inference (classification/segmentation) for point cloud data. | |
| > Details can be found in the [paper](https://arxiv.org/abs/2012.09688). | |
| * [PolyViT](polyvit) | |
| > PolyViT is a simple and effective model for co-training a single | |
| > transformer backbone on multiple modalities and tasks, resulting in a | |
| > parameter-efficient model that performs as well or better than models | |
| > trained on single modalities or tasks. | |
| > Details can be found in the [paper](https://arxiv.org/abs/2111.12993). | |
| * [T5](t5) | |
| > Wrappers of T5 models in [t5x](https://github.com/google-research/t5x). | |
| * [Vid2Seq](vid2seq) | |
| > Vid2Seq is a single-stage dense video captioning model, pre-trained on | |
| > unlabelled narrated videos. | |
| > Details can be found in the [paper](https://arxiv.org/abs/2302.14115). | |
| * [ObjectViViT](objectvivit) | |
| > ObjectViViT uses object detection results from external object detectors | |
| > to help action recognition. | |
| > Details can be found in the [paper](https://openaccess.thecvf.com/content/CVPR2023/html/Zhou_How_Can_Objects_Help_Action_Recognition_CVPR_2023_paper.html). | |
| * [Verbs in action](verbs_in_action) | |
| > Verbs in action ([paper](https://arxiv.org/abs/2304.06708)) uses LLMs to | |
| > create hard negative pairs for contrastive learning, in order to improve | |
| > the verb understanding of video-text models based on CLIP. | |
| * [UniVRD](univrd) | |
| > UniVRD is a bottom-up visual relationship detector built upon pre-trained | |
| > vision and language models. | |
| > Details can be found in the [paper](https://arxiv.org/abs/2303.08998). | |
| * [UnLoc](unloc) | |
| > UnLoc proposes a unified architecture for video localization tasks, | |
| > e.g., Temporal Action Localization, Moment Retrieval, and Action | |
| > Segmentation. More details can be found in the [paper](https://openaccess.thecvf.com/content/ICCV2023/papers/Yan_UnLoc_A_Unified_Framework_for_Video_Localization_Tasks_ICCV_2023_paper.pdf). | |
| * [REVEAL](knowledge_visual_language) | |
| > REVEAL is an Retrieval-Augmented Visual Language Model that | |
| > learns to retrieve world knowledge from a diverse set of multimodal | |
| > knowledge sources, through end-to-end pre-training. | |
| > Details can be found in the [paper](https://arxiv.org/abs/2212.05221). | |
| * [PixelLLM](pixel_llm) | |
| > PixelLLM equips large language models with localization capability. | |
| > Details can be found in the [paper](https://arxiv.org/abs/2312.09237). | |
| * [GER-ALD](gerald) | |
| > GER-ALD is a novel generative framework for web-scale visual entity | |
| > recognition. We represent each entity by a compact, discriminative and | |
| > semantic code that a generative model learns to auto-regressively decode. | |
| > Details can be found in the [paper](https://arxiv.org/abs/2403.02041). | |
| * [Streaming Dense Video Captioning](streaming_dvc) | |
| > Streaming DVC is a framework for dense captioning of long videos. | |
| > Details can be found in the [paper](https://arxiv.org/abs/2404.01297). | |
| * [Dense Video Object Captioning](densevoc) | |
| > Dense VOC is an end-to-end model for joint object detection, tracking, | |
| > and captioning in videos. | |
| > Details can be found in the [paper](https://arxiv.org/abs/2306.11729). | |
| <a name="projects"></a> | |
| ## Scenic projects | |
| A typical project consists of models, trainers, configs, a runner, and some | |
| utility functions developed for the project. | |
| ### Models | |
| Models are entities that define the network architecture, loss function, and | |
| metrics. Network architectures are built using Flax `nn.Modules`. Common loss | |
| functions and metrics can be included via a | |
| [Base Model](../model_lib/README.md#base_model), or within the project | |
| itself for more specific use-cases. | |
| To be accessible by the trainer, a model newly-defined by a project needs to be | |
| registered *within a specific project*. As an exception, the baseline models | |
| are registered directly in `model_lib.models`. | |
| ### Trainers | |
| Trainers implement the training and evaluation loops of the model. There are | |
| already standard trainers that are provided in Scenic for classification, | |
| segmentation, and adaptation (located in the `train_lib` module). | |
| These trainers are directly registered in `train_lib_deprecated/trainers` and | |
| given the careful optimization of these trainers for fast and efficient training | |
| on accelerators (in particular TPUs), they can be forked by projects for further | |
| customization. Projects need to register the new trainers they define within | |
| their project, or they can simply use the standard Scenic trainers when no | |
| modification is needed. | |
| ### Configs | |
| Config files are used to configure experiments. They define (hyper-)parameters | |
| for the selected model, trainer, and dataset (e.g. number of layers, frequency | |
| of logging, etc). | |
| ### Binaries | |
| Binaries bind models, trainers, and datasets together based on the config and | |
| start the training. Usually, this is a `main.py` within the project that also | |
| contains the registry for the project specific models and trainers. Note that | |
| baselines make use of Scenic's default binary `main.py`. | |
| ### Registries | |
| There are three types of objects that can be registered in Scenic: | |
| `model`, `trainer`, and `dataset`. A registry could be any simple data structure | |
| that maps a string name to an object, for instance, a python dictionary. | |
| Scenic defines a dataset registry that uses ad-hoc importing to lazy-load | |
| the code for the input pipeline of a requested dataset. This registry lives in | |
| `dataset_lib/datasets.py`. There are common trainers and models that are | |
| registered in `train_lib_deprecated/trainers.py` and `model_lib/models.py`. However, | |
| a project can define its own dataset, model, and trainer and make a small | |
| registry for these objects within the project, e.g. in the project's `main.py` | |
| so that the right model, trainer, and dataset are selectable using the | |
| configs specified in the config file. | |