109 kB

Title: BootsTAP: Bootstrapped Training for Tracking-Any-Point

URL Source: https://arxiv.org/html/2402.00847

Markdown Content: 1 1 institutetext: Google DeepMind 2 2 institutetext: VGG, Department of Engineering Science, University of Oxford Pauline Luc 11 Yi Yang 11 Dilara Gokay 11 Skanda Koppula 11 Ankush Gupta 11 Joseph Heyward 11 Ignacio Rocco 11 Ross Goroshin 11 João Carreira 11 Andrew Zisserman 1122

Abstract

To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale ground-truth training data for TAP is only available in simulation, which currently has a limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a self-supervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%. For visualizations, see our project webpage at https://bootstap.github.io/

Keywords:

Tracking-Any-Point Self-Supervised Learning Semi-Supervised Learning

1 Introduction

Despite impressive achievements in the vision and language capability of generalist AI systems, physical and spatial reasoning remain notable weaknesses of state-of-the art vision models[47, 59]. This limits their application in many domains like robotics, video generation, and 3D asset creation – all of which require an understanding of the complex motions and physical interactions in a scene. Tracking-Any-Point (TAP)[12] is a promising approach to represent precise motions in videos, and recent work has demonstrated compelling usage of TAP in robotics[62, 69, 2], 3D reconstruction[64], video generation[13], and video editing[71]. In TAP, algorithms are fed a video and a set of query points—potentially densely across the video—and must output the tracked location of these query points in the video’s other frames. If the point is not visible in a frame, the point is marked as occluded in that frame. This approach has many advantages: it is a highly general task, as correspondences for surface points are well-defined for opaque, solid surfaces, and it provides rich information about the deformation and motion of objects across long time periods.

Figure 1: Bootstrapped training for tracking-any-point. After initializing a TAPIR model with standard supervised training, we bootstrap the model on real data by adding an additional self-supervised loss. We apply a teacher model (a simple EMA of the student model) to get pseudo-ground-truth labels for a video. We then apply spatial transformations and corruptions to the video to make the task harder for the student, and train the student to reproduce the teacher’s predictions from any query point along the teacher’s trajectory.

The main challenge for building TAP models, however, is the lack of training data: in the real world, we must rely on manual labeling, which is arduous and imprecise[12], or on 3D sensing[1], which is only available in limited scenarios and quantity. Thus, state-of-the-art methods have relied on synthetic data[19, 74]. In this work, however, we overcome this limitation and demonstrate that unlabeled real-world videos can be used to improve point tracking, using self-consistency as a supervisory signal. In particular, we know that when tracks are correct for a given video, then 1) spatial transformations of the video should result in an equivalent spatial transformation of the trajectories, 2) that different query points along the same trajectory should produce the same track, and 3) that non-spatial data augmentation (e.g. image compression) should not affect results. Deviations from this can be treated as an error signal for learning.

Our architecture is outlined in Figure1. We begin with a strong “teacher” model pre-trained using supervised learning on synthetic data (in our case, a TAPIR[13] model) which serves as initialization for both a “teacher” and a “student” model. Given an unlabeled input video, we make a prediction using the teacher model, which serves as pseudo-ground-truth for the student. We then generate a second “view” of the video by applying affine transformations that vary smoothly in time, re-sampling frames to a lower resolution, and adding JPEG corruption, and padding back to the original size. We input the second view to the “student” network and use a query point sampled from the teacher’s prediction (transformed consistently with the transformation applied to the video). The student’s prediction is then transformed back into the original coordinate space. We then use a self-supervised loss (SSL) to update the student’s weights: that is, we apply TAPIR’s original loss function to the student predictions, using the teacher’s predictions as pseudo-ground-truth. The teacher’s weights are updated by using an exponential moving average (EMA) of the student’s weights. We take steps to ensure that the teacher’s predictions used for training are more likely to be accurate than the student’s: (i) the corruptions that degrade and downsample the video are only applied to the student’s inputs, (ii) we use an EMA of the student’s weights as the teacher’s weights, a common trick for stabilizing student-teacher learning[20, 58]. Co-training using this formulation on real-world videos, in addition to training on synthetic data, provides a substantial boost over prior state-of-the-art across the entire TAP-Vid benchmark.

In summary, our contributions are as follows:

1. We demonstrate the first large-scale pipeline for improving video point tracking using a large dataset of unannotated videos, based on straightforward properties of real trajectories: (i) predictions should vary consistently with spatial transformations of the video, and (ii) predictions should be invariant to the choice of query point along a given trajectory.
1. We analyze the importance of varying model components, and show that a surprisingly simple formulation is sufficient to achieve good results.
1. We show that the resulting formulation achieves new SOTA results on point tracking benchmarks, while requiring minimal architectural changes.
1. We will release a model and checkpoint on GitHub, including model implementations in both JAX and PyTorch for the community to use.

2 Related Work

Tracking-Any-Point.

The ability to track densely-sampled points over long video sequences is a generic visual capability [52, 53]. Because this visual task provides a rich output that is well-defined independent of semantic or linguisitic categories (unlike classification, detection, and semantic segmentation), it is more generically useful and can support other visual capabilities like video editing[71], 3D estimation[65], object segmentation [46, 50], camera tracking[8] and even robotics [62, 69]. Point tracking has recently experienced a flurry of recent works including new datasets [12, 74, 1] and algorithms [22, 29, 13, 65, 44, 3, 43]. Current state-of-the-art works mainly train in a supervised manner, relying heavily on synthetic data[19, 74] which has a large domain gap with the real world.

Self-supervised correspondence via photometric loss.

Tracking has long been a target of self-supervised learning due to the lack of reliable supervised data, especially at the point level. A wide variety of proxy supervisory signals have been proposed, all with their own limitations. Photometric losses use reconstruction, and are particularly popular in optical flow, but occlusions, lighting changes, and repeated (or constant) textures, typically result in multiple or false appearance matches. To compensate for this, these methods typically rely on complicated priors such as multi-frame estimation [26], explicit occlusion handling [56, 68], improved data augmentation [36], additional loss terms [37, 38, 42], and robust loss functions which avoid degenerate solutions [72, 51, 40]. Methods that combine feature learning with appearance reconstruction, such as[63, 33, 32], have demonstrated long-range tracking. Matches based on local appearance are more likely to correspond to motion in high resolution videos because they are able to resolve detailed textures [27]; we make use of this observation in our work.

Temporal continuity and cycle-consistency.

Other works use images or videos to perform more general feature learning, with the aim that features in correspondence should be more similar than those which are not. Temporal continuity in videos has long been used to obtain such correspondences [15, 70, 16, 66, 25], resulting in features which have proven to be effective for object tracking [10, 17]. Temporal cycle-consistency[67, 4] can also result in features useful for tracking; however this learning method fails to provide useful supervision in challenging situations such as occlusions.

Semi-supervised correspondence.

A final self-supervised approach is to create pseudo-ground-truth correspondences for semi-supervised training[23, 54]. Such approaches have a long history in optical flow [37, 38, 24, 45], although with mixed results, typically requiring complex training setups such as GANs[31] or connecting the student to the teacher[39] to prevent trivial solutions. They have only been applied to longer-term point tracking more recently[65, 57]. OmniMotion computes initial point tracks using RAFT[60] or TAP-Net[12] and infers a full pseudo-3D interpretation of the scene in the form of a neural network. Although this method improves point tracks compared to their initialization, it never retrains a general TAP model on the self-labeled data. Perhaps most related is Li et al.[34], which proposes a self-supervised loss based on reconstruction, in addition to supervised point tracking loss and an adversarial domain adaptation loss. The final algorithm is complex, and performs far below our work (59.8 on TAP-Vid-DAVIS <δ a⁢v⁢g x absent subscript superscript 𝛿 𝑥 𝑎 𝑣 𝑔<\delta^{x}_{avg}< italic_δ start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT, versus 78.1 for our work), with the self-supervised providing a relatively small boost. Concurrent work[57], on the other hand, saves a dataset of point tracks and retrains the underlying model on them, using data augmentations similar to ours. We discuss the differences in detail in the following section, after presenting our approach.

3 Method

When developing a self-supervised training method for TAP, it is important to note that TAP has a precise, correct answer for almost every query point. This is different from typical visual self-supervised learning, where the representation can be arbitrary, as long as semantically similar images have similar representations. Supervised learning on synthetic data provides a strong initial guess in many situations, but care must be taken to ensure that the self-supervised algorithm does not find “trivial shortcuts”[11] that become self-reinforcing and harm the initialization.

Figure 2: Bootstrapped training for Tracking-Any-Point. The teacher TAPIR produces a pseudo-label trajectory from query point q 1 subscript 𝑞 1 q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT at time t 1 subscript 𝑡 1 t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Video frames undergo affine transformations Φ Φ\Phi roman_Φ that vary smoothly in time and are augmented with JPEG artifacts, then fed to the student TAPIR, which predicts a trajectory from query point Φ⁢(q 2)Φ subscript 𝑞 2\Phi(q_{2})roman_Φ ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) at time t 2 subscript 𝑡 2 t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (sampled from the teacher’s prediction, then transformed to the student video space using Φ Φ\Phi roman_Φ). The student trajectory is transformed back, and loss is computed against the teacher’s trajectory. To maximize the chances that we train on accurate trajectories, we remove trajectories where the student’s prediction at time t 1 subscript 𝑡 1 t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is too far from the teacher query point q 1 subscript 𝑞 1 q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (i.e. not cycle-consistent; light-orange disk).

Our formulation relies on two facts about point tracks that are true for points on any solid, opaque surface. First, spatial transformations (e.g.affine transformations) which are applied to the video will result in equivalent spatial transformations of the point tracks (i.e.the tracks are “equivariant” under spatial transformation), while the tracks are invariant to many other factors of variation that do not move the image content (e.g.color changes, noise). Second, the algorithm should output the same track regardless of which point along the track is used as a query; mathematically, this means that each trajectory forms an equivalence class. One could imagine enforcing the desired equivariance and invariance properties using a simple Siamese-network formulation[21], where a single network is trained to output consistent predictions on two different ‘views’ of the data (i.e., augmented and transformed versions of the video and tracks). However, we find that minimizing the difference between the two outputs—and backpropping both—results in predictions degrading toward trivial solutions (e.g. over-smoothing of tracks, or tracking the image boundary instead of the image content). In fact, the model can learn to distinguish between synthetic and real data resulting in trivial solutions on the real, unlabeled data only. To prevent this, we adopt a student-teacher framework, where the student’s view of the data is made more challenging by augmentations, and the teacher does not receive gradients that may corrupt its predictions. Figure2 shows the overall pipeline.

Loss functions.

We start with a baseline TAPIR network pre-trained on Kubric following[13]. Let y^={p^,o^,u^}^𝑦^𝑝^𝑜^𝑢\hat{y}={\hat{p},\hat{o},\hat{u}}over^ start_ARG italic_y end_ARG = { over^ start_ARG italic_p end_ARG , over^ start_ARG italic_o end_ARG , over^ start_ARG italic_u end_ARG } be the predictions: p^∈ℝ T×2^𝑝 superscript ℝ 𝑇 2\hat{p}\in\mathbb{R}^{T\times 2}over^ start_ARG italic_p end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × 2 end_POSTSUPERSCRIPT is the position, o^∈ℝ T^𝑜 superscript ℝ 𝑇\hat{o}\in\mathbb{R}^{T}over^ start_ARG italic_o end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is an occlusion logit, and u^∈ℝ T^𝑢 superscript ℝ 𝑇\hat{u}\in\mathbb{R}^{T}over^ start_ARG italic_u end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is an uncertainty logit, where T 𝑇 T italic_T is the number of frames. Calling p⁢[t]𝑝 delimited-[]𝑡 p[t]italic_p [ italic_t ] and o⁢[t]𝑜 delimited-[]𝑡 o[t]italic_o [ italic_t ] the ground truth targets for frame t 𝑡 t italic_t, recall that the standard TAPIR loss for a single trajectory is defined as:

ℒ t⁢a⁢p⁢i⁢r⁢(p^⁢[t],o^⁢[t],u^⁢[t])=subscript ℒ 𝑡 𝑎 𝑝 𝑖 𝑟^𝑝 delimited-[]𝑡^𝑜 delimited-[]𝑡^𝑢 delimited-[]𝑡 absent\displaystyle\mathcal{L}_{tapir}(\hat{p}[t],\hat{o}[t],\hat{u}[t])=caligraphic_L start_POSTSUBSCRIPT italic_t italic_a italic_p italic_i italic_r end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG [ italic_t ] , over^ start_ARG italic_o end_ARG [ italic_t ] , over^ start_ARG italic_u end_ARG [ italic_t ] ) =Huber⁢(p^⁢[t],p⁢[t])⁢(1−o⁢[t])Huber^𝑝 delimited-[]𝑡 𝑝 delimited-[]𝑡 1 𝑜 delimited-[]𝑡\displaystyle\hskip 3.00003pt\text{Huber}(\hat{p}[t],p[t])(1-o[t])Huber ( over^ start_ARG italic_p end_ARG [ italic_t ] , italic_p [ italic_t ] ) ( 1 - italic_o [ italic_t ] )Position loss +BCE⁢(o^⁢[t],o⁢[t])BCE^𝑜 delimited-[]𝑡 𝑜 delimited-[]𝑡\displaystyle+\text{BCE}(\hat{o}[t],o[t])+ BCE ( over^ start_ARG italic_o end_ARG [ italic_t ] , italic_o [ italic_t ] )Occlusion loss(1) +BCE⁢(u^⁢[t],u⁢[t])⁢(1−o⁢[t])BCE^𝑢 delimited-[]𝑡 𝑢 delimited-[]𝑡 1 𝑜 delimited-[]𝑡\displaystyle+\text{BCE}(\hat{u}[t],u[t])(1-o[t])+ BCE ( over^ start_ARG italic_u end_ARG [ italic_t ] , italic_u [ italic_t ] ) ( 1 - italic_o [ italic_t ] )Uncertainty loss

where Huber is the Huber loss and BCE is the sigmoid binary cross-entropy. The target for the uncertainty logit is defined as u⁢[t]=𝟙⁢(d⁢(p⁢[t],p^⁢[t])>δ)𝑢 delimited-[]𝑡 1 𝑑 𝑝 delimited-[]𝑡^𝑝 delimited-[]𝑡 𝛿 u[t]=\mathbbm{1}(d(p[t],\hat{p}[t])>\delta)italic_u [ italic_t ] = blackboard_1 ( italic_d ( italic_p [ italic_t ] , over^ start_ARG italic_p end_ARG [ italic_t ] ) > italic_δ ), where d 𝑑 d italic_d the L 2 subscript 𝐿 2 L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance and δ 𝛿\delta italic_δ is a threshold on the distance, set to 6 pixels, and 𝟙 1\mathbbm{1}blackboard_1 is an indicator function. That is, the uncertainty loss trains the model to predict whether its own prediction is likely to be within a threshold of the ground truth.

After pre-training, we add extra capacity to the model to absorb the extra training data: 5 layers of 2D conv-residual layers to the backbone with a channel multiplier of 4, which roughly doubles the number of parameters in the backbone (see Appendix0.C for details). These are initialized to the identity following “zero init”[18]. Let y^𝒮={p^𝒮,o^𝒮,u^𝒮}subscript^𝑦 𝒮 subscript^𝑝 𝒮 subscript^𝑜 𝒮 subscript^𝑢 𝒮\hat{y}{\mathcal{S}}={\hat{p}{\mathcal{S}},\hat{o}{\mathcal{S}},\hat{u}{% \mathcal{S}}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT = { over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT , over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT } now refer to the student predictions. We derive pseudo-labels y 𝒯={p 𝒯,o 𝒯,u 𝒯}subscript 𝑦 𝒯 subscript 𝑝 𝒯 subscript 𝑜 𝒯 subscript 𝑢 𝒯 y_{\mathcal{T}}={p_{\mathcal{T}},o_{\mathcal{T}},u_{\mathcal{T}}}italic_y start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT = { italic_p start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT , italic_o start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT } from the teacher’s predictions y^𝒯={p^𝒯,o^𝒯,u^𝒯}subscript^𝑦 𝒯 subscript^𝑝 𝒯 subscript^𝑜 𝒯 subscript^𝑢 𝒯\hat{y}{\mathcal{T}}={\hat{p}{\mathcal{T}},\hat{o}{\mathcal{T}},\hat{u}{% \mathcal{T}}}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT = { over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT , over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT , over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT } as follows:

p 𝒯[t]=p^𝒯[t];o 𝒯[t]=𝟙(o^𝒯[t]>0);u 𝒯[t]=𝟙(d(p^𝒯[t],p^𝒮[t])>δ)p_{\mathcal{T}}[t]=\hat{p}{\mathcal{T}}[t]\quad;\quad o{\mathcal{T}}[t]=% \mathbbm{1}(\hat{o}{\mathcal{T}}[t]>0);\quad u{\mathcal{T}}[t]=\mathbbm{1}(d% (\hat{p}{\mathcal{T}}[t],\hat{p}{\mathcal{S}}[t])>\delta)italic_p start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] = over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] ; italic_o start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] = blackboard_1 ( over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] > 0 ) ; italic_u start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] = blackboard_1 ( italic_d ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ italic_t ] ) > italic_δ )(2)

where t 𝑡 t italic_t indexes time. The loss ℓ s⁢s⁢l⁢(p^𝒮⁢[t],o^𝒮⁢[t],u^𝒮⁢[t])subscript ℓ 𝑠 𝑠 𝑙 subscript^𝑝 𝒮 delimited-[]𝑡 subscript^𝑜 𝒮 delimited-[]𝑡 subscript^𝑢 𝒮 delimited-[]𝑡\ell_{ssl}(\hat{p}{\mathcal{S}}[t],\hat{o}{\mathcal{S}}[t],\hat{u}_{\mathcal% {S}}[t])roman_ℓ start_POSTSUBSCRIPT italic_s italic_s italic_l end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ italic_t ] , over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ italic_t ] , over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ italic_t ] ) for a given video frame t 𝑡 t italic_t is derived from the TAPIR loss, treating the pseudo-labels as ground-truth, and defined as:

ℓ s⁢s⁢l⁢(p^𝒮⁢[t],o^𝒮⁢[t],u^𝒮⁢[t])=subscript ℓ 𝑠 𝑠 𝑙 subscript^𝑝 𝒮 delimited-[]𝑡 subscript^𝑜 𝒮 delimited-[]𝑡 subscript^𝑢 𝒮 delimited-[]𝑡 absent\displaystyle\ell_{ssl}(\hat{p}{\mathcal{S}}[t],\hat{o}{\mathcal{S}}[t],\hat% {u}{\mathcal{S}}[t])=roman_ℓ start_POSTSUBSCRIPT italic_s italic_s italic_l end_POSTSUBSCRIPT ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ italic_t ] , over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ italic_t ] , over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ italic_t ] ) =Huber⁢(p^𝒮⁢[t],p 𝒯⁢[t])⁢(1−o 𝒯⁢[t])Huber subscript^𝑝 𝒮 delimited-[]𝑡 subscript 𝑝 𝒯 delimited-[]𝑡 1 subscript 𝑜 𝒯 delimited-[]𝑡\displaystyle\hskip 3.00003pt\text{Huber}(\hat{p}{\mathcal{S}}[t],p_{\mathcal% {T}}[t])(1-o_{\mathcal{T}}[t])Huber ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ italic_t ] , italic_p start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] ) ( 1 - italic_o start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] ) +BCE⁢(o^𝒮⁢[t],o 𝒯⁢[t])BCE subscript^𝑜 𝒮 delimited-[]𝑡 subscript 𝑜 𝒯 delimited-[]𝑡\displaystyle+\text{BCE}(\hat{o}{\mathcal{S}}[t],o{\mathcal{T}}[t])+ BCE ( over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ italic_t ] , italic_o start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] )(3) +BCE⁢(u^𝒮⁢[t],u 𝒯⁢[t])⁢(1−o 𝒯⁢[t])BCE subscript^𝑢 𝒮 delimited-[]𝑡 subscript 𝑢 𝒯 delimited-[]𝑡 1 subscript 𝑜 𝒯 delimited-[]𝑡\displaystyle+\text{BCE}(\hat{u}{\mathcal{S}}[t],u{\mathcal{T}}[t])(1-o_{% \mathcal{T}}[t])+ BCE ( over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ italic_t ] , italic_u start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] ) ( 1 - italic_o start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] )

Note that TAPIR’s loss uses multiple refinement iterations, but we always use the teacher’s final prediction to derive pseudo-ground-truth; therefore, refined predictions serve as supervision for unrefined ones, encouraging stronger features that enable faster convergence.

Video degradations.

While the above formulation is well-defined, if the student and teacher both receive the same video and query point, we expect the loss to be trivially close to zero; therefore, we apply transformations and corruptions to the student’s view of the video. Given an input video, we create a second view by resizing each frame to a smaller resolution r 𝑟 r italic_r and superimposing it onto a black background at a random position (h,w)ℎ 𝑤(h,w)( italic_h , italic_w ) within this background. r 𝑟 r italic_r varies linearly over time, meaning that the frames gradually become larger or smaller within the fixed-size black background. Overall, the decreased resolution degrades the student view, and this increases task difficulty for the student. The location of these frames also move with time, and (h,w)ℎ 𝑤(h,w)( italic_h , italic_w ) follows a linear trajectory within the black background. Formally, this is a frame-wise axis-aligned affine transformation Φ Φ\Phi roman_Φ on coordinates, applied to the pixels. We also apply Φ Φ\Phi roman_Φ to the student query coordinates. We further degrade this view by applying a random JPEG degradation to make the task more difficult, before pasting it onto the black background. Both operations lose texture information; therefore, the network must learn higher-level—and possibly semantic—cues (e.g.the tip of the top left ear of the cat), rather than lower-level texture matching in order to track points correctly. We apply the inverse affine transformation Φ−1 superscript Φ 1\Phi^{-1}roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT to map the student’s predictions back to the original input coordinate space, before feeding these to the loss. We describe these transformations and corruptions in more detail in Appendix0.B.1.

Choosing the sample point.

We enforce that each trajectory forms an equivalence class by training the model to produce the same track regardless of which point is used as a query. While we do not have access to the ground-truth trajectories to sample different query points from, we can use the teacher model’s predictions to form pairs of query points. First, we sample a query point Q 1=(q 1,t 1)subscript 𝑄 1 subscript 𝑞 1 subscript 𝑡 1 Q_{1}=(q_{1},t_{1})italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), where q 1 subscript 𝑞 1 q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is an (x,y)𝑥 𝑦(x,y)( italic_x , italic_y ) coordinate, and t 1 subscript 𝑡 1 t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a frame index, both sampled uniformly. Then the student’s query is sampled randomly from the teacher’s trajectory, i.e. Q 2=(q 2,t 2)∈{(p 𝒯⁢[t],t);t⁢s.t.⁢o 𝒯⁢[t]=0}subscript 𝑄 2 subscript 𝑞 2 subscript 𝑡 2 subscript 𝑝 𝒯 delimited-[]𝑡 𝑡 𝑡 s.t.subscript 𝑜 𝒯 delimited-[]𝑡 0 Q_{2}=(q_{2},t_{2})\in{(p_{\mathcal{T}}[t],t);t\textit{ s.t. }o_{\mathcal{T}}% [t]=0}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ { ( italic_p start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] , italic_t ) ; italic_t s.t. italic_o start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] = 0 }.

Note, however, that if the teacher has not tracked the point correctly, the student’s query might be a different real-world point than the teacher’s, leading to an erroneous training signal. To prevent this, we use cycle-consistency of the student and teacher trajectories, and ignore the loss for trajectories that don’t form a valid cycle, as depicted by the orange circle in Figure2. Formally, we implement this as a mask defined as:

m c⁢y⁢c⁢l⁢e=𝟙⁢(d⁢(p^𝒮⁢[t 1],q 1)<δ c⁢y⁢c⁢l⁢e)∗𝟙⁢(o^𝒮⁢[t 1]≤0)subscript 𝑚 𝑐 𝑦 𝑐 𝑙 𝑒 1 𝑑 subscript^𝑝 𝒮 delimited-[]subscript 𝑡 1 subscript 𝑞 1 subscript 𝛿 𝑐 𝑦 𝑐 𝑙 𝑒 1 subscript^𝑜 𝒮 delimited-[]subscript 𝑡 1 0 m_{cycle}=\mathbbm{1}\left(d(\hat{p}{\mathcal{S}}[t{1}],q_{1})<\delta_{cycle% }\right)\quad*\quad\mathbbm{1}\left(\hat{o}{\mathcal{S}}[t{1}]\leq 0\right)italic_m start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT = blackboard_1 ( italic_d ( over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) < italic_δ start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT ) ∗ blackboard_1 ( over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] ≤ 0 )(4)

Here, δ c⁢y⁢c⁢l⁢e subscript 𝛿 𝑐 𝑦 𝑐 𝑙 𝑒\delta_{cycle}italic_δ start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT is a distance threshold hyperparameter, which we set to 4 pixels.

Note that there is a special case when the student and teacher have the same query point: there is no longer any uncertainty regarding whether the point is on the same trajectory. These points are reliable while also being less challenging. We compromise between extremes, and sample Q 1=Q 2 subscript 𝑄 1 subscript 𝑄 2 Q_{1}=Q_{2}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with probability 0.5, and sample with equal probabilities from the remaining visible points in the teacher prediction. The final self-supervised loss for a single trajectory is then:

ℒ S⁢S⁢L=∑t m c⁢y⁢c⁢l⁢e t∗ℓ s⁢s⁢l t subscript ℒ 𝑆 𝑆 𝐿 subscript 𝑡 superscript subscript 𝑚 𝑐 𝑦 𝑐 𝑙 𝑒 𝑡 superscript subscript ℓ 𝑠 𝑠 𝑙 𝑡\mathcal{L}{SSL}=\sum{t}m_{cycle}^{t}*\ell_{ssl}^{t}caligraphic_L start_POSTSUBSCRIPT italic_S italic_S italic_L end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∗ roman_ℓ start_POSTSUBSCRIPT italic_s italic_s italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT(5)

In practice, we sample 128 query points per input video and average the loss for all of them. We provide pseudocode for the algorithm in Appendix0.A.

To avoid catastrophic forgetting, we continue training on the Kubric dataset with the regular supervised TAPIR loss. Our training setup follows prior work on multi-task self-supervised learning[14]: we maintain separate Adam optimizer parameters to compute separate updates for both tasks, and then apply the gradients with their own learning rates. As the self-supervised task is more expensive due to the extra forward pass, we use half the batch size for self-supervised updates, and therefore we halve the learning rate for these updates. See Appendix0.B.2 for more details.

Differences between our approach and[57]. Concurrent work reproduces some of these decisions, including using cycle-consistency as a method of filtering and using affine transformations when augmenting the student view. However, there are a few key differences. First, rather than a student-teacher setup, they compute trajectories only once and freeze the training data, meaning that the model is permanently trained to reproduce errors in the original labeling. Furthermore, the work fine-tunes on the target dataset, meaning that transfer to a new domain may require a large training set in that domain on which to fine-tune; in contrast, our work demonstrates that it’s possible to train on a single large dataset that covers many domains, meaning that fine-tuning is unnecessary.

4 Experiments

We train our model on over 15 million 24-frame clips from publicly-available online videos, in conjunction with standard training on Kubric. The resulting model is essentially a drop-in replacement for TAPIR (albeit with slightly larger computational requirements due to the extra layers). We evaluate on the TAP-Vid benchmark using the standard protocol.

4.1 Training datasets

We collected a video dataset from publicly accessible videos selected from categories that typically contain high-quality and realistic motion (such as lifestyle and one-shot videos). Conversely, we omitted videos from categories with low visual complexity or unrealistic motions, such as tutorial videos, lyrics videos, and animations. To maintain consistency, we exclusively obtained videos shot at 60fps. Additionally, we applied a quality metric by only considering videos with over 200 views. We removed the first and last 2 seconds of each video, as these often contain intros and outros with text or other overlays. From each video, we randomly sampled five clips, excluding those with overlay/watermarked frames, which were identified by checking the horizontal and vertical gradients and computing the pixel-wise median (similar to[9]). Furthermore, we expect the teacher signal will be more reliable on continuous shots due to temporal continuity; therefore, clips with shot boundary changes are detected and removed based on[5, 73, 61, 41] with additional accuracy improvements based on full-frame geometric alignment. In total, we generated 15 million clips for training.

4.2 Evaluation datasets

We rely on the TAP-Vid[12] and RoboTAP[62] benchmarks for quantitative evaluation; in all cases, we evaluate zero-shot on the entire benchmark, resizing to 256×256 256 256 256\times 256 256 × 256 before evaluating according to the standard procedure[12]. This consists of five datasets: TAP-Vid-Kinetics which contains online videos of human actions and may include cuts[7]; TAP-Vid-DAVIS which is based on the DAVIS object tracking benchmark[49]; TAP-Vid-RGB-Stacking which contains synthetic tracks for videos of robotic manipulation which have little texture; and RoboTAP which contains real-world robotic manipulation videos[62], all of which include ground truth. Evaluation is performed by measuring occlusion accuracy (OA), <δ 𝐚𝐯𝐠 𝐱 absent subscript superscript 𝛿 𝐱 𝐚𝐯𝐠\mathbf{<\delta^{x}_{avg}}< italic_δ start_POSTSUPERSCRIPT bold_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_avg end_POSTSUBSCRIPT which measures the fraction of point estimates within a specified distance to the ground truth location, averaged across 5 thresholds, and Average Jaccard (AJ) which measures a combination of these two. There are two dataset querying “modes”: query first (q_first) uses the first visible point on each trajectory as a query, while strided uses every fifth point along the trajectory as a separate query. We also include qualitative evaluations on two robotics datasets without ground truth: RoboCAT-NIST, a subset of the data collected for RoboCat[6], and Libero[35], a dataset where point tracking has already proven useful for robotic manipulation[69]. See Appendix0.D for details on these datasets and metrics.

4.3 Results

Table 1: Comparison of performance on the TAP-Vid datasets. AJ (Average Jaccard; higher is better) measures both occlusion and position accuracy. <δ a⁢v⁢g x absent subscript superscript 𝛿 𝑥 𝑎 𝑣 𝑔<\delta^{x}_{avg}< italic_δ start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT (higher is better) measures only localization performance, ignoring occlusion accuracy. OA (Occlusion Accuracy; higher is better) measures only accuracy in predicting occlusion.

Our results are shown in Table1. Note that all of our numbers come from a single checkpoint, which has not seen the relevant datasets. Relative to our base architecture, our bootstrapping approach provides a substantial gain across all metrics. We also outperform CoTracker on DAVIS, though this is due more to improvements in occlusion accuracy than position accuracy. This is despite TAPIR having a simpler architecture than CoTracker, which requires cross attention to other points which must be chosen with a hand-tuned distribution, whereas TAPIR tracks points independently. CoTracker results are also obtained by upsampling videos to 384×512 384 512 384\times 512 384 × 512, which further increases compute time, whereas ours are computed directly on 256×256 256 256 256\times 256 256 × 256 videos.

Table 2: Comparison of performance under query-first metrics for Kinetics, TAP-Vid DAVIS, and RoboTAP (standard for this dataset).

Table2 shows performance under q_first mode. Here, we see that bootstrapping outperforms prior works by a wide margin on Kinetics; this is likely because TAPIR’s global search is more robust to large occlusions and cuts, which are more prominent in Kinetics. This search might harm performance in datasets like DAVIS with a stronger temporal continuity bias. Perhaps most impressive is the strong improvement in RoboTAP–over 5% absolute performance–despite RoboTAP looking very different from typical online videos. We see similar results for RGB-Stacking in Table1. These two datasets have large textureless regions; such regions are challenging to track without object-aware priors, which are difficult to obtain from synthetic datasets.

Figure3 shows qualitative examples of some cases where BootsTAPIR improves performance. We see improvements on examples where texture cues are ambiguous (e.g. the dark jacket and trousers) where prior knowledge of common object shape can improve performance, as well as points near object boundaries (e.g. the dog’s ears) where a model trained on synthetic data with different appearance may struggle to estimate the correct segmentation. We also note that BootsTAPIR improves on many cases where TAPIR marks a point as occluded even when it is still visible, such as the person’s arm. On RoboTAP, the model improves on occlusion estimation for the textureless gripper. It also deals well with large changes in scale as the gripper approaches the shoe, as well as shiny objects, both of which are less common in Kubric. Our project webpage https://bootstap.github.io/ includes video examples, which makes the improvements more obvious.

Figure 3: Comparison between TAPIR (\medblacksquare\medblacksquare\medblacksquare), Cotracker (\medblackdiamond\medblackdiamond\medblackdiamond) and BootsTAPIR (\medblackcircle\medblackcircle\medblackcircle), and the ground-truth points (+) on TAP-Vid-DAVIS and RoboTAP benchmarks. We show the initial query frame, and a closeup of four later frames.

Figure 4: Comparison between TAPIR and BootsTAPIR on the real RoboCAT-NIST dataset. Due to the lack of ground truth, we show the TAPIR prediction and BootsTAPIR prediction in Rainbow tail style side-by-side. On NIST, BootsTAPIR works more consistently on location prediction. Particularly points that were originally predicted as occluded now can be visible.

Figure4 further illustrates improvements on the RoboCAT-NIST. Due to the lack of ground truth, sample a grid of points on the red pixels for RoboCAT-NIST. We display a few examples comparing the predicted tracks between the two models. As these are rigid objects, we expect the points to move consistently within each gear; deviations from this are errors. Due to the lack of texture on the gears and the nontrivial domain gap, the original TAPIR trained on Kubric works poorly here, with many jittery tracks and severe tracking failures. This is particularly bad for points that are close to occlusion or out of image boundary. The bootstrapped model fixes many of these failures: the tracks are much smoother and occlusion predictions become much more accurate. Results are comparable on Libero, although the motions there are more complicated and unsuitable for a static figure; see our project webpage for video visualizations.

4.4 Ablations

We focus on four main areas of ablation: data transformations, pseudolabel filtering approaches, training setup, and training data. To arrive at our final model, we performed ablations on a smaller-scale base setting with our best guesses at the optimal hyperparameter settings. This setting includes two components that we found could be removed without harming performance: It uses an additional mask on the occlusion loss, inspired by FixMatch[55], where any occlusion estimate that the teacher is uncertain about (max⁡(σ⁢(o^𝒯⁢[t]),1−σ⁢(o^𝒯⁢[t]))<0.6 𝜎 subscript^𝑜 𝒯 delimited-[]𝑡 1 𝜎 subscript^𝑜 𝒯 delimited-[]𝑡 0.6\max(\sigma(\hat{o}{\mathcal{T}}[t]),1-\sigma(\hat{o}{\mathcal{T}}[t]))<0.6 roman_max ( italic_σ ( over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] ) , 1 - italic_σ ( over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] ) ) < 0.6 is ignored in the loss. It uses a 3D-ConvNet backbone, which we find provides a slight improvement on DAVIS while harming performance on Kinetics (see Appendix0.C), so we remove it for future compatibility with causal TAPIR models. Finally, base also halves the batch sizes (and proportionally halves the learning rate), and also halves the number of training steps. We report Average Jaccard on DAVIS using the strided mode and on Kinetics using the q_first mode.

Data transformations.

We first investigate the effect of the transformations we apply on inputs and outputs in this setting. We respectively ablate: the use of random JPEG augmentations to enforce invariance to various factors of variation (denoted by base-no-augm); the use of framewise affine transformations on inputs and outputs to enforce equivariance with spatial transformations (denoted by base-no-affine). We also investigate sampling the student queries: recall that in our typical setup, we sample the student query from a distribution which places probability 0.5 on the original teacher query point, and 0.5 on a uniform distribution across visible points. In base-same-queries, we always use the teacher’s query for the student, and in base-uniform, we sample from a purely uniform distribution. We report the results for each ablation in Table3(a) (a). We observe that removing JPEG somewhat harms metrics, especially on Kinetics. In contrast, when ablating affine transformations, we find that performance drops massively across metrics, suggesting overfitting. Finally, we find that using different query points improves performance compared with base-same-queries, leading to more accurate position predictions in particular (<δ 𝐚𝐯𝐠 𝐱 absent subscript superscript 𝛿 𝐱 𝐚𝐯𝐠\mathbf{<\delta^{x}_{avg}}< italic_δ start_POSTSUPERSCRIPT bold_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_avg end_POSTSUBSCRIPT increases from 77.5 to 77.9 on DAVIS strided and from 66.8 to 67.7 on Kinetics q_first). Note, however, on DAVIS in particular, this improvement depends on sampling the original teacher query point more often than the others.

Table 3: Ablations of model hyperparameters, including (a) ablations of the data transformations and query point strategies, (b) comparisons of the pseudolabel filtering approaches, (c) ablations of the training setup, including the stop gradient, and (d) ablations of the dataset. We report Average Jaccard (AJ) across all experiments.

(a)Data transformations.

(b)Pseudolabels filtering. base filters occlusion loss terms based on teacher confidence.

(c)Training setup.

(d)Training Data.

Pseudolabel filtering.

We next consider the effectiveness of filtering possibly-incorrect teacher tracks and points, with results in Table3(b) (b). base-no-filtering removes the filtering that base uses on the occlusion confidence score, which makes little difference in performance on DAVIS, but degrades performance on Kinetics. base+cycle uses cycle-consistency criterion from our full model instead and performs slightly better on DAVIS. These results suggest that correctly removing bad teacher tracks remains an open problem.

Training setup.

Table3(c) (c) shows ablations of the overall training setup. In particular, training for longer with a higher capacity model can improve results, so full-kubric-only uses an identical training setup to our full model, but removes self-supervised training, instead simply training on Kubric for longer. We see competitive performance, although self-supervised training still improves by 1.2% on DAVIS and almost 2% on Kinetics. siamese shows the effect of removing the EMA and stop-gradient and instead backpropping to both student and teacher models (in this case, using the base setting): we note that performance on real-world datasets collapses as the model finds trivial shortcuts.

Training data.

We ablate two questions regarding the dataset. Prior work has argued that simple semi-supervised learning for optical flow performs poorly[31, 39]; we hypothesize that more temporal context may be the key ingredient to change this story. To validate this, we re-ran our algorithm using 2- and 6-frame clips from our full dataset. In Table3(d) (d), we see that this indeed performs poorly, possibly because the extra frames allow the teacher model to correct more errors. Interestingly, we also tried training on a 1% subset of the data, and found that this harms performance on Kinetics but actually improves it on DAVIS. It’s possible that the algorithm begins overfitting to the data, but this may be useful for clean data like DAVIS. Regardless, it suggests this algorithm can be effective even in situations where less data is available.

5 Higher Resolution and Public Release

The publicly released version of BootsTAPIR contains a fix for a minor bug that was present in our previous versions of TAPIR. Specifically, recall that the data augmentations used for the Kubric dataset include a random axis-aligned crop. The image cropping mechanism was not pixel-aligned with the transforms used for points, leading to an almost imperceptible error in the track locations. Fixing this bug leads to an improvement in performance for the original TAPIR model, but surprisingly has relatively little effect on BootsTAP performance. However, we find that the reason isn’t because BootsTAP compensates for the bug, but rather, because the bug creates a bias toward tracking foreground objects (the tracks tend to be slightly expanded relative to the underlying objects). We find we can replicate this bias by altering query points that are very near occlusion edges (1 pixel away) to track the foreground object rather than the background, which we call the “snap to occluder” technique. See Appendix0.B.4 for details.

To further tune performance, we also trained on higher-resolution clips, and also longer clips, as we find these improve generalization for real-world applications with longer or higher-resolution videos. To implement this, we add more ‘tasks’ with different data shapes, using the same multi-task framework (i.e., separate optimizers) as described above. Specifically, one extra task uses 512×512 512 512 512\times 512 512 × 512 Kubric clips (24 frames), trained using the same losses. We use the hierarchical refinement approach described in the original TAPIR paper, wherein the initialization and one refinement pass is performed at 256×256 256 256 256\times 256 256 × 256, and then a further refinement pass is performed at 512×512 512 512 512\times 512 512 × 512. We also use an analogous high-resolution self-supervised task, which also uses 24-frame, 512×512 512 512 512\times 512 512 × 512 videos from the same real-world dataset. Finally, we add 150-frame, 256×256 256 256 256\times 256 256 × 256 videos, this time at 30 frames per second.

Table 4: Comparison of performance on the TAP-Vid datasets for the released version of BootsTAPIR. Fix refers to the bugfix to coordinates. Snap refers to the snap-to-occluder bias in the training data. Data refers to extra training data which has longer clips and higher resolution.

Table 5: Comparison of performance under query-first metrics for Kinetics, TAP-Vid DAVIS, and RoboTAP (standard for this dataset).

Tables4 and5 show our results. Note that CoTracker implemented its own data augmentation algorithms and is not affected by the same bug. We see that “snap to occluder” harms TAPIR performance, but improves BootsTAP performance. One possible interpretation is that the snapping is compensating for a particular bias in the bootstrapping toward tracking background. This may be because background is easier to track, especially relative to thin objects. In a bootstrapping framework, the model’s reliable predictions that follow background become self-reinforcing, whereas unreliable predictions for thin foreground objects, are not. Therefore, they tend to get lost over time. Finding more principled solutions to this issue is an interesting area for future work.

The extra training data, however, leads to a non-trivial boost in performance. Note that, for tables4 and5, all evaluation videos are still at 256×256 256 256 256\times 256 256 × 256, and unlike many prior methods we do not upsample them before creating the feature representation. To assess the impact of increased evaluation resolution, we also performed evaluation on 512×512 512 512 512\times 512 512 × 512 videos. Table6 shows results. We see that performance improves by 1.2% on Kinetics and 2.8% on DAVIS, the best reported performance on this dataset by a wide margin. Surprisingly, we found that further resolution at test-time did not improve results, suggesting another interesting area for future work.

Table 6: Comparison of performance for high-resolution setting.

RoboTAP[62] pointed out that point tracking can be very useful in an online setting, e.g. when used as a signal to control agents in real time. It remains straightforward to extend BootsTAPIR to the online setting: the only temporal dependency of the model is in the 1D convolutions in the iterative refinements, so these can be directly converted into causal convolutions to create a causal model. We trained this model using the full training setup for the release model, including the extra high-resolution, long-clip data. Table7 shows results. We see an overall 4.6% improvement on Kinetics and a 3.0$ improvement on DAVIS, in both cases using the query-first evaluation procedure.

Table 7: Causal model performance.

We also perform experiments on the point tracks in the Perception Test[48] validation set, a challenging dataset of point tracks annotated on videos of unusual situations filmed by participants. Results are shown in Table8; we see a similar magnitude gap over prior results.

Table 8: Performance on Perception Test relative to TAPIR.

As a final note, we performed informal benchmarking of our model using an A100 and the latest JAX compiler. We found that after compilation, BootsTAPIR can perform inference of 10,000 10 000 10,000 10 , 000 points on a 256×256 256 256 256\times 256 256 × 256, 50-frame video in 5.6 seconds. Furthermore, the causal model can track 400 points on a 256×256 256 256 256\times 256 256 × 256 video at 30.1 frames per second.

6 Conclusion

In this work we presented an effective method for leveraging large scale, unlabeled data for improving TAP performance. We have demonstrated that a straightforward application of consistency principles, namely invariance to query points and non-spatial corruptions, and equivariance to affine transformations, enable the model to continue to improve on unlabeled data. Our formulation avoids more complex priors such as spatial smoothness of motion or temporal smoothness of tracks that are used in many prior works. In fact, our formulation bears similarities to baselines for two-frame, self-supervised optical flow that are considered too “unstable” to be effective (c.f. Fig. 2(a) in “Flow Supervisor”[24]). Yet in our multi-frame approach, we ultimately surpass the state-of-the-art performance by a large margin. We find little evidence of model ‘overfitting’ to its own biases in ways that cause performance to degrade with long training like in other work [57]. Instead, we find that performance continues to improve for as long as we train the model. Our work does have some limitations: training remains computationally expensive. Furthermore, our estimated correspondence is a single point estimate throughout the entire video, which means we cannot elegantly handle duplicated or rotationally-symmetric objects where the actual correspondence is ambiguous. Nevertheless, our approach demonstrates that it is possible to better bridge the sim-to-real gap using self-supervised learning.

References

[1] Balasingam, A., Chandler, J., Li, C., Zhang, Z., Balakrishnan, H.: Drivetrack: A benchmark for long-range point tracking in real-world videos. arXiv preprint arXiv:2312.09523 (2023)
[2] Bharadhwaj, H., Mottaghi, R., Gupta, A., Tulsiani, S.: Track2Act: Predicting point tracks from internet videos enables diverse zero-shot robot manipulation. arXiv preprint arXiv:2405.01527 (2024)
[3] Bian, W., Huang, Z., Shi, X., Dong, Y., Li, Y., Li, H.: Context-pips: Persistent independent particles demands context features. NeurIPS (2024)
[4] Bian, Z., Jabri, A., Efros, A.A., Owens, A.: Learning pixel trajectories with multiscale contrastive random walks. In: Proc. CVPR (2022)
[5] Boreczky, J.S., Rowe, L.A.: Comparison of video shot boundary detection techniques. Journal of Electronic Imaging 5(2), 122–128 (1996)
[6] Bousmalis, K., Vezzani, G., Rao, D., Devin, C., Lee, A.X., Bauza, M., Davchev, T., Zhou, Y., Gupta, A., Raju, A., et al.: Robocat: A self-improving foundation agent for robotic manipulation. arXiv preprint arXiv:2306.11706 (2023)
[7] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proc. CVPR. pp. 6299–6308 (2017)
[8] Chen, W., Chen, L., Wang, R., Pollefeys, M.: Leap-vo: Long-term effective any point tracking for visual odometry. arXiv preprint arXiv:2401.01887 (2024)
[9] Dekel, T., Rubinstein, M., Liu, C., Freeman, W.T.: On the effectiveness of visible watermarks. In: Proc. CVPR (2017)
[10] Denil, M., Bazzani, L., Larochelle, H., de Freitas, N.: Learning where to attend with deep architectures for image tracking. Neural computation 24(8), 2151–2184 (2012)
[11] Doersch, C., Gupta, A., Efros, A.A.: Unsupervised visual representation learning by context prediction. In: Proc. ICCV (2015)
[12] Doersch, C., Gupta, A., Markeeva, L., Recasens, A., Smaira, L., Aytar, Y., Carreira, J., Zisserman, A., Yang, Y.: TAP-Vid: A benchmark for tracking any point in a video. NeurIPS (2022)
[13] Doersch, C., Yang, Y., Vecerik, M., Gokay, D., Gupta, A., Aytar, Y., Carreira, J., Zisserman, A.: TAPIR: Tracking any point with per-frame initialization and temporal refinement. arXiv preprint arXiv:2306.08637 (2023)
[14] Doersch, C., Zisserman, A.: Multi-task self-supervised visual learning. In: Proc. ICCV (2017)
[15] Földiák, P.: Learning invariance from transformation sequences. Neural computation 3(2), 194–200 (1991)
[16] Goroshin, R., Bruna, J., Tompson, J., Eigen, D., LeCun, Y.: Unsupervised learning of spatiotemporally coherent metrics. In: Proc. ICCV (2015)
[17] Goroshin, R., Mathieu, M.F., LeCun, Y.: Learning to linearize under uncertainty. NeurIPS (2015)
[18] Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677 (2017)
[19] Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y., Duckworth, D., Fleet, D.J., Gnanapragasam, D., Golemo, F., Herrmann, C., et al.: Kubric: A scalable dataset generator. In: Proc. CVPR (2022)
[20] Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent - a new approach to self-supervised learning. In: NeurIPS (2020)
[21] Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: Proc. CVPR (2006)
[22] Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle video revisited: Tracking through occlusions using point trajectories. In: Proc. ECCV (2022)
[23] Huang, H.P., Herrmann, C., Hur, J., Lu, E., Sargent, K., Stone, A., Yang, M.H., Sun, D.: Self-supervised autoflow. In: Proc. CVPR (2023)
[24] Im, W., Lee, S., Yoon, S.E.: Semi-supervised learning of optical flow by flow supervisor. In: Proc. ECCV (2022)
[25] Jabri, A., Owens, A., Efros, A.: Space-time correspondence as a contrastive random walk. NeurIPS 33, 19545–19560 (2020)
[26] Janai, J., Guney, F., Ranjan, A., Black, M., Geiger, A.: Unsupervised learning of multi-frame optical flow with occlusions. In: Proc. ECCV (2018)
[27] Janai, J., Guney, F., Wulff, J., Black, M.J., Geiger, A.: Slow flow: Exploiting high-speed cameras for accurate and diverse optical flow reference data. In: Proc. CVPR (2017)
[28] Jiang, W., Trulls, E., Hosang, J., Tagliasacchi, A., Yi, K.M.: COTR: Correspondence transformer for matching across images. In: Proc. ICCV (2021)
[29] Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: CoTracker: It is better to track together. arXiv preprint arXiv:2307.07635 (2023)
[30] Kimble, K., Van Wyk, K., Falco, J., Messina, E., Sun, Y., Shibata, M., Uemura, W., Yokokohji, Y.: Benchmarking protocols for evaluating small parts robotic assembly systems. Proc. Intl. Conf. on Robotics and Automation 5(2), 883–889 (2020)
[31] Lai, W.S., Huang, J.B., Yang, M.H.: Semi-supervised learning for optical flow with generative adversarial networks (2017)
[32] Lai, Z., Lu, E., Xie, W.: MAST: A memory-augmented self-supervised tracker. In: Proc. CVPR (2020)
[33] Lai, Z., Xie, W.: Self-supervised learning for video correspondence flow. arXiv preprint arXiv:1905.00875 (2019)
[34] Li, R., Zhou, S., Liu, D.: Learning fine-grained features for pixel-wise video correspondences. In: Proc. ICCV (2023)
[35] Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., Stone, P.: Libero: Benchmarking knowledge transfer for lifelong robot learning. NeurIPS 36 (2024)
[36] Liu, L., Zhang, J., He, R., Liu, Y., Wang, Y., Tai, Y., Luo, D., Wang, C., Li, J., Huang, F.: Learning by analogy: Reliable supervision from transformations for unsupervised optical flow estimation. In: Proc. CVPR (2020)
[37] Liu, P., King, I., Lyu, M.R., Xu, J.: Ddflow: Learning optical flow with unlabeled data distillation. In: Proceedings of the AAAI conference on artificial intelligence. vol.33, pp. 8770–8777 (2019)
[38] Liu, P., Lyu, M., King, I., Xu, J.: Selflow: Self-supervised learning of optical flow. In: Proc. CVPR (2019)
[39] Liu, P., Lyu, M.R., King, I., Xu, J.: Learning by distillation: a self-supervised learning framework for optical flow estimation. IEEE PAMI 44(9), 5026–5041 (2021)
[40] Marsal, R., Chabot, F., Loesch, A., Sahbi, H.: Brightflow: Brightness-change-aware unsupervised learning of optical flow. In: Proc. WACV (2023)
[41] Mas, J., Fernandez, G.: Video shot boundary detection based on color histogram. In: TRECVID (2003)
[42] Meister, S., Hur, J., Roth, S.: Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In: Proceedings of the AAAI conference on artificial intelligence. vol.32 (2018)
[43] Moing, G.L., Ponce, J., Schmid, C.: Dense optical tracking: Connecting the dots. In: Proc. CVPR (2024)
[44] Neoral, M., Šerỳch, J., Matas, J.: MFT: Long-term tracking of every pixel. In: Proc. WACV (2024)
[45] Novák, T., Šochman, J., Matas, J.: A new semi-supervised method improving optical flow on distant domains. In: Computer Vision Winter Workshop. vol.3 (2020)
[46] Ochs, P., Malik, J., Brox, T.: Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence 36(6), 1187–1200 (2013)
[47] OpenAI: GPT-4V(ision) system card (September 25, 2023)
[48] Patraucean, V., Smaira, L., Gupta, A., Recasens, A., Markeeva, L., Banarse, D., Koppula, S., Malinowski, M., Yang, Y., Doersch, C., Matejovicova, T., Sulsky, Y., Miech, A., Frechette, A., Klimczak, H., Koster, R., Zhang, J., Winkler, S., Aytar, Y., Osindero, S., Damen, D., Zisserman, A., Carreira, J.: Perception test: A diagnostic benchmark for multimodal video models. NeurIPS
[49] Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: Proc. CVPR (2016)
[50] Rajič, F., Ke, L., Tai, Y.W., Tang, C.K., Danelljan, M., Yu, F.: Segment anything meets point tracking. arXiv preprint arXiv:2307.01197 (2023)
[51] Ren, Z., Yan, J., Ni, B., Liu, B., Yang, X., Zha, H.: Unsupervised deep learning for optical flow estimation. In: Proceedings of the AAAI conference on artificial intelligence. vol.31 (2017)
[52] Rubinstein, M., Liu, C., Freeman, W.T.: Towards longer long-range motion trajectories. In: Proc. BMVC (2012)
[53] Sand, P., Teller, S.: Particle video: Long-range motion estimation using point trajectories. Proc. ICCV (2008)
[54] Shen, Y., Hui, L., Xie, J., Yang, J.: Self-supervised 3d scene flow estimation guided by superpoints. In: Proc. CVPR (2023)
[55] Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk, E.D., Kurakin, A., Li, C.L.: Fixmatch: Simplifying semi-supervised learning with consistency and confidence (2020)
[56] Stone, A., Maurer, D., Ayvaci, A., Angelova, A., Jonschkowski, R.: Smurf: Self-teaching multi-frame unsupervised raft with full-image warping. In: Proc. CVPR (2021)
[57] Sun, X., Harley, A.W., Guibas, L.J.: Refining pre-trained motion models. arXiv preprint arXiv:2401.00850 (2024)
[58] Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: NeurIPS (2017)
[59] Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
[60] Teed, Z., Deng, J.: RAFT: Recurrent all-pairs field transforms for optical flow. In: Proc. ECCV (2020)
[61] Truong, B.T., Dorai, C., Venkatesh, S.: New enhancements to cut, fade, and dissolve detection processes in video segmentation. In: Proceedings of the eighth ACM international conference on Multimedia. pp. 219–227 (2000)
[62] Vecerik, M., Doersch, C., Yang, Y., Davchev, T., Aytar, Y., Zhou, G., Hadsell, R., Agapito, L., Scholz, J.: RoboTAP: Tracking arbitrary points for few-shot visual imitation. In: Proc. Intl. Conf. on Robotics and Automation (2024)
[63] Vondrick, C., Shrivastava, A., Fathi, A., Guadarrama, S., Murphy, K.: Tracking emerges by colorizing videos. In: Proc. ECCV (2018)
[64] Wang, J., Karaev, N., Rupprecht, C., Novotny, D.: Visual geometry grounded deep structure from motion. Proc. CVPR (2024)
[65] Wang, Q., Chang, Y.Y., Cai, R., Li, Z., Hariharan, B., Holynski, A., Snavely, N.: Tracking everything everywhere all at once. In: Proc. ICCV (2023)
[66] Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: Proc. ICCV (2015)
[67] Wang, X., Jabri, A., Efros, A.A.: Learning correspondence from the cycle-consistency of time. In: Proc. CVPR (2019)
[68] Wang, Y., Yang, Y., Yang, Z., Zhao, L., Wang, P., Xu, W.: Occlusion aware unsupervised learning of optical flow. In: Proc. CVPR (2018)
[69] Wen, C., Lin, X., So, J., Chen, K., Dou, Q., Gao, Y., Abbeel, P.: Any-point trajectory modeling for policy learning. arXiv preprint arXiv:2401.00025 (2023)
[70] Wiskott, L., Sejnowski, T.J.: Slow feature analysis: Unsupervised learning of invariances. Neural computation 14(4), 715–770 (2002)
[71] Yu, E., Blackburn-Matzen, K., Nguyen, C., Wang, O., Habib Kazi, R., Bousseau, A.: VideoDoodles: Hand-drawn animations on videos with scene-aware canvases. ACM Transactions on Graphics 42(4), 1–12 (2023)
[72] Yu, J.J., Harley, A.W., Derpanis, K.G.: Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In: ECCV 2016 Workshops (2016)
[73] Yusoff, Y., Christmas, W.J., Kittler, J.: Video shot cut detection using adaptive thresholding. In: Proc. BMVC (2000)
[74] Zheng, Y., Harley, A.W., Shen, B., Wetzstein, G., Guibas, L.J.: PointOdyssey: A large-scale synthetic dataset for long-term point tracking. In: Proc. CVPR (2023)

Appendix 0.A Summary of the approach

We summarize notation and computation of our self-supervised loss in Algorithm1.

Algorithm 1 BootsTAP self-supervised loss. Notation:

𝒰⁢(D)𝒰 𝐷\mathcal{U}(D)caligraphic_U ( italic_D ) refers to the uniform distribution over domain D 𝐷 D italic_D;

we denote queries as Q=(q,t)𝑄 𝑞 𝑡 Q=(q,t)italic_Q = ( italic_q , italic_t ) where q 𝑞 q italic_q is x/y coordinates and t 𝑡 t italic_t is a frame index. In a slight abuse of notation, we call Φ t subscript Φ 𝑡\Phi_{t}roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT the transformation and the mapping that transforms coordinates and leaves other model outputs unchanged.

X 𝑋 X italic_X – video of shape

T×H×W×C 𝑇 𝐻 𝑊 𝐶 T\times H\times W\times C italic_T × italic_H × italic_W × italic_C

f 𝑓 f italic_f – model

Θ Θ\Theta roman_Θ ,

ξ 𝜉\xi italic_ξ – student parameters, teacher parameters

𝒜,𝒟 Φ 𝒜 subscript 𝒟 Φ\mathcal{A},\mathcal{D}_{\Phi}caligraphic_A , caligraphic_D start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT – distribution over augmentations, distribution over transformations

𝒱↦𝒟 𝒱 maps-to 𝒱 subscript 𝒟 𝒱\mathcal{V}\mapsto\mathcal{D}_{\mathcal{V}}caligraphic_V ↦ caligraphic_D start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT – mapping that maps a set of points

𝒱 𝒱\mathcal{V}caligraphic_V to a distribution

𝒟 𝒱 subscript 𝒟 𝒱\mathcal{D}_{\mathcal{V}}caligraphic_D start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT over

𝒱 𝒱\mathcal{V}caligraphic_V

δ 𝛿\delta italic_δ ,

δ c⁢y⁢c⁢l⁢e subscript 𝛿 𝑐 𝑦 𝑐 𝑙 𝑒\delta_{cycle}italic_δ start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT – threshold values for uncertainty target definition and cycle-consistency filtering criterion

d⁢(⋅,⋅)𝑑⋅⋅d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ) – distance function

Uniformly sample teacher query points

Q 1∼𝒰⁢([0,H)×[0,W)×⟦0,T−1⟧)similar-to subscript 𝑄 1 𝒰 0 𝐻 0 𝑊 0 𝑇 1 Q_{1}\sim\mathcal{U}([0,H)\times[0,W)\times\llbracket 0,T-1\rrbracket)italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ caligraphic_U ( [ 0 , italic_H ) × [ 0 , italic_W ) × ⟦ 0 , italic_T - 1 ⟧ ) .

Sample augmentation

a∼𝒜 similar-to 𝑎 𝒜 a\sim\mathcal{A}italic_a ∼ caligraphic_A and a frame-wise affine transformation

Φ={Φ t}t∼𝒟 Φ Φ subscript subscript Φ 𝑡 𝑡 similar-to subscript 𝒟 Φ\Phi={\Phi_{t}}{t}\sim\mathcal{D}{\Phi}roman_Φ = { roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT .

Augment and transform each frame to form

X′superscript 𝑋′X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT :

∀t,X t′←resampling⁢(a⁢(X t),Φ t)←for-all 𝑡 subscript superscript 𝑋′𝑡 resampling 𝑎 subscript 𝑋 𝑡 subscript Φ 𝑡\forall t,X^{\prime}{t}\leftarrow\text{resampling}(a(X{t}),\Phi_{t})∀ italic_t , italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← resampling ( italic_a ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

For each query point

Q 1 subscript 𝑄 1 Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT :

Predict tracks and occlusions with teacher model:

{p^𝒯⁢[t],o^𝒯⁢[t]}t←f⁢(X,Q 1;ξ)←subscript subscript^𝑝 𝒯 delimited-[]𝑡 subscript^𝑜 𝒯 delimited-[]𝑡 𝑡 𝑓 𝑋 subscript 𝑄 1 𝜉\left{\hat{p}{\mathcal{T}}[t],\hat{o}{\mathcal{T}}[t]\right}{t}\leftarrow f% (X,Q{1};\xi){ over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] , over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_f ( italic_X , italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_ξ ) .

Derive pseudo-labels from teacher predictions with:

p 𝒯[t]=p^𝒯[t];o 𝒯[t]=𝟙(o^𝒯[t]>0);u 𝒯[t]=𝟙(d(p 𝒯[t],p^𝒮[t])>δ)p_{\mathcal{T}}[t]=\hat{p}{\mathcal{T}}[t]\quad;\quad o{\mathcal{T}}[t]=% \mathbbm{1}(\hat{o}{\mathcal{T}}[t]>0);\quad u{\mathcal{T}}[t]=\mathbbm{1}(d% (p_{\mathcal{T}}[t],\hat{p}_{\mathcal{S}}[t])>\delta)italic_p start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] = over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] ; italic_o start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] = blackboard_1 ( over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] > 0 ) ; italic_u start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] = blackboard_1 ( italic_d ( italic_p start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT [ italic_t ] , over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ italic_t ] ) > italic_δ )

Calling

𝒱 𝒱\mathcal{V}caligraphic_V the set of visible points along the teacher trajectory,

sample

Q 2=(q 2,t 2)∼𝒟 𝒱 subscript 𝑄 2 subscript 𝑞 2 subscript 𝑡 2 similar-to subscript 𝒟 𝒱 Q_{2}=(q_{2},t_{2})\sim\mathcal{D}_{\mathcal{V}}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT caligraphic_V end_POSTSUBSCRIPT .

Transform query points:

Q 2′←(Φ t 2⁢(q 2),t 2)←subscript superscript 𝑄′2 subscript Φ subscript 𝑡 2 subscript 𝑞 2 subscript 𝑡 2 Q^{\prime}{2}\leftarrow(\Phi{t_{2}}(q_{2}),t_{2})italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← ( roman_Φ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

Predict tracks with the student model and transform predicted coordinates with

the inverse of

Φ t subscript Φ 𝑡\Phi_{t}roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT :

{p^𝒮⁢[t],o^𝒮⁢[t],u^𝒮⁢[t]}t←Φ t−1⁢(f⁢(X′,Q 2′;Θ))←subscript subscript^𝑝 𝒮 delimited-[]𝑡 subscript^𝑜 𝒮 delimited-[]𝑡 subscript^𝑢 𝒮 delimited-[]𝑡 𝑡 superscript subscript Φ 𝑡 1 𝑓 superscript 𝑋′subscript superscript 𝑄′2 Θ\left{\hat{p}{\mathcal{S}}[t],\hat{o}{\mathcal{S}}[t],\hat{u}{\mathcal{S}}% [t]\right}{t}\leftarrow\Phi_{t}^{-1}(f(X^{\prime},Q^{\prime}_{2};\Theta)){ over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ italic_t ] , over^ start_ARG italic_o end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ italic_t ] , over^ start_ARG italic_u end_ARG start_POSTSUBSCRIPT caligraphic_S end_POSTSUBSCRIPT [ italic_t ] } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_f ( italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; roman_Θ ) ) .

Compute masks used to filter out loss terms (when

t 1 subscript 𝑡 1 t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and

t 2 subscript 𝑡 2 t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT differ):

Compute the loss:

ℒ S⁢S⁢L=m c⁢y⁢c⁢l⁢e∗1 T⁢∑t ℓ s⁢s⁢l t subscript ℒ 𝑆 𝑆 𝐿 subscript 𝑚 𝑐 𝑦 𝑐 𝑙 𝑒 1 𝑇 subscript 𝑡 superscript subscript ℓ 𝑠 𝑠 𝑙 𝑡\mathcal{L}{SSL}=m{cycle}*\frac{1}{T}\sum_{t}\ell_{ssl}^{t}caligraphic_L start_POSTSUBSCRIPT italic_S italic_S italic_L end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT ∗ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_s italic_s italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT

where

ℓ s⁢s⁢l t superscript subscript ℓ 𝑠 𝑠 𝑙 𝑡\ell_{ssl}^{t}roman_ℓ start_POSTSUBSCRIPT italic_s italic_s italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the self-supervised TAPIR loss term for

t 𝑡 t italic_t .

Appendix 0.B Implementation details

0.B.1 Distribution over transformations

We design transformations of the inputs to enforce equivariance of the predictions with realistic spatial transformations. At a high level, we intend for the transformations to mimic the effects of additional, simple and plausible camera motion and zooming on the video. Hence, our transformations should vary smoothly in time; they should cover a reasonable ratio of the original video content; and aspect ratio should be roughly preserved.

We define a family of frame-wise affine transformations that has these properties, and a procedure to sample these randomly. Essentially, we sample top-left crop coordinates and crop dimensions for each frame in the video; where coordinates and dimensions are computed as interpolations between values sampled for the start and end frames from a distribution that achieves the desired coverage and aspect ratios.

More formally, we first sample a pair of spatial dimensions (H 0,W 0)subscript 𝐻 0 subscript 𝑊 0(H_{0},W_{0})( italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for the start frame as follows. We sample an area A 𝐴 A italic_A uniformly over [0.6,1.0]0.6 1.0[0.6,1.0][ 0.6 , 1.0 ]. Next we sample values a 1,a 2∼𝒰⁢([A,1])similar-to superscript 𝑎 1 superscript 𝑎 2 𝒰 𝐴 1 a^{1},a^{2}\sim\mathcal{U}([A,1])italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∼ caligraphic_U ( [ italic_A , 1 ] ) and derive random height value by averaging them h=a 1+a 2 2 ℎ superscript 𝑎 1 superscript 𝑎 2 2 h=\frac{a^{1}+a^{2}}{2}italic_h = divide start_ARG italic_a start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT + italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG and width value w=A h 𝑤 𝐴 ℎ w=\frac{A}{h}italic_w = divide start_ARG italic_A end_ARG start_ARG italic_h end_ARG; and finally, we multiply these values by the input’s original shape (H,W)𝐻 𝑊(H,W)( italic_H , italic_W ). This gives us a pair of spatial dimensions biased towards aspect ratios close to 1, and covering an area between 60%percent 60 60%60 % and 100%percent 100 100%100 % of the original input. We proceed the same way to sample a pair of spatial dimensions(H T−1,W T−1)subscript 𝐻 𝑇 1 subscript 𝑊 𝑇 1(H_{T-1},W_{T-1})( italic_H start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) for the end frame.

Next, we uniformly sample a pair of top-left corner coordinates (C 0 x,C 0 y)superscript subscript 𝐶 0 𝑥 superscript subscript 𝐶 0 𝑦(C_{0}^{x},C_{0}^{y})( italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) for the start frame, such that a crop of dimensions (H 0,W 0)subscript 𝐻 0 subscript 𝑊 0(H_{0},W_{0})( italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) can be extracted within the frame. We proceed the same way to sample a pair of spatial coordinates (C T−1 x,C T−1 y)superscript subscript 𝐶 𝑇 1 𝑥 superscript subscript 𝐶 𝑇 1 𝑦(C_{T-1}^{x},C_{T-1}^{y})( italic_C start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ), given (H T−1,W T−1)subscript 𝐻 𝑇 1 subscript 𝑊 𝑇 1(H_{T-1},W_{T-1})( italic_H start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ), for the end frame.

We then interpolate linearly on one hand between the start and end spatial dimensions; and on the other hand between the start and end top-left corner coordinates. Let t∈{0,…,T−1}𝑡 0…𝑇 1 t\in{0,...,T-1}italic_t ∈ { 0 , … , italic_T - 1 } be a frame index. Calling α t=t T−1 subscript 𝛼 𝑡 𝑡 𝑇 1\alpha_{t}=\frac{t}{T-1}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG italic_t end_ARG start_ARG italic_T - 1 end_ARG, we define:

h t=(1−α t)∗H 0+α t∗H T−1 subscript ℎ 𝑡 1 subscript 𝛼 𝑡 subscript 𝐻 0 subscript 𝛼 𝑡 subscript 𝐻 𝑇 1\displaystyle h_{t}=(1-\alpha_{t})*H_{0}+\alpha_{t}*H_{T-1}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∗ italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ italic_H start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT(6) w t=(1−α t)∗W 0+α t∗W T−1 subscript 𝑤 𝑡 1 subscript 𝛼 𝑡 subscript 𝑊 0 subscript 𝛼 𝑡 subscript 𝑊 𝑇 1\displaystyle w_{t}=(1-\alpha_{t})*W_{0}+\alpha_{t}*W_{T-1}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∗ italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ italic_W start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT(7) c t x=(1−α t)∗C 0 x+α t∗C T−1 x superscript subscript 𝑐 𝑡 𝑥 1 subscript 𝛼 𝑡 superscript subscript 𝐶 0 𝑥 subscript 𝛼 𝑡 superscript subscript 𝐶 𝑇 1 𝑥\displaystyle c_{t}^{x}=(1-\alpha_{t})*C_{0}^{x}+\alpha_{t}*C_{T-1}^{x}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT = ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∗ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ italic_C start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT(8) c t y=(1−α t)∗C 0 y+α t∗C T−1 y.superscript subscript 𝑐 𝑡 𝑦 1 subscript 𝛼 𝑡 superscript subscript 𝐶 0 𝑦 subscript 𝛼 𝑡 superscript subscript 𝐶 𝑇 1 𝑦\displaystyle c_{t}^{y}=(1-\alpha_{t})*C_{0}^{y}+\alpha_{t}*C_{T-1}^{y}.italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT = ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∗ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT + italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ italic_C start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT .(9)

This gives us parameters of scaling parameters (h t,w t)subscript ℎ 𝑡 subscript 𝑤 𝑡(h_{t},w_{t})( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and translation parameters (c t x,c t y)superscript subscript 𝑐 𝑡 𝑥 superscript subscript 𝑐 𝑡 𝑦(c_{t}^{x},c_{t}^{y})( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) vary linearly over time. Finally, our frame-wise affine transformations Φ={Φ t}t Φ subscript subscript Φ 𝑡 𝑡\Phi={\Phi_{t}}_{t}roman_Φ = { roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are defined as follows:

∀t,Φ t:(x,y)↦(w t W∗x+c t x,h t H∗y+c t y),:for-all 𝑡 subscript Φ 𝑡 maps-to 𝑥 𝑦 subscript 𝑤 𝑡 𝑊 𝑥 superscript subscript 𝑐 𝑡 𝑥 subscript ℎ 𝑡 𝐻 𝑦 superscript subscript 𝑐 𝑡 𝑦\forall t,\Phi_{t}:(x,y)\mapsto\Big{(}\frac{w_{t}}{W}*x+c_{t}^{x},\frac{h_{t}}% {H}*y+c_{t}^{y}\Big{)},∀ italic_t , roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : ( italic_x , italic_y ) ↦ ( divide start_ARG italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_W end_ARG ∗ italic_x + italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , divide start_ARG italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_H end_ARG ∗ italic_y + italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) ,(10)

We refer to the distribution resulting from our sampling procedure as 𝒟 Φ subscript 𝒟 Φ\mathcal{D}{\Phi}caligraphic_D start_POSTSUBSCRIPT roman_Φ end_POSTSUBSCRIPT. Given a query point coordinates Q=(q,t)𝑄 𝑞 𝑡 Q=(q,t)italic_Q = ( italic_q , italic_t ) and input frames {X t}t subscript subscript 𝑋 𝑡 𝑡{X{t}}_{t}{ italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the corresponding transformation is applied with:

Q′=(Φ t⁢(q),t);∀t,X t′=resample⁢(X t,Φ t),formulae-sequence superscript 𝑄′subscript Φ 𝑡 𝑞 𝑡 for-all 𝑡 subscript superscript 𝑋′𝑡 resample subscript 𝑋 𝑡 subscript Φ 𝑡 Q^{\prime}=(\Phi_{t}(q),t);;;;\forall t,X^{\prime}{t}=\text{resample}(X{t% },\Phi_{t}),italic_Q start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_q ) , italic_t ) ; ∀ italic_t , italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = resample ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,(11)

where resample⁢(⋅,Φ t)resample⋅subscript Φ 𝑡\text{resample}(\cdot,\Phi_{t})resample ( ⋅ , roman_Φ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) consists in scaling its input frame to resolution (h t,w t)subscript ℎ 𝑡 subscript 𝑤 𝑡(h_{t},w_{t})( italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) using bilinear interpolation and placing it within a zero-valued array of shape (H,W)𝐻 𝑊(H,W)( italic_H , italic_W ) such that its top-left corner in the array is at coordinates (c t x,c t y)superscript subscript 𝑐 𝑡 𝑥 superscript subscript 𝑐 𝑡 𝑦(c_{t}^{x},c_{t}^{y})( italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ). We note that in our approach, this transformation is performed after augmenting each frame, i.e. on a⁢(X t)𝑎 subscript 𝑋 𝑡 a(X_{t})italic_a ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

0.B.2 Training details

We train for 200,000 iterations on 256 nVidia A100 GPUs, with a batch size of 4 Kubric videos and 2 real videos per device. The extra layers consist of 5 residual blocks on top of the backbone (which has stride 8, 256 channels), each of which consists of 2 sequential 3×3 3 3 3\times 3 3 × 3 convolutions with a channel expansion factor of 4, which is then added to the input. We use a cosine learning rate schedule with 1000 warmup steps and a peak learning rate of 2e-4. We found it improved stability to reduce the learning rate for the PIPs mixer steps relative to the backbone by a factor of 5. We keep all other hyperparameters the same as TAPIR.

0.B.3 Libero finetuning

We compare results on Libero, using the gripper view, which contains large and difficult motions. Qualitative results show that BootsTAP trained on internet videos as described improves results substantially. However, since there’s a large domain gap between Libero data and internet videos, it’s natural to ask whether performance can be improved by further self-supervised training on the Libero dataset.

We use the full set of demonstrated trajectories in the dataset for all tasks, again using only the gripper view. We begin with the model trained as described in the main paper, and then further train it for another 50K steps using three tasks jointly: Kubric, internet videos, and Libero videos, again using separate optimizers for each and summing updates across tasks. We use an update weight of 0.2 for both self-supervised tasks, and keep all other parameters the same between Libero and the internet video tasks. We see that this approach further improves results despite having no labels: the model can track with surprisingly high fidelity over large changes in scale and viewpoint. See the attached html file for visualizations.

0.B.4 Snap-to-occluder

We aim to slightly modify the training objective to bias TAPIR to track foreground objects rather than background, to counteract the tendency of bootstrapped models to track background. The Kubric data loader works by sampling query pixels randomly (biased toward objects), and then computing the full track by back-projecting into the relevant object’s local coordinate system. We first modify the procedure by preventing the model from sampling pixels on the ’back side’ of an occlusion boundary: this is defined as any pixel with a neighboring pixel (within a 3x3 square) which is less than 95% of the pixel’s depth. After tracking points, we identify query points that are on the ’front side’ of an occlusion boundary: that is, any neighboring pixel which is more than 105% of the depth of the query point. If such pixels exist, with 50% probability we randomly choose one such pixel and replace the query point with it. Therefore, in a small fraction of cases, the model will receive a query point on the background but need to track the foreground object instead.

Appendix 0.C Comparison with and without a 3D ConvNet Backbone

Recall that TAPIR extracts features using a ResNet, with a final feature map of dimension 256 at stride 8 (although it uses an earlier feature map as well at stride 4). The architecture is similar to a ResNet-18, and therefore has relatively little capacity to learn about the full diversity of objects in the world. Therefore, we add extra capacity: 5 more ResNet layers consisting of a LayerNorm, a 3×3 3 3 3\times 3 3 × 3 convolution, followed by a GeLU, followed by another 3×3 3 3 3\times 3 3 × 3 convolution which is added to the input of the layer. Like with TAPIR, our full model applies the feature extractor independently on every frame, meaning that the model cannot use temporal cues for feature extraction. Is this choice optimal? Intuitively, we might expect motion to provide segmentation cues that could enable better matching. Therefore, we develop an alternative model which adds a simple 3D ConvNet into the backbone: specifically, we convert the first convolution of each residual block layer into a 3×3×3 3 3 3 3\times 3\times 3 3 × 3 × 3, giving the features a temporal receptive field of 21 frames.

We report results Table9. We observe that this yields a slight performance increase on TAP-Vid DAVIS (strided evaluation), and in particular, slightly improves the position accuracy, although it harms occlusion accuracy. However, it significantly degrades performance on Kinetics (query_first). It’s possible that the model struggles more with the cuts or camera shake present in Kinetics. Hence, we keep a 2D backbone for the final model, although the optimal model may depend on the desired downstream application.

Table 9: Architectures. We compare a 2D backbone and a 3D backbone using Temporal Shift Modules (TSM) to aggregate information locally over time.

Appendix 0.D Evaluation Datasets

TAP-Vid-Kinetics contains videos collected from the Kinetics-700-2020 validation set [7] with original focus on video action recognition. This benchmark contains 1K internet videos of diverse action categories, approximately 10 seconds long, including many challenging elements such as shot boundaries, multiple moving objects, dynamic camera motion, cluttered background and dark lighting conditions. Each video contains ∼similar-to\sim∼26 tracked points on average, obtained from careful human annotation.

TAP-Vid-DAVIS contains 30 real-world videos from DAVIS 2017 validation set[49], a standard benchmark for video object segmentation, which was extended to TAP. Each video contains ∼similar-to\sim∼22 point tracks using the same human annotation process as TAP-Vid-Kinetics.

TAP-Vid-RGB-Stacking contains 50 synthetic videos generated with Kubric[19] which simulate a robotic stacking environment. Each video contains 30 annotated point tracks and has a duration of 250 frames.

RoboTAP contains 265 real world Robotics Manipulation videos with on average ∼similar-to\sim∼272 frames and ∼similar-to\sim∼44 annotated point tracks per video [62]. These videos are even longer, with textureless and symmetric objects that are far out-of-domain for both Kubric and the online lifestyle videos that we use for self-supervised learning.

RoboCAT-NIST is a subset of the data collected for RoboCat[6]. Inspired by the NIST benchmark for robotic manipulation [30], it includes gears of varying sizes (small, medium, large) and a 3-peg base, introduced for a systematic study of insertion affordance. All videos are collected by human teleoperation. It includes robot arms operating and inserting gears, which are a particularly challenging case due to the rotational symmetry and lack of texture. In this work, we processed videos to 64 frames long with 222 ×\times× 296 resolution. This dataset is mainly for demonstration purpose, there are no human groundtruth point tracks.

Libero[35] is a dataset where point tracking has already proven useful for robotic manipulation[69]. It includes demos of a human-driven robot arm performing a wide variety of tasks in a synthetic environment, intended for use in imitation learning. Sequences are variable length at 128×128 128 128 128\times 128 128 × 128 resolution and has no ground truth tracks.

0.D.1 Evaluation metrics

We use three evaluation metrics same as proposed in [12]. (1) <δ 𝐚𝐯𝐠 𝐱 absent subscript superscript 𝛿 𝐱 𝐚𝐯𝐠\mathbf{<\delta^{x}{avg}}< italic_δ start_POSTSUPERSCRIPT bold_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_avg end_POSTSUBSCRIPT is the average position accuracy across 5 thresholds for δ 𝛿\delta italic_δ: 1, 2, 4, 8, 16 pixels. For a given threshold δ 𝛿\delta italic_δ, it computes the proportion of visible points (not occluded) that are closer to the ground truth than the respective threshold. (2) Occlusion Accuracy (OA) is the average binary classification accuracy for the point occlusion prediction at each frame. (3) Average Jaccard (AJ) combines the two above metrics and is typically considered the target for this benchmark. It is the average Jaccard score across the same thresholds as <δ a⁢v⁢g x absent subscript superscript 𝛿 𝑥 𝑎 𝑣 𝑔<\delta^{x}{avg}< italic_δ start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT. Jaccard at δ 𝛿\delta italic_δ measures both occlusion and position accuracy. It is the fraction of ‘true positives’, i.e., points within the threshold of any visible ground truth points, divided by ‘true positives’ plus ‘false positives’ (points that are predicted visible, but the ground truth is either occluded or farther than the threshold) plus ‘false negatives’ (groundtruth visible points that are predicted as occluded, or where the prediction is farther than the threshold).

For TAP-Vid datasets, evaluation is split into strided mode and query-first mode. Strided mode samples query points every 5 frames on the groundtruth tracks when they are visible. Query points can be any time in the video hence it tests the model prediction power both forward and backward in time. Query-first mode samples query points only when they are first time visible and the evaluation only measures tracking accuracy in future frames.

Xet Storage Details

Size:: 109 kB
Xet hash:: b372e49167faf7d82b2abe1396b8387c713ac6396b6cea3840724f0758c29b5a

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.