Buckets:
Title: DeepOIS: Gyroscope-Guided Deep Optical Image Stabilizer Compensation
URL Source: https://arxiv.org/html/2101.11183
Markdown Content: Haipeng Li 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Shuaicheng Liu 2,1 2 1{}^{2,1}start_FLOATSUPERSCRIPT 2 , 1 end_FLOATSUPERSCRIPT Jue Wang 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Megvii Technology
2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT University of Electronic Science and Technology of China
Abstract
Mobile captured images can be aligned using their gyroscope sensors. Optical image stabilizer (OIS) terminates this possibility by adjusting the images during the capturing. In this work, we propose a deep network that compensates the motions caused by the OIS, such that the gyroscopes can be used for image alignment on the OIS cameras 1 1 1 Code will be available on https://github.com/lhaippp/DeepOIS.. To achieve this, first, we record both videos and gyroscopes with an OIS camera as training data. Then, we convert gyroscope readings into motion fields. Second, we propose a Fundamental Mixtures motion model for rolling shutter cameras, where an array of rotations within a frame are extracted as the ground-truth guidance. Third, we train a convolutional neural network with gyroscope motions as input to compensate for the OIS motion. Once finished, the compensation network can be applied for other scenes, where the image alignment is purely based on gyroscopes with no need for images contents, delivering strong robustness. Experiments show that our results are comparable with that of non-OIS cameras, and outperform image-based alignment results with a relatively large margin.
1 Introduction
Image alignment is a fundamental research problem that has been studied for decades, which has been applied in various applications[4, 45, 12, 41, 24]. Commonly adopted registration methods include homography[10], mesh-based deformation[45, 22], and optical flow[8, 29]. These methods look at the image contents for the registration, which often require rich textures[21, 46] and similar illumination variations[37] for good results.
In contrast, gyroscopes can be used to align images, where image contents are no longer required[17]. The gyroscope in a mobile phone provides the camera 3D rotations, which can be converted into homographies given camera intrinsic parameters for the image alignment[17, 14]. In this way, the rotational motions can be compensated. We refer to this as gyro image alignment. One drawback is that translations cannot be handled by the gyro. Fortunately, rotational motions are prominent compared with translational motions[34], especially when filming scenes or objects that are not close to the camera[23].
Compared with image-based methods, gyro-based methods are attractive. First, it is irrelevant to image contents, which largely improves the robustness. Second, gyros are widely available and can be easily accessed on our daily mobiles. Many methods have built their applications based on the gyros[15, 13, 44].
Figure 1: (a) inputs without the alignment, (b) gyroscope alignment on a non-OIS camera, (c) gyroscope alignment on an OIS camera, and (d) our method on an OIS camera. We replace the red channel of one image with that of the other image, where misaligned pixels are visualized as colored ghosts. The same visualization is applied for the rest of the paper.
On the other hand, the cameras of smartphones keep evolving, where optical image stabilizer (OIS) becomes more and more popular, which promises less blurry images and smoother videos. It compensates for 2D pan and tilt motions of the imaging device through lens mechanics[5, 42]. However, OIS terminates the possibility of image registration by gyros. As the homography derived from the gyros is no longer correspond to the captured images, which have been adjusted by OIS with unknown quantities and directions. One may try to read pans and tilts from the camera module. However, this is not easy as it is bounded with the camera sensor, which requires assistance from professionals of the manufacturers[19].
In this work, we propose a deep learning method that compensates the OIS motion without knowing its readings, such that the gyro can be used for image alignment on OIS equipped cell-phones. Fig.1 shows an alignment example. Fig.1 (a) shows two input images. Fig.1 (b) is the gyro alignment produced by a non-OIS camera. As seen, the images can be well aligned with no OIS interferences. Fig.1 (c) is the gyro alignment produced by an OIS camera. Misalignments can be observed due to the OIS motion. Fig.1 (d) represents our OIS compensated result.
Two frames are denoted as I a subscript 𝐼 𝑎 I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and I b subscript 𝐼 𝑏 I_{b}italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, the motion from gyro between them as G ab subscript 𝐺 𝑎 𝑏 G_{ab}italic_G start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT. The real motion (after OIS adjustment) between two frames is G ab′subscript superscript 𝐺′𝑎 𝑏 G^{\prime}{ab}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT. We want to find a mapping function that transforms G ab subscript 𝐺 𝑎 𝑏 G{ab}italic_G start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT to G ab′subscript superscript 𝐺′𝑎 𝑏 G^{\prime}_{ab}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT:
G ab′=f(G ab).subscript superscript 𝐺′𝑎 𝑏 𝑓 subscript 𝐺 𝑎 𝑏\small G^{\prime}{ab}=f(G{ab}).italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT = italic_f ( italic_G start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT ) .(1)
We propose to train a supervised convolutional neural network for this mapping. To achieve this, we record videos and their gyros as training data. The input motion G ab subscript 𝐺 𝑎 𝑏 G_{ab}italic_G start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT can be obtained directly given the gyro readings. However, obtaining the ground-truth labels for G ab′subscript superscript 𝐺′𝑎 𝑏 G^{\prime}{ab}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT is non-trivial. We propose to estimate the real motion from the captured images. If we estimate a homography between them, then the translations are included, which is inappropriate for rotation-only gyros. The ground-truth should merely contain rotations between I a subscript 𝐼 𝑎 I{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and I b subscript 𝐼 𝑏 I_{b}italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. Therefore, we estimate a fundamental matrix and decompose it for the rotation matrix[14]. However, the cell-phone cameras are rolling shutter (RS) cameras, where different rows of pixels have slightly distinct rotations matrices. In this work, we propose a Fundamental Mixtures model that estimates an array of fundamental matrices for the RS camera, such that rotational motions can be extracted as the ground-truth. In this way, we can learn the mapping function.
For evaluations, we capture a testing dataset with various scenes, where we manually mark point correspondences for quantitative metrics. According to our experiments, our network can accurately recover the mapping, achieving gyro alignments comparable to non-OIS cameras. In summary, our contributions are:
• We propose a new problem that compensates OIS motions for gyro image alignment on cell-phones. To the best of our knowledge, the problem is not explored yet, but important to many image and video applications.
• We propose a solution that learns the mapping function between gyro motions and real motions, where a Fundamental Mixtures model is proposed under the RS setting for the real motions.
• We propose a dataset for the evaluation. Experiments show that our method works well when compared with non-OIS cameras, and outperforming image-based opponents in challenging cases.
2 Related Work
2.1 Image Alignments
Homography[10], mesh-based[45], and optical flow[37] methods are the most commonly adopted motion models, which align images in a global, middle, and pixel level. They are often estimated by matching image features or optimize photometric loss[21]. Apart from classical traditional features, such as SIFT[25], SURF[3], and ORB[32], deep features have been proposed for improving robustness, e.g., LIFT[43] and SOSNet[38]. Registration can also be realized by deep learning directly, such as deep homography estimation[20, 46]. In general, without extra sensors, these methods align images based on the image contents.
2.2 Gyroscopes
Gyroscope is important in helping estimate camera rotations during mobile capturing. The fusion of gyroscope and visual measurements have been widely applied in various applications, including but not limited to, image alignment and video stabilization[17], image deblurring[26], simultaneous localization and mapping (SLAM)[15], gesture-based user authentication on mobile devices[13], and human gait recognition[44]. In mobiles, one important issue is the synchronization between the timestamps of gyros and video frames, which requires gyro calibration[16]. In this work, we access the gyro data at the Hardware Abstraction Layer (HAL) of the android layout[36], to achieve accurate synchronizations.
2.3 Optical Image Stabilizer
Optical Image Stabilizer (OIS) has been around commercially since the mid-90s[33] and becomes more and more popular in our daily cell-phones. Both the image capturing and video recording can benefit from OIS, producing results with less blur and improved stability[19]. It works by controlling the path of the image through the lens and onto the image sensor, which is achieved by measuring the camera shakes using sensors such as gyroscope, and move the lens horizontally or vertically to counteract shakes by electromagnet motors[5, 42]. Once a mobile is equipped with OIS, it cannot be turn-off easily[27]. On one hand, OIS is good for daily users. On the other hand, it is not friendly to mobile developers who need gyros to align images. In this work, we enable the gyro image alignment on OIS cameras.
3 Algorithm
Figure 2: The overview of our algorithm which includes (a) gyro-based flow estimator, (d) the fundamental-based flow estimator, and (b) neural network predicting an output flow. For each pair of frames I a subscript 𝐼 𝑎 I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and I b subscript 𝐼 𝑏 I_{b}italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, the homography array is computed using the gyroscope readings from t I a subscript 𝑡 subscript 𝐼 𝑎 t_{I_{a}}italic_t start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT to t I b subscript 𝑡 subscript 𝐼 𝑏 t_{I_{b}}italic_t start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which is converted into the source motion G ab subscript 𝐺 𝑎 𝑏 G_{ab}italic_G start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT as the network input. On the other side, we estimate a Fundamental Mixtures model to produce the target flow F ab subscript 𝐹 𝑎 𝑏 F_{ab}italic_F start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT as the guidance. The network is then trained to produce the output G ab′subscript superscript 𝐺′𝑎 𝑏 G^{\prime}_{ab}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT.
Our method is built upon convolutional neural networks. It takes one gyro-based flow G ab subscript 𝐺 𝑎 𝑏 G_{ab}italic_G start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT from the source frame I a subscript 𝐼 𝑎 I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT to the target frame I b subscript 𝐼 𝑏 I_{b}italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT as input, and produces OIS compensated flow G ab′subscript superscript 𝐺′𝑎 𝑏 G^{\prime}{ab}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT as output. Our pipeline consists of three modules: a gyro-based flow estimator, a Fundamental Mixtures flow estimator, and a fully convolutional network that compensates the OIS motion. Fig.2 illustrates the pipeline. First, the gyro-based flows are generated according to the gyro readings (Fig.2(a) and Sec.3.1), then they are fed into a network to produce OIS compensated flows G ab′subscript superscript 𝐺′𝑎 𝑏 G^{\prime}{ab}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT as output (Fig.2(b) and Sec.3.3). To obtain the ground-truth rotations, we propose a Fundamental Mixtures model, so as to produce the Fundamental Mixtures flows F ab subscript 𝐹 𝑎 𝑏 F_{ab}italic_F start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT (Fig.2 (d) and Sec.3.2) as the guidance to the network (Fig.2 (c)). During the inference, the Fundamental Mixtures model is not required. The gyro readings are converted into gyro-based flows and fed to the network for compensation.
Figure 3: Illustration of rolling shutter frames. t I a subscript 𝑡 subscript 𝐼 𝑎 t_{I_{a}}italic_t start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUBSCRIPT and t I b subscript 𝑡 subscript 𝐼 𝑏 t_{I_{b}}italic_t start_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT are the frame starting time. t s subscript 𝑡 𝑠 t_{s}italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the camera readout time and t f subscript 𝑡 𝑓 t_{f}italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT denotes the frame period (t f>t s subscript 𝑡 𝑓 subscript 𝑡 𝑠 t_{f}>t_{s}italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT > italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT). t a(i)subscript 𝑡 𝑎 𝑖 t_{a}(i)italic_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_i ) and t b(i)subscript 𝑡 𝑏 𝑖 t_{b}(i)italic_t start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_i ) represent the starting time of patch i 𝑖 i italic_i in I a subscript 𝐼 𝑎 I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and I b subscript 𝐼 𝑏 I_{b}italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT.
3.1 Gyro-Based Flow
We compute rotations by compounding gyro readings consisting of angular velocities and timestamps. In particular, we read them from the HAL of android architecture for synchronization. The rotation vector n=(ω x,ω y,ω z)∈ℝ 3 𝑛 subscript 𝜔 𝑥 subscript 𝜔 𝑦 subscript 𝜔 𝑧 superscript ℝ 3 n=\left(\omega_{x},\omega_{y},\omega_{z}\right)\in\mathbb{R}^{3}italic_n = ( italic_ω start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT is computed from gyro readings between frames I a subscript 𝐼 𝑎 I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and I b subscript 𝐼 𝑏 I_{b}italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT[17]. The rotation matrix R(t)∈SO(3)𝑅 𝑡 𝑆 𝑂 3 R(t)\in SO(3)italic_R ( italic_t ) ∈ italic_S italic_O ( 3 ) can be produced according to the Rodrigues Formula[6].
If the camera is global shutter, the homography is modeled as:
𝐇(t)=𝐊𝐑(t)𝐊−1,𝐇 𝑡 𝐊𝐑 𝑡 superscript 𝐊 1\small\mathbf{H}(t)=\mathbf{K}\mathbf{R}(t)\mathbf{K}^{-1},bold_H ( italic_t ) = bold_KR ( italic_t ) bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ,(2)
where K 𝐾 K italic_K is the intrinsic camera matrix and R(t)𝑅 𝑡 R(t)italic_R ( italic_t ) denotes the camera rotation from I a subscript 𝐼 𝑎 I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT to I b subscript 𝐼 𝑏 I_{b}italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT.
In an RS camera, every row of the image is exposed at a slightly different time. Therefore, Eq.(2) is not applicable since every row of the image has slightly different rotation matrices. In practice, assigning each row of pixels with a rotation matrix is unnecessary. We group several consecutive rows into a row patch and assign each patch with a rotation matrix. Fig.3 shows an example. Let t s subscript 𝑡 𝑠 t_{s}italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT denotes the camera readout time that is the time duration between the exposure of the first row and the last row of pixels.
t a(i)=t I+t si N,subscript 𝑡 𝑎 𝑖 subscript 𝑡 𝐼 subscript 𝑡 𝑠 𝑖 𝑁\small t_{a}(i)=t_{I}+t_{s}\frac{i}{N},italic_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_i ) = italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT divide start_ARG italic_i end_ARG start_ARG italic_N end_ARG ,(3)
where t a(i)subscript 𝑡 𝑎 𝑖 t_{a}(i)italic_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_i ) denotes the start of the exposure of the i 𝑖 i italic_i-th 𝑡 ℎ th italic_t italic_h patch in I a subscript 𝐼 𝑎 I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT as shown in Fig.3, t I subscript 𝑡 𝐼 t_{I}italic_t start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT denotes the starting timestamp of the corresponding frame, N 𝑁 N italic_N denotes the number of patches per frame. The end of the exposure is:
t b(i)=t a(i)+t f,subscript 𝑡 𝑏 𝑖 subscript 𝑡 𝑎 𝑖 subscript 𝑡 𝑓\small t_{b}(i)=t_{a}(i)+t_{f},italic_t start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_i ) = italic_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_i ) + italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ,(4)
where t f=1/FPS subscript 𝑡 𝑓 1 𝐹 𝑃 𝑆 t_{f}=1/FPS italic_t start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 1 / italic_F italic_P italic_S is the frame period. Here, the homography between the i 𝑖 i italic_i-th row at frame I a subscript 𝐼 𝑎 I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and I b subscript 𝐼 𝑏 I_{b}italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT can be modeled as:
𝐇=𝐊𝐑(t b)𝐑⊤(t a)𝐊−1,𝐇 𝐊𝐑 subscript 𝑡 𝑏 superscript 𝐑 top subscript 𝑡 𝑎 superscript 𝐊 1\small\mathbf{H}=\mathbf{K}\mathbf{R}\left(t_{b}\right)\mathbf{R}^{\top}\left(% t_{a}\right)\mathbf{K}^{-1},bold_H = bold_KR ( italic_t start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) bold_K start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ,(5)
where 𝐑(t b)𝐑⊤(t a)𝐑 subscript 𝑡 𝑏 superscript 𝐑 top subscript 𝑡 𝑎\small\mathbf{R}\left(t_{b}\right)\mathbf{R}^{\top}\left(t_{a}\right)bold_R ( italic_t start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) bold_R start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) can be computed by accumulating rotation matrices from t a subscript 𝑡 𝑎 t_{a}italic_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT to t b subscript 𝑡 𝑏 t_{b}italic_t start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT.
In our implementation, we divide the image into 6 6 6 6 patches which computes a homography array containing 6 6 6 6 horizontal homographies between two consecutive frames. We convert the homography array into a flow field[26] so that it can be fed as input to a convolutional neural network. For every pixel p 𝑝 p italic_p in the I a subscript 𝐼 𝑎 I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, we have:
𝐩′=𝐇(t)𝐩,(𝐮,𝐯)=𝐩′−𝐩,formulae-sequence superscript 𝐩′𝐇 𝑡 𝐩 𝐮 𝐯 superscript 𝐩′𝐩\small\mathbf{p}^{\prime}=\mathbf{H}(t)\mathbf{p},\quad\mathbf{(u,v)}=\mathbf{% p}^{\prime}-\mathbf{p},bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_H ( italic_t ) bold_p , ( bold_u , bold_v ) = bold_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - bold_p ,(6)
computing the offset for every pixel produces a gyro-based flow G ab subscript 𝐺 𝑎 𝑏 G_{ab}italic_G start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT.
3.2 Fundamental Mixtures
Before introducing our model of Fundamental Mixtures, we briefly review the process of estimating the fundamental matrix. If the camera is global-shutter, every row of the frame is imaged simultaneously at a time. Let p 1 subscript 𝑝 1 p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and p 2 subscript 𝑝 2 p_{2}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT be the projections of the 3D point X 𝑋 X italic_X in the first and second frame, p 1=P 1X subscript 𝑝 1 subscript 𝑃 1 𝑋 p_{1}=P_{1}X italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_X and p 2=P 2X subscript 𝑝 2 subscript 𝑃 2 𝑋 p_{2}=P_{2}X italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_X, where P 1 subscript 𝑃 1 P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and P 2 subscript 𝑃 2 P_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represent the projection matrices. The fundamental matrix satisfies the equation[14]:
𝐩 1 T𝐅𝐩 2=0,superscript subscript 𝐩 1 𝑇 subscript 𝐅𝐩 2 0\small\mathbf{p}{1}^{T}\mathbf{F}\mathbf{p}{2}=0,bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Fp start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0 ,(7)
where p 1=(x 1,y 1,1)T subscript 𝑝 1 superscript subscript 𝑥 1 subscript 𝑦 1 1 𝑇 p_{1}=(x_{1},y_{1},1)^{T}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 1 ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and p 2=(x 1′,y 1′,1)T subscript 𝑝 2 superscript superscript subscript 𝑥 1′superscript subscript 𝑦 1′1 𝑇 p_{2}=\left(x_{1}^{\prime},y_{1}^{\prime},1\right)^{T}italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. Let 𝐟 𝐟\mathbf{f}bold_f be the 9-element vector made up of F 𝐹 F italic_F, then Eq.(7) can be written as:
(x 1′x 1,x 1′y 1,x 1′,y 1′x 1,y 1′y 1,y 1′,x 1,y 1,1)𝐟=0,superscript subscript 𝑥 1′subscript 𝑥 1 superscript subscript 𝑥 1′subscript 𝑦 1 superscript subscript 𝑥 1′superscript subscript 𝑦 1′subscript 𝑥 1 superscript subscript 𝑦 1′subscript 𝑦 1 superscript subscript 𝑦 1′subscript 𝑥 1 subscript 𝑦 1 1 𝐟 0\small\left(x_{1}^{\prime}x_{1},x_{1}^{\prime}y_{1},x_{1}^{\prime},y_{1}^{% \prime}x_{1},y_{1}^{\prime}y_{1},y_{1}^{\prime},x_{1},y_{1},1\right)\mathbf{f}% =0,( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 1 ) bold_f = 0 ,(8)
given n 𝑛 n italic_n correspondences, yields a set of linear equations:
A𝐟=[x 1′p 1 T y 1′p 1 T p 1 T⋮⋮⋮x n′p n T y n′p n T p n T]𝐟=0.𝐴 𝐟 delimited-[]superscript subscript 𝑥 1′superscript subscript 𝑝 1 𝑇 superscript subscript 𝑦 1′superscript subscript 𝑝 1 𝑇 superscript subscript 𝑝 1 𝑇⋮⋮⋮superscript subscript 𝑥 𝑛′superscript subscript 𝑝 𝑛 𝑇 superscript subscript 𝑦 𝑛′superscript subscript 𝑝 𝑛 𝑇 superscript subscript 𝑝 𝑛 𝑇 𝐟 0\displaystyle A\mathbf{f}=\left[\begin{array}[]{ccc}x_{1}^{\prime}p_{1}^{T}&y_% {1}^{\prime}p_{1}^{T}&p_{1}^{T}\ \vdots&\vdots&\vdots\ x_{n}^{\prime}p_{n}^{T}&y_{n}^{\prime}p_{n}^{T}&p_{n}^{T}\end{array}\right]% \mathbf{f}=0.italic_A bold_f = [ start_ARRAY start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] bold_f = 0 .(9)
Using at least 8 matching points yields a homogenous linear system, which can be solved under the constraint ‖𝐟‖2=1 subscript norm 𝐟 2 1|\mathbf{f}|_{2}=1∥ bold_f ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 using the Singular Value Decomposition(SVD) of A=UDV⊤𝐴 𝑈 𝐷 superscript 𝑉 top A=UDV^{\top}italic_A = italic_U italic_D italic_V start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT where the last column of V 𝑉 V italic_V is the solution[14].
In the case of RS camera, projection matrices P 1 subscript 𝑃 1 P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and P 2 subscript 𝑃 2 P_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT vary across rows instead of being frame-global. Eq.(7) does not hold. Therefore, we introduce Fundamental Mixtures assigning each row patch with a fundamental matrix.
We detect FAST features[40] and track them by KLT[35] between frames. We modify the detection threshold for uniform feature distributions[11, 12].
To model RS effects, we divide a frame into N 𝑁 N italic_N patches, resulting in N 𝑁 N italic_N unknown fundamental matrices F i subscript 𝐹 𝑖 F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to be estimated per frame. If we estimate each fundamental matrix independently, the discontinuity is unavoidable. We propose to smooth neighboring matrices during the estimation as shown in Fig.2 (d), where a point p 1 subscript 𝑝 1 p_{1}italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT not only contributes to its own patch but also influences its nearby patches weighted by the distance. The fundamental matrix for point p 𝑝 p italic_p is the mixture:
F(p 1)=∑i=1 N F iw i(p 1),𝐹 subscript 𝑝 1 superscript subscript 𝑖 1 𝑁 subscript 𝐹 𝑖 subscript 𝑤 𝑖 subscript 𝑝 1\small F({p_{1}})=\sum_{i=1}^{N}F_{i}w_{i}(p_{1}),italic_F ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ,(10)
where w i(p 1)subscript 𝑤 𝑖 subscript 𝑝 1 w_{i}(p_{1})italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is the gaussian weight with the mean equals to the middle of each patch, and the sigma σ=0.001h 𝜎 0.001 ℎ\sigma=0.001h italic_σ = 0.001 * italic_h, where h ℎ h italic_h represents the frame height.
To fit a Fundamental Mixtures F i subscript 𝐹 𝑖 F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT given a pair of matching points (p 1,p 2)subscript 𝑝 1 subscript 𝑝 2\left(p_{1},p_{2}\right)( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), we rewrite Eq.(7) as:
0=𝐩 1 T𝐅 𝐩 𝟏𝐩 2=∑i=1 N w i(p 1)⋅𝐩 1 T𝐅 𝐢𝐩 2,0 superscript subscript 𝐩 1 𝑇 subscript 𝐅 subscript 𝐩 1 subscript 𝐩 2 superscript subscript 𝑖 1 𝑁⋅subscript 𝑤 𝑖 subscript 𝑝 1 superscript subscript 𝐩 1 𝑇 subscript 𝐅 𝐢 subscript 𝐩 2\small 0=\mathbf{p}{1}^{T}\mathbf{F{p_{1}}}\mathbf{p}{2}=\sum{i=1}^{N}w_{i% }(p_{1})\cdot\mathbf{p}{1}^{T}\mathbf{F{i}}\mathbf{p}_{2},0 = bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_F start_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⋅ bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_F start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,(11)
where 𝐩 1 T𝐅 𝐤𝐩 2 superscript subscript 𝐩 1 𝑇 subscript 𝐅 𝐤 subscript 𝐩 2\mathbf{p}{1}^{T}\mathbf{F}{\mathbf{k}}\mathbf{p}_{2}bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_F start_POSTSUBSCRIPT bold_k end_POSTSUBSCRIPT bold_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be transformed into:
A p 1 if i=(x 1′p 1 T y 1′p 1 T p 1 T)f i,superscript subscript 𝐴 subscript 𝑝 1 𝑖 subscript 𝑓 𝑖 superscript subscript 𝑥 1′superscript subscript 𝑝 1 𝑇 superscript subscript 𝑦 1′superscript subscript 𝑝 1 𝑇 superscript subscript 𝑝 1 𝑇 subscript 𝑓 𝑖\small A_{p_{1}}^{i}f_{i}=\left(\begin{array}[]{ccc}x_{1}^{\prime}p_{1}^{T}&y_% {1}^{\prime}p_{1}^{T}&p_{1}^{T}\end{array}\right)f_{i},italic_A start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( start_ARRAY start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ) italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,(12)
where f i subscript 𝑓 𝑖 f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the vector formed by concatenating the columns of F i subscript 𝐹 𝑖 F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Combining Eq.(11) and Eq.(12) yields a 1×9i 1 9 𝑖 1\times 9i 1 × 9 italic_i linear constraint:
(w 1(p 1)A p 1 1…w i(p 1)A p 1 i)⏟A p 1(f 1⋮f i)⏟f=A p 1f=0.subscript⏟subscript 𝑤 1 subscript 𝑝 1 superscript subscript 𝐴 subscript 𝑝 1 1…subscript 𝑤 𝑖 subscript 𝑝 1 superscript subscript 𝐴 subscript 𝑝 1 𝑖 subscript 𝐴 subscript 𝑝 1 subscript⏟subscript 𝑓 1⋮subscript 𝑓 𝑖 𝑓 subscript 𝐴 subscript 𝑝 1 𝑓 0\small\underbrace{\left(w_{1}(p_{1})A_{p_{1}}^{1}\ldots w_{i}(p_{1})A_{p_{1}}^% {i}\right)}{A{p_{1}}}\underbrace{\left(\begin{array}[]{c}f_{1}\ \vdots\ f_{i}\end{array}\right)}{f}=A{p_{1}}f=0.under⏟ start_ARG ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_A start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT … italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_A start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) end_ARG start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT under⏟ start_ARG ( start_ARRAY start_ROW start_CELL italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ) end_ARG start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f = 0 .(13)
Aggregating all linear constraints A p j subscript 𝐴 subscript 𝑝 𝑗 A_{p_{j}}italic_A start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT for every match point (p j,p j+1)subscript 𝑝 𝑗 subscript 𝑝 𝑗 1(p_{j},p_{j+1})( italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT ) yields a homogenous linear system A𝐟=0 𝐴 𝐟 0 A\mathbf{f}=0 italic_A bold_f = 0 that can be solved under the constraint ‖𝐟‖2=1 subscript norm 𝐟 2 1|\mathbf{f}|_{2}=1∥ bold_f ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 via SVD.
For robustness, if the number of feature points in one patch is inferior to 8 8 8 8, Eq.(13) is under constrained. Therefore, we add a regularizer to constrain λ‖A p i−A p i−1‖2=0 𝜆 subscript norm superscript subscript 𝐴 𝑝 𝑖 superscript subscript 𝐴 𝑝 𝑖 1 2 0\lambda\left|A_{p}^{i}-A_{p}^{i-1}\right|_{2}=0\small italic_λ ∥ italic_A start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_A start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0 to the homogenous system with λ=1 𝜆 1\lambda=1\small italic_λ = 1.
3.2.1 Rotation-Only Homography
Given the fundamental matrix F i subscript 𝐹 𝑖 F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the camera intrinsic K 𝐾 K italic_K, we can compute the essential matrix E i subscript 𝐸 𝑖 E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the i 𝑖 i italic_i-th patch: 𝐄 𝐢=𝐊 T𝐅 𝐢𝐊.subscript 𝐄 𝐢 superscript 𝐊 𝑇 subscript 𝐅 𝐢 𝐊\mathbf{E_{i}}=\mathbf{K}^{T}\mathbf{F_{i}}\mathbf{K}\small.bold_E start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT = bold_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_F start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT bold_K . The essential matrix E i subscript 𝐸 𝑖 E_{i}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT[14] can be decomposed into camera rotations and translations, where only rotations R i subscript 𝑅 𝑖 R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are retained. We use R i subscript 𝑅 𝑖 R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to form a rotation-only homography similar to Eq.(2) and convert the homography array into a flow field as Eq.(6). We call this flow field as Fundamental Mixtures Flow F ab subscript 𝐹 𝑎 𝑏 F_{ab}italic_F start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT. Note that, R i subscript 𝑅 𝑖 R_{i}italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is spatially smooth, as F i subscript 𝐹 𝑖 F_{i}italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is smooth, so does F ab subscript 𝐹 𝑎 𝑏 F_{ab}italic_F start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT.
3.3 Network Structure
The architecture of the network is shown in Fig.2 that utilizes a backbone of UNet[30] consists of a series of convolutional and downsampling layers with skip connections. The input to the network is gyro-based flow G ab′subscript superscript 𝐺′𝑎 𝑏 G^{\prime}{ab}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT and the ground-truth target is Fundamental Mixtures flow F ab subscript 𝐹 𝑎 𝑏 F{ab}italic_F start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT. Our network aims to produce an optical flow of size H×W×2 𝐻 𝑊 2 H\times W\times 2 italic_H × italic_W × 2 which compensates the motion generated by OIS between G ab′subscript superscript 𝐺′𝑎 𝑏 G^{\prime}{ab}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT and F ab subscript 𝐹 𝑎 𝑏 F{ab}italic_F start_POSTSUBSCRIPT italic_a italic_b end_POSTSUBSCRIPT. Besides, the network is fully convolutional which accepts input of arbitrary sizes.
Our network is trained on 9k 9 𝑘 9k 9 italic_k rich-texture frames with resolution of 360 360 360 360 x 270 270 270 270 pixels over 1k 1 𝑘 1k 1 italic_k iterations by an Adam optimizer[18] whose l r=1.0×10−4 subscript 𝑙 𝑟 1.0 superscript 10 4 l_{r}=1.0\times 10^{-4}italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 1.0 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, β 1=0.9 subscript 𝛽 1 0.9\beta_{1}=0.9 italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β 2=0.999 subscript 𝛽 2 0.999\beta_{2}=0.999 italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999. The batch size is 8 8 8 8, and for every 50 50 50 50 epochs, the learning rate is reduced by 20%percent 20 20%20 %. The entire training process costs about 50 50 50 50 hours. The implementation is in PyTorch and the network is trained on one NVIDIA RTX 2080 Ti.
4 Experimental Results
4.1 Dataset
Previously, there are some dedicated datasets which are designed to evaluate the homography estimation[46] or the image deblurring with the artificial-generated gyroscope-frame pair[26], whether none of them combine real gyroscope readings with corresponding video frames. So we propose a new dataset and benchmark GF4.
Figure 4: A glance at our evaluation dataset. Our dataset contains 4 categories, regular(RE), low-texture(LT), low-light(LL) and moving-foreground(MF). Each category contains 350 350 350 350 pairs, a total of 1400 1400 1400 1400 pairs, with synchronized gyroscope readings.
Training Set To train our network, we record a set of videos with the gyroscope readings using a hand-held cellphone. We choose scenes with rich textures so that sufficient feature points can be detected to calculate the Fundamental Mixtures model. The videos last 300 300 300 300 seconds, yielding 9 9 9 9,000 000 000 000 frames in total. Note that, the scene type is not important as long as it can provide enough features as needed for Fundamental Mixtures estimation.
Figure 5: We mark the correspondences manually in our evaluation set for quantitative metrics. For each pair, we mark 6∼8 similar-to 6 8 6\sim 8 6 ∼ 8 point matches.
Evaluation Set For the evaluation, we capture scenes with different types, to compare with image-based registration methods. Our dataset contains 4 4 4 4 categories, including regular (RE), low-texture (LT), low-light (LL), and moving-foregrounds (MF) frame-gyroscope pairs. Each scene contains 350 350 350 350 pairs. So, there are 1400 1400 1400 1400 pairs in the dataset. We show some examples in Fig.4. For quantitative evaluation, we manually mark 6∼8 similar-to 6 8 6\sim 8 6 ∼ 8 point correspondences per pair, distributing uniformly on frames. Fig.5 shows some examples.
4.2 Comparisons with non-OIS camera
Our purpose is to enable gyro image alignment on OIS cameras. Therefore, we compare our method with non-OIS cameras. In general, our method should perform equally well as non-OIS cameras, if the OIS motion could be compensated successfully. For comparison, ideally, we should use one camera with OIS turn on and off. However, the OIS cannot be turned off easily. Therefore, we use two cell-phones with similar camera intrinsics, one with OIS and one without, and capture the same scene twice with similar motions. Fig.6 shows some examples. Fig.6 (a) shows the input frames. Fig.6 (b) shows the gyro alignment on a non-OIS camera. As seen, images can be well aligned. Fig.6 (c) shows the gyro alignment on an OIS camera. Due to the OIS interferences, images cannot be aligned directly using the gyro. Fig.6 (d) shows our results. With OIS compensation, images can be well aligned on OIS cameras.
Table 1: Comparisons with non-OIS camera.
We also calculate quantitative values. Similarly, we mark the ground-truth for evaluation. The average geometry distance between the warped points and the manually labeled GT points are computed as the error metric (the lower the better). Table1 shows the results. Our result 0.709 0.709 0.709 0.709 is comparable with non-OIS camera 0.688 0.688 0.688 0.688 (slightly worse), while no compensation yields 1.038 1.038 1.038 1.038, which is much higher.
4.3 Comparisons with Image-based Methods
Figure 6: Comparisons with non-OIS cameras. (a) input two frames. (b) gyro alignment results on the non-OIS camera. (c) gyro alignment results on the OIS camera. (d) our OIS compensation results. Without OIS compensation, clear misalignment can be observed in (c) whereas our method can solve this problem and be comparable with non-OIS results in (b).
Figure 7: Comparisons with image-based methods. We compare with SIFT[25] + RANSAC[9], Meshflow[22], and the recent deep homography[46] method. We show examples covering all scenes in our evaluation dataset. Our method can align images robustly while image-based methods contain some misaligned regions.
Although it is a bit unfair to compare with image-based methods as we adopt additional hardware. We desire to show the importance and robustness of the gyro-based alignment, to highlight the importance of enabling this capability on OIS cameras.
Table 2: Quantitative comparisons on the evaluation dataset. The best performance is marked in red and the second-best is in blue.
4.3.1 Qualitative Comparisons
Firstly, we compare our method with one frequently used traditional feature-based algorithm, i.e. SIFT[25] and RANSAC[9] that compute a global homography, and another feature-based algorithm, i.e. Meshflow[22] that deforms a mesh for the non-linear motion representation. Moreover, we compare our method with the recent deep homography method[46].
Fig.7 (a) shows a regular example where all the methods work well. Fig.7 (b), SIFT+RANSAC fails to find a good solution, so does deep homography, while Meshflow works well. One possible reason is that a single homography cannot cover the large depth variations. Fig.7 (c) illustrates a moving-foreground example that SIFT+RANSAC and Meshflow cannot work well, as few features are detected on the background, whereas Deep Homography and our method can align the background successfully. A similar example is shown in Fig.7 (d), SIFT+RANSAC and Deep Homography fail. Meshflow works on this example as sufficient features are detected in the background. In contrast, our method can still align the background without any difficulty. Because we do not need the image contents for the registration. Fig.7(e) is an example of low-light scenes, and Fig.7(f) is a low-texture scene. All the image-based methods fail as no high-quality features can be extracted, whereas our method is robust.
4.3.2 Quantitative Comparisons
We also compare our method with other feature-based methods quantitatively, i.e., the geometry distance. For the feature descriptors, we choose SIFT[25], ORB[31], SOSNet[38], SURF[3]. For the outlier rejection algorithms, we choose RANSAC[9] and MAGSAC[2]. The errors for each category are shown in Table2 followed by the overall averaged error, where ℐ 3×3 subscript ℐ 3 3\mathcal{I}{3\times 3}caligraphic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT refers to a 3×3 3 3 3\times 3 3 × 3 identity matrix as a reference. In particular, feature-based methods sometimes crash, when the error is larger than ℐ 3×3 subscript ℐ 3 3\mathcal{I}{3\times 3}caligraphic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT error, we set the error equal to ℐ 3×3 subscript ℐ 3 3\mathcal{I}_{3\times 3}caligraphic_I start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT error. Regarding the motion model, from 3 3 3 3) to 10 10 10 10) and 12 12 12 12) are single homography, 11 11 11 11) is mesh-based, and 13 13 13 13) is a homography array. In Table2, we mark the best performance in red and the second-best in blue.
As shown, except for comparing to feature-based methods in RE scenes, our method outperforms the others for all categories. It is reasonable because, in regular(RE) scenes, a set of high-quality features is detected which allows to output a good solution. In contrast, gyroscopes can only compensate for rotational motions, which decreases scores to some extent. For the rest scenes, our method beats the others with an average error being lower than the 2 2 2 2 nd best by 56.07%percent 56.07 56.07%56.07 %. Especially for low-light(LL) scenes, our method computes an error which is at least lower than the 2 2 2 2 nd best by 43.9%percent 43.9 43.9%43.9 %.
4.4 Ablation Studies
4.4.1 Fully Connected Neural Network
Figure 8: Regression of the homography array using the fully connected network. For each pair of frames I a subscript 𝐼 𝑎 I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and I b subscript 𝐼 𝑏 I_{b}italic_I start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, a homography array is computed by using gyro readings, which is fed to the network. On the other side, a Fundamental Mixtures model is produced as targets to guide the training process.
Our network is fully convolutional, where we convert gyroscope data into homography arrays, and then into flow fields as image input to the network. However, there is another option where we can directly input homography arrays as input. Similarly, on the other side, the Fundamental Mixtures are converted into rotation-only homography arrays and then used as guidance. Fig.8 shows the pipeline, where we test two homography representations, including 3×3 3 3 3\times 3 3 × 3 homography matrix elements and H 4pt subscript 𝐻 4 𝑝 𝑡 H_{4pt}italic_H start_POSTSUBSCRIPT 4 italic_p italic_t end_POSTSUBSCRIPT representation from[7] that represents homography by 4 motion vectors. The network is fully connected and L2 loss is adopted for the regression. We adopt the same training data as described above. The result is that neither representations converge, where the H 4pt subscript 𝐻 4 𝑝 𝑡 H_{4pt}italic_H start_POSTSUBSCRIPT 4 italic_p italic_t end_POSTSUBSCRIPT is slightly better than directly regressing matrix elements.
Perhaps, there exist other representations or network structures that may work well or even better than our current proposal. Here, as the first try, we have proposed a working pipeline and want to leave the improvements as future works.
4.4.2 Global Fundamental vs. Mixtures
To verify the effectiveness of our Fundamental Mixtures model, we compare it with a global fundamental matrix. Here, we choose the evaluation dataset of the regular scenes to alleviate the feature problem. We estimate global fundamental matrix and Fundamental Mixtures, then convert to the rotation-only homographies, respectively. Finally, we align the images with rotation-only homographies accordingly. An array of homographies from Fundamental Mixtures produces an error of 0.451, which is better than an error of 0.580 produced by a single homography from a global fundamental matrix. It indicates that the Fundamental Mixtures model is functional in the case of RS cameras.
Moreover, we generate GT with the two methods and train our network, respectively. As shown in Table3, the network trained on Fundamental Mixtures-based GT outperforms the global fundamental matrix, which demonstrates the effectiveness of our Fundamental Mixtures.
Table 3: The performance of networks trained on two different GT.
4.4.3 Backbone
Table 4: The performance of networks with different backbones.
We choose the UNet[30] as our network backbone, we also test several other variants[1, 28]. Except for AttUNet[28], performances are similar, as shown in Table4.
5 Conclusion
We have presented a DeepOIS pipeline for the compensation of OIS motions for gyroscope image registration. We have captured the training data as video frames as well as their gyro readings by an OIS camera and then calculated the ground-truth motions with our proposed Fundamental Mixtures model under the setting of rolling shutter cameras. For the evaluation, we have manually marked point correspondences on our captured dataset for quantitative metrics. The results show that our compensation network works well when compared with non-OIS cameras and outperforms other image-based methods. In summary, a new problem is proposed and we show that it is solvable by learning the OIS motions, such that gyroscope can be used for image registration on OIS cameras. We hope our work can inspire more researches in this direction.
References
- [1] Md Zahangir Alom, Mahmudul Hasan, Chris Yakopcic, Tarek M Taha, and Vijayan K Asari. Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation. arXiv preprint arXiv:1802.06955, 2018.
- [2] Daniel Barath, Jiri Matas, and Jana Noskova. Magsac: marginalizing sample consensus. In Proc. CVPR, pages 10197–10205, 2019.
- [3] Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. SURF: speeded up robust features. In Proc. ECCV, volume 3951, pages 404–417, 2006.
- [4] Matthew Brown and David G. Lowe. Recognising panoramas. In Proc. ICCV, pages 1218–1227, 2003.
- [5] Chi-Wei Chiu, Paul C-P Chao, and Din-Yuan Wu. Optimal design of magnetically actuated optical image stabilizer mechanism for cameras in mobile phones via genetic algorithm. IEEE Trans. on Magnetics, 43(6):2582–2584, 2007.
- [6] Jian S Dai. Euler–rodrigues formula variations, quaternion conjugation and intrinsic connections. Mechanism and Machine Theory, 92:144–152, 2015.
- [7] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Deep image homography estimation. arXiv preprint arXiv:1606.03798, 2016.
- [8] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Häusser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In Proc. ICCV, 2015.
- [9] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, 1981.
- [10] Junhong Gao, Seon Joo Kim, and Michael S Brown. Constructing image panoramas using dual-homography warping. In Proc. CVPR, pages 49–56, 2011.
- [11] Matthias Grundmann, Vivek Kwatra, Daniel Castro, and Irfan Essa. Calibration-free rolling shutter removal. In IEEE international conference on computational photography (ICCP), pages 1–8, 2012.
- [12] Heng Guo, Shuaicheng Liu, Tong He, Shuyuan Zhu, Bing Zeng, and Moncef Gabbouj. Joint video stitching and stabilization from moving cameras. IEEE Trans. on Image Processing, 25(11):5491–5503, 2016.
- [13] Dennis Guse and Benjamin Müller. Gesture-based user authentication for mobile devicesusing accelerometer and gyroscope. In Informatiktage, pages 243–246, 2012.
- [14] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
- [15] Weibo Huang and Hong Liu. Online initialization and automatic camera-imu extrinsic calibration for monocular visual-inertial slam. In IEEE International Conference on Robotics and Automation (ICRA), pages 5182–5189, 2018.
- [16] Chao Jia and Brian L Evans. Online calibration and synchronization of cellphone camera and gyroscope. In IEEE Global Conference on Signal and Information Processing, pages 731–734, 2013.
- [17] Alexandre Karpenko, David Jacobs, Jongmin Baek, and Marc Levoy. Digital video stabilization and rolling shutter correction using gyroscopes. CSTR, 1(2011):2, 2011.
- [18] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [19] Jun-Mo Koo, Myoung-Won Kim, and Byung-Kwon Kang. Optical image stabilizer for camera lens assembly, Feb.10 2009. US Patent 7,489,340.
- [20] Hoang Le, Feng Liu, Shu Zhang, and Aseem Agarwala. Deep homography estimation for dynamic scenes. In Proc. CVPR, pages 7652–7661, 2020.
- [21] Kaimo Lin, Nianjuan Jiang, Shuaicheng Liu, Loong-Fah Cheong, Minh N Do, and Jiangbo Lu. Direct photometric alignment by mesh deformation. In Proc. CVPR, pages 2701–2709, 2017.
- [22] Shuaicheng Liu, Ping Tan, Lu Yuan, Jian Sun, and Bing Zeng. Meshflow: Minimum latency online video stabilization. In Proc. ECCV, volume 9910, pages 800–815, 2016.
- [23] Shuaicheng Liu, Binhan Xu, Chuang Deng, Shuyuan Zhu, Bing Zeng, and Moncef Gabbouj. A hybrid approach for near-range video stabilization. IEEE Trans. on Circuits and Systems for Video Technology, 27(9):1922–1933, 2017.
- [24] Shuaicheng Liu, Lu Yuan, Ping Tan, and Jian Sun. Bundled camera paths for video stabilization. ACM Trans. Graphics, 32(4), 2013.
- [25] David G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis., 60(2):91–110, 2004.
- [26] Janne Mustaniemi, Juho Kannala, Simo Särkkä, Jiri Matas, and Janne Heikkila. Gyroscope-aided motion deblurring with deep networks. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1914–1922. IEEE, 2019.
- [27] Steven S Nasiri, Mansur Kiadeh, Yuan Zheng, Shang-Hung Lin, and SHI Sheena. Optical image stabilization in a digital still camera or handset, May 1 2012. US Patent 8,170,408.
- [28] Ozan Oktay, Jo Schlemper, Loic Le Folgoc, Matthew Lee, Mattias Heinrich, Kazunari Misawa, Kensaku Mori, Steven McDonagh, Nils Y Hammerla, Bernhard Kainz, et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018.
- [29] Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Epicflow: Edge-preserving interpolation of correspondences for optical flow. In Proc. CVPR, pages 1164–1172, 2015.
- [30] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- [31] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An efficient alternative to sift or surf. In Proc. ICCV, pages 2564–2571, 2011.
- [32] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary R. Bradski. ORB: an efficient alternative to SIFT or SURF. In Proc. ICCV, pages 2564–2571, 2011.
- [33] Koichi Sato, Shigeki Ishizuka, Akira Nikami, and Mitsuru Sato. Control techniques for optical image stabilizing system. IEEE Trans. on Consumer Electronics, 39(3):461–466, 1993.
- [34] Qi Shan, Wei Xiong, and Jiaya Jia. Rotational motion deblurring of a rigid object from a single image. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–8. IEEE, 2007.
- [35] Jianbo Shi et al. Good features to track. In 1994 Proceedings of IEEE conference on computer vision and pattern recognition, pages 593–600. IEEE, 1994.
- [36] Vividh Siddha, Kunihiro Ishiguro, and Guillermo A Hernandez. Hardware abstraction layer, Aug.28 2012. US Patent 8,254,285.
- [37] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proc. CVPR, pages 8934–8943, 2018.
- [38] Yurun Tian, Xin Yu, Bin Fan, Fuchao Wu, Huub Heijnen, and Vassileios Balntas. Sosnet: Second order similarity regularization for local descriptor learning. In Proc. CVPR, pages 11016–11025, 2019.
- [39] Yurun Tian, Xin Yu, Bin Fan, Fuchao Wu, Huub Heijnen, and Vassileios Balntas. Sosnet: Second order similarity regularization for local descriptor learning. In Proc. CVPR, pages 11016–11025, 2019.
- [40] Miroslav Trajković and Mark Hedley. Fast corner detection. Image and vision computing, 16(2):75–87, 1998.
- [41] Bartlomiej Wronski, Ignacio Garcia-Dorado, Manfred Ernst, Damien Kelly, Michael Krainin, Chia-Kai Liang, Marc Levoy, and Peyman Milanfar. Handheld multi-frame super-resolution. ACM Trans. Graphics, 38(4):28:1–28:18, 2019.
- [42] DH Yeom. Optical image stabilizer for digital photographing apparatus. IEEE Trans. on Consumer Electronics, 55(3):1028–1031, 2009.
- [43] Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. LIFT: learned invariant feature transform. In Proc. ECCV, volume 9910, pages 467–483, 2016.
- [44] Tirra Hanin Mohd Zaki, Musab Sahrim, Juliza Jamaludin, Sharma Rao Balakrishnan, Lily Hanefarezan Asbulah, and Filzah Syairah Hussin. The study of drunken abnormal human gait recognition using accelerometer and gyroscope sensors in mobile application. In 2020 16th IEEE International Colloquium on Signal Processing & Its Applications (CSPA), pages 151–156, 2020.
- [45] Julio Zaragoza, Tat-Jun Chin, Michael S Brown, and David Suter. As-projective-as-possible image stitching with moving dlt. In Proc. CVPR, pages 2339–2346, 2013.
- [46] Jirong Zhang, Chuan Wang, Shuaicheng Liu, Lanpeng Jia, Jue Wang, Ji Zhou, and Jian Sun. Content-aware unsupervised deep homography estimation. In Proc. ECCV, 2020.
Xet Storage Details
- Size:
- 71.8 kB
- Xet hash:
- b7bec38169459a1513132c32de052e5ddb36a398cbbdef5f99e3f036c6926ce1
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.







