23.5. Motion Segmentation#

The theory of structure from motion and motion segmentation has evolved over a set of papers [14, 24, 39, 53, 66, 73, 74]. In this section, we review the essential ideas from this series of work.

A typical image sequence (from a single camera shot) may contain multiple objects moving independently of each other. In the simplest model, we can assume that images in a sequence are views of a single moving object observed by a stationary camera or a stationary object observed by a moving camera. Only rigid motions are considered. In either case, the object is moving with respect to the camera.

The structure from motion problem focuses on recovering the (3D) shape and motion information of the moving object. In the general case, there are multiple objects moving independently. Thus, we also need to perform a motion segmentation such that motions of different objects can be separated and (either after or simultaneously) shape and motion of each object can be inferred.

This problem is typically solved in two stages. In the first stage, a frame to frame correspondence problem is solved which identifies a set of feature points whose coordinates can be tracked over the sequence as the point moves from one position to other in the sequence. We obtain a set of trajectories for these points over the frames in the video. If there is a single moving object or the scene is static and the observer is moving then all the feature points will belong to the same object. Otherwise, we need to cluster these feature points to different objects moving in different directions. In the second stage, these trajectories are analyzed to group the feature points into separate objects and recover the shape and motion for individual objects. In this section we assume that the feature trajectories have been obtained by an appropriate method. Our focus is to identify the moving objects and obtain the shape and motion information for each object from the trajectories.

23.5.1. Modeling Structure from Motion for Single Object#

We start with the simple model of a static camera and a moving object. All feature point trajectories belong to the moving object. Our objective is to demonstrate that the subspace spanned by feature trajectories of a single moving object is a low dimensional subspace.

Let the image sequence consist of $F$ frames denoted by $1 \leq f \leq F$ . Let us assume that $S$ feature points of the moving object have been tracked over this image sequence. Let $(u_{f s}, v_{f s})$ be the image coordinates of the $s$ -th point in $f$ -th frame. We form the feature trajectory vector for the $s$ -th point by stacking its coordinates for the $F$ frames vertically as

y_{s} = {[\begin{matrix} u_{1 s} & v_{1 s} & u_{2 s} & v_{2 s} & \dots & u_{F s} & v_{F s} \end{matrix}]}^{T} .

Putting together the feature trajectory vectors of $S$ points in a single feature trajectory matrix, we obtain

Y = [\begin{matrix} y_{1} & y_{2} & \dots & y_{S} \end{matrix}] .

This is the data matrix under consideration from which the shape and motion of the object need to be inferred.

We need two coordinate systems. We use the camera coordinate system as the world coordinate system with the $Z$ -axis along the optical axis. The coordinates of different points in the object are changing from frame to frame in the world coordinate system as the object is moving. We also establish a coordinate system within the object with origin at the centroid of the feature points such that the coordinates of individual points do not change from frame to frame in the object coordinate system. The (rigid) motion of the object is then modeled by the translation (of the centroid) and rotation of its coordinate system with respect to the world coordinate system. Let $(a_{s}, b_{s}, c_{s})$ be the coordinate of the $s$ -th point in the object coordinate system. Then, the matrix

\begin{array}{r} [\begin{array}{c} a_{1} & a_{2} & \dots & a_{S} \\ b_{1} & b_{2} & \dots & b_{S} \\ c_{1} & c_{2} & \dots & c_{S} \end{array}] \end{array}

represents the shape of the object (w.r.t. its centroid).

Let us choose an orthonormal basis in the object coordinate system. Let $d_{f}$ be the position of the centroid and $(i_{f}, j_{f}, k_{f})$ be the (orthonormal) basis vectors of the object coordinate system in the $f$ -th frame. Then, the position of the $s$ -th point in the world coordinate system in $f$ -th frame is given by

h_{f s} = d_{f} + a_{s} i_{f} + b_{s} j_{f} + c_{s} k_{f} .

Assuming orthographic projection and letting $h_{f s} = (u_{f s}, v_{f s}, w_{f s})$ , the image coordinates are obtained by chopping of the third component $w_{f s}$ . We define the rotation matrix for $f$ -th frame as

\begin{array}{r} R_{f} ≜ [\begin{array}{c} i_{f} & j_{f} & k_{f} \end{array}] = [\begin{array}{c} {\underset{―}{i}}_{f} \\ {\underset{―}{j}}_{f} \\ {\underset{―}{k}}_{f} \end{array}] \end{array}

where ${\underset{―}{i}}_{f}$ , ${\underset{―}{j}}_{f}$ , ${\underset{―}{k}}_{f}$ are the row vectors of $R_{f}$ . Let $x_{s} = (a_{s}, b_{s}, c_{s}, 1)$ be the homogeneous coordinates of the $s$ -th point in object coordinate system. We can write the homogeneous coordinates in camera coordinate system as

\begin{array}{r} [\begin{array}{c} h_{f s} \\ 1 \end{array}] = [\begin{array}{c} R_{f} & d_{f} \\ 0_{1 \times 3} & 1 \end{array}] x_{s} . \end{array}

If we write $d_{f} = (d_{f i}, d_{f j}, d_{f k})$ , then, the data matrix $Y$ can be factorized as

\begin{array}{r} Y = [\begin{array}{c} u_{11} & \dots & u_{1 S} \\ v_{11} & \dots & v_{1 S} \\ ⋮ & \dots & ⋮ \\ ⋮ & \dots & ⋮ \\ u_{F 1} & \dots & u_{F S} \\ v_{F 1} & \dots & v_{F S} \end{array}] = [\begin{array}{cc} {\underset{―}{i}}_{1} & d_{1 i} \\ {\underset{―}{j}}_{1} & d_{1 j} \\ ⋮ & ⋮ \\ ⋮ & ⋮ \\ {\underset{―}{i}}_{F} & d_{F i} \\ {\underset{―}{j}}_{F} & d_{F j} \end{array}] [\begin{array}{c} x_{1} & \dots & x_{S} \end{array}] . \end{array}

We rewrite this as

Y = M S

where $M$ represents the motion information of the object and $S$ represents the shape information of the object. This factorization is known as the Tomasi-Kanade factorization of shape and motion information of a moving object. Note that $M \in R^{2 F \times 4}$ and $S \in R^{4 \times S}$ . Thus the rank of $Y$ is at most 4. Thus the feature trajectories of the rigid motion of an object span an up to 4-dimensional subspace of the trajectory space $R^{2 F}$ .

23.5.2. Solving the Structure From Motion Problem#

We digress a bit to understand how to perform the factorization of $Y$ into $M$ and $S$ . Using SVD, $Y$ can be decomposed as

Y = U Σ V^{T} .

Since $Y$ is at most rank $4$ , we keep only the first 4 singular values as

Σ = diag (σ_{1}, σ_{2}, σ_{3}, σ_{4}) .

Matrices $U \in R^{2 F \times 4}$ and $V \in R^{S \times 4}$ are the left and right singular matrices respectively.

There is no unique factorization of $Y$ in general. One simple factorization can be obtained by defining:

\hat{M} = U Σ^{\frac{1}{2}}, \hat{S} = Σ^{\frac{1}{2}} V^{T} .

But for any $4 \times 4$ invertible matrix $A$ ,

M = \hat{M} A, S = A^{- 1} \hat{S}

is also a possible solution since $M S = \hat{M} \hat{S} = Y$ . Remember that $M$ is not an arbitrary matrix but represents the rigid motion of an object. There is considerable structure inside the motion matrix. These structural constraints can be used to compute an appropriate $A$ and thus obtain $M$ from $\hat{M}$ . To proceed further, let us break $A$ into two parts

A = [\begin{array}{cc} A_{R} & a_{t} \end{array}]

where $A_{R} \in R^{4 \times 3}$ is the rotational component and $a_{t} \in R^{4}$ is related to translation. We can now write:

M = [\begin{array}{cc} \hat{M} A_{R} & \hat{M} a_{t} \end{array}]

Rotational constraints

Recall that $R_{f}$ is a rotation matrix hence its rows are unit norm and orthogonal to each other. Thus every row of $\hat{M} A_{R}$ is unit norm and every pair of rows (for a given frame) is orthogonal. This yields following constraints.

\begin{aligned} {\hat{m}}_{2 f - 1} A_{R} A_{R}^{T} {\hat{m}}_{2 f - 1}^{T} = 1 \\ {\hat{m}}_{2 f} A_{R} A_{R}^{T} {\hat{m}}_{2 f}^{T} = 1 \\ {\hat{m}}_{2 f - 1} A_{R} A_{R}^{T} {\hat{m}}_{2 f}^{T} = 0 \end{aligned}

where ${\hat{m}}_{k}$ are rows of matrix $\hat{M}$ for $1 \leq f \leq F$ . This over-constrained system can be solved for the entries of $A_{R}$ using least squares techniques.

Translational constraints

Recall that the image of a centroid of a set of points under an isometry (rigid motion) is the centroid of the images of the points under the same isometry. The homogeneous coordinates of the centroid in the object coordinate system are $(0, 0, 0, 1)$ . The coordinates of the centroid in image are

(\frac{1}{S} \sum_{s} u_{f s}, \frac{1}{S} \sum_{s} v_{f s}) .

Putting back, we obtain

\begin{array}{r} \frac{1}{S} [\begin{array}{c} \sum_{s} u_{1 s} \\ \sum_{s} v_{1 s} \\ ⋮ \\ \sum_{s} u_{F s} \\ \sum_{s} v_{F s} \end{array}] = [\begin{array}{cc} \hat{M} A_{R} & \hat{M} a_{t} \end{array}] [\begin{array}{c} 0 \\ 0 \\ 0 \\ 1 \end{array}] = \hat{M} a_{t} . \end{array}

A least squares solution for $a_{t}$ is straight-forward.

23.5.3. Modeling Motion for Multiple Objects#

The generalization of modeling of motion of one object to multiple objects is straight-forward. Let there be $K$ objects in the scene moving independently 1. Let $S_{1}, S_{2}, \dots, S_{K}$ feature points be tracked for objects $1, 2, \dots, K$ respectively for $F$ frames with $S = \sum_{k} S_{k}$ . Let these feature trajectories be put in a data matrix $Y \in R^{2 F \times S}$ . In general, we don’t know which feature point belongs to which object and how many feature points are there for each object. There is at least one feature point for each object (otherwise the object isn’t being tracked at all). We could permute the columns of $Y$ via an (unknown) permutation $Γ$ so that the feature points of each object are placed contiguously giving us

Y^{*} = Y Γ = [\begin{matrix} Y_{1} & Y_{2} & \dots & Y_{K} \end{matrix}] .

Clearly, each submatrix $Y_{k}$ ( $1 \leq k \leq K$ ) which consists of feature trajectories of one object spans an (up to) 4 dimensional subspace. Now, the problem of motion segmentation is essentially separating $Y$ into $Y_{k}$ which reduces to a standard subspace clustering problem.

Let us dig a bit deeper to see how the motion shape factorization identity changes for the multi-object formulation. Each data submatrix $Y_{k}$ can be factorized as

Y_{k} = U_{k} Σ_{k} V_{k}^{T} = M_{k} S_{k} = {\hat{M}}_{k} A_{k} A_{k}^{- 1} {\hat{S}}_{k} .

$Y^{*}$ now has the canonical factorization:

\begin{array}{r} Y^{*} = [\begin{array}{c} M_{1} & \dots & M_{K} \end{array}] [\begin{array}{c} S_{1} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & S_{K} \end{array}] . \end{array}

If we further denote :

\begin{array}{r} M = [\begin{array}{c} M_{1} & \dots & M_{K} \end{array}] \\ \hat{M} = [\begin{array}{c} {\hat{M}}_{1} & \dots & {\hat{M}}_{K} \end{array}] \\ S = [\begin{array}{c} S_{1} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & S_{K} \end{array}] \\ \hat{S} = [\begin{array}{c} {\hat{S}}_{1} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & {\hat{S}}_{K} \end{array}] \\ A = [\begin{array}{c} A_{1} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & A_{K} \end{array}] \\ U = [\begin{array}{c} U_{1} & \dots & U_{K} \end{array}] \\ Σ = [\begin{array}{c} Σ_{1} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & Σ_{K} \end{array}] \\ V = [\begin{array}{c} V_{1} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & V_{K} \end{array}], \end{array}

then we obtain a factorization similar to the single object case given by

\begin{array}{r} Y^{*} = M S = \hat{M} A A^{- 1} \hat{S} \\ S = A^{- 1} \hat{S} = A^{- 1} Σ^{\frac{1}{2}} V^{T} \\ M = \hat{M} A = U Σ^{\frac{1}{2}} A . \end{array}

Thus, when the segmentation of $Y$ in terms of the unknown permutation $Γ$ has been obtained, (sorted) data matrix $Y^{*}$ can be factorized into shape and motion components as appropriate.

23.5.4. Limitations#

Our discussion so far has established that feature trajectories for each moving object span a 4-dimensional space. There are a number of reasons why this is only approximately valid: perspective distortion of camera, tracking errors, and pixel quantization. Thus, a subspace clustering algorithm should allow for the presence of noise or corruption of data in real life applications.

1: Our realization of an object is a set of feature points undergoing same rotation and translation over a sequence of images. The notion of locality, color, connectivity etc. plays no role in this definition. It is possible that two visually distinct objects are undergoing same rotation and translation within a given image sequence. For the purposes of inferring an object from its motion, these two visually distinct object are treated as one.

Topics in Signal Processing

Motion Segmentation

Contents

23.5. Motion Segmentation#

23.5.1. Modeling Structure from Motion for Single Object#

23.5.2. Solving the Structure From Motion Problem#

23.5.3. Modeling Motion for Multiple Objects#

23.5.4. Limitations#