Notes on Correlation Filter-based Tracking

最近看了一点特征跟踪方面的资料，这篇日志是看下面这篇论文时的一点笔记。

Chen Z, Hong Z, Tao D. An experimental survey on correlation filter-based tracking[J]. arXiv preprint arXiv:1509.05520, 2015.

1. Introduction

跟踪有哪些难点：

illumination variations
occlusions
deformations
rotations

tracking 算法可以分成下面两类：

generative models
- Generative trackers perform tracking by searching the best-matching windows
discriminative models
- discriminative methods learn to distinguish the target from backgrounds
- background information is advantageous for effective tracking, which suggests that discriminative methods are more competing.
- 特别是 correlation ﬁlter-based discriminative trackers

关于 correlation filters

correlation filters are designed to produce correlation peaks for each interested target in the scene while yielding low responses to background, which are usually used as detectors of expected patterns
the required training needs used to make them inappropriate for online tracking.
MOSSE 的提出，改变了这种状况，Using an adaptive training scheme
在 MOSSE 基础上，目前 state-of-the-art 的三种方法，SAMF [18]，DSST [19]，improved KCF [20]
In general, training schemes of filters are extremely crucial in correlation filter- based tracking, and CFTs can be further improved by introducing better training schemes, extracting powerful features, relieving scaling issue, applying part-based tracking strategy and cooperating with long-term tracking.
- training 对 correlation filter tracker 的性能很关键
下面这张表给出了 correlation fillter tracker 发展上的里程碑算法

mark

2. Correlation filter-based tracking framework

correlation ﬁlter-based tracking methods 的 general framework

Initially, correlation filter is trained with image patch cropped from a given position of the target at first frame.
Then in each subsequent time step, the patch at previous predicted position is cropped for detection.
Afterwards, as shown in Figure 1, various features can be extracted from the raw input data, and a cosine window is usually applied for smoothing the boundary effects.
Subsequently, efﬁcient correlation operations are performed by replacing the exhausted convolutions with element-wise multiplications using Discrete Fourier Transform (DFT).
Following the correlation procedure, a spatial confidence map, or response map, can be obtained using inverse FFT. The position with a maximum value in this map is then predicted as the new state of target.
Next, appearance at the estimated position is extracted for training and updating the correlation ﬁlter.
- 因为仅仅用到了 correlation filter 的 DFT，所以 training 和 updating procedures 都是在频域完成的

流程图如下：

mark

FFT 加速

mark

x 可以是 raw image patch 或者 extracted features，h 是 correlation filter，公式（1）说的就是时域的卷积等于频域的乘积再做反变换

怎么训练的？

mark

由公式（2）反推出公式（3），在训练阶段，y 是 desired correlation output，这应该是标记好的，x 是给定的，$\hat{x}$ 可以用傅里叶变换算出来，所以公式（3）就是求得 correlation filter 的过程。

问题是这不就一个样本的么？N 个样本的怎么弄？
- 不对，correlation filter 好像就是先标记 1 个样本的，然后更新，根据 detect 到的目标来更新，具体的看后面 updating scheme 那一块。

some issues when using correlation ﬁlter-based tracking framework

First, training schemes are extremely crucial for CFTs. Since the target may change its appearance continuously, correlation filters should be adaptively trained and updated on-the-fly to adapt to the new appearance of target.
- 目标的外形会变，所以要能适应这种变化
Second, feature representing methods also greatly influence the performance. Although raw pixels can be directly used for detection, the tracker may be affected by various noises like illumination changes and motion blurs. More powerful features are sup- posed to be helpful.
- CFT 本质是特征跟踪，所以特征非常重要，要对各种干扰、变化都要鲁棒，选择feature 其实是灌水的一大途径，或者说 domain adaptation 吧
Moreover, how to adapt to the scales of target is another challenging problem for CFTs. Since the sizes of correlation filters are usually fixed in tracking, scale variations of the target cannot be handled well in these trackers. As a result, an effective scale estimation approach is supposed to complement this shortage of correlation filter- based tracking.
- 其实，一开始我觉得这一点还是跟上一点部分重合的，抽取 sacle-invariant 的特征不就行了，不过现在想想应该不同，这一点强调的应该是怎么从 framework 的层面来 handle，而不是仅仅在特征层面
Furthermore, long-term tracking is believed to be the weakness of many CFTs since they commonly lack the ability to re-locate the target after drifting. By cooperating with long-term tracking methods, CFTs can be much more robust in tracking.
- long-term tracking，处理目标丢失的情况

3. Training schemes for correlation filters

correlation filters 的 training 手段很多很多，种类不同。

3.1 Traditional Training Methods

最简单的 case，自然是直接从 image 上面 crop 一块 template
- 但这样 crop 下来的 template 对 background 的 response 也会很高
很多方法都试图 suppressing responses to negative training samples while maintaining high response to the target，他们的差别在于怎么构造 correlation filter 方法的不同。
Synthetic Discriminant Functions (SDF) [27], [32], Optimal Tradeoff Filters (OTF) [28] and Minimum Average Corre- lation Energy (MACE) [29] are trained with enforced hard constraints so that peaks would always be produced in the same height.
On the contrary, hard constraints are believed to be unnecessary in other filters, such as Maximum Average Correlation Height (MACH) [30] and Unconstrained MACE (UMACE) [31]. These filters are trained by relaxing the hard constraints.
- 怎么感觉这类和上面一类正好相反啊
Recently, a correlation filter, which is named as Average of Synthetic Exact Filters (ASEF) [34], averages all the trained exact filters to obtain a general one.

3.2 Adaptive Correlation Filters

目的是 To train correlation ﬁlters more efﬁciently

3.2.1 Minimum Output Sum of Squared Error (MOSSE)

MOSSE 的 Motivation 是前面的公式（2）和公式（3）只是一个 sample 的，如果能 involve 更多的 samples，那么 correlation filters 的 robustness 就会进一步提升。
MOSSE 的做法就是求一个 h，使得 actual correlation output 和 disired correlation output 之间的 square error 最小，这个 minimization problem 在 frequency domain 的表示形式为：

那么， solution 就是

3.2.2 Regularized ASEF (Average of Synthetic Exact Filters)

同 MOSSE 不同的是，ASEF 是一次处理一个样本来求解公式（4）

mark

然后再把公式（6）除出来的给累加起来求平均

mark

所谓的 regularized ASEF 就是在公式（6）、（7）的分母上加了一个 regularization parameter $\epsilon$ 来防止分母接近于 0，有利于提供 stabilization。

3.3 Kernelized Correlation Filters

是 ASEF 和 MOSSE 的成功激起了 correlation filter tracking 的成功，但 ASEF 和 MOSSE 的性能还是有限，因为本质上 ASEF and MOSSE ﬁlters can be viewed as simple linear classiﬁers.
KCF 的 Motivation 就是，taking advantage of kernel trick，使得 correlation filter 更加 powerful
Henriques （KCF 那篇 TPAMI 的作者）的贡献在于提出 correlation ﬁlters 能够通过引入 Ridge Regression 和 circulant matrix 来被 be effectively kernelized

3.3.1 Ridge Regression Problem

虽然叫做 Kernelized Correlation Filters，但是 KCF 的贡献并不是说引入了 kernel trick，kernel trick 是很自然的事情，我觉得他的突破是将原来两个图像块相关得到另一个同样大小的 confidence map 的问题，处理成了一个 Ridge Regression 问题，因为是 regression，一个样本最后当然只会有一个值，为了最后还是能够得到一个矩阵的 confidence map，所以作者后面才要引入 circulant matrix

regression，也就是 $f(x_i)=y_i$ ，KCF 采用 Regularized Least Squares (RLS)（也就是 Ridge Regression，脊回归），training problem 可以被写作：

mark

KCF 的 loss function 采用 quadratic loss，也就是

mark

对于 function $f(x_i)$，KCF 用的是 linear operation $f(x_i) = + b$

从而有，公式（8）的闭式解为：

mark

X 是每行都是一个 training sample 构成的 matrix

引入 kernel function 来提升性能，把 input data x 引射到 non-linear-feature space $\psi(x)$ ，于是有

mark

那么公式（8）的解，借助 kernel function

mark

可以表示为：

mark

那么原来的线性函数 $f(x_i) = + b$，就可以写成非线性的形式

mark

3.3.2 Circulant Matrix

个人的理解是，为什么要用 Circulant Matrix 其实还是为了能够有足够的样本来支撑 kernel matrix，否则只有一个 sample 怎么弄呢？所以才要用 circulant

mark

circulant matrix 有很多很好的性质，1）他们的 sum，product 和 inverse 都是 circulant 的，因此，公式（9）里面，除了最右边的 y 以外的部分也是 circulant 的。2）另外，一个 circulant matrix 可以通过 base vector x 的 DFT 来实现对角化 ，如下所示

mark

F 是 DFT matrix。上面这两个性质配合，公式（9）就可以被简便地表示为

mark

其更简便的频域的等价形式为

mark

同理，$\alpha$ 也可以像公式（15）这样快速计算，只要其 kernel matrix K 是 circulant 的就行。那是不是 circulant 的呢？的确是的，这个在纸上写写马上就可以证明了。所以，另 k 为 the base vector of circulant matrix K，我们就有：

mark

kernel k is computed between $x$ and $x’$，上面我们讲的一直都是抽象的 kernel k，下面看下具体的 kernel k 的形式是怎么样的。

polynomial kernel

mark

上面这个公式是样本 $x$ and $x’$ 在 polynomial kernel spase 上的内积，是一个数值，具体到 KCF，新来的样本 $x’$ 与 circulant matrix X （移位而成的 n 个样本）在 polynomial kernel spase 上的内积向量可以表示为

mark

Gaussian kernel

mark

同理，新来的样本 $x’$ 与 circulant matrix X （移位而成的 n 个样本）在 Gaussian kernel spase 上的内积向量可以表示为

mark

3.3.3 Detection

事实上，公式（10）只是一个标量，如果把 n 个 $f(x_i)$ 堆叠起来变成一个列向量，根据公式（10），可以变成 $y = K\alpha$。

在检测阶段，我们已经有了一个训练好的 $\alpha$ 和一个在维护的 base sample $x$，给定一个新样本 $z$ ， a conﬁdence map y can be obtained by:

mark

3.4 Dense Spatio-Temporal Context Tracker

3.5 Updating Scheme

从上面的 training schemes 可以看出，每一帧都会产生一个 correlation filter，因此，在跟踪的时候，怎么结合当前和已有的 trained filter 对于构建一个 robust appearance model 非常重要。在 CFT 中，大部分都是用 average 来 update 的，就是不同的 average 罢了。

对于regularized ASEF

mark

对于 MOSSE，是分别 average 公式（15）的分子和分母

mark

对于 KCF，则是通过在频域更新 $\alpha$ 来实现

其中，z 是从 currently predicted position 抽取出来的 new sample。

4. Further improvements

提高 CFT 鲁棒性的努力主要集中在以下几个方面：

representing features
handling scale variations
applying part-based strategy
cooperating with long-term tracking

4.1 Feature Representation

早期的 CFT，像是 MOSSE 和 CSK 都是用 raw pixels 的，noise 导致 performance 极度受限。

Apparently, features with multiple channels can be more representative and informative. 在 KCF 中，integrating multiple channel features 很方便，以 Gaussian kernel function 为例，如下形式：

mark

c 是 channel 数， HOG 特征在 KCF 中运用非常成功。除了 HOG 外，还有 color names。

4.2 Handling Scale Variations

MOSSE 和 KCF 这样的 conventional CFT 都是采用的 fixed-size window，没法处理目标大小变化。
SAMF 和 DSST 采用了一种 searching strategy 来估计目标尺度。每次，都会采样不同大小的 window，然后跟 learned filter 做相关运算。具有最高相关性的窗口就会被采用。

4.3 Part-based Tracking

part-based tracking algorithms 并不 learning a holistic appearance model，而只是 track the target by its local appearance。这样做的 motivation 是，如果目标被 partially occluded（被部分遮挡），its remaining visible parts can still represent the target and thus the tracker is able to continue tracking.

4.4 Long-term Tracking

除了被部分遮挡外，visual tracking 另外一个 vital challenge 是 the absence of the target，也就是目标部分或者全部的从视野里消失了。对于这种情况，CFT 会很容易去跟踪其他非目标东西了，因为 CFT 算法设计的时候就没有包含 a long-term component，这个看 3.5 的 Updating Scheme 就一目了然了。As a consequence, introducing long-term tracking methods is believed to be favorable for improving correlation ﬁlter-based tracking methods.

对于 long-term tracking，目前主要是两个思路：

一个是引入一个 re-detection module，一旦目标丢失，就重新 detect。（以 TLD 为代表）
另一个是 conservatively learns the target appearance from reliable frames with a self-paced learning scheme. （以 MUSTer 为代表）

5. Experiments

从实验结果看，MUSTer 是最 promising的。

如果您觉得我的文章对您有所帮助，不妨小额捐助一下，您的鼓励是我长期坚持的动力。