Notes on CVPR-16-Weakly Supervised Deep Detection Networks

这是 CVPR 2016 上的文章，对于弱监督的目标检测给了 a simple and elegant end-to-end architecture，影响力还是蛮大的。后续的一些工作，比如 ContextLocNet 都是在本文的框架下做的。引用这篇论文的论文质量都蛮高的，也很值得看一下。

代码地址：https://github.com/hbilen/WSDDN.

本文中有一句话我觉得说得特别好，Weakly Supervised Object Detection 方法的 Model 都是为了 weakly supervised object detection 设计，但是最终训练（cost function）是在 image classification。

While WSDDN is primarily designed for weakly supervised object detection, ultimately it is trained to perform image classification.

哈哈，这不就是挣卖白菜的钱，操卖白粉的心么。其实 Weakly Supervised Learning 就是这样啊，喂的是草（image-level annotation），挤的是奶（object-level annotation）。怎么养这么一头奶牛并且挤出奶来，就是我们挤奶工（算法设计者）的工作了。机器学习 Data、Model、Cost function、Optimization 这四块，对于通常的 CV 来说，Data 就是 ImageNet + VOC；因为是 end-to-end，Optimization 就是 BP 啦；剩下的 Model 和 Cost function 就是我们要关注的重点了。看别人论文的时候，也是，看他们是怎么设计这两块的。

1. Contribution

本文最大贡献应该就是给出了一个 end-to-end 的范式吧，之前方法多是 multi-stage 的，有了 end-to-end 后训练就很方便了，这也是为什么后面有这么多高质量的论文都引用或者基于本文的原因。

事实上，虽然本文里面没有明说，其实本文方法应该借鉴的是 Fast R-CNN 的架构，或者说是受了 Fast R-CNN 很大的启发。下面是本文 WSDDN 和 Fast R-CNN 的网络架构：


WSDDN 网络架构

Fast R-CNN 网络架构

相似之处在于：

同 Fast R-CNN 一样，WSDDN 用的也是 VGG-16 的基础架构，怎么看出来的？看 pool5 就好，然后是 fc6，fc7，最后是 fc8，这就是 VGG-16 的架构，具体的看 Netscope 上的 VGG-16 架构就好.
都是 ROI-pooling （SPP 的特例）
fc7 出来的 fc8 都有两支，一支做 Classification，一支做 Detection。

不同之处在于：

Fast R-CNN 是 Supervised 的，有 Object-level 的监督信息，所以两支各自有一个 loss，最后的 cost function 只是两个 loss 相加而已；但 WSDDN 因为是 Weakly Supervised Learning，只有 Image-level 的 label，最后的 loss 只能有一个，就是是否含有这类 object，所以 Classification 和 Detection 两支最后还是要汇合成 Image-level 的 output 用于计算 loss (用作者的话说是 inject image-level supervision in learning)。
Fast R-CNN 中，经过 ROI-pooling 后的 softmax 什么的都是 For each ROI 单独计算的；在 WSDDN 中，是一起计算的，每个 ROI 并不独立，最后是个 C * |R| 的矩阵，C 是类别数，|R| 是 Region Proposal 的个数。

虽然相比 Fast R-CNN，本文 WSDDN 的 two data streams 看起来平淡一些；但是对于之前 WSL Object Detection 的工作来说，本文的 two data streams 架构，提供了独立于 recognition branch 的 a dedicated parallel detection branch，突破了以前 Detection by Region Classification 的范式。

2. Prior Works

目前的 Weakly Supervised Detection 方法大概可以被分为 formulate this task as MIL 和 based on the idea of identifying the similarity between image parts 这两大类。

2.1 MIL strategy

这一类方法肯定是主流啦，The majority of existing approaches to WSD formulate this task as MIL.

其思想或者说范式大体如下：

In this formulation an image is interpreted as a bag of regions.
- 这个其实很常见吧，不管是 sliding window 还是 Region Proposal 都是这样认为的吧
- 更重要的 assumption 应该是 region 里面是 contain 了完整的一个 object 吧。
If the image is labeled as positive, then one of the regions is assume to tightly contain the object of interest. If the image is labeled as negative, then no region contains the object.
- 这其实是在给定 cost function，最后要把 cost function 转化成 image 是否含有其内部 region 识别出来的类，若是含有，就鼓励这种可能性，若是不含有则降低其概率。MIL 是通过 label 正负类的奖惩把 label 的 supervision information 嵌入并赋予我们的模型的。
- Supervised Learning 是根据这个 Region 是否含有这类 Object 来做奖惩；Weakly Supervised Learning 是根据包含这个 Region 的 Image 是否含有这类 Object 来做奖惩，仅此而已。
- 相比 Supervised Learning，Weakly Supervised Learning 少掉的就是 Object 在哪一个 Region 内这个信息。
Learning alternates between estimating a model of the object appearance and selecting which regions in the positive bags correspond to the object using the appearance model.
- MIL 范式是通过 appearance model，也就是识别出这个 Region 的 Class 来做做 Detection，就是 Detection by Region Classification 这个范式；而不是直接像 bounding box regression 那样直接输出坐标。

MIL strategy 的缺点也很明显，就是本质是个 non-convex optimization problem。求解的时候，solvers tend to get stuck in local optima，很容易陷入局部极值。这个问题其实深度网络也一样，因为本质都是非凸优化，所以也跟深度网络一样，最后解的质量跟 初始化 和 正则化 关系巨大，因此大量的后续改进都集中在 developing various initialization strategies 和 regularizing the optimization problem 上。日后看 MIL strategy 的 Weakly Supervised Learning 论文，就重点关注这两块好啦。

在 initialization strategies 上做文章的有：

M. Kumar, B. Packer, and D. Koller. Self-paced learning for latent variable models. In NIPS, pages 1189–1197, 2010.
- propose a self-paced learning strategy that progressively includes harder samples to a small set of initial ones at training.
T. Deselaers, B. Alexe, and V. Ferrari. Localizing objects while learning their appearance. In ECCV, pages 452–466. 2010.
- initialize object locations based on the objectness score.
H. O. Song, Y. J. Lee, S. Jegelka, and T. Darrell. Weakly supervised discovery of visual pattern configurations. In NIPS, pages 1637–1645, 2014.
R. G. Cinbis, J. Verbeek, and C. Schmid. Weakly supervised object localization with multi-fold multiple instance learning. arXiv preprint arXiv:1503.00949, 2015.

在 regularization 上做文章的有：

H. Bilen, M. Pedersoli, and T. Tuytelaars. Weakly supervised object detection with posterior regularization. In BMVC, 2014.
- propose a smoothed version of MIL that softly labels object instances instead of choosing the highest scoring ones.
H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell. On learning to localize objects with minimal supervision. In ICML, pages 1611–1619, 2014.

2.2 Identifying the similarity between image parts

The idea of identifying the similarity between image parts 是 Weakly Supervised Detection 研究的 Another line of research。

H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell. On learning to localize objects with minimal supervision. In ICML, pages 1611–1619, 2014.
H. O. Song, Y. J. Lee, S. Jegelka, and T. Darrell. Weakly supervised discovery of visual pattern configurations. In NIPS, pages 1637–1645, 2014.
C. Wang, W. Ren, K. Huang, and T. Tan. Weakly supervised object localization with latent category learning. In ECCV 2014, volume 8694, pages 431–445, 2014.
H. Bilen, M. Pedersoli, and T. Tuytelaars. Weakly supervised object detection with convex clustering. In CVPR, 2015.

除此之外，get 到一个认知科学上的小知识，human visual system 是有 ventral 和 dorsal 这两个 streams，one focusing on recognition and the other one on localization. 这也算是本文 two-steam 的 Motivation 之一吧。其实这个 recognition 和 localization 两个 stream，很有 multi-task 的味道啊，Fast R-CNN 就是 multi-task loss，本文的 WSDDN 只不过因为只有 image-level label，loss 没法做成 multi-task loss，但这不影响我中间过程 multi-task 啊，这可能也可以用来解释为啥本文效果会好，因为 multi-task 可以相互促进这在 holistic scene understanding 里是被反复证明了的事情。

作者认为 Spatial Transformer Networks 其实也是一项 Weakly Supervised Detection 的工作，这个观点还蛮新奇的，给的理由如下。既然给的是 Image-level 的 label，最后找出了 Object 并对齐了，这么一看的确也算是 WSL 啊。

This “transformer network”, which is trained in an end-to-end fashion from image-level labels, is shown to align objects to a common reference frame, which is a proxy to detection.

3. Algorithm

3.1 Model / Network Architecture

怎么理解 modify the SPP layer to take as input not a single region, but rather the full list R 这句话？这还是要回到 Fast R-CNN，Fast R-CNN 是输入一个 Region，做 pooling，然后就做 softmax 和 bounding box regression，是一个 Region 一个 Region 做过来的；而本文不是，因为公式（1）、公式（2）的需要，本文是一口气先对所有 Region 做完 Pooling，然后再一起都通过全连接，其实实际上根本没多大差别，无非就是公式（1）、公式（2）要等所有 Region 都到了才能算么，其实是在有了公式（1）公式（2）之后很自然的方式。

Classification steam

对于某一个 region，在 C 类中，计算所有这个 region 可能属于某一类的概率.

The first stream associates a class score $\phi^c(x; R)$ to each region individually, performing recognition.

Detection steam

对于某一个特定的类，计算，在所有 region 中出现的概率（belief），的确是有点 detection 的味道.

The second stream, instead, compares regions by computing a probability distribution $\phi^d(x; R)$ over them; the latter represents the belief that, among all the candidate regions in the image, R is the one that contains the most salient image structure, and is therefore a proxy to detection.

In the first case, in fact, the softmax operator compares, for each region independently, class scores, whereas in the second case the softmax operator compares, for each class independently, the scores of different regions. Hence, the first branch predicts which class to associate to a region, whereas the second branch selects which regions are more likely to contain an informative image fragment.

Combined region scores and detection

上面两个 streams 得到的是两个 C |R| 的矩阵，这两个矩阵做 element-wise 的点积，就是最后得分的 C |R| 的矩阵啦，相当于用 Detection 那个 steam 的矩阵把 Classification 的那个矩阵给调制了一下，反过来也可以理解。

Image-level classification scores

因为我们只有 image-level 的 label，自然我们 model 的 output 也该是要 image-level 的，所以上面公式计算的是对于某个特定的类，所有 Region 的得分总和就是这个类的得分，在公式（2）中，如果加了所有 Region 的话得分肯定就是 1，但因为我们经过了一个 Hadamard product 调制过的，所以得分肯定是小于等于 1 的，而这个等于也只有在公式（1）对于这一类有一个 Region 是 1，其余都是 0 的时候取到，这基本是不可能的，所以得分肯定在 (0, 1) 之内。

如此看来，本文的 Output 并不是一个 binary 的 hard decision，而是一个位于（0， 1）之内的 soft output。当然在 cost function 里面，我们还可以用 soft 的值，在最后我们实际还是需要一个有没有物体的 binary 的结果的，作者没有提这个阈值，但既然可以理解成概率，那 0.5 应该是个合适的值吧。

3.2 Cost function

这个 cost function 还是一个 image-level 的cost function。

Spatial Regularizer

作者的这个 key observation 或者说 Motivation 其实蛮好的。就是说有 overlapping 的 Region 之间其实是有 spatial smoothness 的。那当然因为空间上有重叠，对于 Object 的得分上的确应该有连续性，何况本文还是 0 到 1 之间 soft 的得分呢。

As WSDDN is optimized for image-level class labels, it does not guarantee any spatial smoothness such that if a region obtains a high score for an object class, the neighboring regions with high overlap will also have high scores.

a soft regularization strategy that penalizes the feature map discrepancies between the highest scoring region and the regions with at least 60% IoU during training:

3.3 Optimization

既然是 end-to-end，那本文肯定是用 BP 就可以训练的啦。

如果您觉得我的文章对您有所帮助，不妨小额捐助一下，您的鼓励是我长期坚持的动力。