Notes on CVPR-14-R-CNN

Change Logs

Updated on 2018-05-02: 添加了 Bounding Box Regression 的内容。
First Commit on 2018-01-27.

CVPR-14-Rich feature hierarchies for accurate object detection and semantic segmentation 就是这篇论文是提出了大名鼎鼎的 R-CNN，第一作者是 Ross Girshick，也就是 rbg 大神。rgb 是 Pedro Felzenszwalb 的博士生，而且是他的第一个博士生。Felzenszwalb 最知名的工作应该就是 DPM 了，10 年那篇 TPAMI ，Felzenszwalb 是一作，rbg 是二作。所以大神就是大神，不管做不做深度都有非常精彩的工作。Felzenszwalb 在 Grammar Models 上做了很多很好的工作，很值得 follow。

本文的代码地址： https://github.com/rbgirshick/rcnn

1. Problem & Background

因为这篇论文的影响力太大了，很多思想已经变成了目前很多工作的基础，所以比较看的时候感觉会比较平淡。为了更好地理解这篇论文，我们必须回到当时的情景。这篇论文是 14 年的 CVPR，但是在 13 年 11 月的时候放到 arXiv 上的，正是 CVPR 论文提交的 deadline。

当时，AlexNet 已经在 2012 年 ImageNet 的 Single-label Image Classification 上取得了巨大的成功。很自然的想法就是怎么把 CNN 应用到 Multi-label Image Classification、Object Detection 等的其他场景上。但毕竟是不同的 task，Single-label Image Classification，和 Multi-label Image Classification、Object Detection 这些 task 之间还是有一定的 gap 的。对于解决 Multi-label Image Classification 的尝试，昨天已经介绍过了，具体可看 Notes on TPAMI-16-HCP A Flexible CNN Framework for Multi-label Image Classification. 本文要做的就是如何用在 Object Detection 上，所以 R-CNN 是 pretrained + fine-turning 的早期尝试者之一。

1.1 Motivation

尽管 CNN 在分类上取得了巨大成功，很自然的想法是看能不能也把 CNN 拓展到其他任务上。这个自然想法背后的 Motivation 是什么呢？是特征。

引言一开始，rbg 就写了我认为特别漂亮的一段话，摘录如下：

Features matter. The last decade of progress on various visual recognition tasks has been based considerably on the use of SIFT [27] and HOG [7]. But if we look at performance on the canonical visual recognition task, PASCAL VOC object detection [13], it is generally acknowledged that progress has been slow during 2010-2012, with small gains obtained by building ensemble systems and employing minor variants of successful methods.

作者把在具体 task 上的进步归结于特征设计、提取的进步。当特征不再改善时，比如 10-12 年，那么具体 task 上的 performance 的进步就几乎停滞了。CNN 在分类上取得了这么大的成功，原因在于 CNN 提取到了更好的特征，CNN 具有更好的特征抽取功能。如果也能用这些 CNN 提取到的更好的特征，现有方法在其他 task 的性能表现也肯定会有一个很大的提高。

嘿嘿，这是很典型的做 CV 的人的观点了。与做 ML 的整天关心泛化、边界啥的不同，在做 CV 的看来，分类器根本不重要，数据、特征才是关键。数据多，有好的特征，performance 自然会上去。作者贡献了一个很好的以后反驳做 ML 小伙伴的鄙视的论据，哈哈。

嗯，上面是从特征更好，效果更好的角度出发的 Motivation，作者还给了一个神经科学上的解释，为什么要用 CNN 特征。CNN 抽取的是 hierarchical features，而人的识别过程也是基于 hierarchical features 的.

recognition occurs several stages downstream, which suggests that there might be hierarchical, multi-stage processes for computing features that are even more informative for visual recognition.

上面是表示用 CNN 抽取的特征合乎天道，下面则是表示为什么 SIFT、HOG 这些特征不够好，因为从 hierarchical features 角度看来，它们都只是第一层特征：

SIFT and HOG are block-wise orientation histograms, a representation we could associate roughly with complex cells in V1, the first cortical area in the primate visual pathway.

从这个 Motivation 里也可以看出来了，这篇文章里，作者只是把 CNN 看做一个特征抽取器的。所以对于 R-CNN 这个全称，与其理解成 Region-based Convolution Neural Network，个人觉得还是理解成 Regions with CNN features 为好。之所以会有这个想法，是因为在没看这论文之前，我以为 R-CNN 就是 Region Proposal + CNN，CNN 负责了特征抽取和分类；其实不是的，R-CNN 其实是 Region Proposal + CNN + Category-specific Linear SVMs，只是用了 CNN 来抽 hierarchical features 而已。

这个也就可以理解了为啥本文标题后面还有一个 semantic segmentation，因为只要解决了在当前 task 的小数据集上也能训练 CNN 学出好的特征，就可以把这个特征替换原来的 SIFT、HOG 这些用到原来的 semantic segmentation 方法里去，本文里面的 segmentation 的确也是这么做的。

1.2 Challenges

在当时，用 CNN 做 Object Detection 与 Single-label Image Classification 之间的 Gap，或者说这个问题 challenging 的地方一共有两处：

一个就是之前没有过用 CNN 做 Object Detection 的工作。不同于 Image Classification，Object Detection 要求 localizing objects within an image. 这个 Location 信息要怎么给出？在当时的 CNN，也就是 AlexNet 可是只能够输出图像的类别的。
另外一个就是，labeled data for detection is scarce. AlexNet 在 Single-label Image Classification 上取得成功，是因为 ImageNet 正好有上千万幅标注好的分类图像，但是对于 Object Detection，当时最大的 VOC dataset 可不足以支撑训练其那么多的神经网络参数。

哈哈，我知道看到这，哪怕之前没有看过这篇论文，乃至深度学习的论文也没怎么看过，但毕竟作为当世显学，整天被相关的信息轰炸，脑子里肯定一下子就冒出来了应对方法，region proposal 和 pre-training + fine-tunning。这两点恰恰就是本文的 Contribution。

1.3 Contributions

本文的 Contribution 一共两点：

Combining region proposals with CNNs 来实现 localizing objects with a deep network
supervised pre-training for an auxiliary task + domain-specific fine-tuning 这个范式来实现了如何用 insufficient data 来训练 a large CNN 的问题。

这两个 Contribution 都非常重大。对于 Object Detection，R-CNN，Fast R-CNN，Faster R-CNN 这路下来，R-CNN 是开山鼻祖肯定绕不过。对于 Pre-training + Fine-tuning，这点就更是了啊，影响无远弗界，不仅仅在 Object Detection 了。

需要注意的是，同样是 13 年放到 arXiv，同期用 pre-training 来做的文章也不少，可见这一块竞争还是很激烈的，这些相关的文献目录具体可看 Notes on TPAMI-16-HCP A Flexible CNN Framework for Multi-label Image Classification.

2. Method

2.1 Data

对于 pre-training: discriminatively pre-trained the CNN on a large auxiliary dataset (ILSVRC 2012) with image-level annotations，就是 pretrained AlexNet
对于 fine-tunning: Domain-specific fine-tuning on VOC
- 这个 VOC 是 VOC Detection，并不是 Classification，
- 注意的是，这里是把 Object Region 和 Background Region 输入 CNN，并不是整张图像，而且每个 Region Proposal 都是被 resize 到 $227 \times 227$ 大小，为了符合 AlexNet 对输入的要求（因为 fully-connected layer 的要求）。所以只要把 AlexNet 最后分类的 1000 类改成 VOC 的 21 类就可以了，网络结构不用变，因为 R-CNN 背后的思想是 Object Detection by Region Proposal Classification，所以 Classification Network 可以直接拿来用。
- 构造 Batch 的时候，mini-batch size 是 128，32 个 positive window，从每类里面均匀分布随机采样；96 个 Background window。之所以背景样本比 Object 样本多很多，因为实际情况就是这样的。
对于 Classification: 对于每一个 category，肯定是要构建 positive 和 negative 两类啦
- 对于完全是背景 or 完全含有目标的 Proposal，很容易判断是 positive or Negative
- 对于只包含部分的 Proposal，是算 IOU，阈值是 0.3，这是作者做了网格搜索验证后的结果，后面的研究似乎也都沿用了 0.3 这个值。

2.2 Model

R-CNN = Region Proposal + CNN + Category-specific Linear SVMs + Non-Maximum Suppression + Bounding Box Regression

The first generates category-independent region proposals.
The second module is a large convolutional neural network that extracts a fixed-length feature vector from each region.
The third module is a set of class-specific linear SVMs.

Module 1: Region Proposal

Region Proposal 用 Selective Search 实现，选这个文章里到没有说是有什么优点采选 SS，而是为了跟其他已有的方法对比，他们用了 SS。所以完全可以用其他 Region Proposal 方法，比如 BING 之类的。

目前我能理解的 detection 的思路其实就是 region classification，确定了哪些 region 里面有什么类的 object 也就完成了 detection，至于怎么产生这些 region，可以用 sliding-window，也可以用 region proposal，其实我觉得 sliding-window 就是一种特别简化的 region proposal 嘛。

Module 2: Feature Extraction

CNN 就是用来抽取特征的，输出是 softmax，但这个只是在训练 CNN fine-tunning 的阶段用，最分类还是用的 SVM.

但这里存在一个细节的技术问题，就是在当时，CNN 对输入的要求是必须都是固定大小的，但是 Region Proposal method 产生的 region 每个都各不相同，怎么把这不规则的 region 输入需要固定大小输入的 CNN，这是怎么解决的呢？作者就很简单粗暴的处理了，Regardless of the size or aspect ratio of the candidate region, we warp all pixels in a tight bounding box around it to the required size.

Module 3: Classification

SVM 用来分类。那么问题来了，为啥要弄个 SVM 跟在后面，直接 softmax 不好吗？作者在 suppliment 里面提到了，就是说效果掉得厉害（4 个百分点），作者给了两个原因

the definition of positive examples used in fine-tuning does not emphasize precise localization
the softmax classifier was trained on randomly sampled negative examples rather than on the subset of “hard negatives” used for SVM training.

不过后面我们也知道了，后面 end-to-end training 这种再套个 SVM 的方式不怎么有人用了，还是用的 softmax。同样还有下面两个问题：

why the positive and negative examples are deﬁned differently in fine-tuning versus SVM training
why it’s necessary to train detection classifiers rather than simply use outputs from the final layer (fc8) of the fine-tuned CNN

虽然没搞懂，但是不关心了，因为没有意义了，后面的模型用得都是 softmax，为了 End-to-End…

Module 4: Non-Maximum Suppression

去除重复的 Proposal，所以 R-CNN 这里是有大量的冗余计算的，先产生大量的 Proposal，然后最后再剔除掉绝大部分，如果一开始就可以产生少量高质量的 Proposal 就好了，这就是后面改进工作（Faster R-CNN）的 Motivation 了。

Given all scored regions in an image, we apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over-union (IoU) overlap with a higher scoring selected region larger than a learned threshold.

Module 5: Bounding Box Regression

这一块是在最初写这篇日志的时候被我忽略的，但现在 One-Stage 的方法类似 YOLO、SSD 越来越多，实际效果也很好，还是需要重新仔细审视下 Bounding Box Regression 的。作者在文章里讲了是借鉴了 DPM 模型的。

Inspired by the bounding box regression employed in DPM [15], we train a linear regression model to predict a new detection window given the pool 5 features for a selective search region proposal.

上面 4 步所做的只是 Pick 出一些包含有目标的 Proposal，但至于 Proposal 是否把目标包裹得非常紧致，也就是 Proposal 的 Bounding Box 质量到底怎么是没有保证的。Bounding Box Regression 这一步就是在有了比较确信的 Proposal 后，进一步把 Bounding Box refine 一下。

需要注意的是，Bounding Box Regression 回归出来的并不直接是 Bounding Box 的坐标和长宽，而是当前 Proposal 的中心坐标距离真实 Bounding Box 中心坐标的 offset，以及当前 Proposal 的长或宽与真实 Bounding Box 的长或宽的比率，而且这个比率还是经过 log 之后的。所以，给定的 N 个训练对 ${( P^i, G^i)}_{i = 1, \ldots, N}$，$P^i = (P^i_x, P^i_y, P^i_w, P^i_h)$ 代表预测的 Proposal 中心像素的 x、y 坐标和 Proposal 的宽度 w 和高度 h，这一部分并不会出现在 Regression Model 里面，Regression Model 唯一的输入这个 Proposal 在 CNN 最后一层的 Feature Map 上对应的 Feature Map Patch，$G^i = (G^i_x, G^i_y, G^i_w, G^i_h)$ 也不是是 Regression 的 Output，真正的 Output 上面说了，是当前 Proposal 的中心坐标距离真实 Bounding Box 中心坐标的 offset 和当前 Proposal 的长或宽与真实 Bounding Box 的长或宽的比率再 log。之所以用 offset 和 ratio 是有原因的，出于 scale-invariant translation 和 log-space translation 的考虑。输入 Linear Regression Model的变量要稍稍变换下，为了赋予 scale-invariant translation 给 the center of bounding box，以及 log-space translations of the width and height of the bounding box，Linear Model 建模如下

采用 Ridge Regression 来做预测

预测出了 offset 和 ratio，真实的坐标和长宽也就可以算出来了。

注意，R-CNN 里面是没有 Anchor Box 的概念的，这个概念 Fast R-CNN 也没有，要等到 Faster R-CNN 里面才有。

2.3 Cost function

对于 CNN，cost function 应该就是普通 CNN 的 cost function 吧。
对于 SVM，肯定是 hinge loss 啦。

2.4 Optimization

因为 feature extraction 和 Classification 是分开学习的，所以对于 CNN 就是普通 DL 常用的 SGD 那些，对于 SVM 应该就是那些优化啦。

本质上，这还是一个 pipeline，而且这个 pipeline 只逐个 local minimum，并不是 joint optimum，所以肯定有很大的提升空间。

3. 题外话

R-CNN 肯定是属于 recognition using regions 这个 paradigm 啦，有意思的一点是作者用的 recognition using regions 这个 paradigm 的参考文献是 09 年的，这个范式出现的这么迟么，真是不可思议。

C. Gu, J. J. Lim, P. Arbelaez, and J. Malik. Recognition using regions. In CVPR, 2009.

作者又给了几篇同样用这种范式的，都是 09 年之后的了

J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV, 2013.
J. Carreira and C. Sminchisescu. CPMC: Automatic object segmentation using constrained parametric min-cuts. TPAMI, 2012.

如果您觉得我的文章对您有所帮助，不妨小额捐助一下，您的鼓励是我长期坚持的动力。