## [1] Learning Scene Gist with Convolutional Neural Networks to Improve Object Recognition

IEEE Information Sciences and Systems - CISS 2018 的论文，作者都是哈佛的。

primate visual system 的 prominent feature 是 eccentricity-dependent sampling, with a high-resolution foveal region and a lower resolution periphery

1. The periphery has a decaying density of cells as function of distance from the fovea, and allows for faster approximate perception.
2. low-resolution peripheral information provides an initial approximation of the scene gist.

During scene understanding, peripheral information can be used to propose regions of interest for active sampling, and the eyes can then quickly foveate on these regions for high-resolution interpretation. The interplay between foveal and peripheral information may enable faster recognition of objects within a scene with a significantly reduced number of cells. 但事实上，本文的 GistNet 作者并没有利用 peripheral information 来 propose regions of interest for active sampling，而是将其作为特征和 foveated vision 的特征一道来最后分类。

1. Region Proposal 相较 Sliding window 的优点：Those region proposals cut down on the cost of having to perform classifications on the entire image.
2. Region Proposal 相较 Gist 的缺点：Yet, these models lack critical components of contextual information provided by interactions between the fovea and the periphery which are characteristic of human vision:
1. a low resolution and rapid peripheral system
2. interactions between the periphery and foveal information
3. global sharing of information learned across foveations.
4. Using global features from the scene gist may reduce the need for additional region proposals, aiding recognition of all objects within the same scene and enforcing all objects in a scene to be influenced by the same prior during inference.

### 待读

1. Zhu, C., et al., CMS-RCNN: contextual multi-scale region-based CNN for unconstrained face detection. Deep Learning for Biometrics, 2017: p. 57-79.
2. Chen, X. and A. Gupta, Spatial memory for context reasoning in object detection. arXiv, 2017: p. 1704.04224.

## [3] Multi-Channel CNN-based Object Detection for Enhanced Situation Awareness

207 GB of MWIR imagery, 106 GB of visible imagery

## [4] CNN-based thermal infrared person detection by domain adaptation

The KAIST dataset can be considered the current state-of-the-art dataset dataset for thermal person detection under challenging conditions.

1. In the first step, transform the thermal image data in a way that it is shifted closer to the visible domain
2. Then in the second step, the detector model is adjusted to further reduce the remaining gap.

## [5] Image Captioning with Semantic Attention

CVPR 2016 的文章。

Image Captioning 是 Computer Vision 和 Natural Language Processing 结合的领域

Top-down Approach：start from a gist of an image and convert it into words；更细致地讲是 Top-down approaches are the “modern” ones, which formulate image captioning as a machine translation problem. Instead of translating between different languages, these approaches translate from a visual representation to a language counterpart. The visual representation comes from a convolutional neural network which is often pretrained for image classification on large-scale datasets [18]. Translation is accomplished through recurrent neural networks based language models.

Bottom-up Approach：come up with words describing various aspects of an image
and then combine them；更细致地讲是 Bottom-up approaches are the “classical” ones, which start with visual concepts, objects, attributes, words and phrases, and combine them into sentences using language models.

attend to semantic concept proposals and fuse them into hidden states and outputs of recurrent neural networks.

image captioning 的定义是：Automatically generating a natural language description of an image

One of the limitations of the top-down paradigm is that it is hard to attend to fine details which may be important in terms of describing the image.（这个 details 和语义的矛盾，在 Semantic Segmentation 和 Object Detection 里面也存在）

feedback 是 the key to combine top-down and bottom-up information.

1. able to attend to a semantically important concept or region of interest in an image,
2. able to weight the relative strength of attention paid on multiple concepts,
3. able to switch attention among concepts dynamically according to task status

## [7] Priming Neural Networks

CVPRW 2018, Oral @ MBCC Workshop

### PrimingNN

1. 这里的 top-down feedback 是影响每一个 layer 的
2. DES 里面的 cue 是通过 Weakly Supervised Semantic Segmentation 完成的，而本文好像是给定的 cue 表达（比如含有的类别？cue 是在 label 之外额外的信息）
3. DES 是生成一个 Pixel-wise weight map，而 PrimingNN 是 Channel-wise 的 weight vector

#### 怎么建模 Top-down Attention？

1. A cue about some target in the image is given by and external source or some form of feedback.
2. The process of priming involves affecting each layer of computation of the network by modulating representations along the path.
##### cue 的来源

cue 是 top-down signaling 的来源，首先第一步是怎么刻画 cue？在这里，cue 是给定的，比如 a binary encoding of them presence of some target(s) (e.g, objects)，注意的是，cue 是在 label 之外的信息。这里作者似乎是直接给的，而 DES 里那样是从弱监督语义分割来的。弱监督还是利用的 label 的信息，并没有信息的增益，顶多是相对于 Object Detection 分支，语义分割也许会弥补上一些 Object Detection 分支疏漏的信息。

##### cue 的调制

Top-down feedback 在本文中的表现就是对 neural network layers 的调制，特别是对低层的 layers 的。

$L_i$ 是网络的某一层，$x_i \in R^{c_i \times h_i \times wi}$，所以 $x{ij}$ 就是 $x{i}$ 的第 j 个特征平面，有 $x{ij} \in R^{times h_i \times w_i}$

$$\hat{x}{i j}=\alpha{i j} \cdot x{i j}+x{i j}$$

$$\alpha{i}=W{i} * h$$

1. free viewing：没有 cue 的情况下
2. priming：a modification to the computation performed when viewing the scene with the cue in mind（注意，这是在计算阶段）priming often highly increases the chance of detecting the cued object.
3. pruning：a modification to the decision process after all the computation is finished（注意，这是在决策阶段）When the task is to detect objects, this can mean retaining all detections match the cue, even very low confidence ones and discarding all others.

priming 和 pruning 的差别在于，pruning 只是对于 forward 的 decision 做删减，而 priming 则是 allows the cue to affect the visual process from early layers，就是因为调制了 early layers，所以 priming 才可以 allowing detection where it was previously unlikely to occur in free-viewing conditions。

## Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks

2015 年的 CVPR，期刊版本是 TPAMI 的 Feedback Convolutional Neural Network for Visual Localization and Segmentation。这是一篇非常棒的论文，解答了自我开始看 Attention 的论文以来一直存在的一个问题，即 Top-down Attention 中 cue 的来源问题。cue 是 Top-dwon Attention 信息流的来源，可以是“这里有只猫”这样的提示，也可以是“船在水上”这样的上下文先验。

Cue 和 Label 不一样，虽然都含有关于 Task 的信息，Label 只有在 Training 的时候才 available 的，而 Cue 则是在 Inference 的时候也是 Available 的。

1. Resize image to size 224∗224, run CNN model and predict top 5 class labels.
2. For each of the top 5 class labels, compute object localization box with feedback model.
3. Crop image patch for each of 5 bounding boxes from original image and resize to 224 ∗ 224. Pre- dict top 5 labels again.
4. Given the total 25 labels and the corresponding confidences, rank them and pick the top 5 as final solution.

z 可以依据公式（6）用 SGD 求出

“偏好竞争模型”（biased competition model）。根据这个模型，在任何给定的时间，大脑中都有大量的感官信息或认知表征是活跃的，但是大脑的计算资源只能加工有限数量的表征。因此，各种表征总是在争夺神经资源。在这个竞争环境中，注意作为一种选择机制，它可以有偏好地选择某些信息，使其得到更精细的加工。出处 (那是什么决定了 有些表征争夺神经资源成功了，有些失败了？ passes the high-level semantic information down to the low- level perception, controls the selectivity of neuron activations in an extra loop in addition to the feedforward process. This results in the “Top-Down” attention in human cognition，这就是 top-down 机制，但究竟什么是 top-down 机制，还不够清楚)

### 问题

1. during the feedforward stage, the proposed networks perform inference from input images in a bottom-up manner as traditional Convolutional Networks
2. while in feedback loops, it sets up high-level semantic labels, (e.g., outputs of class nodes) as the “goal” in visual search to infer the activation status of hidden layer neurons.

Inspired by Deformable Part-Based Models (DPMs) [8] that characterize middle level part locations as latent variables and search for them during object detection, we utilize a simple yet efficient method to optimize image compositions and assign neuron activations given “goals” in visual search.

Biased Competition Theory 里面的 feedback or “Top-Down” attention 的？passes the high-level semantic information down to the low- level perception, controls the selectivity of neuron activations in an extra loop in addition to the feedforward process.

visualization of CNN 是 shows semantically meaningful salient object regions and helps understand working mechanism of CNNs.（这的确跟 Localization 很像，因为你要先决定要展示的区域）

Object detection and localization 可以被认作是 a searching process with clear “goals.”

behaviors of ReLU and Max-Pooling could be formulated as y = z ◦ x, where ◦ is the element wise product (Hadamard product)

y = z ∗ x, where ∗ is the convolution operator and z is a set of convolutional filters except that they are location variant.

Bottom-Up Inherent the selectivity from ReLU layers, and the dominant features will be passed to upper layers;

Top-Down Controlled by Feedback Layers, which propagate the high-level semantics and global information back to image representations. Only those gates related with particular target neurons are activated.

1. Zhang2018SingleShotOD: Single-Shot Object Detection with Enriched Semantics
2. Shrivastava2016ContextualPA: Contextual Priming and Feedback for Faster R-CNN
3. Rosenfeld2018PrimingNN: Priming Neural Networks

## Exemplar-Driven Top-Down Saliency Detection via Deep Association

2016 CVPR

Bottom-up visual saliency is stimulus-driven, and thus sensitive to the most interesting and conspicuous regions in the scene.

Top-down visual saliency, on the other hand, is knowledge-driven and involves high-level visual tasks, such as intentionally looking for a specific object.

bottom-up saliency detection 是 task-free nature，can only capture the most salient object(s) in the scene.

top-down saliency aims to locate all the intended objects in the scene, which can help reduce the search space for object detection.（top-down 才可以减少 search space）

learn the “knowledge” that guides top-down saliency detection, from a set of categorized training data

knowledge 可以来自于 memory，比如 locating salient objects in the scene using knowledge from training data，也可以来自于 object association，比如 locating objects in the scene using known or unknown exemplars

## Comparison of Infrared and Visible Imagery for Object Tracking: Toward Trackers with Superior IR Performance

2015 CVPRW

1. 可见光图像是物体反射光成像，而红外（非 NIR）是物体直接的黑体辐射成像，温度以及大气窗口才是决定辐射强度的关键，由于是物体自身温度决定的，因此 IR 图像相比可见光通常很少有纹理。灰度变化也相对平缓（对于人这样均匀分布热量的），对于飞机这种不一样，引擎那边会很亮。

in the visible image, the contours of the target can be easily segmented

## Meeting

https://zhuanlan.zhihu.com/p/51514687

## Single-Shot Refinement Neural Network for Object Detection

two-stage approach (e.g., Faster R-CNN) 的优点是 achieving the highest accuracy

one-stage approach (e.g., SSD) 的优点是 high efficiency

RefineDet 由两部分构成

1. the anchor refinement module：这个 module 的目的是
1. filter out negative anchors to reduce search space for the classifier
2. coarsely adjust the locations and sizes of anchors to pro- vide better initialization for the subsequent regressor.
2. the object detection module：这个 module 的目的是
1. takes the refined anchors as the input from the former to further improve the regression accuracy and predict multi-class label

1. Kong et al. [23] use the objectness prior con- straint on convolutional feature maps to significantly reduce the search space of objects

Two-Stage Method 相比 One-Stage Method 的优势：

1. using two-stage structure with sampling heuristics to handle class imbalance
2. using two-step cascade to regress the object box parameters
3. using two-stage features to describe the objects