Notes on GluonCV's VOCMApMetric

Mean Average Precision，也就是 mAP，是目标检测里面最常用的评价指标。因为最近在用 MXNet/Gluon，发现最新的 GluonCV 里面是有计算 mAP 这个 API 的，虽然文档也给了说明，但是由于没有给出代码示例，我这个菜鸟还是不知道怎么调用。这篇日志记录一下我读 GluonCV 的 VOCMApMetric的代码的一些笔记。

1. Notes on mAP

如果了解 mAP 是什么，肯定能事半功倍。我就是吃了自己本身对 mAP 也糊里糊涂的亏，花了整整一天才把 VOCMApMetric 的代码给读懂。下面两个是了解 mAP 的好资源：

目标检测算法的性能评价，对于一幅图像或一个测试集合，我们肯定有我们训练好的模型在这个测试集合上的 Predict 的 BBox，同时，我们也知道这个测试集上的 Groundtruth BBox。

1.1 meaning of “precision”

所谓的 mean average precision，首先是 precision，那 precision 是怎么计算的呢？Precision = # True Positive / (# True Positive + # False Positive) ，通俗的说，# True Positive 就是我们的模型预测到且预测正确的个数，# False Positive 就是我们的模型预测到但预测错误的个数，# True Positive + # False Positive 就是我们模型做出预测的总个数。

那么怎么判断我们预测的某个 Predicted BBox 预测正确呢？用的指标是 Predicted BBox 和 Groundtruth BBox 的 IoU。流程是这样的：

先找出在所有 Groundtruth BBox 中与我们当前的这个 Predicted BBox 的 IoU 最大的那个
如果这个 IoU 小于某个阈值，那这个 Predicted BBox 就预测错误；
如果这个 IoU 大于某个阈值，且对应的 Groundtruth BBox 在此之前并没有被分配给之前的 Predicted BBox，那么就认为我们预测对了。

这里又来了，“对应的 Groundtruth BBox 在此之前并没有被分配给之前的 Predicted BBox” 这个在此之前，这个前后依据什么呢？依据的是每个 Predicted BBox 的 Prediction Score，一般都是这个 Predicted 属于某一类的概率，且这一类已经是这个 Predicted BBox 在所有类别中预测概率最大的那一类了。所以这也表明，我们在计算每一个 Predicted BBox 是否预测成功的时候，是要按照这些 Predicted BBox 的 Prediction Score 从大到小依次来的。

从大到小的每一个对于 Prediction Score，我们都可以计算依照从大到小的顺序截止到目前为止的 Predicted BBox 们的 # True Positive 和 # False Positive，由此我们可以算出依照 Prediction Score 从大到小的顺序排列的 Precision Array，这个 Array 的元素个数就是我们的模型 Predicted BBox 的个数。

同理，因为 Groundtruth BBox 的个数知道，对于 Recall = # True Positive / # Groundtruth BBox，我们也可以计算出依照 Prediction Score 从大到小的顺序排列的 Recall Array，这个 Array 的元素个数也是我们的模型 Predicted BBox 的个数。

再扯一句，说起这个排序依据 Prediction Score，在神经网络里可以是最后输出属于概率最大类的概率，对于 SVM 可以是 $y = w^T x + b$ 这个 score；对于 RPCA 分离红外小目标，也可以是分离出来的目标图像里面目标的（灰度）值。

1.2 meaning of “average precision”

现在我们已经有了依照 Prediction Score 从大到小的顺序排列的 Precision Array 和 Recall Array，不过需要注意的是，average precision 可不是简单的对上面得到的 Precision Array 求平均，这个 average precision 的 average 是 Precision Array 对 recall 求的。

Precision - Recall 曲线图，Precision 是纵轴，Recall 是横轴，Precision - Recall 曲线围成的面积就是 average Precision。所以说，average Precision 其实说的是让 Precision 对 Recall 求积分，得到的 average Precision 是一个数，每一类都有一个 average Precision。

1.3 meaning of “mean average precision”

最后的这个 mean 含义简单，就是对算出来每一类的 average Precision 求平均，这个 mean 是 among all classes 的意思。

2. Notes on Code

关于 VOCMApMetric 可以看到，没有下划线的函数只有 reset，get 和 update 三个，作用分工明确

reset 负责 Clear the internal statistics to initial state.

get 负责 Get the current evaluation result.

update 负责 Update internal buffer with latest prediction and gt pairs.

2.1 `init`

update 函数是负责更新 self._n_pos，self._score，self._match 这些internal statistics 的，而这些内部统计量是用来算 Precision 和 Recall 的原料
get 得到最后每一类的 average Precision 以及最后的 mean Average Precision 的（通过调用 _update，而 _update 再调用 _recall_prec 计算）

注意，虽然 Python 是行优先的，内存中是先存一行，再存一列的，但是 Numpy 的 axis 是和 MATLAB 一样的，axis = 0 表示对列操作，axis = 1 表示对行操作，MXNet 的 NDArray 也是同理。所以记住，MATLAB、Numpy、MXNet 都跟数学有关，都是 axis = 0 表示对列操作，axis = 1 表示对行操作。

self._n_pos：记录属于第 l 类的 Groundtruth BBox 的个数
self._score：记录 Predict 出来的每一个 BBox 的 score
self._match：记录 Predict 出来的每一个 BBox 究竟是否是一个足够好，可以被认识是 True Positive 的 BBox

2.2 `update`

def update(self, pred_bboxes, pred_labels, pred_scores,
           gt_bboxes, gt_labels, gt_difficults=None):
    """Update internal buffer with latest prediction and gt pairs.

    Parameters
    ----------
    pred_bboxes : mxnet.NDArray or numpy.ndarray
        Prediction bounding boxes with shape `B, N, 4`.
        Where B is the size of mini-batch, N is the number of bboxes.
    pred_labels : mxnet.NDArray or numpy.ndarray
        Prediction bounding boxes labels with shape `B, N`.
    pred_scores : mxnet.NDArray or numpy.ndarray
        Prediction bounding boxes scores with shape `B, N`.
    gt_bboxes : mxnet.NDArray or numpy.ndarray
        Ground-truth bounding boxes with shape `B, M, 4`.
        Where B is the size of mini-batch, M is the number of grount-truths.
    gt_labels : mxnet.NDArray or numpy.ndarray
        Ground-truth bounding boxes labels with shape `B, M`.
    gt_difficults : mxnet.NDArray or numpy.ndarray, optional, default is None
        Ground-truth bounding boxes difficulty labels with shape `B, M`.

    """

~~OK，对 update 的输入解读的这一段其实是我第二遍编辑这篇日志时候才添加，之前第一遍写的时候，自己还有一些含糊的地方，希望这一次能够补上。~~

OK，对 update 的输入解读的这一段，在第四次更新的时候又补上了之前忽略的地方。

当我们度量两者差异时，这两者一般都是一类东西。Loss Function 是在 Training 阶段，度量我们的 Prediction 和 Groundtruth 差异是用的；mAP 是在 Test 阶段，也是度量我们的 Prediction 和 Groundtruth 的差异。那 Training 阶段和 Test 阶段的 Prediction 和 Groundtruth 有差异吗？当然是有的，在 Training 阶段，我们的主体是 Anchor；在 Test 阶段，我们的主体是 Object。因此，这里有 4 个量，Predicted Anchor Label，Groundtruth Anchor Label，Predicted Object Label 和 Groundtruth Object Label。

实际上，我们只是对 net 做一次 forward 得到 Prediction，所以 Predicted Object Label 只是在 Predicted Anchor Label 的基础上对 Bounding Box 的坐标和预测出来的 Class Label 做了 Decoding 和 NMS（非极大值抑制）而已。

在 GluonCV 的代码中，Predicted Anchor Label、Groundtruth Anchor Label 和 Groundtruth Human Label 从变量名称上和注释上就有很大的区分，比如

对于 Predicted Anchor Label 变量名是 pred_labels 或者 cls_preds、box_preds 这种，关键词是 preds，大小是 (B, N)，N 一般都是值一层 Layer 的 Anchor 总数，但对于我们 SSD 在 Testing 阶段最后的 output 也可以是做完 NMS 之后的，这时候前面的 N 就是 100 了；注意，在 Predicted Anchor 中是有 Background 这一 Class 的，label 是 0；
对于 Groundtruth Anchor Label 变量名是 cls_target、box_target，关键词是 target，因为这是由 SSDTargetGenerator 产生的，所以关键词是 Target，尺寸是 (B, N)，N 是值一层 Layer 的 Anchor 总数，之所以这个 N 不会被 NMS，因为 Groundtruth Anchor Label 是在 Training 计算 Loss 时候用的，在 Testing 输入 mAP 这样的指标的时候不会用，mAP 指标用的是 Groundtruth Human Label；还要注意的是，Groundtruth Anchor Label 中是有 Background 这一 Class 的，label 是 0；如果做 Hard Negative Mining，Ignore 掉的那些 Anchor 的标号是 -1；
对于 Predicted Object Label，变量名是 ids, scores, bboxes 或者 pred_bbox, pred_label, pred_score，可以看到跟 Predicted Anchor Label 没啥差别，本来就是啊，只不过 Label 把 0 去掉了，然后 NMS 把个数精简了一下而已。
对于 Groundtruth Object Label，变量名是 gt_labels，gt_bboxes，关键词是 gt （代表 Groundtruth） + bbox，注意 Anchor 和 BBox 的区别，如果是 Anchor，其变量名绝对不会带有 bbox；Anchor 只是固定几个形状比例的，而 BBox 则是可以是任意形状比例，match 的 Anchor 和 BBox 两者只是有较大重合；注意对于 Groundtruth Human Label，是没有 Background 这一 Class 的，所以 Label 为 0 的不是 Background，这个在 Label 语义上 Anchor Predicted Labels 和 Groundtruth Human Labels 的不一致，一定要注意。大小是 (B, M)， M 是 Positive Object （注意不是 Anchor）的数量。

对于 Predicted Anchor Label，其一共有 fg_class + 1 类，fg_class 是 Positive Object 的类别数（不含 Background），+ 1 是因为 Background 类；而对于 Groundtruth Human Label 则只有 fg_class 类，因为其不包含 Background；这样的话两个 Label 之间语义上是不对齐的，要输入 VOCMApMetric 做 match 的话，一定要对齐才是。不用担心，在 Test 阶段，在 SSD 网络输出前会有一个 MultiPerClassDecoder 阶段，经过这个后，label 就会从 Anchor Label Space 变成 Human Label Space。总之，如果结果输出是给我们人看的，是不包含 Background 的（也就是 Test 阶段）；如果结果是给 Model 的，也就是用于计算 Training Loss 的，Label 是包含 Background 的。

对于在测试（非训练）阶段，GluonCV 里面的 SSD 返回的是 ids, scores, bboxes，他们都是 MXNet 的 NDArray 数组，ids.shape: (batch_size, 100, 1)，scores.shape: (batch_size, 100, 1)，bboxes.shape: (batch_size, 100, 4) ，这个 100 是因为代码里默认了做 NMS 后就保留 100 个，当然我们可以自己设置。

下面这段代码的作用是把 net 输出的 MXNet 的 NDArray 数组都转换成 Numpy 数组，net 对于一个 batch 输出的就是一个 NDArray 数组，但也有可能有人（比如我）会把所有不同 batch 对应的 ids 的 NDArray 数组放到一个 list 里面去，这个代码的作用就是如果是 NDArray 数组，换成 Numpy 数组，如果是 list of NDArray，换成 list of Numpy 数组

def as_numpy(a):
    """Convert a (list of) mx.NDArray into numpy.ndarray"""
    if isinstance(a, (list, tuple)):
        out = [x.asnumpy() if isinstance(x, mx.nd.NDArray) else x for x in a]
        return np.concatenate(out, axis=0)
    elif isinstance(a, mx.nd.NDArray):
        a = a.asnumpy()
    return a

下面这个代码虽然复杂，但其实在做一件我经常做的事情，因为 pred_bboxes, pred_labels, pred_scores, gt_bboxes, gt_labels, gt_difficults 这些都是 List of Numpy 数组，但我们每次迭代的时候要的是，他们各自中的第 n 个 Numpy 数组，所以这个代码的作用就是返回各自中的第 n 个 Numpy 数组。

1	for pred_bbox, pred_label, pred_score, gt_bbox, gt_label, gt_difficult in zip(*[as_numpy(x) for x in [pred_bboxes, pred_labels, pred_scores, gt_bboxes, gt_labels, gt_difficults]]):

下面的代码 pred_label.flat 这个是把 (batch_size, 100, 1) 这样大小的压成一维的 array，np.where 返回的是 True 元素的 index，后面这个 [0]，是因为 np.where 返回的是 ([index of True], ) 这样的，所以 [0] 只是为了把 [index of True] 取出来，顾名思义，valid_pred 就是 valid predition 的 index，这个 valid 的含义其实就是取出不是 Ignore 的那些 Anchors
pred_bbox 也就是 valid 的 Prediction BBox，
pred_label 也就是 valid 的 Prediction Label
pred_score 也就是 valid 的 Prediction score
同理，valid_gt 也是 groundtruth 里面 valid 的 index，至此之后，pred_bbox，pred_label, pred_score, gt_bbox, gt_label 都是 valid 的了，但是名字还是和之前一样，也算省事
你可能会好奇，既然是 Groundtruth，怎么会有 -1 这样的 Label 么，所有 Object 的 cls_id 都是从 0 开始的嘛？这是因为不同 image 所包含的 object 个数不同，但为了把他们拼成一个 batch，所以做了 Pad 运算最后都按照这个 Batch 中 object 个数最多的来，而对于哪些根本不存在的 object， Pad 的 pad_val 是 -1，这一步就是把那些 Pad 出来的再去掉
这个操作对于语义上其实很 trivial

valid_pred = np.where(pred_label.flat >= 0)[0]
pred_bbox = pred_bbox[valid_pred, :]
pred_label = pred_label.flat[valid_pred].astype(int)
pred_score = pred_score.flat[valid_pred]
valid_gt = np.where(gt_label.flat >= 0)[0]
gt_bbox = gt_bbox[valid_gt, :]
gt_label = gt_label.flat[valid_gt].astype(int)

np.unique(np.concatenate((pred_label, gt_label)).astype(int)) 这个就是找出在 pred_label 和 gt_label 中出现的几类，对每一类做如下操作
pred_mask_l 是第 l 类在 pred_mask_l 中的 index mask
pred_bbox_l 是第 l 类的 bbox
pred_score_l 是第 l 类的 scores
np.argsort(x) 的作用是返回 index，这个 index 可以让 x[index] 变成一个升序的 array，[::-1] 就是 reverse 一下 list，它的作用就是让 np.argsort(x) 返回的升序的变成降序的，当然了，计算 mAP 可是要把 pred_score_l 从高到低排序的
在 pred_score_l 从高到低排序后，pred_bbox_l 也要根据这个 index 这么从高到低排序一下
groundtruth 就是 groundtruth，所以没有也不应该有这种排序，只要把是第 l 类的 gt_mask_l 和 gt_bbox_l 找出来就好了

for l in np.unique(np.concatenate((pred_label, gt_label)).astype(int)):
    pred_mask_l = pred_label == l
    pred_bbox_l = pred_bbox[pred_mask_l]
    pred_score_l = pred_score[pred_mask_l]
    # sort by score
    order = pred_score_l.argsort()[::-1]
    pred_bbox_l = pred_bbox_l[order]
    pred_score_l = pred_score_l[order]

    gt_mask_l = gt_label == l
    gt_bbox_l = gt_bbox[gt_mask_l]
    gt_difficult_l = gt_difficult[gt_mask_l]

self._score 是内部储存 Prediction Score 的内部变量，首先是个字典，是个 key 是 l 也就是类别号，value 是 Prediction 出来属于这一类的 pred_score
self._score[l].extend 这个 extend 操作是因为没来一个 batch 的数据，就存起来
self._n_pos 也是一个内部存储变量，用来存储 not difficult sample 的数量，注意，是 not difficult 的 sample，因为有个 np.logical_not 操作，那么在我这里，感觉就是属于第 l 类的样本个数的统计了，这个 not difficult sample 的数量也就是 Groundtruth BBox 的数量了，这个数量在后面算 recall 的时候会有用

1 2	self._n_pos[l] += np.logical_not(gt_difficult_l).sum() self._score[l].extend(pred_score_l)

当pred_bbox_l 一个元素都没有的时候，也就意味着我们预测并没有遇到到任何第 l 类的样本，因为 mAP 还是计算我的模型的预测精度，所以立足于 Prediction 的结果，当这一类没有任何 Prediction 的时候，就 continue 跳过
当 groundtruth 没有任何一个这一类元素时，很明显，即使 Prediction 了很多这一类的样本，也不会和任何 groundtruth match 住，所以这一个 batch 里面，对第 l 类的预测，也就是 pred_bbox_l 没有一个 match 的，(0,) pred_bbox_l.shape[0] 就是产生对应个数 0 的 tuple，然后 self._match[l].extend((0,) pred_bbox_l.shape[0]) 表明 self._match 也是个内部统计量，也是个字典，是个 key 是 l 也就是类别号，value 是在这个 batch 里面，被预测成第 l 类的这些样本究竟是否 match 的 flag tuple

if len(pred_bbox_l) == 0:
    continue
if len(gt_bbox_l) == 0:
    self._match[l].extend((0,) * pred_bbox_l.shape[0])
    continue

pred_bbox_l[:, 2:] += 1 这行代码的意思是让 y_max 和 x_max 加 1，这样计算面积的时候 gluoncv.utils.bbox_iou 就不用设置 offset 了，为了让 end_pos - beg_pos + 1
如果 pred_bbox_l 长度是 N，gt_bbox_l 的长度是 M，iou 是一个 N * M 的二维 array
这个 iou 矩阵的第 i 行表示的是，在当前 batch 被预测成 label = l 的样本里的第 i 个样本，与所有 groundtruth bbox 的 IoU 数值
axis=1 表示对行向量操作，iou.argmax(axis=1) 返回每行也就是每个 list 最大元素的那个 index，也就是当前的 Prediction bbox 与哪个 groundtruth bbox IoU 最大，就返回这个 groundtruth bbox 的 index
上面的 iou.argmax(axis=1) 是返回当前的 Prediction bbox 与哪个 groundtruth bbox IoU 最大，就返回这个 groundtruth bbox 的 index，iou.max(axis=1) 就是返回当前的 Prediction bbox 与哪个 groundtruth bbox IoU 最大的那个最大值，iou.max(axis=1) < self.iou_thresh 表示即使是最大 IoU 的那个，如果数值小于 self.iou_thresh，还是不算做 match，对啊，否则每个 Predicted BBox 都会有一个 match 了。
注意，gt_index 这个 Array 的长度是 pred_bbox_l.shape[0]，并不是 gt_bbox_l.shape[0]，不要望文生义给搞错了，gt_index 这个 array 表示的是 pred_bbox_l 中符合要求的 bbox 在 gt_bbox_l 中对应的 Groundtruth BBox 的 index

# VOC evaluation follows integer typed bounding boxes.
pred_bbox_l = pred_bbox_l.copy()
pred_bbox_l[:, 2:] += 1
gt_bbox_l = gt_bbox_l.copy()
gt_bbox_l[:, 2:] += 1

iou = bbox_iou(pred_bbox_l, gt_bbox_l)
gt_index = iou.argmax(axis=1)
# set -1 if there is no matching ground truth
gt_index[iou.max(axis=1) < self.iou_thresh] = -1
del iou

在 MATLAB 里，zeros(5) 返回的是一个 5 * 5 的矩阵，跟 MATLAB 不一样，在 Numpy 里，np.zeros(gt_bbox_l.shape[0], dtype=bool) 返回的是一个长度为 gt_bbox_l.shape[0] 的一维 Array
gt_index 是一个 Prediction bbox 与哪个 groundtruth bbox IoU 最大且该最大 IoU 大于阈值的 Index Array，里面的元素 -1 表示这个 Predicted BBox 没有与任何 Groundtruth BBox 达到要求，其余非 -1 的就是这个 Groundtruth BBox 的 Index
当 if gt_idx >= 0 为 False 的时候，也就是 gt_index 是 -1，就表明这个 Predicted BBox 没有 match 到任何 Groundtruth BBox
当 if gt_idx >= 0 为 True 时，表示当前这个 Predicted BBox 有与之 match 的 Groundtruth BBox：
- if gt_difficult_l[gt_idx] 为 True，当前样本在 self._match[l] 中的标记应该设为 -1，并添加到 self._match[l] 中去，但因为我目前的 Task 中，并没有 Difficult 这个 tag，所以这一行代码永远都是 False
- if if gt_difficult_l[gt_idx] 为 False:
  - 如果 if not selec[gt_idx] 为 True，也就是说要 selec[gt_idx] 为 False，也就是说 selec[gt_idx]，也就是在这个 batch 的 Groundtruth BBox gt_bbox_l 中的第 gt_idx BBox 是否已经被之前的 Predicted BBox 标记过了，如果没有，那就说现在这个 Predicted BBox 是符合进一步要求的 BBox，在 self._match[l] 中的标记应该设为 1，并添加到 self._match[l] 中去
  - 如果 if not selec[gt_idx] 为 False，也就是说要 selec[gt_idx] 为 True，也就是说，当前这个 Predicted BBox 对应的 Groundtruth BBox 已经被之前的 Predicted BBox 给用过了，那就不能认为当前这个 Predicted BBox 是符合进一步要求的了，被之前用过就不能再用这样是合理的，因为这个 list 是按照 pred_score 排序的，也就是对预测的信心从大到小，如果 Groundtruth BBox 只能被分配给一个 Predicted BBox 的话，那肯定是选之前更确信的 Predicted BBox
  - selec[gt_idx] = True 这个代码就是，一旦 gt_idx 这个 Groundtruth BBox 被用过了，就打上一个用过的标记
从这里可以看出 match 的条件有下面两个：
- Predicted BBox 和与之最 match 的 Groundtruth BBox 的 IoU 必须大于某个阈值
- 当前 Predicted BBox 与之最 match 的 Groundtruth BBox 还没有被分配给 Prediction Score 比当前 Predicted BBox 更高的 Predicted BBox

selec = np.zeros(gt_bbox_l.shape[0], dtype=bool)
for gt_idx in gt_index:
    if gt_idx >= 0:
        if gt_difficult_l[gt_idx]:
            self._match[l].append(-1)
        else:
            if not selec[gt_idx]:
                self._match[l].append(1)
            else:
                self._match[l].append(0)
        selec[gt_idx] = True
    else:
        self._match[l].append(0)

2.3 `_recall_prec`

这个子函数顾名思义，就是用来算 recall 和 precision 的。

self._n_pos.keys() 是拿出 self._n_pos 里面所有的 key，这些 key 都是 0，1，2，3 这样的 label，因为 label 是从 0 开始，所以 max(self._n_pos.keys()) + 1 得到的是一共有几类，但注意的是，这里如果某一类，比如 label = 2 的不存在，但有 label = 3 的，还是会按照最大是 3 然后 3 + 1 怎么算
prec、rec 都是按照类别数初始化，用来记录每一类的 precision 和 recall
order = score_l.argsort()[::-1] 这一步之所以要得到按照 score_l 的降序，是因为后面的 np.cumsum，计算 mAP 就是要对所有 prediction 按照 score 从高到底排序的
后面就是按照定义计算 TP 和 FP，注意得到的 prec[l] 是一个 Numpy Array，因为 tp / (fp + tp) 是对 Numpy Array 点除，得到的还是 Numpy Array，这也是符合 mAP 定义的计算过程
所以最后返回的 rec 和 prec 都是 List of Numpy Array，里面每一个元素都是按照 score 排序好的 precision array 和 recall array

def _recall_prec(self):
    """ get recall and precision from internal records """
    n_fg_class = max(self._n_pos.keys()) + 1
    prec = [None] * n_fg_class
    rec = [None] * n_fg_class

    for l in self._n_pos.keys():
        score_l = np.array(self._score[l])
        match_l = np.array(self._match[l], dtype=np.int32)

        order = score_l.argsort()[::-1]
        match_l = match_l[order]

        tp = np.cumsum(match_l == 1)
        fp = np.cumsum(match_l == 0)

        # If an element of fp + tp is 0,
        # the corresponding element of prec[l] is nan.
        with np.errstate(divide='ignore', invalid='ignore'):
            prec[l] = tp / (fp + tp)
        # If n_pos[l] is 0, rec[l] is None.
        if self._n_pos[l] > 0:
            rec[l] = tp / self._n_pos[l]

    return rec, prec

2.4 `_average_precision`

上面的 _recall_prec 是计算 recall 和 precision 的 Numpy Array，这个 _average_precision 往前进一步，是计算 average precision 了。这里传入的 rec 和 prec 都是上面 _recall_prec 中的 rec[l] 和 prec[l]，这里传入的只是具体某一类 precision 和 recall 的 Numpy Array （按照 Score 排好序的计算的）
np.concatenate(([0.], rec, [1.])) 这个操作就是让 mrec 既有 0 又有 1
mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i]) 这个操作是因为

we replace the precision value with the maximum precision for any recall ≥ $\hat{r}$”
上图和公式都出自 mAP (mean Average Precision) for Object Detection
i 是 recall 发生变化的索引
mrec[i + 1] - mrec[i] 就是计算每一段没有发生变化的 recall 的长度，乘上对应的 precision 的高度就是面积了，也就是代码的注释说的 sum (\delta recall) * prec
最后返回的 ap 是个数，就是这一类的 average precision

def _average_precision(self, rec, prec):
    """
    calculate average precision

    Params:
    ----------
    rec : numpy.array
        cumulated recall
    prec : numpy.array
        cumulated precision
    Returns:
    ----------
    ap as float
    """
    if rec is None or prec is None:
        return np.nan

    # append sentinel values at both ends
    mrec = np.concatenate(([0.], rec, [1.]))
    mpre = np.concatenate(([0.], np.nan_to_num(prec), [0.]))

    # compute precision integration ladder
    for i in range(mpre.size - 1, 0, -1):
        mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i])

    # look for recall value changes
    i = np.where(mrec[1:] != mrec[:-1])[0]

    # sum (\delta recall) * prec
    ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])
    return ap

2.5 `_update`

先用 self._recall_prec() 得到所有类的 recall 和 precisions
然后用 self._average_precision(rec, prec) 计算各类的 AP
if self.num is not None and l < (self.num - 1) 这种为 True 应该就是最常见的情形，就是让 self.sum_metric 存储各类的 AP，self.num_inst 各类对应也为 1
还要注意 if self.num is None: 也是会运行的，只不过我们的情形都是 False，就会运行 self.sum_metric[-1] = np.nanmean(aps)，也就是真正从各类的 AP 中算出 mAP
所以 self.sum_metric 是一个 List，记录的是各类的 AP，但最后一个元素记录的是 mAP

def _update(self):
    """ update num_inst and sum_metric """
    aps = []
    recall, precs = self._recall_prec()
    for l, rec, prec in zip(range(len(precs)), recall, precs):
        ap = self._average_precision(rec, prec)
        aps.append(ap)
        if self.num is not None and l < (self.num - 1):
            self.sum_metric[l] = ap
            self.num_inst[l] = 1
    if self.num is None:
        self.num_inst = 1
        self.sum_metric = np.nanmean(aps)
    else:
        self.num_inst[-1] = 1
        self.sum_metric[-1] = np.nanmean(aps)

2.6 `get`

self.sum_metric 是用来存各类的 AP 和 mAP 的，这个 get 就是取出了 AP 和 mAP，返回、打印一下而已

def get(self):
    """Get the current evaluation result.

    Returns
    -------
    name : str
       Name of the metric.
    value : float
       Value of the evaluation.
    """
    self._update()  # update metric at this time
    if self.num is None:
        if self.num_inst == 0:
            return (self.name, float('nan'))
        else:
            return (self.name, self.sum_metric / self.num_inst)
    else:
        names = ['%s'%(self.name[i]) for i in range(self.num)]
        values = [x / y if y != 0 else float('nan') \
            for x, y in zip(self.sum_metric, self.num_inst)]
        return (names, values)

3. 调用

上面虽然把代码给解释清楚了，但是具体调用的时候还是会有问题，我在是用 metric.update 函数的时候遭遇到的就是

ValueError: zero-dimensional arrays cannot be concatenated

我发现原因出在当在 metric.update 中指定 gt_difficults=None 时，其实默认就是 None，所以下面代码会运行，对于一个 batch，gt_labels 是一个 MXNet 的 NDArray 矩阵，所以得到的 gt_difficults 其实就是 [None]；gt_labels 是多个 batch 构成的 list，最后得到的就是 [None, None,…, None] 结果是一样的，所以还是按照 [None] 来说。

1 2	if gt_difficults is None: gt_difficults = [None for _ in gt_labels]

在调用 as_numpy 的时候，isinstance(a, (list, tuple)) 是 True，又因为 isinstance(x, mx.nd.NDArray) 是 False，所以最后的 out 还是 [None]，问题出在 return np.concatenate(out, axis=0) 的时候，np.concatenate(out, axis=0) 就会报错 ValueError: zero-dimensional arrays cannot be concatenated

def as_numpy(a):
    """Convert a (list of) mx.NDArray into numpy.ndarray"""
    if isinstance(a, (list, tuple)):
        out = [x.asnumpy() if isinstance(x, mx.nd.NDArray) else x for x in a]
        return np.concatenate(out, axis=0)
    elif isinstance(a, mx.nd.NDArray):
        a = a.asnumpy()
    return a

Logs

2018-07-29：初稿
2018-08-02：增加了对 update 输入部分的解读，即 Human Label Space vs Anchor Label Space.
2018-08-02：增加了 VOCMApMetric 计算 valid_gt 的原因：剔除 Test 阶段 Pad 操作的影响；增加了 Anchor 是 Training 的主体，Object 是 Test 的主体

如果您觉得我的文章对您有所帮助，不妨小额捐助一下，您的鼓励是我长期坚持的动力。

1. Notes on mAP

1.1 meaning of “precision”

1.2 meaning of “average precision”

1.3 meaning of “mean average precision”

2. Notes on Code

2.1 __init__

2.2 update

2.3 _recall_prec

2.4 _average_precision

2.5 _update

2.6 get