R-CNN
R-CNN
Papers
Posts
Note
Because I am not going to implement them… so I just write some notes here…
History: R-CNN -> Fast R-CNN -> Faster R-CNN -> Mask R-CNN
R-CNN
The first paper using CNN: image classification -> object detection
4 steps:
- Region proposals (images -> selective search -> top 2000 proposals)
- Feature extraction of each propsals (proposals -> AlexNet -> 4096-dim feature vector)
- Object category classifiers (feature vector -> k SVM -> k-dim vector -> overlap filter)
- Bounding-box regression (linear function of feature vector of P)
Proposal overlap filter:
- a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-overunion (IoU) overlap with a higher scoring selected region larger than a learned threshold.
Trainning:
- Positive and negative exmaples
- AlexNet
- SVM
- Bounding-box regression
Fast R-CNN
Improvements:
- The RoI pooling layer
- The softmax classifier
- bbox fully connected layer
4 steps:
- Region proposals (images -> selective search -> top 2000 proposals)
- Feature extraction of each propsals (proposals -> VGG + RoI pooling layer -> $H \times W$ feature map)
- Object category classifiers (feature map -> FC + (k + background) softmax -> (k+1)-dim vector -> overlap filter)
- Bounding-box regression (feature map -> FC -> bbox)
Faster R-CNN
Improvments:
- Region Proposal Networks: make the whole model become a single unified network
5 steps:
- Feature map (images -> VGG -> feature map)
- Region proposals (feature map -> Region Proposal Networks -> $(W \times H \times k)$ proposals with scores -> top 2000 + overlap filter)
- Feature extraction of each propsals (feature map + proposals -> RoI pooling layer -> $(H \times W)$ feature map)
- Object category classifiers (feature map -> FC + (k + background) softmax -> (k+1)-dim vector)
- Bounding-box regression (feature map -> FC -> bbox)
RPN:
- input: $W \times H$ feature map
- k anchor boxes
- each boxes has two scores: object or non-object.
- output: $W \times H \times(4k)$ coordinates, $W \times H \times (2k)$ scores Training:
- Alternating training
Mask R-CNN
Improvments:
- backbone: ResNeXt-101+FPN
- RoIAlign: use bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average)
- Mask branch: FCN (Fully Convolutional Networks for Semantic Segmentation)
FCN:
- Convnet
- Upsampling
- DAG Net
6 steps:
- Feature map (images -> backbone -> feature map)
- Region proposals (feature map -> Region Proposal Networks -> $(W \times H \times k)$ proposals with scores -> top 2000 + overlap filter)
- Feature extraction of each propsals (feature map + proposals -> RoIAlign -> $(H \times W)$ feature map)
- Object category classifiers (feature map -> FC + (k + background) softmax -> (k+1)-dim vector)
- Bounding-box regression (feature map -> FC -> bbox)
- Mask branch (feature map -> head -> FCN -> k mask proposals)