GitHub:
liuyuemaicha/PyTorch-YOLOv3github.com
注:该代码fork自eriklindernoren/PyTorch-YOLOv3,该代码相比master分支,主要改动有:
- 修改了原代码中发现的bug,如:dataset.py 中boxes[:, 3] *= w_factor / padded_w和boxes[:, 4] *= h_factor / padded_h,改为boxes[:, 3] = boxes[:, 3] * w_factor / padded_w和boxes[:, 4] = boxes[:, 4] * h_factor / padded_h,因为原计算方式因为优先级问题,会优先计算h_factor / padded_h,从而变为0.0(可能浮点数不够,导致太小的小数四舍五入为0)
- 去掉了插件terminaltables,便于直接运行
- 将Python环境从3.0改为2.7
- 对代码做了大量的中文注释
在objects detection算法中,大概可以分为两大方向
- Two-Stage,这是一种Proposal-based的方法,需要先使用启发式方法(selective search)或者CNN网络(RPN)产生Region Proposal,然后再在Region Proposal上做分类与回归。
- R-CNN
- Fast R-CNN
- Faster R-CNN
- Mask R-CNN
- One-Stage,这是一种Proposal-free的方法,不需要先计算Proposal,直接使用一个CNN网络预测不同目标的类别与位置
- YOLO系列
- SSD
- RetinaNet
关于Two-Stage的Faster-RCNN的简介和代码,可以转:
KevinCK:目标检测——Faster RCNN简介与代码注释(附github代码——已跑通)zhuanlan.zhihu.com
本文认为读者对YOLO有初步的认识,如果是初学者,建议先阅读如下文章:
小小将:目标检测|YOLO原理与实现zhuanlan.zhihu.com
小小将:目标检测|YOLOv2原理与实现(附YOLOv3)zhuanlan.zhihu.com
YOLOV3特点:
- 使用DarkNet-53网络,该网络特点1是融合了ResNet,防止有效信息的丢失,也防止深层网络训练时出现梯度消失;特点2是没有Pooling层,使用Conv做下采样,进一步防止有效信息的丢失,这对于小目标来说是十分有利的。
- Multi-Scale策略,这里包括Multi-Scale Train和Multi-Scale Predict。Multi-Scale Train是在训练时采用不同尺寸的图片作为输入,这样可以使模型适应不同大小的图片;Multi-Scale Predict是在预测时取不同尺寸的下采样——即FPN(Feature Pyramid Networks)架构,YOLOV3分别在sub_sample=32, 16和8处做目标检测,这样可以预测多尺寸的目标。
- 类别预测方面主要是将原来的单标签分类改进为多标签分类,因此网络结构上就将原来用于单标签多分类的softmax层换成用于多标签多分类的逻辑回归层。首先说明一下为什么要做这样的修改,原来分类网络中的softmax层都是假设一张图像或一个object只属于一个类别,但是在一些复杂场景下,一个object可能属于多个类,比如你的类别中有woman和person这两个类,那么如果一张图像中有一个woman,那么你检测的结果中类别标签就要同时有woman和person两个类,这就是多标签分类,需要用逻辑回归层来对每个类别做二分类。
- 关于bounding box的初始尺寸还是采用YOLO v2中的k-means聚类的方式来做,不过数量变了。这种先验知识对于bounding box的初始化帮助还是很大的,毕竟过多的bounding box虽然对于效果来说有保障,但是对于算法速度影响还是比较大的。作者在COCO数据集上得到的9种聚类结果:(10*13); (16*30); (33*23); (30*61); (62*45); (59*119); (116*90); (156*198); (373*326),这应该是按照输入图像的尺寸是416*416计算得到的。
YOLO V3 结构图:
转自:https://blog.csdn.net/leviopku/article/details/82660381
结合上图的代码构造:
其中YOLO层的特征计算步骤:
step1:从特征图的255中划分出anchor框(3),坐标pred_boxes(4),置信度pred_conf(1)和类别pred_cls(80)=》 255=3*(4+1+80)
x.view(num_samples, self.num_anchors, 5 + self.num_classes, grid_size, grid_size).permute(0, 1, 3, 4, 2).contiguous()
step2:将模型预测坐标pred_boxes(x, y, w, h)转化为特征图的坐标系output,并将output作为当前YOLO层预测的输出
因为模型生成的坐标信息是基于原始图片的,其中中心点坐标x,y是根据原图像的w,h做等比计算,x=center_x/w, y=center_y/h
超参数anchor框是基于原图像的,所以在特征图计算w,h时,需要将anchor框做等比例的下采样。
step3:将目标实际坐标target_boxes转为特征图的坐标信息 (原目标坐标是基于原图像的)
step4:根据IOU为每个目标从3个备用anchor框中选出最合适的anchor(best_ious, best_n)
step5:得到目标实体框obj_mask和目标非实体框noobj_mask
1.目标实体框只有一种:
目标实体框中心点所在的单元网格,其最优anchor设置为1: obj_mask[b, best_n, gj, gi] = 1
2. 目标非实体框有两种:
(1)对目标实体框中心点所在的单元网格,其最优anchor设置为0 (与obj_mask相反):
noobj_mask[b, best_n, gj, gi] = 0
(2)目标框与anchor的IOU大于阈值,其对应的位置设置为0:
noobj_mask[b[i], anchor_ious > ignore_thres, gj[i], gi[i]] = 0
step6:目标实体框的归一化坐标(tx, ty, tw, th)
step7:根据坐标(x,y,w,h),置信度(conf),分类(cls)计算损失值
1. 坐标和尺寸的loss计算:
loss_x = self.mse_loss(x[obj_mask], tx[obj_mask])
loss_y = self.mse_loss(y[obj_mask], ty[obj_mask])
loss_w = self.mse_loss(w[obj_mask], tw[obj_mask])
loss_h = self.mse_loss(h[obj_mask], th[obj_mask])
2. anchor置信度的loss计算(置信度计算分实体框和非实体框两种来做):
loss_conf_obj = self.bce_loss(pred_conf[obj_mask], tconf[obj_mask]) # tconf[obj_mask] 目标实体框置信度全为1
loss_conf_noobj = self.bce_loss(pred_conf[noobj_mask], tconf[noobj_mask]) # tconf[noobj_mask] 目标非实体框置信度全为0
loss_conf = self.obj_scale * loss_conf_obj + self.noobj_scale * loss_conf_noobj
3. 类别的loss计算
loss_cls = self.bce_loss(pred_cls[obj_mask], tcls[obj_mask])
4. loss汇总
total_loss = loss_x + loss_y + loss_w + loss_h + loss_conf + loss_cls
参考文献:
https://blog.csdn.net/u014380165/article/details/80202337blog.csdn.net
https://blog.csdn.net/leviopku/article/details/82660381blog.csdn.net
https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6btowardsdatascience.com
附加:
测试集结果如下,img_size为416的mAP为51.5,img_size为608的mAP为53.5,没有达到Github上的55.5和57.5,也许跟使用Python 2.7有关,或者跟系统环境的浮点数有效位数有关。
如果大家能发现问题所在,欢迎指出。
img_size为416的评估详情:
python test.py --weights_path weights/yolov3.weights
Namespace(batch_size=8, class_path='data/coco.names', conf_thres=0.001, data_config='config/coco.data', img_size=416, iou_thres=0.5, model_def='config/yolov3.cfg', n_cpu=8, nms_thres=0.5, weights_path='weights/yolov3.weights')
Compute mAP...
Detecting objects: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 625/625 [08:05<00:00, 1.29it/s]
Computing AP: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:01<00:00, 48.64it/s]
Average Precisions:
+ Class '0' (person) - AP: 0.690715786814
+ Class '1' (bicycle) - AP: 0.468696263699
+ Class '2' (car) - AP: 0.584785409049
+ Class '3' (motorbike) - AP: 0.617342547155
+ Class '4' (aeroplane) - AP: 0.736821607109
+ Class '5' (bus) - AP: 0.752159880981
+ Class '6' (train) - AP: 0.75436613555
+ Class '7' (truck) - AP: 0.418845415814
+ Class '8' (boat) - AP: 0.405536741846
+ Class '9' (traffic light) - AP: 0.444352489056
+ Class '10' (fire hydrant) - AP: 0.780323613332
+ Class '11' (stop sign) - AP: 0.720325098041
+ Class '12' (parking meter) - AP: 0.531870851371
+ Class '13' (bench) - AP: 0.333477122909
+ Class '14' (bird) - AP: 0.44403753686
+ Class '15' (cat) - AP: 0.730350406736
+ Class '16' (dog) - AP: 0.731988734812
+ Class '17' (horse) - AP: 0.775121552363
+ Class '18' (sheep) - AP: 0.598467814833
+ Class '19' (cow) - AP: 0.523387458122
+ Class '20' (elephant) - AP: 0.856378839961
+ Class '21' (bear) - AP: 0.746202492129
+ Class '22' (zebra) - AP: 0.787078306605
+ Class '23' (giraffe) - AP: 0.822787313475
+ Class '24' (backpack) - AP: 0.324516366247
+ Class '25' (umbrella) - AP: 0.527123866383
+ Class '26' (handbag) - AP: 0.204464062772
+ Class '27' (tie) - AP: 0.495962300433
+ Class '28' (suitcase) - AP: 0.569835653931
+ Class '29' (frisbee) - AP: 0.635626602247
+ Class '30' (skis) - AP: 0.40624013442
+ Class '31' (snowboard) - AP: 0.454860015814
+ Class '32' (sports ball) - AP: 0.543138370312
+ Class '33' (kite) - AP: 0.409971165338
+ Class '34' (baseball bat) - AP: 0.503833906346
+ Class '35' (baseball glove) - AP: 0.477819691368
+ Class '36' (skateboard) - AP: 0.684912073091
+ Class '37' (surfboard) - AP: 0.622125284525
+ Class '38' (tennis racket) - AP: 0.687646089588
+ Class '39' (bottle) - AP: 0.422858294504
+ Class '40' (wine glass) - AP: 0.510764916053
+ Class '41' (cup) - AP: 0.470900058275
+ Class '42' (fork) - AP: 0.441071681355
+ Class '43' (knife) - AP: 0.288951139356
+ Class '44' (spoon) - AP: 0.212644605589
+ Class '45' (bowl) - AP: 0.488293672102
+ Class '46' (banana) - AP: 0.274810213987
+ Class '47' (apple) - AP: 0.176945733903
+ Class '48' (sandwich) - AP: 0.459509805447
+ Class '49' (orange) - AP: 0.286156884797
+ Class '50' (broccoli) - AP: 0.349783550871
+ Class '51' (carrot) - AP: 0.223717764721
+ Class '52' (hot dog) - AP: 0.3702692587
+ Class '53' (pizza) - AP: 0.529775775173
+ Class '54' (donut) - AP: 0.506838476713
+ Class '55' (cake) - AP: 0.476632708388
+ Class '56' (chair) - AP: 0.398044915995
+ Class '57' (sofa) - AP: 0.521408653907
+ Class '58' (pottedplant) - AP: 0.423973845076
+ Class '59' (bed) - AP: 0.633835173775
+ Class '60' (diningtable) - AP: 0.413801242547
+ Class '61' (toilet) - AP: 0.737728403797
+ Class '62' (tvmonitor) - AP: 0.699158857175
+ Class '63' (laptop) - AP: 0.687128516643
+ Class '64' (mouse) - AP: 0.721448041651
+ Class '65' (remote) - AP: 0.478972941695
+ Class '66' (keyboard) - AP: 0.664482993427
+ Class '67' (cell phone) - AP: 0.397435785484
+ Class '68' (microwave) - AP: 0.642376309562
+ Class '69' (oven) - AP: 0.483132993049
+ Class '70' (toaster) - AP: 0.162337662338
+ Class '71' (sink) - AP: 0.507507409808
+ Class '72' (refrigerator) - AP: 0.68628967803
+ Class '73' (book) - AP: 0.17111818793
+ Class '74' (clock) - AP: 0.688645968288
+ Class '75' (vase) - AP: 0.441558495019
+ Class '76' (scissors) - AP: 0.34379878322
+ Class '77' (teddy bear) - AP: 0.58595909793
+ Class '78' (hair drier) - AP: 0.113636363636
+ Class '79' (toothbrush) - AP: 0.264372243744
mAP: 0.514519651314
|
|