SSD训练与测试

上一篇中介绍了如何编译、安装、配置ssd,本文介绍如何训练数据为模型并测试模型效果。

数据集

首先下载数据集。这里下载的是VOC 2007/2012,总计2.7GB,解压后2.9GB

1
2
3
4
5
6
7
8
cd $HOME/data
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
# Extract the data.
tar -xvf VOCtrainval_11-May-2012.tar
tar -xvf VOCtrainval_06-Nov-2007.tar
tar -xvf VOCtest_06-Nov-2007.tar

注意这里的下载保存路径,需要是家目录下的data目录,是和下一步的脚本中路径一致。

LMDB

创建LMDB文件

1
2
3
4
5
6
7
8
9
cd $CAFFE_ROOT
# Create the trainval.txt, test.txt, and test_name_size.txt in data/VOC0712/
./data/VOC0712/create_list.sh
# You can modify the parameters in create_data.sh if needed.
# It will create lmdb files for trainval and test with encoded original image:
# - $HOME/data/VOCdevkit/VOC0712/lmdb/VOC0712_trainval_lmdb
# - $HOME/data/VOCdevkit/VOC0712/lmdb/VOC0712_test_lmdb
# and make soft links at examples/VOC0712/
./data/VOC0712/create_data.sh

创建列表输出为

1
2
3
4
5
6
7
8
9
10
hyper372@hyper372-ai:~/Documents/caffe$ ./data/VOC0712/create_list.sh
Create list for VOC2007 trainval...
Create list for VOC2012 trainval...
Create list for VOC2007 test...
I0605 10:07:47.896740 5859 get_image_size.cpp:61] A total of 4952 images.
I0605 10:07:49.007894 5859 get_image_size.cpp:100] Processed 1000 files.
I0605 10:07:50.128299 5859 get_image_size.cpp:100] Processed 2000 files.
I0605 10:07:51.242344 5859 get_image_size.cpp:100] Processed 3000 files.
I0605 10:07:52.354455 5859 get_image_size.cpp:100] Processed 4000 files.
I0605 10:07:53.439630 5859 get_image_size.cpp:105] Processed 4952 files.

创建数据数据库输出为

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
hyper372@hyper372-ai:~/Documents/caffe$ ./data/VOC0712/create_data.sh
/home/hyper372/Documents/caffe/build/tools/convert_annoset --anno_type=detection --label_type=xml --label_map_file=/home/hyper372/Documents/caffe/data/VOC0712/../../data/VOC0712/labelmap_voc.prototxt --check_label=True --min_dim=0 --max_dim=0 --resize_height=0 --resize_width=0 --backend=lmdb --shuffle=False --check_size=False --encode_type=jpg --encoded=True --gray=False /home/hyper372/data/VOCdevkit/ /home/hyper372/Documents/caffe/data/VOC0712/../../data/VOC0712/test.txt /home/hyper372/data/VOCdevkit/VOC0712/lmdb/VOC0712_test_lmdb
I0605 10:21:20.892621 6031 convert_annoset.cpp:122] A total of 4952 images.
I0605 10:21:20.893379 6031 db_lmdb.cpp:35] Opened lmdb /home/hyper372/data/VOCdevkit/VOC0712/lmdb/VOC0712_test_lmdb
I0605 10:21:23.375008 6031 convert_annoset.cpp:195] Processed 1000 files.
I0605 10:21:25.777346 6031 convert_annoset.cpp:195] Processed 2000 files.
I0605 10:21:28.468402 6031 convert_annoset.cpp:195] Processed 3000 files.
I0605 10:21:31.168674 6031 convert_annoset.cpp:195] Processed 4000 files.
I0605 10:21:33.670279 6031 convert_annoset.cpp:201] Processed 4952 files.
/home/hyper372/Documents/caffe/build/tools/convert_annoset --anno_type=detection --label_type=xml --label_map_file=/home/hyper372/Documents/caffe/data/VOC0712/../../data/VOC0712/labelmap_voc.prototxt --check_label=True --min_dim=0 --max_dim=0 --resize_height=0 --resize_width=0 --backend=lmdb --shuffle=False --check_size=False --encode_type=jpg --encoded=True --gray=False /home/hyper372/data/VOCdevkit/ /home/hyper372/Documents/caffe/data/VOC0712/../../data/VOC0712/trainval.txt /home/hyper372/data/VOCdevkit/VOC0712/lmdb/VOC0712_trainval_lmdb
I0605 10:21:34.663084 6100 convert_annoset.cpp:122] A total of 16551 images.
I0605 10:21:34.663497 6100 db_lmdb.cpp:35] Opened lmdb /home/hyper372/data/VOCdevkit/VOC0712/lmdb/VOC0712_trainval_lmdb
I0605 10:21:37.976790 6100 convert_annoset.cpp:195] Processed 1000 files.
I0605 10:21:41.071249 6100 convert_annoset.cpp:195] Processed 2000 files.
I0605 10:21:44.191231 6100 convert_annoset.cpp:195] Processed 3000 files.
I0605 10:21:47.320384 6100 convert_annoset.cpp:195] Processed 4000 files.
I0605 10:21:50.551687 6100 convert_annoset.cpp:195] Processed 5000 files.
I0605 10:21:53.697355 6100 convert_annoset.cpp:195] Processed 6000 files.
I0605 10:21:56.773370 6100 convert_annoset.cpp:195] Processed 7000 files.
I0605 10:21:59.869189 6100 convert_annoset.cpp:195] Processed 8000 files.
I0605 10:22:02.992766 6100 convert_annoset.cpp:195] Processed 9000 files.
I0605 10:22:06.083061 6100 convert_annoset.cpp:195] Processed 10000 files.
I0605 10:22:09.214797 6100 convert_annoset.cpp:195] Processed 11000 files.
I0605 10:22:12.303718 6100 convert_annoset.cpp:195] Processed 12000 files.
I0605 10:22:15.529724 6100 convert_annoset.cpp:195] Processed 13000 files.
I0605 10:22:18.636446 6100 convert_annoset.cpp:195] Processed 14000 files.
I0605 10:22:21.745776 6100 convert_annoset.cpp:195] Processed 15000 files.
I0605 10:22:24.911562 6100 convert_annoset.cpp:195] Processed 16000 files.
I0605 10:22:26.647538 6100 convert_annoset.cpp:201] Processed 16551 files.

基础模型

https://github.com/conner99/VGGNet/下载基础模型文件并放置于.../caffe/models/VGGNet目录下

训练

在caffe目录执行

1
python3 examples/ssd/ssd_pascal.py

开始训练,输出数据为

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
...
02:18.929926 9181 solver.cpp:259] Train net output #0: mbox_loss = 16.1801 (* 1 = 16.1801 loss)
I0606 10:02:18.929934 9181 sgd_solver.cpp:138] Iteration 30, lr = 0.0001
I0606 10:02:19.643611 9181 solver.cpp:243] Iteration 40, loss = 15.662
I0606 10:02:19.643646 9181 solver.cpp:259] Train net output #0: mbox_loss = 14.884 (* 1 = 14.884 loss)
I0606 10:02:19.643652 9181 sgd_solver.cpp:138] Iteration 40, lr = 0.0001
I0606 10:02:20.352501 9181 solver.cpp:243] Iteration 50, loss = 15.9099
I0606 10:02:20.352522 9181 solver.cpp:259] Train net output #0: mbox_loss = 14.0294 (* 1 = 14.0294 loss)
I0606 10:02:20.352526 9181 sgd_solver.cpp:138] Iteration 50, lr = 0.0001
I0606 10:02:21.057381 9181 solver.cpp:243] Iteration 60, loss = 12.4434
I0606 10:02:21.057401 9181 solver.cpp:259] Train net output #0: mbox_loss = 14.9566 (* 1 = 14.9566 loss)
I0606 10:02:21.057406 9181 sgd_solver.cpp:138] Iteration 60, lr = 0.0001
I0606 10:02:21.766348 9181 solver.cpp:243] Iteration 70, loss = 10.3261
I0606 10:02:21.766368 9181 solver.cpp:259] Train net output #0: mbox_loss = 14.9214 (* 1 = 14.9214 loss)
I0606 10:02:21.766372 9181 sgd_solver.cpp:138] Iteration 70, lr = 0.0001
I0606 10:02:22.482571 9181 solver.cpp:243] Iteration 80, loss = 14.9099
I0606 10:02:22.482591 9181 solver.cpp:259] Train net output #0: mbox_loss = 14.1905 (* 1 = 14.1905 loss)
I0606 10:02:22.482596 9181 sgd_solver.cpp:138] Iteration 80, lr = 0.0001
I0606 10:02:23.193933 9181 solver.cpp:243] Iteration 90, loss = 13.2082
I0606 10:02:23.193953 9181 solver.cpp:259] Train net output #0: mbox_loss = 14.931 (* 1 = 14.931 loss)
I0606 10:02:23.193957 9181 sgd_solver.cpp:138] Iteration 90, lr = 0.0001

mbox_loss/loss整体是减少的

训练完成

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
...
I0606 12:50:29.691890 9181 sgd_solver.cpp:138] Iteration 119950, lr = 1e-06
I0606 12:50:30.420228 9181 solver.cpp:243] Iteration 119960, loss = 2.89373
I0606 12:50:30.420271 9181 solver.cpp:259] Train net output #0: mbox_loss = 2.66499 (* 1 = 2.66499 loss)
I0606 12:50:30.420276 9181 sgd_solver.cpp:138] Iteration 119960, lr = 1e-06
I0606 12:50:31.145910 9181 solver.cpp:243] Iteration 119970, loss = 3.60434
I0606 12:50:31.145931 9181 solver.cpp:259] Train net output #0: mbox_loss = 5.0304 (* 1 = 5.0304 loss)
I0606 12:50:31.145933 9181 sgd_solver.cpp:138] Iteration 119970, lr = 1e-06
I0606 12:50:31.867653 9181 solver.cpp:243] Iteration 119980, loss = 2.38721
I0606 12:50:31.867677 9181 solver.cpp:259] Train net output #0: mbox_loss = 5.5256 (* 1 = 5.5256 loss)
I0606 12:50:31.867681 9181 sgd_solver.cpp:138] Iteration 119980, lr = 1e-06
I0606 12:50:32.584725 9181 solver.cpp:243] Iteration 119990, loss = 4.15097
I0606 12:50:32.584744 9181 solver.cpp:259] Train net output #0: mbox_loss = 3.42382 (* 1 = 3.42382 loss)
I0606 12:50:32.584748 9181 sgd_solver.cpp:138] Iteration 119990, lr = 1e-06
I0606 12:50:33.244800 9181 solver.cpp:596] Snapshotting to binary proto file models/VGGNet/VOC0712/SSD_300x300/VGG_VOC0712_SSD_300x300_iter_120000.caffemodel
I0606 12:50:33.442319 9181 sgd_solver.cpp:307] Snapshotting solver state to binary proto file models/VGGNet/VOC0712/SSD_300x300/VGG_VOC0712_SSD_300x300_iter_120000.solverstate
I0606 12:50:33.523787 9181 solver.cpp:332] Iteration 120000, loss = 3.751
I0606 12:50:33.523806 9181 solver.cpp:433] Iteration 120000, Testing net (#0)
I0606 12:50:33.523845 9181 net.cpp:693] Ignoring source layer mbox_loss
I0606 12:52:32.019611 9181 solver.cpp:546] Test net output #0: detection_eval = 0.576414
I0606 12:52:32.019716 9181 solver.cpp:337] Optimization Done.
I0606 12:52:32.019721 9181 caffe.cpp:254] Optimization Done.

训练结果

训练结果在.../caffe/modes/VGGNet/VOC0712/SSD_300x300/目录中

1
2
3
4
5
6
7
8
9
10
11
12
hyper372@hyper372-ai:~/Documents/caffe/models/VGGNet/VOC0712/SSD_300x300$ ll
total 410856
drwxrwxr-x 2 hyper372 hyper372 4096 6月 6 12:50 ./
drwxrwxr-x 3 hyper372 hyper372 4096 6月 5 10:25 ../
-rw-rw-r-- 1 hyper372 hyper372 26298 6月 6 10:02 deploy.prototxt
-rw-rw-r-- 1 hyper372 hyper372 669 6月 6 10:02 solver.prototxt
-rw-rw-r-- 1 hyper372 hyper372 27125 6月 6 10:02 test.prototxt
-rw-rw-r-- 1 hyper372 hyper372 28593 6月 6 10:02 train.prototxt
-rw-rw-r-- 1 hyper372 hyper372 105154337 6月 6 12:50 VGG_VOC0712_SSD_300x300_iter_120000.caffemodel
-rw-rw-r-- 1 hyper372 hyper372 105143086 6月 6 12:50 VGG_VOC0712_SSD_300x300_iter_120000.solverstate
-rw-rw-r-- 1 hyper372 hyper372 105154337 6月 6 11:53 VGG_VOC0712_SSD_300x300_iter_80000.caffemodel
-rw-rw-r-- 1 hyper372 hyper372 105143085 6月 6 11:53 VGG_VOC0712_SSD_300x300_iter_80000.solverstate

机械革命深海幽灵Z3 Air-S 2060,batch为1,耗时3个小时。batch为8,耗时18小时20分钟。再加就爆内存了。

模型测试

目标检测

执行命令

1
python3 examples/ssd/ssd_detect.py

输出为

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
...
I0606 13:52:13.823376 10354 net.cpp:228] pool1 does not need backward computation.
I0606 13:52:13.823379 10354 net.cpp:228] relu1_2 does not need backward computation.
I0606 13:52:13.823401 10354 net.cpp:228] conv1_2 does not need backward computation.
I0606 13:52:13.823403 10354 net.cpp:228] relu1_1 does not need backward computation.
I0606 13:52:13.823405 10354 net.cpp:228] conv1_1 does not need backward computation.
I0606 13:52:13.823410 10354 net.cpp:228] data_input_0_split does not need backward computation.
I0606 13:52:13.823411 10354 net.cpp:228] input does not need backward computation.
I0606 13:52:13.823413 10354 net.cpp:270] This network produces output detection_out
I0606 13:52:13.823448 10354 net.cpp:283] Network initialization done.
I0606 13:52:13.887385 10354 net.cpp:761] Ignoring source layer data
I0606 13:52:13.887403 10354 net.cpp:761] Ignoring source layer data_data_0_split
I0606 13:52:13.907215 10354 net.cpp:761] Ignoring source layer mbox_loss
[[0.43371916, 0.041391477, 0.72588027, 0.50166625, 15, 0.642155, 'person']]
481 323
[0.43371916, 0.041391477, 0.72588027, 0.50166625, 15, 0.642155, 'person']
[209, 13, 349, 162]
[209, 13] person

检测的图片位于.../caffe/examples/images/,路径可以在源码中修改。

待检测图片

.../caffe目录会生成检测结果

视频检测

batch为8时,模型可以检测出更多的结果

视频检测

视频检测

执行命令

1
python3 examples/ssd/ssd_pacsal_webcam.py

会自动打开摄像头,检测结果为

视频检测

batch为8时,检测可能性值为0.95

模型评分

执行命令

1
python3 examples/ssd/score_ssd_pascal.py

部分输出为

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
...
I0606 13:58:26.945226 11671 net.cpp:228] relu2_1 does not need backward computation.
I0606 13:58:26.945227 11671 net.cpp:228] conv2_1 does not need backward computation.
I0606 13:58:26.945230 11671 net.cpp:228] pool1 does not need backward computation.
I0606 13:58:26.945230 11671 net.cpp:228] relu1_2 does not need backward computation.
I0606 13:58:26.945232 11671 net.cpp:228] conv1_2 does not need backward computation.
I0606 13:58:26.945233 11671 net.cpp:228] relu1_1 does not need backward computation.
I0606 13:58:26.945235 11671 net.cpp:228] conv1_1 does not need backward computation.
I0606 13:58:26.945237 11671 net.cpp:228] data_data_0_split does not need backward computation.
I0606 13:58:26.945238 11671 net.cpp:228] data does not need backward computation.
I0606 13:58:26.945240 11671 net.cpp:270] This network produces output detection_eval
I0606 13:58:26.945272 11671 net.cpp:283] Network initialization done.
I0606 13:58:26.945401 11671 solver.cpp:75] Solver scaffolding done.
I0606 13:58:26.946391 11671 caffe.cpp:155] Finetuning from models/VGGNet/VOC0712/SSD_300x300/VGG_VOC0712_SSD_300x300_iter_120000.caffemodel
I0606 13:58:27.101177 11671 net.cpp:761] Ignoring source layer mbox_loss
I0606 13:58:27.103570 11671 caffe.cpp:251] Starting Optimization
I0606 13:58:27.103576 11671 solver.cpp:294] Solving VGG_VOC0712_SSD_300x300_train
I0606 13:58:27.103578 11671 solver.cpp:295] Learning Rate Policy: multistep
I0606 13:58:27.276463 11671 solver.cpp:332] Iteration 0, loss = 5.70725
I0606 13:58:27.276504 11671 solver.cpp:433] Iteration 0, Testing net (#0)
I0606 13:58:27.285435 11671 net.cpp:693] Ignoring source layer mbox_loss
I0606 14:00:27.669929 11671 solver.cpp:546] Test net output #0: detection_eval = 0.576414
I0606 14:00:27.670023 11671 solver.cpp:337] Optimization Done.
I0606 14:00:27.670027 11671 caffe.cpp:254] Optimization Done.

评分为0.576414,不及格,batch为8时,评分为0.715659,还行。

错误

这里指的是我遇到的错误

NameError: name ‘xrange’ is not defined. Did you mean: ‘range’

xrange函数是python2的,将其改为range即可。

TypeError: ‘>’ not supported between instances of ‘builtin_function_or_method’ and ‘int’

打开.../caffe/python/caffe/model_libs.py

注释第16行如下

1
#assert len > 0

TypeError: 1.0 has type float, but expected one of: int, long

打开.../caffe/python/caffe/model_libs.py

修改第156行

1
pad = int((3+(dilation-1) *2)-1) // 2

第375行

1
pad = int((kernel_size + (dilation -1)*(kernel_size-1))-1)//2

第417行

1
pad = int((kernel_size+(dilation-1)*(kernel_size-1))-1) //2

这三处都是将/改为//

Check failed: status == CUDNN_STATUS_SUCCESS (1 vs. 0) CUDNN_STATUS_NOT_INITIALIZED

这个可能导致的原因有很多,比如显卡和启动不匹配,显存不够,删除~/.nv目录等。我遇到的解决方案是给train.prototxt中所有convolution_param添加engine: CAFFE

Check failed: a<=b <0 vs -1.19209e-007>

打开.../caffe/src/caffe/util/math_functions.cpp

注释第250行

1
//CHECK_LE(a,b);

Data layer prefetch queue empty

由修改上一个问题导致的问题

打开.../caffe/src/caffe/util/sampler.cpp

在第109行添加

1
2
3
4
if(bbox_width >= 1.0){bbox_width=1.0}
if(bbox_height>= 1.0){bbox_height=1.0}

//Figure out top left coordinates

确保数据不会越界。

2 vs. 0 Out of memory

这个是内存不够,将.../caffe/example/ssd/ssd_pacal.py中batch_size调小一点。

(10 vs. 0) invalid device ordinal

GPU顺序错误,官方程序使用了四个GPU,但是我只有一个,将.../caffe/example/ssd/ssd_pacal.py中第332行改为1个GPU。

(4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR

我的是内存不够了,把nvidia-smi中占用多的杀了。

TypeError: JPEGFormat.Reader._open() got an unexpected keyword argument ‘as_grey’

此函数由caffe.io.load_image()调用

.../caffe/example/ssd/ssd_detect.py中第68行的

1
image = caffe.io.load_image(image_file)

改为

1
2
3
image = cv2.imread(image_file)
image = cv2.cvtColor(image,cv2.COLOR_BGR2RGB)
image = image/255

需要导入cv2库。

mbox_loss = nan (* 1 = nan loss) 或者 loss = nan

损失值溢出

修改.../caffe/example/ssd/ssd_pacal.py第232行base_lr的值,缩至1/10,如果不行再缩至1/10

Couldn’t find any detections

同上

源码

我修改后的可用源码位于caffe-ssd


SSD训练与测试
https://feater.top/ssd/how-to-train-and-test-ssd-model/
作者
JackeyLea
发布于
2024年6月10日
许可协议