• 대한전기학회
Mobile QR Code QR CODE : The Transactions of the Korean Institute of Electrical Engineers
  • COPE
  • kcse
  • 한국과학기술단체총연합회
  • 한국학술지인용색인
  • Scopus
  • crossref
  • orcid

  1. (Dept. of Mechatronics Engineering, Kyungnam University, Korea. E-mail : aobangqian@163.com )



Object detection, Accuracy, Tracking system, Monocular camera

1. Introduction

Traditional moving object detection methods generally use the background subtraction technique, which separates the objects from the background by comparing each frame with the background. Background subtraction algorithms utilize the frame difference methods to take the previous frames as the background, and consider the newest one as the object. Other methods use the Haar cascade classifiers (1), that implement the AdaBoost algorithm. The latter is organized as a screening cascade classifier, where each node is a classifier that comprises multiple trees. At any level, the calculation is terminated once the conclusion of “not in the category” is obtained. This algorithm proved to be fast but not sensitive to the stable or moving slowly objects, and is unable to detect the inner pixels of the moving objects that are not uniformly colored.

Other object detection methods have been also available. For instance, the mean shift (2) algorithm can quickly find the targets using few iterations, and it generally achieves good detection results. However, it cannot solve the occlusion problem of the target and cannot adapt to the shape and size change of the moving target. The cam-shift (3) algorithm is an improved version which can adapt to the size and shape change of the moving target, leading to good detection results. However, the target area can easily be made larger, which may eventually lead to losing the former target when the colors between the background and the target are close. In addition, the Kalman filter (4) is a method that considers the object’s motion to obey the Gaussian model. It predicts the target’s motion state and compares it with that of the observation model, in order to update the state of the moving target according to the error. However, this method has a low accuracy. The particle filtering algorithm (5) re-samples the particle distribution through the current detecting results, and diffuses the particles according to their distribution. The algorithm re-observes the target state through the diffusion result and finally normalizes and updates the target state. However, in this algorithm, the large samples are needed to approximate the posterior system probability, and the re-sampling stage would cause loss of sample validity and diversity, which could lead to sample depletion. These algorithms have a high performance in several applications. However, they are time consuming, and they do no attain real-time effects because of their high computational cost. In addition, they are not sufficiently robust when the environment changes or a local distortion of the object features happens because of sudden changes in the light intensity (6), (7). Occlusion and background noise also make the object detector much more complex (8), (9). Because of all these negative factors, the detecting system design becomes more challenging, while both real-time requirements and computing power limitations should be considered.

The remainder of this paper is organized as follows. Section 2 summarizes the existing algorithms, provides some of their applications and analyzes their advantages and disadvantages. Section 3 presents the elaborately designed M-SSD model structure while setting its key parameters. It also shows the training and validation of the proposed algorithm, with its performance evaluation. The monocular camera algorithm is implemented and compared with other instruments in Section 4. Finally, conclusions and perspectives are drawn in Section 5.

2. Related works

Deep learning (DL) (10) and convolutional neural network (CNN) (11) have been rapidly developed, and demonstrated high efficiency in classification and image recognition. They became crucial solutions in several application domains, especially in computer vision and object detection. CNN is a basic net structure composed of a feature extractor and a classifier. In the past decades, a great progress has been made on CNN-based systems. Object detection (or tracking) is a basic problem in computer vision.

Several classical efficient object detectors have been recently proposed. For instance, the Region-based CNN (R-CNN) (12) applies a high-capacity CNN to the bottom-up candidate area, in order to localize and segment objects. The fast R-CNN (13) and faster R-CNN (14) speed up the approach of R-CNN and improve its accuracy. The Region-based Fully Convolutional Network (R-FCN) (15) uses a special convolutional layer to construct a position-sensitive score map. It introduces translated changes to the Fully Convolutional Network (FCN), with which each space-sensitive map encodes the relative spatial position information of the region of interest, and a position-sensitive region of interest pooling layer is added on the FCN to supervise these score maps.

The You Only Look Once (YOLO) (16) method divides a single image into multiple grids, and then performs localization and classification in each grid, in order to predict the confidence and location for multiple categories. The Single Shot MultiBox Detector (SSD) (17) obtains actual bounding boxes and scores for each feature map, by creating bounding box candidates on the feature map. It is based on a proposal with multi-scale features, while achieving a balance between efficiency and effect. The SSD can almost achieve a real-time performance from the perspective of computing speed. However, it remains a significant challenge with a limited computation platform, in several real-time applications.

In all these object detectors, two criteria are used to judge whether the system is practical or not: (a) computational complexity and (b) detection accuracy. Generally, a higher accuracy requires more CNN layers to get more features, which inevitably increases the computation complexity, and vice versa. It is hard to balance between accuracy and speed. Hence, designing a real-time object detecting or tracking system, with a limited computing platform, is still a challenging problem.

3. Object detection algorithm

3.1 The M-SSD model

In this paper, an improved SSD model is designed. The SSD approach produces a fixed-size collection of bounding boxes and scores in the presence of object class instances, using a feed-forward convolutional network, followed by a non-maximum suppression step to perform the object detection. The model utilizes the visual geometry group (VGG-16) as its basic structure. However, it casts away the last fully connected layers, adds a set of auxiliary convolutional layers to extract features at multiple scales, and decreases the input size to each subsequent layer. It can improve the detection accuracy of small objects, compared with other existing algorithms. For this kind of model structure, the number of network architecture weights is large, and much disk space is required. Furthermore, the detecting speed is slow. Therefore, it is not suitable for limited computing platforms and small-storage real-time detection systems.

Wei Liu (17) analyzed the SSD model structure and pointed out that the forward time is costed mainly on the base network (i.e. nearly 80%). Therefore, for real-time applications, using a faster basic network can reduce the amount of calculation and greatly improve the speed. ResNet (18) was first proposed by Kaiming He and proven to be an efficient network. Lili Chen (19) replaced the basic feature extraction model to ResNet-34 and got fast detection speed on vehicle counting. Note that, in our single former USV object detection system, it is unnecessary to utilize too many network layers for feature extraction. We choose ResNet-18 as its basic feature extraction network, in order to obtain a real-time detection performance.

The whole model structure of ResNet-18 comprises a convolutional layer, four basic block layers and a final fully connected layer, which is shown in detail in Fig. 1. This structure avoids the problem of gradient disappearance caused by the deepening of the neural network layers. Its efficiency has also been simultaneously improved due to the introduced basic blocks.

Fig. 1. The flowchart of ResNet-18

../../Resources/kiee/KIEE.2021.70.10.1488/fig1.png

In real-time object detecting tasks, large-sized and excessive- convolution kernels increase the computational cost, dilute the effective features and reduce the real-time control accuracy. The authors of (20), (21) prove that the kernel sizes of 1×1 and 3×3 have fewer parameters but stronger feature generalization abilities than the 5×5 and 7×7 kernel size. In addition, a block of two convolutional layers with a 3×3 kernel size plays the same role as one 5×5 convolutional layer, as the convolutional window is scanning the input. The original throughput is kept. However, it results in a lighter number of parameters, while the stacked convolutional layers yield a better result.

Table 1. Parameters of M-SSD from FC6 to conv9_2 layer

Layer

Input size

Output size

Kernel size

Input channel

Output channel

FC6

38×38

19×19

3×3

256

512

FC7

19×19

19×19

1×1

512

512

conv6_1

19×19

10×10

1×1

512

256

conv6_2

19×19

10×10

3×3

128

256

conv7_1

10×10

5×5

1×1

256

128

conv7_2

10×10

5×5

3×3

64

128

conv8_1

5×5

3×3

1×1

128

128

conv8_2

5×5

3×3

3×3

64

64

conv9_1

3×3

1×1

3×3

128

128

conv9_2

3×3

1×1

1×1

64

64

Inspired by these literature methods, two modifications are performed herein compared with the original SSD model: (a) we retain the SSD structure, use ResNet-18 as the basic feature extraction network, but discard the VGG-16, followed by some convolutional layers to detect the object; (b) we replace the convolutional kernels from FC6 to conv9_2 layers and use convolutional kernels of 1×1 size to classify the object. The layer’s specification from FC6 to conv9_2 is presented in Table 1 in detail, while the M-SSD model structure is presented in Fig. 2.

Fig. 2. The overall structure of the M-SSD

../../Resources/kiee/KIEE.2021.70.10.1488/fig2.png

In contrast to the SSD model, we choose the layers of res3d, fc6, fc7, conv6_1, conv7_1, conv8_1 and conv9_1 as the regression feature map layers to classify the object. In each feature map layer, 1×1 represent the size of the convolutional kernel, 3 or 6 represents the numbers of prior box and 4 represents the values of the bounding box.

Afterwards, the M-SSD model parameters are set for the proposed real-time detection system as follows:

Ⅰ. Select default box parameters: the feature maps located in different layers have different sizes of receptive fields in a CNN. To correctly detect targets with different scales when they are moved, some algorithms convert the input image to different scales, then process the converted image and fuse the detection results (22), (23). The strategy proposed in (24) is based on the fact that the default frame does not need to be mapped one to one with the feature map receptive.

The default frame at different positions corresponds to different regions and target sizes. Assuming that $m$ feature maps should be predicted, the default frame size in each feature map is calculated as:

(1)
$S_{i}=S_{\min}+\dfrac{S_{\max}+S_{\min}}{m-1}(i-1)$, $i\in[1,\: m]$

where $S_{\min}$ is the default frame size of the lowest layer having a value of 0.1 and $S_{\max}$ is the default frame size of the highest layer having a value of 0.96 in the network structure.

The different layers are sorted at regular intervals. The width- to-height ratio of the default frame is $a_{r}\in\{1,\: 2,\: 3,\: 1/2,\: 1/3\}$. The width and height of each default frame are respectively given by:

(2)
$W^{a}_{i}=S_{i}\sqrt{a_{r}}$, $h^{a}_{i}=S_{i}/\sqrt{a_{r}} $

Ⅱ. Choose the matching strategy: this strategy selects the default box for each true label box to match it when it generates the M-SSD detection model. It then finds the highest Jaccard for each true label from all the candidate default boxes, by re-adjusting the Jaccard overlap coefficient.

Ⅲ. Select the loss function: Softmax $l_{i}= -\log(e^{S_{y_i}}/\sum_{j} e^{S_{j}})$ is selected as the loss function, $S_{j}$ is the score of class $j$ and $y_{i}$ is the true label of the real object. Then the formula for the total loss function $L$ is as follows:

(3)
$L=\sum_{i=1}^{N}l_{i}$

where $N$ is the total number of images.

An object function always exists during model training. We should optimize the loss function to minimize the loss value until the value becomes the lowest. The M-SSD training model is developed based on the TensorFlow deep learning framework.

Based on this design, the algorithm complexity is reduced. The advantage of the proposed design will be shown in the following comparative analysis.

The reason why we use SSD for object detection is because the SSD network framework is designed to be independent of the basic network and is used to accurately classify and locate targets. It can run on any basic network(such as VGG, ResNet, MobileNet). Therefore, we can use different basic networks for neural network learning and different regression layers(from 6 to 8) to estimate their accuracy. It is a very useful neural network framework to improve the detection accuracy and speed. YOLO and its improved edition YOLO v3, YOLO v5 have been proposed for multiple objects detection. But, for real-time detection, they are especially performed for tasks on mobile terminal. SSD network framework is still a better choice, since its performance in terms of comprehensive consideration of accuracy and speed is particularly outstanding when used as a network with light structure to detect objects.

3.2 M-SSD model training/testing

The next step consists in training/testing the proposed M-SSD model for object detection. The hardware specifications of the experiment environment are shown in Table 2. CPU is used to train the M-SSD model with 16G RAM. The GPU can highly improve the training speed. Note that some Library Functions of CUDA 10.0/CUDNN 8.0.0, and some platforms such as Python 3.6/TensorFlow 1.8, are used to quickly and effectively train the model. The trained model runs on Ubuntu 18.04 operating system, using a camera to capture real-time objects with a resolution of 1024×768.

Table 2. Hardware specification

Hardware device

Parameter

CPU

Inter(R) Core(TM) i7-8750H

RAM

16GB

GPU

NVIDIA GeForce GTX1060

Operate system

Ubuntu 18.04

CUDA/CUDNN

CUDA 10.0/CUDNN 8.0.0

Platform

Python, TensorFlow

Camera

USB HD, resolution1024×768

Table 3. The parameters initialization

Parameters

Value

base_lr

0.0001

max_iter

50000

Ir_policy

Step

Gamma

0.1

Momentum

0.9

weight_decay

0.0005

image_size

300×300

Type

SGD

BN

32

An image database containing 2000 images was built. These images were collected under different external environments and illumination intensities, with a ratio of 3:1 (positive images, including the USV: negative images without the object). A part of the images was flipped, stretched or compressed to enhance the data set universality. Accordingly, 80% of the images were used for training. The remaining 20% were used for the network testing. In the base network, the images captured by the camera were re-sized to 300×300 before inputting them to the net structure model. The model is trained using stochastic gradient descent (SGD) with a 0.0001 initial learning rate (base_lr), 0.9 momentum, 0.0005 weight decay and a batch normalization (BN) of 32. The network was trained for 50,000 iterations and successfully converged. Other parameters are detailed in Table 3.

A part of the labeled images for training/validating is illustrated in Fig. 3. The experiment is implemented in a pool area of Kyungnam University in South Korea. The training/validating accuracy of the proposed model is presented in Fig. 4. It can be seen that the classification accuracy can reach 96.75%. Some classification and accuracy results, in the case of a successful detection, are shown in Fig. 5.

To evaluate the performance of the proposed detection system, the following four evaluation criteria are used:

(4)
$pr ecision=\dfrac{TP}{TP+FP}×100\% $

(5)
$recall=\dfrac{TP}{TP+FN}×100\% $

(6)
$accuracy=\left(1-\dfrac{1}{n}\right)×100\%$

(7)
$F1=\dfrac{2×pr ecision×recall}{pr ecision+recall}×100\%$

where $a$ and $n$ respectively represent the number of misclassified samples and the total number of samples, TP (true positive) refers to a positive sample which is predicted to be a correct result, FP (false positive) refers to a negative sample which is predicted to be a false alarm, FN (false negative) refers to a positive sample which is predicted to be a missed detection, and TN (true negative) refers to a negative sample which is predicted to be negative.

Fig. 3. Part of the images used for training

../../Resources/kiee/KIEE.2021.70.10.1488/fig3.png

The proposed M-SSD model is compared with SSD (10), R-SSD (24) and F-SSD (18), using the previously mentioned four parameters; precision, recall, accuracy and F1. The results are shown in Fig. 6. It can be observed that the proposed M-SSD model results in a higher detection performance than SSD, which can reach an accuracy of 96.75\%. This is due to the fact that ResNet-18, which has a stronger feature extraction residual structure, is used to extract the basic feature infor- mation. However, M-SSD has a lower detection performance than R-SSD and F-SSD. This is due to the fact that the proposed model has fewer layers than R-SSD of ResNet-50 and F-SSD of ResNet-34. This inversely proves that a higher accuracy requires deeper network layers. However, this does not mean that a higher accuracy results in a better detection performance. The computation time, given in Table 4, is another parameter for performance estimation. It can be seen from Table 4 that the computation time of the proposed M-SSD model is 424.36s, which is 26.35% less than that of the SSD model, and much less than that of R-SSD and F-SSD. The proposed design improves the detection performance and the detection speed. It can also be implemented on mobile terminals, such as Rasberry Pi and Jetson Nano, for example.

Fig. 4. Accuracy results of the proposed model

../../Resources/kiee/KIEE.2021.70.10.1488/fig4.png

Fig. 5. Output of the M-SSD testing

../../Resources/kiee/KIEE.2021.70.10.1488/fig5.png

For our collected USV data set, the FPS of SSD is about 67 with the input resolution 300×300, and the FPS of our proposed M-SSD model is about 86 with the same input resolution. When we download the trained file to the mobile terminal Jeston Nano, the FPS of our proposed model is about 32, which achieves real-time former USV detection.

Fig. 6. Performance comparison of different models

../../Resources/kiee/KIEE.2021.70.10.1488/fig6.png

Table 4. Computation time of the methods (s)

Method

Basic network

Time

SSD (10)

VGG-16

576.25

R-SSD (24)

ResNet-50

824.36

F-SSD (18)

ResNet-34

720.64

M-SSD

ResNet-18

424.36

Part of the failure detection images are shown in Fig. 7. It can be observed that the unobvious characteristics and sharp changes of the ambient light around the detected object, may cause a failure detection. Another labeled image data set is used to verify our conjecture and to train a high accuracy model for further studies. This collected data set mainly comprises images that we previously failed to detect, as well as images collected under the situation of a similar environment. A part of the new data set is shown in Fig. 8. The re-train loss for the new data set is presented in Fig. 9. It can be seen from Fig. 9that the training loss is slightly high during the re-training process. We are not able to obtain a better train loss after 50,000 iterations. This is due to the fact that the basic net structure cannot obtain more features of the USV object to train the model, because of an unclear feature data set. This leads to a low object classification.

In summary, for blurred or unclear images, the network cannot learn enough features and the loss function can not converge to zero. It is concluded that images with clear features are required to train the model and then the network models can achieve good accuracy. For former object detection, the paper gets higher detection accuracy and faster speed than original SSD model through replacing the basic network VGG-16 with ResNet-18 and utilizing 1×1 as the convolutional kernel to return 6 feature maps. Although there is no significant improvement in accuracy, the computational time is reduced 26.35% less than former SSD structure. In addition, it has an advantage in that it can utilize a network with reduced computing efficiency.

Fig. 7. Part of the failure detection images

../../Resources/kiee/KIEE.2021.70.10.1488/fig7.png

Fig. 8. New data set for training images

../../Resources/kiee/KIEE.2021.70.10.1488/fig8.png

Fig. 9. The re-train loss for the new data set

../../Resources/kiee/KIEE.2021.70.10.1488/fig9.png

4. Former USV measurement

In the proposed detection system, the former USV is already detected in the image. Afterwards, we need to measure the distance and the orientation of the former object. Once the object is detected in an image (cf. Fig. 10), the network returns a boundary box over the special object that contains the object position.

The four-pixel values of the object position, namely ($x_{\min}$, $y_{\min}$), ($x_{\min}$, $y_{\max}$), ($x_{\max}$, $y_{\min}$), and ($x_{\max}$, $y_{\max}$), are shown in Fig. 10. The object center is obtained from the pixel values of the object as:

(8)
$x=\left(\dfrac{x_{\min}+x_{\max}}{2}\right)×IM_WIDTH$

(9)
$y=\left(\dfrac{y_{\min}+y_{\max}}{2}\right)×IM_HEIGHT$

where IM_WIDTH and IM_HEIGHT represent the image width and height, respectively.

As for the monocular camera (25), we assume that the camera coordinates of the focal point of the optical axis are depicted to the phase plane. Furthermore, the pixels dimensions corresponding to the $x$ and $y$ axes on the phase plane are $dx$ and $dy$, respectively. $(u,\: v)$ is the camera coordinate on the coordinate axis, while $(u_{0,\:}v_{0})$ is the initial coordinate. The distance $d$ between the object and the camera plane is calculated as:

(10)
$d=h/t\mathrm{g}(a r ct\mathrm{g}((y-y_{0})/f)+\alpha)$

(11)
$u=u_{0}+x/dx$, $v=v_{0}+y/dy$

For simplicity, we assume that the center of the image plane $(x_{0,\:}y_{0})$ is such that $x_{0}=y_{0}=0$. Using Eq. (10) and (11) and some simple calculations, we can conclude that:

(12)
$d=h/t\mathrm{g}(a r ct\mathrm{g}(v-v_{0})/a_{y}+\alpha)$

(13)
$a_{y}=f/dy$

where $f,\:\alpha$ and $h$ represent the focal length, tilt angle and optical center height of the camera, respectively. The distance is then obtained.

The principle of the camera is shown in Fig. 14. Using orientation measurement (26), we obtain:

(14)
$\beta =(F_{c}/\omega)\beta^{I}\cos\phi$

where $F_{c}$ and $\omega$ represent the camera wide angle and the camera image width, respectively. $\beta^{I}$ and $\phi$ denote the pixel distance from the target to the centerline of the image plane and the inclination of the phase plane in the horizontal direction, respectively.

The relative deflection orientation of the target ($\beta$) in the direction perpendicular to the camera plane, can be obtained using Eq. (14). The flowchart of the proposed M-SDD-based control system is presented in Fig. 13. According to the previously mentioned design and calculation, the detecting results obtained by the proposed algorithm are shown in Fig. 12, with the detecting class accuracy as well as the distance and azimuth information.

Fig. 10. Object location with a boundary box

../../Resources/kiee/KIEE.2021.70.10.1488/fig10.png

Fig. 11. The principle of the monocular camera

../../Resources/kiee/KIEE.2021.70.10.1488/fig11.png

High-precision instruments (Lidar and gyroscope) are used to evaluate the accuracy of the distance and angle measured by the proposed algorithm. The separated values are measured and compared with the results of the data information detected by the monocular camera.

It can be seen from Table 5 that the error is less than 1.6%, compared with the true values. Table 6 shows a comparison of the orientation values between the gyroscope (Gyc) and the camera (Cam). It can be observed that an error below 1.2% is obtained. This error is absolutely acceptable in real-time applications. It can also be seen that the error increases with the distance increase.

Fig. 12. Image of real-time distance and angle

../../Resources/kiee/KIEE.2021.70.10.1488/fig12.png

Fig. 13. M-SDD-based control system flowchart

../../Resources/kiee/KIEE.2021.70.10.1488/fig13.png

Table 5. Distance value (m) compared with Lidar

order

1

2

3

4

5

6

7

8

Lidar

4.0

4.5

4.8

5.0

5.2

5.5

5.8

6.0

Cam

4.02

4.51

4.78

4.97

5.16

5.44

5.75

5.90

Table 6. Orientation value compared with gyroscope

order

1

2

3

4

5

6

7

8

Gyc

$15^{\circ} 18^{\prime}$

$18^{\circ} 42^{\prime}$

$20^{\circ} 24^{\prime}$

$25^{\circ} 12^{\prime}$

$30^{\circ} 24^{\prime}$

$33^{\circ} 48^{\prime}$

$37^{\circ} 24^{\prime}$

$40^{\circ} 12^{\prime}$

Cam

$15^{\circ} 08^{\prime}$

$18^{\circ} 27^{\prime}$

$20^{\circ} 04^{\prime}$

$24^{\circ} 40^{\prime}$

$30^{\circ} 02^{\prime}$

$33^{\circ} 12^{\prime}$

$36^{\circ} 50^{\prime}$

$39^{\circ} 26^{\prime}$

5. Conclusion

In this paper, an improved Single Shot MultiBox Detector for former object detection with its distance and orientation mea- surement, is proposed. After training and validating with the modified model, the proposed method detected the former USV with a faster detection rate and a higher detection performance than that of the original SSD model. The average accuracy of USV object detection reached 96.75%, which is sufficiently robust for later tracking tasks. Simultaneously, the monocular camera installed on the tracking system, calculates the distance and orientation information of the former USV in real-time. As a result, the relative location information error is less than 3% of the former object, compared with the true values obtained through experiments. This can fully meet the tracking system design. In future work, an analysis of the tracking system according to the theoretical results obtained herein, is required. This would be beneficial, since the design of the target detecting and tracking system is more conducive to the future development of the marine environment.

Acknowledgements

This work was supported by Kyungnam University Foundation Grant, 2021.

References

1 
P. Goel, S. Agarwal, 2012, Hybrid Approach of Haar Cascade Classifiers and Geometrical Properties of Facial Features Applied to Illumination Invariant Gender Classification System, 2012 International Conference on Computing Sciences, pp. 132-136DOI
2 
G. B. Li, H. F. Wu, 2011, Weighted fragments-based mean shift tracking using color-texture histogram, Journal for ComputerAided Design and Computer Graphics, Vol. 12, No. 12, pp. 2059-2066Google Search
3 
D. Exner, E. Bruns, D. Kurz, A. Grundhofer, O. Bimber, 2010, Fast and robust CAMShift tracking, In Proceeding of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 9-16DOI
4 
G. Bishop, G. Welch, 2010, An introduction to the Kalman filter, Proc of SIGGRAPH, Course, Vol. 8, No. 41, pp. 27599-23175Google Search
5 
K. Nummiaro, E. Koller-Meier, G. L. Van, 2003, An adaptive color-based particle filter, Image and Vision computing, Vol. 21, No. 1, pp. 99-110DOI
6 
J. Fan, W. Xu, Y. Wu, Y. Gong, 2010, Human Tracking Using convolutional neural networkss, in IEEE Transactions on Neural Networks, Vol. 21, No. 10, pp. 1610-1623DOI
7 
J. Zhu, Y. Lao, Y. F. Zheng, 2010, Object Tracking in Structured Environments for Video Surveillance Applications, in IEEE Transactions on Circuits and Systems for Video Technology, Vol. 20, No. 2, pp. 223-235DOI
8 
D. Koller, J. Weber, J. Malik, 1994, Robust multiple car tracking with occlusion reasoning, Proc. Third European Conference on Computer Vision, pp. 189-196DOI
9 
L. Vasu, D. M. Chandler, 2010, Vehicle tracking using a human-vision-based model of visual similarity, 2010 IEEE Southwest Symposium on Image Analysis & Interpretation (SSIAI), Vol. , No. , pp. 37-40DOI
10 
P. N. Druzhkov, V. D. Kustikova, 2016, A survey of deep learning methods and software tools for image classification and object detection, Pattern Recognition. Image Anal., Vol. 26, No. 1, pp. 9-15DOI
11 
A. Krizhevsky, I. Sutskever, G. E. Hinton, 2012, Image net classification with deep convolutional neural networkss, Advances in neural information processing systems, Vol. 25, pp. 1097-1105DOI
12 
R. Girshick, J. Donahue, T. Darrell, J. Malik, 2014, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 580-587Google Search
13 
R. Girshick, 2015, Fast r-cnn, Proceeding of the IEEE international conference on computer vision., pp. 1440-1448Google Search
14 
S. Ren, K. He, R. Girshick, J. Sun, 2017, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell., Vol. 39, No. 6, pp. 1137-1149DOI
15 
J. Dai, Y. Li, K. He, J. Sun, 2016, R-FCN: Object detection via region-based fully convolutional networks, in Proc. NIPS, pp. 379-387Google Search
16 
J. Redmon, S. Divvala, R. Girshick, A. Farhadi, 2016, You only look once: Unifified, real-time object detection, in Proc. IEEE Conf. Computation. Vis. Pattern Recognition. (CVPR), pp. 779-788Google Search
17 
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg, 2016, SSD: Single shot Multi-Box detector, in Proc. ECCV, pp. 21-37Google Search
18 
K. He, X. Ren, S., 2016, Deep residual learning for image recognition, IEEE Conf. On Computer Vision and Pattern Recognition, pp. 770-778Google Search
19 
L. Chen, Z. Zhang, L. Peng, 2018, Fast single shot multibox detector and its application on vehicle counting system, IET Intelligent Transport Systems, Vol. 12, No. 10, pp. 1406-1413Google Search
20 
L. Gao, P. Chen, S. Yu, 2016, Demonstration of convolution kernel operation on resistive cross-point array, in IEEE Electron Device Letters, Vol. 37, No. 7, pp. 870-873DOI
21 
S. Ozturk, U. Ozkaya, B. Akdemir, L. Seyfi, 2018, Convolution kernel size effect on convolutional neural networks in histopathological image processing Applications, 2018 International Symposium on Fundamentals of Electrical Engineering (ISFEE), pp. 1-5DOI
22 
P. Sermanet, D. Eigen, X. Zhang, M. Michael, R. Fergus, Y. LeCun, 2017, OverFeat: Integrated recognition, localization and detection using convolutional networks, in IEEE Electron Device Letters, pp. 256-260Google Search
23 
K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun, 2015, Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 37, No. 9, pp. 1904-1916DOI
24 
W. Pei, Y. M. Xu, Y. Y. Zhu, P. Q. Wang, M. Y. Lu, F. Li, 2019, The target detection method of aerial photography images with improved SSD, Ruan Jian Xue Bao/Journal of Software, Vol. 30, No. 3, pp. 738-758Google Search
25 
J. W. Chu, L. S. Ji, L. Guo, B. B. Li, R. B. Wang, 2004, Study on method of detecting preceding vehicle based on monocular camera, IEEE Intelligent Vehicles Symposium, pp. 750-755DOI
26 
J. Park, Y. Cho, B. Yoo, J. Kim, 2015, Autonomous collision avoidance for unmanned surface ships using on board monocular vision, OCEANS 2015 MTS/IEEE Washington, pp. 1-6DOI

저자소개

Bangqian Ao
../../Resources/kiee/KIEE.2021.70.10.1488/au1.png

He received his B.S. degree in Department of Mathematics and Computational Science from Xiangtan University, Xiangtan, Hunan, China, in 2008, and his M.S. degree in Department of Electronic Technology from Central South University, Changsha, Hunan, China, in 2011.

He is currently studying for a PhD degree at Kyungnam University.

His research interest include computer vision, intelligent control technology.

김동헌(Dong Hun Kim)
../../Resources/kiee/KIEE.2021.70.10.1488/au2.png

He received his BS, MS and PhD degrees from the Department of Electrical Engineering, Hanyang University, Korea, 1n 1995, 1997 and 2001, respectively.

From 2001 to 2003, he was a Research Associate under several grants in the Department of Electrical and Computer Engineering, Duke University, NC, USA.

In 2003, he joined Boston University, MA, USA as Visiting Assistant Professor under several grants in the Department of Aerospace and Mechanical Engineering.

In 2004, he was engaged in Post-doctoral Research at the School of Information Science and Technology, University of Tokyo, Japan.

Currently, he is a Professor with the Division of Electronic and Electrical Engineering, Kyungnam University, South Korea.

His research interests include swarm intelligence, self-organization of swarm system, mobile robot path planning, decentralized control of autonomous vehicles, intelligent control and adaptive non-linear control.