Jarhinbek RASOL, Yuelei XU, Qing ZHOU, Tian HUI, Zhaoxiang ZHANG
Unmanned System Research Institute, Northwestern Polytechnical University, Xi’an 710072, China
KEYWORDS Autonomous aerial refueling;N-fold Bernoulli probability theorem;Object detection;Object tracking;YOLOv4
Abstract Recently,deep learning has been widely utilized for object tracking tasks.However,deep learning encounters limits in tasks such as Autonomous Aerial Refueling (AAR), where the target object can vary substantially in size, requiring high-precision real-time performance in embedded systems. This paper presents a novel embedded adaptiveness single-object tracking framework based on an improved YOLOv4 detection approach and an n-fold Bernoulli probability theorem.First, an Asymmetric Convolutional Network (ACNet) and dense blocks are combined with the YOLOv4 architecture to detect small objects with high precision when similar objects are in the background.The prior object information,such as its location in the previous frame and its speed,is utilized to adaptively track objects of various sizes.Moreover,based on the n-fold Bernoulli probability theorem,we develop a filter that uses statistical laws to reduce the false positive rate of object tracking.To evaluate the efficiency of our algorithm,a new AAR dataset is collected,and extensive AAR detection and tracking experiments are performed. The results demonstrate that our improved detection algorithm is better than the original YOLOv4 algorithm on small and similar object detection tasks; the object tracking algorithm is better than state-of-the-art object tracking algorithms on refueling drogue tracking tasks.
The use of Autonomous Aerial Refueling (AAR) can increase the range of an Unmanned Aerial Vehicle(UAV).1It has high application value in military and civilian fields. There are two main types of aerial refueling: Flying Boom Refueling (FBR)and Probe-Drogue Refueling (PDR). Compared with FBR,PDR meets the requirements of UAVs,such as high flexibility,high safety, and simplicity. Thus, PDR is more suitable than FBR for unmanned aerial systems, and this paper is oriented to the PDR-based AAR task.
For the AAR task, the refueling drogue must be detected correctly. Then, an object tracking algorithm is required to accelerate the proceeding speed for real-time detection.Generally,object tracking tasks consist of image processing,machine learning,and optimization,and they are the premise and foundation of the AAR task.2,3
The task of refueling drogue detection and tracking can be divided into two types: active vision methods and passive vision methods.4The former uses artificial features such as painted markers5-8and light emitting diode beacons9,10to detect and track refueling drogues. Although the active vision method has the advantages of high speed and high reliability,the disadvantages are also obvious.It needs to modify the tanker, for example by installing LEDs, which need additional power,and such modification is apt to malfunction when used in real-world situations. The passive vision method mainly relies on intrinsic vision features to detect refueling drogues.Passive vision methods can be divided into two types: nondeep learning based tracking algorithm11-13and deep learning based tracking algorithm.11,14-16Although the non-deep learning based algorithm has high speed and can satisfy the realtime requirement, the robustness is not sufficient to apply it to real-world tasks.
The deep learning based object tracking algorithm uses the powerful presenting ability to track objects. Many deep learning based object tracking algorithms have been proposed to improve the tracking performance. The MDNet17tracking algorithm designs a small network to learn features, uses softmax18to classify samples, and has high performance, except that its speed is only one Frame Per Second(FPS).The SiamFC algorithm19uses a Siamese network to train a similar function offline,which can be used to select the most similar candidate in each frame of video.The Siamese instance search network algorithm20directly uses the Siamese network to learn the matching function of object and candidate templates.It uses the object in the first frame only as a template to track the object during the tracking step. Based on the Siamese network, it uses a region proposal network to directly estimate object scales,21which promotes tracking performance and efficiency. The DaSiamRPN22tracker introduces a distractor-aware model to increase accuracy further. Furthermore, SiamRPN++23uses ResNet and Inception to replace the baseline network,which increases the robustness of the tracking algorithm.
Although all of these object tracking algorithms can be used to track refueling drogue, they cannot simultaneously handle the situation encountered in refueling drogue tracking task,in which(A)the target object has a large range of scales,(B)the target object is sometimes too small, and(C)high precision with real-time operating speed is needed in an embedded system. To address these problems, we develop an adaptive tracking algorithm, as shown in Fig. 1. We improve the YOLOv424by introducing the dense block and ACNet,25which enhances the ability to detect small objects and increases the precision with no computational complexity added in the inference stage. We also use the n-fold Bernoulli probability theorem to design a filter that can screen out false positive samples when tracking an object. When we combine this filter with improved YOLOv4, the detecting precision will be high enough to satisfy the AAR task. Finally, we develop a tracking-by-detection algorithm based on this improved YOLOv4 and the filter mentioned above for visually tracking refueling drogues in AAR tasks. The main contributions of this article are as follows:
(1) In this paper,we improve YOLOv4 by(A)replacing the residual block with a dense block to increase the feature extraction capability,which enhances the detection ability on small objects.(B)The ACNet25theorem is used to enrich the network’s receptive field by putting differentsized kernels parallel to the 3 × 3 kernel. This increases the precision when similar objects are in the background,but does not increase the computational burden in the inference stage.
(2) We propose an adaptive tracking-by-detection algorithm based on the improved YOLOv4 for the AAR task. The proposed algorithm utilizes the prior knowledge of the object, such as the object’s location, moving speed and direction, to adaptively track the object,which may vary in size, with the real-time operating speed needed in an embedded system.
(3) Based on the n-fold Bernoulli probability theorem, we develop a filter that synthesizes the previous information of the object to decrease the false positive rate, which can increase precision.
In Fig.1,the dashed rectangles correspond to our contributions. The dashed arrow line represents the improvement that we apply to the YOLOv4 algorithm.
The remainder of this paper is organized as follows: Section 2 introduces the improvement applied to YOLOv4. The object tracking algorithm that we design is described in Section 3.Section 4 presents the experimental results and analysis,and finally, Section 5 presents concluding remarks.
YOLOv4 is an object detection algorithm based on the YOLO algorithm. It uses the cross-stage partial 53-layer Darknet(CSPDarknet53) as the backbone network for feature extraction. CSPDarknet53 enriches the network gradient combination by introducing Cross-Stage Partial connections (CSPs)while reducing the amount of calculation.Therefore,YOLOv4 has better performance and is faster than other object detection algorithms.
CSPDarknet53 in YOLOv4 is constructed with several CSP-n blocks,and a CSP-n block mainly consists of n residual blocks that connect back and forth layers by simply adding their feature maps together. Although this structure can make the training process easier, it limits the feature extraction capability of the network by assigning the same weights to all feature maps added together in one block, which consequently restricts the network’s detecting capability on small objects.On the other hand, the dense block26assigns different weights to different feature maps concatenated together in one block,enhancing the network’s feature extraction ability.
Fig. 1 Aerial refueling drogue tracking algorithm.
Therefore, in this paper, we replace every residual block in all CSP-n blocks with dense blocks to improve the information flow between layers, as shown in Fig. 2 (in Fig. 2, we take the CSP-2 block for illustration). The dense block consists of a 3 × 3 convolutional layer followed by a 1 × 1 convolutional layer. In each dense block (×n), every 3 × 3 convolutional layer directly connects to every 1 × 1 convolutional layer in the following blocks by concatenation, as shown in the solidline box in Fig. 2. The 1 × 1 convolutional layers are used to merge the feature maps produced by each 3 × 3 convolutional layer of the front block.
Fig. 2 Application of dense block.
The CSPDarknet53 structure proves to have powerful feature extraction capabilities. Using dense blocks to replace the residual blocks does not change the overall structure; instead,only the pointwise addition operator needs to be replaced with the concatenation operator. The dense block has a more powerful feature representing ability than the residual block.Therefore, using the dense block to replace the residual block can improve the feature extracting ability of CSPDarknet53.
In Fig. 2, CBA represents convolutional, normalization,and activation layers that are serially connected in sequence.The ‘‘-n” is the number of dense blocks or residual blocks,e.g., in Fig. 2, n is set to 2. ‘‘s1” and ‘‘s2” represent the stride values of the convolutional layer.
For an object detection algorithm, the receptive field heavily influences the semantic features that the detection algorithm extracts,which will directly affect the detection accuracy when there is a similar object in the background,and the object has a wide range of scales. The YOLOv4 network consists of only 3×3 and 1×1 convolutional layers,which limits the receptive field to a square shape.Consequently,this will restrict the representation of semantic features. To address this problem, we must make the receptive field have more scales.Most methods do this by introducing architectural changes to the network,which will introduce an additional computational burden at the inference stage. Although our strategy brings architectural change to the network, by introducing the theory of ACNet,the computational burden exists only in the training process,not in the inference process. Therefore, there is no extra computational burden on the network.
The main theory of ACNet is described as follows:if several 2D kernels have compatible sizes,they work on the same input with the same stride to produce the output with the same size,and their outputs are directly summed.Then,the same output can be achieved by adding these kernels to the corresponding positions. For convenience, we marked this property as the Additivity Property of Convolution (APC). Compatibility means that the smaller kernel can patch into larger kernels,i.e.,the 1 × 3, 1 × 1, and 3 × 1 kernels are compatible with 3 × 3 kernels.Formally,if condition(1)is satisfiable on layers q and p, transformation (2) can be applied to layers q and p.
where Mand Mare the inputs of the q-th and p-th layers,respectively;Y and X are the dimensions of the kernel;D is the filter number of the convolutional layer;E is a matrix;K(1)and K(2)are two compatible 2D kernels; * and ⊕are a convolutional operation and elementwise addition of the kernel parameters on the corresponding position, respectively.
Usually, networks adopt a batch normalization layer to accelerate the training process and enhance the feature representational power. To apply the APC to the networks that use batch normalization after the convolutional layer, ACNet fuses Batch Normalization (BN)into the convolutional kernel parameters. Afterward, the kernels fused with BN can be directly fused into one square kernel using APC.
To widen the receptive field scales of the network, during the training stage, we change each of 3 × 3 convolutional layers and the corresponding BN layer in YOLOv4 into 3 × 3,3 × 1, and 1 × 3 convolutional layers with the corresponding BN,as shown in the purple dashed box on the left-hand side of Fig. 3. The three branches have identical numbers of filters,output resolutions, and numbers of output channels. Then,we fuse the BN in each branch into a convolutional kernel using ACNet BN fusion theory. Then the APC is applied to fuse the branches into one standard convolutional layer by adding the three kernels onto the corresponding positions of the square kernel.Finally,we use the fused weights to initialize the original network—the YOLOv4 after using the dense block to replace the residual block in the last section,as shown in the right part of Fig.3.Note that the original network mentioned in the last sentence has no BN after each 3 × 3 convolutional layer. This is because the BN has been fused into each 3 × 3 convolutional kernel parameter.
By using ACNet, we expand the scale of the receptive field to a certain extent, thereby improving the detection accuracy of the model. Compared to other methods, this method does not introduce additional operations during the inference stage.
If the tracking object is certain, we can use a modified object detecting algorithm to track it. It is more reliable than other object tracking algorithms to use the object detection algorithm to track a specific object because the object detection algorithm has prior knowledge about the tracking object.However, it is well known that deep learning based object detection algorithms have very high complexity, which limits their use in object tracking tasks in an embedded device. To address this issue, we develop a tracking-by-detection algorithm based on YOLOv4,which has higher precision and more robustness than an arbitrary object tracking algorithm. Moreover, it has a high inference speed that ensures its use in an embedded device.
In YOLOv4, the input can have different sizes during online use. Decreasing the input size can significantly accelerate the inference speed of the algorithm. For example, given that the original input size is H×W and the decreased input size is H'×W', the computational complexity decreases as follows:
where R denotes the decreasing ratio of complexity. Usually,the input shape of a network is square. Let I denote the original input size of a network, and I'denote the decreased input size of the network. Eq. (3) is transformed to.
Fig. 3 Application of ACNet.
To benefit from this property for real-time applications,we need to greatly reduce the input size.However,if we carelessly reduce the input size, the object size becomes too small to be detected when the object size is not large enough in the original image.However,if we know the object’s approximate location in an image, we can crop the image around the object so that the cropped area is much smaller than the area of the original image. The object detection algorithm itself can provide the object location in the first frame. The rest of the object’s approximate location can be given by the former frame because the objects in two adjacent frames are close to each other. Based on this, we need to crop a larger area around the bounding box of the former frame and feed it into the network to detect the object in the cropped image.Of note is that the object tracking algorithm and the object detection algorithm are identical except for the input size. Thus, the smaller input size network has lower computational complexity, as shown in Eq. (4).
Here, the input size is not equal to the cropping size. The input size of YOLOv4 should be a multiple of 32, and when used to track the refueling drogue, the input size of YOLOv4 needs to be sufficiently small for real-time application. The cropping size needs to be considered comprehensively to ensure that (A) the size of the object is sufficiently large for the network to detect,and that(B)the object is in the cropped image. Condition (A) is called the upper bound condition because for a certain sized object, a larger image corresponds to a smaller object size in the input image(because the cropped images are resized back to the same input size). Condition (B)is called the lower bound condition because the smaller cropped size corresponds to narrower vision. Therefore, we need to determine the upper and lower bounds of the cropping size. For clarity, we illustrate some of the variables in Fig. 4.
(1) Upper bound calculation
The object detection network has limits for objects that are too small in the image.If the object’s size in the image is smaller than this limit size, the network is blind to that object. Let SLdenote the limit size, Sobthe object’s original size in an image, and WCand HCthe cropped width and height, respectively.Then,to ensure that the object size in the input image is larger than the limit size, WCand HCmust satisfy.
By satisfying Eq. (6), the object size in the input image becomes larger than SL, which can ensure that the detection algorithm can detect the tracking object.Usually, the cropped image size is larger than the input size of the network.We need to resize back the cropped image to match the input size of the network. Therefore, if SCis larger than this upper boundary when we resize back the cropped image, the object size in the input image will be smaller than the limit size.
(2) Lower bound calculation
Algorithm 1. Adaptive tacking 1. Initialize r to 1 2. for every frame 3. for index, ξindex in ξ 4. if ξindex+1 >Sob×r >ξindex 5. e = index + 1 6. break 7. else 8. continue 9. end 10. end 11. SC =I ×2e 12. if SC does not satisfy Eq. (6)13. return SC to the previous value 14. end 15. if this is not the first frame 16. let d ≈Li - Li+1 17. vcam = d × f 18. end 19. if SC does not satisfy Eq. (9)20. let SC =2×vcam f 21. end 22. r=I'SC 23. resize cropped image to I' × I'24. execute detection 25. update Sob, Li 26. end
Fig. 4 Illustration of variables.
Given that the size of the object that needs to be tracked is SRobin the moving direction and the moving speed is v(relative to the camera), we can calculate the field of view l as follows:In Algorithm 1,the threshold list ξ is sorted from the smallest to the largest. r is the resizing ratio to resize cropped images back to the input size. The larger the cropped size, the larger the field of view and the more robust the algorithm.Therefore,on line 11,we enlarge the cropped size to obtain as large a field of view as possible. In this paper, we modify the crop size by doubling or halving each time it needs to be changed,as shown on line 11. The scaling parameters e on line 11 is set to index + 1, as shown on line 5. However, it must satisfy Eq.(6) to ensure that the object’s size in the input image is larger than the limit size. Otherwise, the algorithm is blind to the object, which is ensured on line 12. However, in some cases,Eq. (6) and Eq. (9) cannot both be satisfied. In that case, we choose to compromise Eq. (6) because if Eq. (6) is violated,there is a certain probability that the target can be detected,but if Eq.(9)is violated,it will be impossible to detect the target object because the object will no longer be in the input image.
It should be considered that the tracking algorithm fails to track the object in a certain frame.The algorithm used to track the object is based on YOLOv4, which can change the input size when using it online. Therefore, if the tracking algorithm loses the object, we can feed the whole image to the detection network with an input size of 608 × 608 to find the object.After finding the object, we can switch back to the tracking algorithm.
For a binary classification task,we assume that:Ak=‘‘k samples are predicted as positive,” k = 0,1,···,n, and B = ‘‘the samples predicted as positive are more than 50% of the total predicted samples”. Clearly, the events that the samples are predicted as positive or negative are independent and identically distributed events. Thus, if we set the detecting precision as p, according to the n-fold Bernoulli probability theorem:
To analyze the relationship among P(B),p,and n,we take p from 0 to 1 with intervals of 0.05. Then, for each p, we calculate P(B)based on n,which is taken from 2 to 50 with intervals of 1, and draw them in Fig. 5. In Fig. 5, the vertical and horizontal coordinates are the dependent variable P(B) and independent variable n in Eq. (14), respectively.
From Fig. 5(a), when p is greater than 0.5, it rises in a zigzag shape with increasing n;when p is less than 0.5,it falls in a zigzag shape. From Fig. 5(c), when n is an odd number, P(B)monotonically increases with p and becomes greater than p when p is larger than 0.5; this trend reverses when p is smaller than 0.5. This is not true for even numbers n, as shown in Fig. 5(b).
In a single-object detection task, we can regard the detection of an object from the background as a binary classification task. The challenge is to ensure that the object in one frame is identical to the object in the former frame. For this issue, we first use the former frames to estimate the speed and direction of the object,and if there is only one object near the estimated location, we simply regard it as the tracking object. Suppose that there is more than one object around the estimated location.In that case,we use the Structural Similarity Index Measure (SSIM)27to calculate the similarity of the image in the object bounding box with the former one and choose the bounding box with the highest score as the final object bounding box.
Moreover, according to Fig. 5, this process enhances the part above 0.5 and can depress the part below 0.5. Therefore,after ensuring the target object, we can regard the rest of the image as negative(nonobject),which can filter out noisy detection results in some frames. Based on the above analysis, we develop a filter to diminish false positive results, which consequently improves the tracking precision. The algorithm is shown in Algorithm 2.
In Algorithm 2, n represents n in Eq. (14), which is usually set to an odd number according to the above analysis. The update on line 24 pops the first element from the queue Zjand appends the new result. If the results in the queue Zjare less than n,we must first fill up the queue.Line 16 corresponds to P(B) in Eq. (14), and the real object on line 17 means that there is a possible object in the cropped image. Because the AAR task is a single object tracking task, we need to select only one object in the end. Therefore, if more than one object are in the cropped image, we use the confidence to select the tracking object on line 27.
Fig. 5 Relationship among P(B), p, and n in Eq. (14).
This tracking algorithm can also be used for other single object tracking tasks. However, to get better performance,some modifications are needed. The threshold list ξ should be experimentally chosen according to the limit size in Eq.(6) and the object’s size. Given a certain limit size, we need to choose a proper threshold list to ensure that the object exists in the cropped image. Instinctively, the smaller the threshold is, the smaller the probability that the object will exceed the cropped image is. However, if the object size is not large enough when enlarging the cropping size, the object size will be smaller than the limit size, leading to missing detection when tracking the object. On the other hand, if the values in the threshold list are too large, the probability of the object being outside of the cropped image increases.
There are other limits to using the algorithm in other fields.The algorithm proposed in this paper is based on an object detection algorithm, which cannot distinguish objects belonging to the same class. Therefore, the proposed algorithm cannot be applied in scenarios with many objects belonging to the same class in the background,such as tracking a person in the street,tracking birds in a flock,etc.However,it can be used in the AAR task because the probability that another refueling drogue will exist in the receiver’s field of vision is remote.
Algorithm 2.N-fold Bernoulli probability theorem based filter 1. Maintain a list that contains the detected object Q= Q0,Q1,···[ ]2. For every object Qj maintains a queue of size n, which holds the results Zj = Z1,Z2,···,Zn[]3. for every frame 4. if the algorithm detects a new object 5. add that object to the list Q 6. end 7. for every object Qj 8. estimate the object location Li =vcam f +Li-1 9. if the location Li is outside the cropping range 10. Delete Qj from the object list 11. return to line 7 12. end 13. if there is more than one object near the estimated location 14. Use SSIM to select the most similar one as an object Qj 15. end 16. if the positive result is more than half in queue Zj 17. Regard it as the real object 18. else 19. Regard it as background 20. end 21. if there are n negative results in the queue Zj 22. Delete Qj 23. else 24. Update Zj 25. end 26. if there is more than one real object 27. Select the one with the highest confidence as the tracking object 28. else 29. regard the real object as the tracking object 30. end 31. end
In this section, we first evaluate our improvement on YOLOv4, and compare our algorithm with other state-ofthe-art algorithms: YOLOv5L28(YV5L) and EfficientDet29(EFD). Then, we test the influence of the input size on the tracking algorithm and verify the effectiveness of the n-fold Bernoulli probability theorem based filter. Finally, we compare our tracking algorithm with other tracking algorithms:Efficient Convolution Operators30(ECO), DaSiamRPN,22MDNet,17and SiamRPN++.23
For the dataset,we make a refueling drogue at a 1:1 scale and take many videos using this refueling drogue.These videos are mainly taken from distances of 2 to 40 m. Therefore, these videos cover the size of the refueling drogue in an image from approximately two meters to approximately forty meters. The object size distribution in the dataset is shown in Fig. 6. To simulate a practical situation, we download many refueling videos from the Internet. Afterward, we choose 33 videos out of 37 videos, split the video frames, randomly select 14700 images from the video frames,and set them as the training set.For the test set,because we need to verify the detection algorithm and tracking algorithm, we label every frame of the remaining videos and sequentially store each video frame in different folders.
During the training process, we train our algorithm on the training dataset from scratch. We use mosaic, random rotations, Gaussian noise, and Gamma transfer augmentation.The Adam weight decay optimizer is used to optimize the network; the learning rate is set to 10-3; the batch size and minibatch size are set to 130 and 26, respectively. We iterate the network for 250 epochs on the dataset.
During the tracking process, we set the input size of our tracking algorithm to 128 × 128 pixels, n in Eq. (14) to 9,and ξ in Algorithm 1 to list(60,120,200,350,400).The cropping area adaptively changes according to the size of the object in Algorithm 1.
The algorithm is programmed based on PyTorch 1.6 and carried out on an NVIDIA TITAN RTX GPU,and we implement it on a Jetson AGX Xavier to verify its efficiency on the embedded device.
After setting up the operating environment, we compare our improved YOLOv4 with the original YOLOv4 on the test dataset by drawing the PR curve. When calculating the precision and recall, we regard the samples that box only the refueling drogue as True Positive (TP) instances; the samples that box something else are regarded as False Positive (FP)instances; the samples that miss the refueling drogue are considered False Negative(FN)instances.The equations of precision and recall are:
Fig. 6 Object size distribution in training dataset.
In the test, we use Dense-CSPDarknet (DC) to refer to the strategy that uses a dense block to replace the residual block.ACNet means that the 3 × 3 layers are replaced by placing 1×3 and 3×1 layers parallel to the 3×3 layers at the training stage and setting them back to 3×3 layers at the inference stage using ACNet theory. Ultimately, we obtain Table 1.
To ensure that our strategy positively affects object detection, we separately test every proposed strategy in this paper.We list the tested algorithms in Table 1 and plot the testing results in Fig. 7. To better examine the results, we enlarge the upper right corner of Fig. 7(a). The experimental results show that ACNet increases the precision,and the DC strategy increases the recall ratio. The reason is that the ACNet strategy can widen the range of the receptive field scale to represent more semantic information, and dense blocks have a more powerful feature extraction capability,which can help the network extract detailed information to distinguish a small object from the background. Thus, YV4_D_A has higher precision and recall ratio than the other methods.
We also visualize some of the samples with different prediction results on these algorithms. The visualized samples are shown in Fig. 8. In Fig. 8, the three columns correspond to the results of the three algorithms. For clarity, we enlarge the content in the yellow box area. In the first row of Fig. 8,YV4 did not find the target, while YV4_D found the target object, but it also recognized something else as the target object. Only the YV4_D_A gave the correct result. In the second row, although YV4 found the object, it also regarded the engine as the drogue. YV4_D and YV4_D_A correctly recognized the drogue.Although they all found the correct object in the last row,YV4 and YV4_D had false detections for the last row. Only the YV4_D_A correctly found the drogue again.From the above description, YV4_D has a stronger detection capability for small objects than YV4, but the accuracy is not as high as YV4_D_A. Additionally, YV4_D_A has a higher confidence in predicting the target object.
For a more detailed comparison, we separate the test dataset into three scales: small, normal, and large. The size of a small object is smaller than 50 pixels, the normal object is greater than 50 pixels and less than 300 pixels, and the large object is greater than 300 pixels. YV5L and our methods are tested on three different input sizes, 416, 544, and 640. For the EFD, we implement the three fastest versions of the EFD, i.e., D0, D1, and D2. All algorithms are tested with batch size 8.We draw our testing result in Fig.9.As seen fromFig. 9, although YV5L is faster than our algorithm and has similar performance on normal and large objects, its performance on small targets is not as good as our algorithm. The performance on small objects can enable the aircraft to accurately identify the refueling drogue farther away from the refueling drogue to buy more time for the aircraft to adjust. EFD cannot satisfy the real-time requirement of the AAR task, the fastest version of EFD achieves only 62 FPS on a TITAN RTX, and the Average Precision (AP) is below that of our algorithm.
Table 1 Algorithms to evaluate efficacy of our algorithm.
Fig. 9 Comparison of object detection algorithms.
To demonstrate the advantage of our tracking algorithm in the autonomous aerial refueling task, we compare our algorithm with some state-of-the-art algorithms on the test dataset. In this paper,we use a precision plot and success plot to evaluate the algorithms, which is a common benchmark of tracking algorithms. During the tracking algorithm test, we add the reversed sequence video to the test dataset.
Fig. 8 Visualized samples with different results on different algorithms.
(1) Precision plot: Percentages of frames whose estimated locations lie in a given threshold distance to the ground truth centers.
(2) Success plot: A frame whose overlap is larger than the threshold is termed a successful frame, and the ratios of successful frames at thresholds ranging from 0 to 1 are plotted in the success plots.
Fig. 10 Object size distribution and recall and precision of different object-sized datasets.
First,we test the influence of the cropping size on the tracking performance. To accomplish this, we first need to determine the limit size of our detection algorithm. To find the limit size, we set multiple continuous size intervals and divide the dataset into multiple sets according to the interval that contains the object size.Fig.10(a)and Fig.10(b)show the distribution of widths and heights.The different colors in Fig.10(a) and Fig. 10 (b) represent different object-sized datasets.The algorithm is tested separately on these sets. When there is a considerable drop in the recall at a certain interval,we consider the upper bound of the interval to be the limit size. We record the precision and recall of the algorithm on each interval set, and the results are shown in Fig. 10 (c). In Fig. 10 (c),we omit the results achieved on the datasets in which the object sizes are larger than 70 pixels. This is because the recall rate already achieves stability when the object size reaches 70 pixels. It can be seen from Fig. 10 (c) that there is a great loss in recall rate when the size passes 26 pixels. Accordingly, we choose 26 as the limit size of our algorithm, which means that when the object in the input image is smaller than 26 pixels,the algorithm will have a difficulty finding it in images.Afterward,we decrease the input size of the detection algorithm to every size divisible by 32 from 224 pixels to 64 pixels and record the missing frames and operating time on the embedded system. The results are shown in Table 2.
Table 2 shows that the input size of the network greatly affects the operating speed because the input size of the network can change the computational complexity of the network, as shown in Eq. (4). From a speed perspective, the smaller the input size, the faster the network. In contrast, the effect of the input size on accuracy is the opposite of the speed.This is because when the cropping size is not sufficiently large to contain the object in the cropped image,we need to enlarge the cropping size to ensure that the object exists in the cropped image. Afterward, we must resize the cropped image to fit the input size. However, if the cropping size is much larger than the input size, many object details are lost when resizing the image back to the input size, making the network extract blurred features. Thus, the result of the network is based on blurred features, which decreases the performance. In general,the cropping size cannot be smaller than a certain size,although a smaller size corresponds to a higher speed. Therefore, we need to make a tradeoff between precision and operating speed.In this paper,we choose a size of 128 pixels for the AAR drogue tracking task.
Second, we test the validity of the n-fold Bernoulli probability theorem based filter. We test the algorithms (with and without the n-fold Bernoulli probability theorem based filter)on all testing datasets. In Eq. (14), n can be set to different numbers.According to our analysis,if n is set to an odd number, it can increase the result when p is larger than 0.5 and decrease it when p is smaller than 0.5.The larger n is,the better the results are.In our test,n is set to odd numbers from 3 to 11,and‘‘-n”is used to indicate the chosen number.The results are listed in Table 3. In Table 3, ‘‘Proposed-n” indicates the YV4_D_A algorithm with an n-fold Bernoulli event algorithm.Table 3 shows that the precision increases with increasing n,while the recall remains the same, because we can only weed out the FP results using the former information of the object but cannot decrease the FN results.
Table 2 Performance of network with different input sizes.
Finally,we compare our algorithm on an embedded system with other state-of-the-art object tracking algorithms. We first use the improved YV4_D_A to detect refueling drogues and submit the drogue locations to tracking algorithms. We use the same strategy to test all algorithms on the test data.In this test,we use the MobileNetv2 backbone version of SiamRPN++,23and the template and input sizes are set to 127 and 255,respectively. For the SiamAPN++,31we set the input size and template size to 127 × 127 and 287 × 287, respectively.In ECO,30we set the maximum number of the components to 50,the filter update interval to 6,and the number of conjugate gradient iterations to 5. In this paper, the ECO is implemented based on Histogram of Oriented Gradients (HOG)features. For MDNet,17we set the input size to 107 × 107,and the Intersection over Union(IoU)thresholds used to select positive and negative samples in training videos are set to 0.7 and 0.5, respectively. When using MDNet online to track the refueling drogue, the IoU thresholds used to select positive and negative samples in the first frame are set to 0.7 and 0.3,respectively. For the DaSiamRPN, the initial input size is set to 255 × 255 and gradually increases to 767 × 767 when the target is lost. The detection score thresholds for entering and leaving failure cases are set to 0.8 and 0.95, respectively. We record the precision plot and success plot and show them in Fig. 11. Additionally, we record the operating speeds on the embedded system, as shown in Table 4 (in Table 4, all algorithms are operated on Xavier AGX).
We also test our algorithm under different weather and lighting conditions. For different conditions, we simulate the three different weather conditions: rainy, snowy and foggy weather. When simulating the weather condition, we apply a smooth random change to each of the weather conditions.The smooth random change means that the state of each weather condition varies smoothly and randomly. For exam-
Fig. 11 shows that our algorithm has advantages over the other tracking algorithms: the success rate plot of One Pass Evaluation (OPE) and the precision plot of OPE of our algorithm are all above other object tracking algorithms. Because our tracking algorithm tracks objects by detection,it has some prior knowledge about the tracking object,and it uses the former information such as location and class to reduce false positives. Although our algorithm does not match the SiamAPN++ on speed, it has higher precision than SiamAPN++and can run up to 20.04 FPS on the embedded device to guarantee its real-time application in the AAR task.ple,when we add the rain to the clean image,we choose to randomly increase every attribute of the rain, such as its density,and tilt angle,randomly in the interval (-ε,ε)in each frame.ε is small enough to ensure that the weather state changes smoothly. After adding the weather condition to each testing video data, we test all the algorithms on these datasets and record the results in Table 5. Among these different weather conditions, foggy weather affects the tracking precision the most. Snowy weather also has effects on the precision, but not as much as foggy weather. Rainy weather has minor effects. During the test, we separately test our algorithm with and without random Gaussian noise augmentation on the snowy and rainy weather datasets. Table 5 shows that Gaussian noise augmentation can improve our tracking algorithm when dealing with snowy or rainy weather conditions. Therefore, for snowy and rainy weather conditions, our algorithm is affected the least. For foggy weather, although the performance of our algorithm drops, it still maintains the highest performance.
Table 3 Test results of n-fold Bernoulli probability theorem based filter.
Fig. 11 Comparison of tracking algorithms.
Table 4 Performance comparison of tracking algorithms.
Table 5 Test under different weather and lighting conditions.
For the lighting condition test, we take several different lighting condition videos at night. We separately test all the algorithms on these video datasets and record the results in Table 5. In this paper, we use gamma transfer augmentation to simulate the lighting conditions for the training dataset.Therefore,our algorithm is capable of handling different lighting conditions. The testing results are recorded in Table 5,demonstrating that the performance of all the algorithms decreases when they are tested under different lighting conditions, but our algorithm still achieves the best performance among them.
In this paper, we first present an object detection algorithm based on YOLOv4 with the aim to increase the reliability of the detection algorithm in situations that are intolerant to mistakes and involve small objects.To improve the ability of small object detection, we use dense blocks to replace the residual blocks to improve the feature extraction capability. We also use ACNet to enrich the scale of the receptive field,which consequently increases the accuracy of the network.Additionally,we develop an n-fold Bernoulli probability theorem based filter that can decrease false positives to increase precision when tracking an object.Based on this filter and the aforementioned object detection algorithm, we develop an adaptive object tracking algorithm for the AAR task, which proves to have high precision and strong robustness with real-time operating ability in an embedded system. The experimental results demonstrate the effectiveness of the proposed tracking algorithm.
In this paper,we consider single-object tracking tasks.This algorithm can also be used in other single-object tracking tasks.In future work,we will focus on a multi-object tracking algorithm using an object detection algorithm that can enhance the practicality of the algorithm.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
CHINESE JOURNAL OF AERONAUTICS2023年1期