Tianhao Zhang,, Jiuhong Xiao, Liang Li, Chen Wang,, and Guangming Xie,
Abstract—Controlling multiple multi-joint fish-like robots has long captivated the attention of engineers and biologists,for which a fundamental but challenging topic is to robustly track the postures of the individuals in real time.This requires detecting multiple robots,estimating multi-joint postures,and tracking identities,as well as processing fast in real time.To the best of our knowledge,this challenge has not been tackled in the previous studies.In this paper,to precisely track the planar postures of multiple swimming multi-joint fish-like robots in real time,we propose a novel deep neural network-based method,named TABIOL.Its TAB part fuses the top-down and bottom-up approaches for vision-based pose estimation,while the IOL part with long short-term memory considers the motion constraints among joints for precise pose tracking.The satisfying performance of our TAB-IOL is verified by testing on a group of freely swimming fish-like robots in various scenarios with strong disturbances and by a deed comparison of accuracy,speed,and robustness with most state-of-the-art algorithms.Further,based on the precise pose estimation and tracking realized by our TAB-IOL,several formation control experiments are conducted for the group of fish-like robots.The results clearly demonstrate that our TABIOL lays a solid foundation for the coordination control of multiple fish-like robots in a real working environment.Webelieve our proposed method will facilitate the growth and development of related fields.
IN the past decade, fish-like robots, as a typical representative of biomimetic underwater robots, have attracted increasing attention of engineers and biologists since such robots replicate the outstanding skills of fish in nature[1], [2], and in turn, become powerful tools for understanding fish behaviors [3]–[6].Of these, one key fundamental research topic is to track the postures of multiple fish-like robots in real time robustly, including multiple robot detection, multi-joint posture estimation, and identities tracking.This is because multi-individual pose tracking is the basis of individual motion evaluation [7], group motion analysis [8], and realtime feedback motion control [2].However, multi-individual pose tracking for fish-like robots is dramatically challenging.For one thing, the hydrodynamics around an underwater robot may bring high uncertainty and noise to the robot’s motion;thus, the pose estimation and tracking face particular difficulties compared to the ground mobile robots [9].For another, unlike traditional underwater vehicles, a fish-like robot usually has a multi-joint flexible deformable body to generate the biology-inspired swimming modes [10], so the body deformation during swimming together with the hydrodynamic factors leads to the difficulty in distinguishing and tracking multiple fish-like robots when they get closer [11].
Due to these unique challenges, the non-visual sensor-based posture measurement methods cannot work well anymore.For instance, the inertial measurement unit (IMU) in the robot reveals obvious drift error [12]; the Global Positioning System(GPS) is unable to be used indoors, and its measuring accuracy is not high enough compared to the body length of the fish-like robots [13].Therefore, the vision-based system is widely used for fish-like robots in experimental environments due to its convenience, accuracy, and stable performance.According to where cameras are installed, there are two categories of vision-based systems, the on-board vision systems and the global vision systems.In the on-board vision systems, the cameras are carried by the fish-like robots [14];thus the cameras move together with the robots and capture images, which seems convenient.However, since the cameras are moving with the robots and have a limited field of vision,to realize multi-robot tracking only using the on-board vision system is difficult and hence requires more devices, such as communication devices.In contrast, cameras in global vision systems are carried by the unmanned aerial vehicles (UAVs)[15], [16], or installed above or around the pool [17], [18],whose precise position can be obtained.Therefore, global vision systems are suitable for posture tracking in indoor experimental environments and natural waters and applied by most researchers [7].
In general, the global vision systems can be divided into two categories according to whether there are makers mounted on the robots: the marker-based systems and the marker-less systems.The marker-based systems estimate and track postures by detecting the markers, like the light-emitting diodes [19] and color markers [7].For instance, in [7], the bilateral filtering and K-means are utilized to accurately segment out color markers for posture tracking of a fish-like robot.Although such an approach is precise when there is one robot, it is troublesome to tune parameters and limited in many situations like uneven illuminations.Besides, it is hardly possible to preset markers on the real fish for biology research.To break through the restriction of using markers,marker-less systems that identify the targets by matching them with image features have been studied in recent years.In previous works, marker-less systems tackled the pose estimation of fish-like objectives using artificial experience,such as geometric shapes [20].For instance, one typical marker-less planar pose tracking method for multiple fish-like robots is the double-template matching algorithm [21].It estimates each robot’s rigid head by double-template matching and tracks the robot by the Kalman filter.Although such a traditional approach is widely utilized for multiple fishlike robot coordination control research [2], [17], it requires the manual acquisition of a rough scope of the robotic fish initially, and estimates the robotic fish’s pose as a template with the pre-designed fixed shape but ignores the deformation of the body, which happens to be the key characteristic of the fish-like robots.Besides, it fails when multiple fish-like robots gather.Therefore, to meet the needs of further researches on the coordination control of multiple fish-like robots, the global vision system for multiple fish-like robots needs to be upgraded.To this end, this paper aims to propose a markerless vision-based pose estimation and tracking method for suiting these challenging extreme situations, including uneven illuminations, water ripples, and collisions, to advance the robotic fish research.
Fortunately, deep learning has witnessed success in computer vision, promoting the vision systems for multiple robots.For example, towards collaborative robotics in top view, multiple object tracking was realized by detection using deep learning in [22].However, how to estimate each object’s pose has not been solved in their work.Pose estimation is a hotspot but challenging topic of computer vision, which requires both detecting objects and estimating their pose.In the field of two-dimensional multiple human pose estimation,considerable progress based on deep learning has been made,and the existing methods can be classified into top-down approaches and bottom-up approaches.The top-down approaches, such as the Simple Baseline [23], detect the human first and then estimate the pose for each person.In contrast, the bottom-up approaches identify all the key points first and then assign them to each person.OpenPose [24] is a typical and widely used bottom-up multi-person pose estimation method.It proposes the part affinity fields (PAF) to establish a directional relationship between key points.The greedy parsing algorithm is used to assign the key points to individuals based on PAF, requiring some prior information to speed up the algorithm.DeeperCut [25] is another representative bottom-up method.It uses an integer linear program, instead of PAF, to analyze multi-person poses.More fortunately, recently, human pose estimation methods have been applied for solving animal posture tracking.For instance,DeepLabCut [26] shows that DeeperCut can be validated on biometrics in the case of a small dataset.LEAP [27] is another animal posture tracking method, consisting of the top-down model like the Simple Baseline, and bottom-up model like the OpenPose.Considering both humans and animals can be regarded as with a multi-joint body, these deep neural network-based signs of progress encourage the marker-less posture tracking of multiple multi-joint fish-like robots.However, these methods cannot meet the online real-time precise requirements of feedback control: The human pose estimation methods should balance the accuracy and speed;the animal pose tracking methods are not precise enough.Thus, for the needs of coordination control, precise tracking postures of multiple multi-joint fish-like robots in real time is still a challenge.
This paper proposes a novel global-visual marker-less method via deep neural networks for the real-time precise planar pose estimation and tracking of multiple multi-joint fish-like robots, called TAB-IOL.First, for pose estimation,we design a novel CNN-based network with a parallel structure to combine the t op-down a nd b ottom-up approaches,namedTAB.TAB outputs the target region and the pixel information of key points, i.e., head, body, and tail,simultaneously.In this way, the limitation of top-down approaches in multiple individuals in one region is corrected by the key points pixel information.In turn, the target region provides the overall individual information for bottom-up approaches to reduce the false-positive pairwise connections.TAB takes advantage of top-down and bottom-up approaches;thus, it has the accuracy and speed advantages over any separate one.Second, for pose tracking, we construct a novel network that combines the independent movement characteristic of each type of key point and the o verall constraints between key points using the l ong short-term memory (LSTM), namedIOL.Such a tracking network first separately extracts the motion feature of each type of key point and then fuses all the separate features to consider the impact of motion constraints between key points for precise tracking performance.
Fig.1. The experimental platform used to conduct our proposed method to track postures of multiple multi-joint fish-like robots.(a) The prototype of the carp robot.(b) The schematic of the experimental platform.(c) Typical scenarios of single or multiple swimming fish-like robots.
To demonstrate the satisfying accuracy and speed performance of our proposed method, a pose dataset is built,including one image training set, one image test set, one video training set, and one video test set.Next, TAB trained in the image training set is compared with the state-of-art human and animal pose estimation methods on the image test set.The test images contain the different number of fish-like robots in various scenarios with strong disturbances.IOL trained in the video training set is compared with the Kalman filter approach in the video test set.Experimental results show that TAB and IOL have accuracy and speed advantages over the existing methods.Then, ablation studies are constructed on TAB and IOL to illustrate their novelty and contribution.The effectiveness of our proposed pose tracking system is verified on an offline video by comparing it with the traditional fishlike robot pose tracking system.Tracking results show that our proposed system is more accurate in pose estimation than the existing methods with less identity loss and less exchange in tracking.At last, the formation control experiments for a group of fish-like robots are conducted, which clearly demonstrates that our proposed method lays a key foundation for the coordination research of multiple multi-joint fish-like robots in a real working environment.
The main contributions of this paper are threefold.
1) As far as the authors are aware, it is the first time that a deep neural network-based approach is proposed for online real-time pose estimation and tracking for multiple fish-like robots.Comparison experiments show that our method performs better than the existing fish-like robot pose tracking methods.
2) A fusion pose estimation method “TAB” is proposed,which simultaneously outputs detection and key points information by a parallel structure.As a result, it takes advantage of both the top-down and bottom-up approaches,achieving higher accuracy than existing state-of-art methods on the premise of satisfying real-time.
3) A novel pose prediction method “IOL” is proposed,which is the first time to track pose by considering both the motion characteristics of each type of key point and the motion constraints among different key points.The IOL shows excellent performance on the key points prediction in the next frame.
The rest of this paper is organized as follows.In Section II,the basic information about the multi-joint fish-like robot and the experimental platform is described, and the unique challenges of tracking the postures of multiple fish-like robots are emphasized.Section III introduces our proposed pose estimation and tracking method.Section IV elaborates on the dataset collection and training results.Section V presents the ablation studies and formation control experiments.The supplementary movie1The supplementary movie is available online at https://ibdl.pku.edu.cn/research/video/926374.htmshows the satisfying performance of our method by comparing it with baselines.Finally, Section VI concludes this paper.
The fish-like robot focused in this paper, as shown in Fig.1(a),mimics the morphology and kinematics of carp in nature.The carp robot consists of a rigid head, a flexible body, and a caudal fin with a total length of 44.3 cm and a width of 10.0 cm.Inside its head, there are one STM 32 control board, a set of lithium batteries, and a wireless module.The body of the fish-like robot contains three revolute joints, and the caudal fin is attached to the third joint.The density of the robot is just a little bit less than that of the water so that it swims near the water surface.The propulsion of the carp robot is achieved through generating a traveling wave traversing the body towards the tail by swinging its three joints; that is, its body deforms during swimming.Thus, it is critical for robot control to identify and track the head, body, and tail.
Fig.2. Pipeline of TAB for the pose estimation of multiple multi-joint fish-like robots and its detailed network architecture.TAB simultaneously outputs the detection and key points information of multiple fish-like robots using the parallel structure.
The experimental platform is shown in Fig.1(b), consisting o fa 3 m×2 m pool, a global camera 3 mvertically above the pool, a computer, and a wireless transmitter.The camera captures images 25 frames per second (fps).It then sends them to the computer, where our proposed method is adopted to process the real-time images for realizing pose estimation and tracking of multiple fish-like robots swimming in the pool.Last, the computer sends control instructions to the robot through the wireless transmitter.Note that the global camera here can be replaced by a UAV carrying a camera, which makes the platform more flexible [15], [16].
Four typical extreme scenarios of single or multiple swimming multi-joint fish-like robots are shown in Fig.1(c),including the situations of uneven illuminations, the water ripples, body deformation, and collisions.These situations illustrate the unique challenges of pose estimation and tracking for multiple multi-joint fish-like robots.First,considering the area ratio of the robot to pool, 0.74%, is very small, the key point localization requires high accuracy.Second, unlike a human with much skeleton, when deformable bodies of robots are intertwined, assigning key points to fish-like robots that are without other skeletal assistance is challenging.Last, the total processing speed should be fast in real time while ensuring accuracy to meet online feedback control requirements.More detail of this widely concerned fish-like robot and platform can refer to[28], [29].
This section will introduce in detail the pose estimation method, TAB, and the pose tracking method, IOL.Fig.2 depicts the pipeline of TAB, and Fig.3 illustrates the network structure of IOL.
Fig.3. The network architecture of IOL for the pose tracking of multiple multi-joint fish-like robots.IOL outputs the coordinates of key points in the t+1 time frame for each robot, based on the coordinates in the t time frame.
The pose estimation of multiple multi-joint fish-like robots requires identifying each robot's key points, i.e., head, body,and tail, in this paper.One approach is to detect the robot region using the bounding box first and then localize its key points in the region, called the top-down.Top-down can decrease false-positive connections between different robots by overall detection information if the assumption that one detection region contains one robot is valid.However, if the assumption is broken, the identification and assignment of key points of the top-down approach may be ineffective.For example, when fish-like robots are intertwined, the top-down approach performs unsatisfactorily (see experimental results in Fig.4(b)).Another approach is to identify key points of all robots first and then assign them to individuals, called the bottom-up.Although the bottom-up approach is not limited by the number of robots in one region, it faces the challenge of heavy computation for checking the connection and high assignment error due to the lack of overall detection information (see experimental results in Fig.4(c)).
To address these challenges, we present a novel networkbased pose estimation method, combining top-down and bottom-up using a parallel structure, to simultaneously output detection and key points information, named TAB.The detector generates bounding boxes {B1,...,BM} for allNindividuals.Assuming that there is no false-positive bounding box and all individuals are bounded, each box has one or more individuals, i.e.,M≤N.At the same time, the estimator generates predicted heatmaps {H1,...,Hk} for allkkey points and vectormaps {V1,...,Vp} for allpjoints.The heatmaps represent the confidence of key points in each pixel, and the vectormaps indicate the confidence of the vector between key points.
The detailed architecture of TAB is shown in Fig.2, which combines the Yolo-v3 [30] as the detector and the OpenPose[24] as the pose estimator.The EfficientNet [31] is applied as the network backbone since it has the accuracy and speed advantages by using the depth-wise convolution and the squeeze-and-excitation (SE) structure.The backbone outputs three resolution feature maps, 1/32, 1/16 , and 1 /8 of the size of input image.The detector uses all three maps as inputs to output the bounding box information, while the estimator uses only the 1 /32 feature map as input and outputs 1 /4 resolution key points feature maps by three deconvolution layers.Besides, the mobile inverted bottleneck convolution(MBConv) unit [32] is applied in the network for feature extraction, since it has the advantages of fast convolution speed and high accuracy over the traditional convolution unit.Specifically, in experiments, by balancing the accuracy and speed performance, the EfficientNet-B4 is chosen as the backbone.A ll the MBConv filters are 3×3 kernel without upsampling, whose number is 256, 128, and 64, corresponding to 1/32,1/16, and 1/8 channels, respectively.The three deconvolution layers have 256, 128, and 64 filters with 4×4 kernel in sequence, and the stride is 2.The loss function of TAB is
In a word, the cross-entropy is used as the loss between the predictedx,ycoordinates of the bounding box, and the loss of confidence of each predicted bounding box.The mean squared error is used to measure the length and width of the bounding box and the confidence of the heatmaps and vectormaps.
The pose tracking of multiple multi-joint fish-like robots requires assigning a unique identity to each robot.In previous works, the clustering [7] or Kalman filter [21] algorithms are used to achieve pose tracking of fish-like robots.However, the clustering-based approach fails when collisions occur since all identities gather, and the approach based on the Kalman filter causes the identity exchange during collisions due to insufficient accuracy.Some deep neural network-based works recently combine feature maps of frames for precise pose tracking in an end-to-end network [33], [34], where the recurrent neural network (RNN) [35], especially the LSTM network [36], is the key unit for object matching in time series.Compared to the fully connected network, the hidden layer in LSTM at the previous moment affects the one in the next moment.However, these pose tracking approaches based on time series images are so slow with the heavy overhead that satisfies the online real-time requirements.Considering that the coordinates of key points estimated by TAB can be the lightweight inputs for LSTM to meet the real-time requirement, predicting key points in the next frame based on LSTM is a possible approach to pose tracking.
According to the fact that the propulsion of the fish-like robot is achieved by generating a traveling wave traversing the body towards the tail, the individual movement is obviously related to each key point constrained by each other,which is noted as the overall motion constraints.Besides, each type of key point has its independent movement characteristic due to the different physical structures, such as the rigid head,deformable body, and soft tail.Thus, the prediction of key points should consider both the independent movement characteristic of key points and the overall motion constraints between them, which is exactly the core of our proposed LSTM-based network for pose tracking, named IOL.
The structure of IOL is shown in Fig.3, whose inputs are the coordinates of the key points.At first, the independent motion feature of each key point is sequentially extracted by an FC (fully connected) network and an LSTM network.Then, all independent motion features are entered into a common LSTM network to consider the overall constraints among the key points.Last, the FC network converts the overall information to the mixed Gaussian distribution of the coordinates of key points in the next frame.For key pointi,denote its coordinates prediction at next time stept+1 as(,)t+1.According to [37], the prediction based on a bivariate Gaussian distribution is more precise than that by direct distance minimization.Thus, similarly to [38], we assume a bivariate Gaussian distribution parameterized by the mean, standard deviationand correlation coefficient.The predicted coordinatesat time steptare given by.In this way,the parameters of the IOL are learned by minimizing the negative log-Likelihood loss
which predicts pose coordinates from the next time stepTobs+1toTpred. In practice,Tpred=Tobs+1.The independent FC network includes one hidden layer with 128 neurons, and the independent LSTM network is 256-dimensional.The overall LSTM network is 512-dimensional.
Identities Assignment:After the coordinates prediction, the identities assignment is realized by comparing the coordinates between each pair of two continuous frames.Take the frame at time steptas an example.The pose vector of robotkpredicted by IOL is denoted as,which sequentially represents the coordinates of head, body,and tail.The pose vector of allNrobots estimated by TAB is denoted as.Then, the Euclidean distance between the prediction and estimation is calculated.The idkis assigned to the robot with the minimum distance.As long as the prediction error by IOL dose not exceed the size of the robot itself, such a matching approach is accurate and fast.
Image Cropping:In fact, when there is no missing or new robot in the current frame, it is unnecessary to estimate pose using the entire image.Mark the input of TAB as the region of interest (ROI).Obviously, if the next frame image is cropped to some subgraphs based on the coordinates prediction that are set as the ROI, the pose estimation processing can speed up.Specifically, suppose thatnfish-like robots are estimated by the TAB in the currentkframe, and the coordinate of the body for robotiin the next framek+1 is denoted as (x2,y2)i.Then,the ROIS={S1,S2,...,Sn} is a subset of the captured next frame image, whereSirepresents a subgraph with the pixel c oordinates (x2,y2)iof roboti.The region ofSiisbi,x?l≤bi,x≤bi,x+l,bi,y?w≤bi,y≤bi,y+w, whose size can be adjusted by choosing the lengthland the widthw.In a word, the next frame image is cropped intonsubgraphs with a size of 2l×2wbased on the coordinates prediction.Moreover,the GPU parallel technology can be applied for processingnsubgraphs at the same time.In this way, the ROI is 2l×2wrather than the entire image, which greatly reduces the processing time.If TAB detects the robot in the related subgraph, it shows that the prediction matches the estimation;that is, there is no robot missing.If not, in the next frame, the entire image should be set as the ROI for the pose estimation.Besides, the entire image should also be processed once in a while to deal with the appearance of new robots in the image.In summary, the diagram of the pose tracking scheme is illustrated in Fig.5.First, the entire frame is set as the ROI,and TAB inputs the ROI to estimate each robot’s pose, i.e.,the coordinates of each robot’s key points.Then, random ly assign a unique id to each robot if it is the first frame.If not,execute the identities validation.When the missing or appearing situations occur, the entire image should be set as the ROI for pose estimation and match the latest frame to assign identities.Last, if the validation result is normal, IOL predicts the coordinates of key points, which are used to crop t he entire image to subgraphs as ROI and assign identities.
Fig.5. Diagram of the pose tracking scheme.
There is currently no public dataset of multiple multi-joint fish-like robots for pose estimation and tracking.Thus, we built a dataset2The built dataset has been publicly released on https://github.com/xjh19971/Robotic-Fish-Pose-Dataset.using the typical kind of fish-like robots and the typical experimental platform (illustrated in Section II).The computer of the platform controls multiple robots to swim freely in the pool and saves the motion videos recorded by the global camera.To build a strong dataset, many factors should be considered, including brightness, background color, robot location, robot direction, the number of robots, and collisions.Since a small dataset with diverse images is suitable for pose estimation of a specific problem [26], [27], representative images from different videos have been chosen to make sure images are varied enough to avoid overfitting.The dataset consists of 1100 images containing over 4000 different poses,including a different number of fish-like robots, uneven illuminations, water ripples, and collisions.These images are divided into an image training set with 800 images, and an image test set with 300 images.As for pose tracking, a video of five swimming fish-like robots with 300 continuous frames is selected as the video training set.A video of five swimming fish-like robots with 200 continuous frames is chosen as the video test set.In a word, the dataset consists of one image training set, one image test set, one video training set, and one video test set.Each image has 752×480 pixels.No duplicate images are in training and test sets.The image training set and video training set are with labels, i.e., the coordinates of each robot’s head, body, and tail, while the test set not.
Training 1:The ground truth of the robot detection bounding box is made from the key points location label.Specifically, a minimum box that contains all the key points is set as the detection ground truth.Ground truth heatmaps are placed as 2D Gaussian probability peaks.The entire image with 752×480 pixels is preprocessed to 416×288 pixels by resizing and padding.Further, data augmentation including H channel (±10%), S channel (±20%), V channel (±30%), and flipping is utilized to take into account the effects of uneven illumination.Due to the fact that a random ly initialized model with a small dataset does not lead to overfitting [39], the pretraining model is unnecessary.The TAB and IOL network is random ly initialized at the beginning of training.The base learning rate is 8×10?3.The learning rate will drop to the previous half when the loss is not falling.TAB is trained on Tensor Flow-environment with two RTX 2080Ti GPUs on a single i9-CPU computer.The training time is 5 hours.For comparison, representative human pose estimation methods are chosen as baselines, including the typical top-down method, the Simple Baseline [23] with ResNet 50 backbone,and the typical bottom-up method, OpenPose [24] with VGG-16 backbone.The training settings of these baselines are same as the TAB.
Test 1:The object keypoint similarity (OKS) is utilized to measure the accuracy of pose estimation
wherediis the Euclidean distance between the estimation of key pointiand the related ground truth, andSrepresents the scale factor, i.e., the square root of the robot region.vi=1 means the key pointihas been detected, thus δ(vi=1) implies that only the detected key point will be chosen for evaluation.τ is the factor representing the standard deviation of the label,which is set as 4.0 for all key points according to the statistics of the dataset.The mean average precision (AP) is utilized to compare the performance of different algorithms.The threshold is taken every 0.05 from 0.5 to 0.95.The TAB model trained in the image training set is tested on the image test set using a computer with a single i9-CPU and a single RTX 2080Ti GPU.Test results are shown in Table I.The comparison results of Methods (a, c) show that, compared to the Simple Baseline with ResNet50 backbone, our TAB is 47.3% higher on the average accuracy and 12 fps faster on the speed.Results of (b, c) show that, compared to OpenPose with VGG16 backbone, our TAB is 38.5% higher on the average accuracy but 16 fps slower on the speed.Considering that the online real-time control period of fish-like robots is 40 ms, the processing speed of TAB, 25 ms, is enough.To gain insight into the performance of our proposed pose estimation method,some snapshots are shown in Fig.4.It is obvious that the topdown approach may fail when one detection region contains more than one robot, and the bottom-up approach may assign key points ineffectively when one key point is next to the same type of key point of another robot.In contrast, our TAB combining the advantages of top-down and bottom-up to obtain the detection and key points information at the same time can accurately analyze the pose of each robot in the case of multiple robots collisions.
Fig.4. The snapshots of pose estimation test results.
Training 2:The input of IOL is some sequential data with size of 5×300×6, where 5 represents the number of the robots, 300 is the sequence length, and 6 means the number of coordinates for three types of key points.A ll coordinates are normalized by the height and width of images.The flipping method of changing coordinates and the random ly missing of some coordinates in one sequence are applied for data augmentation.The time-step of the LSTM is set as 30, which means that 30 latest previous frames are utilized to predict the next frame.The learning rate is 4e?3, which will be half when the loss does not fall.There are 4000 training epochs in total.The mini-batch size is 128.Adam optimizer is utilized.IOL is trained on the TensorFlow-environment with a single RTX 2080Ti GPUs on a single i9-CPU computer.The training costs 3 hours.
TABLE I POSE ESTIMATION RESULTS ON MULTIPLE MULTI-JOINT FISH-LIKE ROBOT IMAGE TEST SET
TABLE II POSE TRACKING RESULTS ON MULTIPLE MULTI-JOINT FISH-LIKE ROBOT VIDEO TEST SET
TABLE III POSE ESTIMATION ABLATION STUDIES RESULTS ON MULTIPLE MULTI-JOINT FISH-LIKE ROBOT IMAGE TEST SET
Test 2:The tracking accuracy is evaluated by the mean squared error (MSE) between the prediction coordinates ()and the ground truth (x,y),
wherenis the number of each type of key point.The IOL model trained on the video training set is tested on the video test set.Test results are shown in Table II.When the time-step is 30, the average distance error between the prediction and the ground truth is 0.74 pixel, 0.85 pixel, and 0.96 pixel,respectively, and the processing speed is 32 fps.Further, to evaluate the influence of the time-step for performance, the network is retrained from the initial with the time-step of 10,then the average distance error is 1.15 pixel, 1.26 pixel, and 1.50 pixel, respectively, and the processing speed is 105 fps.It is exciting that IOL achieves a pixel-level prediction in realtime.A first-order Kalman filter [21] is chosen for comparison.The average error of the Kalman filter approach is 4.75 pixel, 4.50 pixel, and 5.98 pixel for head, body, and tail, respectively.It is about 4 times that of the IOL.Although the processing speed is slower than that of the Kalman filter approach, considering that IOL already meets the real-time requirement, our IOL is a better choice for pose tracking of fish-like robots.
Pose estimation ablation studies are constructed to demonstrate the novelty and contribution of our TAB.First, to validate the performance of different methods under the same backbone, TAB with ResNet50 backbone, OpenPose with EfficientNet-B4 (ENB4) backbone, and the Simple Baseline with Efficient Net-B4 backbone are trained and tested.Second,to validate the significance of the combining of top-down and bottom-up, four TAB variant networks are designed.The network without the Detector (respectively Estimator) is denoted as TAB-w/o-D (respectively TAB-w/o-E).The network whose Detector (respectively Estimator) is frozen during test is denoted as TAB-f-D (respectively TAB-f-E).These four TAB variant networks with EfficientNet-B4 backbone are trained and tested with the same settings as in the TAB.Ablation results are illustrated in Table III.Compared Table III and Table I, results in methods (a, c, d, e,f) show that the backbone is not the key unit for the excellent performance of TAB.Under the same backbone, TAB has both the accuracy and speed advantages over the typical human pose estimation methods.Results (c, h, j) show that the performance of TAB decreases a lot without the parallel structure during training.Results (g, h) and (i, j) illustrate that the parallel structure utilized in the training process benefits the performance even if it is frozen during test.We believe it is because the parallel structure potentially combines the detection and estimation feature to improve the performance of each other.These ablation studies clearly demonstrate the novelty and contribution of our proposed pose estimation method.
Two IOL variant networks are proposed to evaluate the novelty and contribution of the IOL structure.The first network extracts the feature of each type of key point by the independent L STM unit, named IL.A common fully connected layer at the last of the network converts the independent information to coordinates prediction.That is,there is no common LSTM unit for explicitly fusing the independent information for extracting overall constraints between key points.The other network fuses the feature of all key points at the first of the network using a common fully connected layer.There is no independent LSTM unit for explicitly extracting the independent movement characteristic of each type of key point.Such a network is named as OL.The structures of IL and OL are shown in Fig.6.Note that the network units are the same as those in IOL.IL and OL are trained on the video training set with the same settings as IOL.Ablation results, shown in Table IV, illustrate that the difference in structure brings a huge difference in performance.The OL has the accuracy advantage over IL,while IL has the speed advantage over OL.Moreover, IOL,considering both the independent movement characteristic and the overall constraints, improves the accuracy performance a lot and maintains the high processing speed at the same time.To analyze why IOL takes excellent performance, we represent each layer of the network as a function and compare the expression of different structures.Letxkbe the coordinates of thektype key point,(·) represents thew-th fully connected unit inllayer,(·) represents thew-th LSTM unit inllayer.Assume that there aremtype key points.Then,the outputs of IOL, IL, and OL are
Fig.6. The structures of pose prediction ablation studies including: (a) IL network and (b) OL network.
TABLE IV POSE TRACKING ABLATION RESULTS ON MULTIPLE MULTI-JOINT FISH-LIKE ROBOT VIDEO TEST SET
The effectiveness of our propose pose tracking system is evaluated on the video test set.The pose estimation and tracking results are shown in Fig.7.Intuitively, in extreme situations, including uneven lighting, water rippers, boundary effects, and collisions, our system can accurately recognize the pose of each robot and track it.To quantify the performance our system, the representative network-based animal pose tracking systems are chosen as baselines,including the DeepLabCut [26], Leap with top-down structure(Leap-TD), and Leap [27] with bottom-up structure (Leap-BU).The animal pose systems are trained on the video training set under the same settings as our system.The pose estimation results on the video test set are shown in Table V.Results of methods (k, l, m, n) show that our proposed system has both accuracy and speed advantages over existing network-based animal pose tracking systems.The average accuracy of TAB is 41.3%, 25.7%, and 23.7% higher than that of them, respectively.The processing speed is 18 fps, 30 fps,and 26 fps faster, respectively.Fig.8(a) shows the comparison results between our TAB and the animal systems when collisions occur.Obviously, animal pose tracking systems fail when robots are intertwined.
Further, to illustrate the progress of this paper to the field of robotic fish, our system is compared with the typical makerless fish-like robot pose tracking system with template matching [21].Such a traditional vision-based method does not require training but requires manually identifying the robot at the beginning.Fig.8(b) shows the comparison results under the situations with the collisions, boundary effects,uneven illuminations, and water ripples.Fig.9 shows the tracking trajectories of the five fish-like robots under our proposed method and under the template matching method,respectively.Obviously, the traditional system loses the robots when collisions occur, and estimates the pose with a nonnegligible error in extreme situations.In contrast, our method is more accurate than the template matching method with less failed tracking and fewer identity switches.
TABLE V POSE ESTIMATION RESULTS ON MULTIPLE MULTI-JOINT FISH-LIKE ROBOT VIDEO TEST SET
Fig.8. The comparison snapshots of the pose tracking on the multiple multi-joint fish-like robot video test set.
Moreover, compared with the ground truth, the tracking errors are shown in Fig.10, which ignores those situations when the robot is not detected (since the tracking error is difficult to define).Although the tracking errors of animal systems seem to be same as that of our system, according to Table V and (7) of OKS, the situations where the key points are not detected happen more frequently than our system.Besides, the template matching system has a larger tracking error over the network-based systems, which shows the bottleneck of the fish-like robot coordination research.This is also the motivation of this paper.
Until now, the superior performance of our proposed method for pose estimation and tracking of multiple multijoint fish-like robots has been demonstrated.Based on these,in this subsection, experiments of controlling a group of fishlike robots to do a coordination task are performed to further verify the effectiveness and practicability of our method.The formation control task is chosen since it is one of the most actively studied topics in the coordination control of multirobot systems [40]–[44].In particular, the circle formation experiments were conducted, where a group of three fish-like robots were random ly placed in the tank initially, and their goal was to form a uniform circle formation while rotating clockwise around a preset target.For this purpose, we adopt a reinforcement learning (RL) approach working in a decentralized manner for obtaining the controller for circle formation control [45].The real-time position and the orientation of each robot are obtained by our precise pose tracking system.Then the RL-based controller with the pose information as its input steers the three fish-like robots to swim clockwise on the expected circle and form an equilateral triangle, an isosceles right triangle, and a right triangle with[1/2,1/3,1/6]π angle in turn.The snapshots of the formation control experiments are shown in Fig.11.Such excellent results clearly demonstrate that our proposed tracking system lays a key foundation for the coordination research of multiple multi-joint fish-like robots in a real working environment.
Fig.11. Snapshots of the circle formation task on our real-time precise pose tracking system.A group of three fish-like robots start a circle formation task at t=0 s. At around t=75 s, they swam clockwise on the expected circle and presented an equilateral triangle distribution.At around t=90 s, the formation was c hanged to an isosceles right triangle.At around t=180 s, the formation was changed to a right triangle with [1/2,1/3,1/6]π angle.
Fig.9. Tracking trajectories of the five fish-like robots.(a) Our proposed method.(b) The template matching method.
Fig.10. The comparison results of the pose tracking on the multiple multijoint fish-like robot video test set.
This paper addresses the online real-time precise planar pose estimation and tracking problem for multiple multi-joint fish-like robots.A novel pose estimation network-based method, combining the top-down and bottom-up approaches,has been proposed.Then, the pose tracking has been conducted by a novel pose coordinates prediction networkbased method, considering the independent movement characteristic of each type of key point and the overall constraints among the key points.Experiments show that our method has more excellent performance on accuracy and realtime than existing network-based methods and the traditional robotic fish pose tracking method.The formation control experiments for a group of fish-like robots demonstrate that our method lays a key foundation for the coordination control of multiple multi-joint fish-like robots in a real working environment.We believe that our method can inspire the tracking of fish-like objects in a real working environment and promote robotic fish coordination research.In the future,based on this paper, we will carry out 3D pose tracking research of multiple fish-like robots.
IEEE/CAA Journal of Automatica Sinica2021年12期