LU Ping ,SHENG Bin ,SHI Wenzhe
(1.State Key Laboratory of Mobile Network and Mobile Multimedia Technology,Shenzhen 518055,China;2.ZTE Corporation,Shenzhen 518057,China;3.Shanghai Jiao Tong University,Shanghai 200240,China)
Abstract: With the rapid popularization of mobile devices and the wide application of various sensors,scene perception methods applied to mobile devices occupy an important position in location-based services such as navigation and augmented reality (AR).The development of deep learning technologies has greatly improved the visual perception ability of machines to scenes.The basic framework of scene visual per‐ception,related technologies and the specific process applied to AR navigation are introduced,and future technology development is pro‐posed.An application (APP) is designed to improve the application effect of AR navigation.The APP includes three modules: navigation map generation,cloud navigation algorithm,and client design.The navigation map generation tool works offline.The cloud saves the navigation map and provides navigation algorithms for the terminal.The terminal realizes local real-time positioning and AR path rendering.
Keywords: 3D reconstruction;image matching;visual localization;AR navigation;deep learning
Navigation services applied to mobile devices are an in‐dispensable part of modern society.At present,the outdoor positioning and navigating service technology has become mature,and the Global Positioning Sys‐tem (GPS) can provide relatively accurate position information and related supporting navigation services for outdoor pedes‐trians.For example,the navigation products of Baidu,Amap,Tencent and other companies can meet the location informa‐tion and navigation service needs of outdoor pedestrians in terms of location services.However,once pedestrians go in‐doors,e.g.,in shopping malls,airports,underground parking lots and other sheltered places,the positioning signal is greatly attenuated by factors like walls,and the GPS-based outdoor navigation technology becomes insufficient.The exist‐ing indoor localization methods have many constraints in local‐ization accuracy,deployment overhead,and resource con‐sumption,which limits their promotion in real-world naviga‐tion applications.
In recent years,researchers have designed a variety of in‐door and outdoor positioning solutions for various types of in‐formation such as visible light communication (VLC),built-in sensors,QR codes,and WIFI.However,these solutions have many shortcomings in terms of localization accuracy,deploy‐ment difficulty,and equipment overhead.For example,the VLC-based methods require indoor LED lights to be upgraded on a large scale,which greatly increases deployment costs.Meanwhile,the WIFI-based methods cannot provide accurate direction information,which is difficult to meet the needs of precise localization.
However,in a visual scenario perception method,target recognition and position calculation are performed by means of image processing,so that relatively high positioning preci‐sion can be provided,and deployment of an additional de‐vice is not required,which is widely researched and applied in recent years.
The main application of scene perception is visual localiza‐tion,which is a method of determining the position of 6-degree of freedom (6-DoF) from the image.The initialization condi‐tions of visual localization usually require a sparse model of the scene and the estimated pose of the query image.Aug‐mented reality (AR) navigation is an important application sce‐nario of visual localization technologies,which can interact with the real world in a virtual environment through localiza‐tion.The application of AR navigation technologies has great prospects in the future.Shopping malls have the most demand for localization and navigation technologies,and users are very interested in store discount information,personalized ad‐vertisements,store ratings,store locations,and indoor road guidance.The application of scene visual perception and AR navigation can solve most of the above problems well,and has vast potential in future development in the expansion of added value.
This paper introduces the design and implementation of AR navigation applications (APPs) and the cloud algorithm in de‐tail,and starts from three aspects: navigation map generation,the cloud navigation algorithm,and the client design.Com‐bined with specific cases,this paper introduces in detail the process of panoramic data acquisition and processing,point cloud map[1]and computer aided design (CAD) map alignment in the navigation map generation tool,and introduces the path planning algorithm and path correction algorithm in the cloud navigation algorithm.In terms of localization and AR path ren‐dering,the client design method is introduced in detail,and fi‐nally,the running example of an AR navigation APP is given.
Similar to humans,machines perceive and understand the environment mostly through visual information.In recent years,the development of 3D visual perception methods has provided great help for building models of the real physical world.For various application scenarios,there are currently some vision algorithms with commercial application capabili‐ties,including face recognition,living body detection,3D re‐construction,simultaneous localization and mapping (SLAM),gesture recognition,behavior analysis,augmented reality,vir‐tual reality,etc.
Scene visual perception applied to navigation mainly in‐cludes 3D reconstruction and SLAM.The above steps can be regarded as the process of building a visual map.Visual mapbased localization usually includes steps such as visual map construction and update,image retrieval,and fine localization,among which the visual map is the core of the method.Accord‐ing to the condition that the image frame has accurate prior pose information or not,the process of constructing a visual map can be divided into prior pose-based construction meth‐ods and non-prior pose methods.In the prior pose-based con‐struction methods,the prior pose of the image frame can be de‐rived from the high-precision LiDAR data synchronized and calibrated with the camera,which is common in highprecision acquisition vehicles in the field of autonomous driv‐ing.In small-scale scenes,especially indoors,the prior pose can also be obtained from visual motion capture systems such as Vicon and OptiTrack.The non-prior pose methods adopt of‐fline extraction of feature points and offline optimization of pose and scene structures,which is similar to structure-frommotion (SfM).The constructed geometric visual map generally includes image frames,feature points and descriptors,3D points,the correspondence between image frames,and the cor‐respondence between 2D points and 3D points.During the process,due to changes in the real scene,the constructed vi‐sual map also needs to be updated synchronously to detect new and expired changes in time,and then update the corre‐sponding changes to the visual map.When the prior visual map is obtained,the image retrieval and fine localization steps can usually be performed on the newly acquired image frame to complete localization.In the visual map-based localization framework,sensor information such as inertial measurement unit (IMU),GPS,and wheel odometer can also be fused.
Accurate and robust 3D reconstruction methods are crucial to visual localization.The purpose of 3D reconstruction is to obtain the geometry and structure of an object or a scene from a set of images.SfM is a way to achieve 3D reconstruction,which is mainly used in the stage of building sparse point cloud in 3D reconstruction.A complete 3D reconstruction pro‐cess usually also includes a multi-view stereo (MVS) step to achieve dense reconstruction.SfM is mainly used for mapping and restoring the structure of the scene.According to the dif‐ference in the image data processing flow,SfM can usually in‐clude four categories: incremental SfM,global SfM,distrib‐uted SfM,and hybrid SfM.Among them,distributed SfM and hybrid SfM are usually used to solve large-scale reconstruc‐tion and are based on incremental SfM and global SfM.Incre‐mental SfM mainly includes two steps.The first step is to find the initial correspondence,and the second step is to achieve incremental reconstruction.The former aims to extract robust and well-distributed features to match image pairs,and the lat‐ter is used to estimate the image pose and 3D structure through image registration,triangulation,bundle adjustment (BA),and outlier removal.The initial corresponding outliers usually need to be removed by geometric verification methods.Generally,when the number of recovered image frames ac‐counts for a certain proportion,global BA is required.Due to the incremental BA processing,incremental SfM usually has higher accuracy and better robustness.As the number of im‐ages increases,the scale of BA processing becomes larger,leading to disadvantages such as low efficiency and large memory usage.Additionally,incremental SfM suffers from cu‐mulative drift as images are incrementally added.Typical SfM frameworks include Bundler and COLMAP.
CAO et al.[2]proposed a fast and robust feature tracking method for 3D reconstruction using SfM.First,to save compu‐tational costs,a feature clustering method was used to cluster a large set of images into small ones to avoid some wrong fea‐ture matching.Second,the joint search set method was used to achieve fast feature matching,which could further save the computational time of feature tracking.Third,a geometric con‐straint method was proposed to remove outliers in trajectories produced by feature tracking methods.The method could cope with the effects of image distortion,scale changes,and illumi‐nation changes.LINDENBERGER et al.[3]directly aligned low-level image information from multiple views,optimized feature point locations using depth feature metrics after fea‐ture matching,and performed BA through similar depth fea‐ture metrics during incremental reconstruction.In this pro‐cess,the convolutional network was used to extract the dense feature map from the image,then the position of the feature points in the image was adjusted according to the sparse fea‐ture matching to obtain the two-dimensional observation of the same 3D point in different images,and the SfM reconstruction was completed according to the adjustment.The BA optimiza‐tion residual in the reconstruction process changes from repro‐jection error to feature metric error.This improvement is ro‐bust to large detection noise and appearance changes,as it op‐timizes feature metric errors based on dense features pre‐dicted by neural networks.
The cumulative drift problem can be solved by global SfM.For the fundamental and essential matrix between images ob‐tained in the image matching process,the relative rotation and relative translation can be obtained through decomposition.Using the relative rotation as a constraint,the global rotation can be recovered,and then the global translation can be recov‐ered using the global rotation and relative translation con‐straints.Since the construction of the global BA does not re‐quire multiple optimizations,the global SfM is more efficient.However,since the relative translation constraints only con‐strain the translation direction and the scale is unknown,the translation averaging is difficult to solve.In addition,the trans‐lational average solution process is sensitive to outliers,so the global SfM is limited in practical applications.
How to extract robust,accurate,and sufficient image corre‐spondences is a key issue in 3D reconstruction.With the de‐velopment of deep learning,learning-based image matching methods have achieved excellent performance.A typical im‐age matching process usually includes three steps: feature ex‐traction,feature description,and feature matching.
Detection methods based on deep convolutional networks search for interest points by constructing response graphs,in‐cluding supervised methods[4–5],self-supervised methods[6–7],and unsupervised methods[8–9].Supervised methods use an‐chors to guide the training process of the model,but the perfor‐mance of the model is likely to be limited by the anchor con‐struction method.Self-supervised and unsupervised methods do not require human-annotated data,while they focus on geo‐metric constraints between image pairs.Feature descriptors use local information around interest points to establish the correct correspondence of image features.Due to the informa‐tion extraction and representation capabilities,deep learning techniques have also achieved good performance in feature de‐scriptions.The deep learning-based feature description prob‐lem is usually a supervised learning problem,that is,learning a representation so that the matched features in the measure‐ment space are as close as possible,and the unmatched fea‐tures are as far as possible[10].Learning-based descriptors largely avoid the requirement of human experience and prior knowledge.Existing learning-based feature description meth‐ods include two categories,namely metric learning[11–12]and descriptor learning[13–14],and the difference lies in the output content of the descriptor.Metric learning methods learn met‐ric discriminants for similarity measurement,while descriptor learning generates descriptor representations from raw images or image patches.
Among these methods,SuperGlue[14]proposed a network ca‐pable of feature matching and filtering outliers simultaneously,whose feature matching was achieved by solving a differen‐tiable optimization transfer problem.The loss function was con‐structed by a graph neural network,and a flexible content ag‐gregation mechanism was proposed based on the attention mechanism,which enabled SuperGlue to simultaneously per‐ceive potential 3D scenes and perform feature matching.LoFTR[15]used a transformer module with self-attention and cross-attention layers to process dense local features extracted from convolutional networks.Dense matches were first ex‐tracted at a low feature resolution (1∕8 of the image dimension),from which high-confidence matches were selected and refined to high-resolution sub-pixel levels using correlation-based methods.In this way,the large receptive field of the model en‐abled the transformed features to reflect context and location in‐formation,and the prior matching was achieved through mul‐tiple self-attention and cross-attention layers.Many methods in‐tegrate feature detection,feature description,and feature match‐ing into matching pipelines in an end-to-end manner,which is beneficial for improving matching performance.
Visual localization is a problem of estimating the pose of a 6-DoF camera,from which a given image is obtained relative to a reference scene representation.Classical approaches to vi‐sual localization are structure-based,which means that they rely on 3D reconstructions of the environment (e.g.point clouds) and use local feature matching to establish correspon‐dences between query images and 3D maps.Image retrieval can be used to reduce the search space by considering only the most similar reference images instead of all possibilities.Another approach is to directly interpolate the pose from the reference image or estimate the relative pose between the query and the retrieved reference image,which does not rely on the 3D reconstruction results.Scene point regression meth‐ods can directly obtain the correspondence between 2D pixel positions and 3D points using a deep neural network (DNN),and compute camera poses similar to structure-based meth‐ods.Modern scene point regression methods benefit from 3D reconstruction during training but do not rely on it.Absolute pose regression methods use a DNN to estimate poses end-toend.These methods differ in generalization ability and local‐ization accuracy.Furthermore,some methods rely on 3D re‐construction,while others only require pose-labeled reference images.The advantage of using 3D reconstructions is that the generated poses can be very accurate,while the disadvantage is that these 3D reconstructions are sometimes difficult to ob‐tain and even more difficult to maintain.For example,if the environment changes,they need to be updated.
The typical work of the structure-based approach can refer to a general visual localization pipeline proposed in Ref.[17].Through a hierarchical localization approach,the pipeline can simultaneously predict local features and global descriptors for accurate 6-DoF localization,which utilizes a coarse-to-fine localization paradigm,first performing global retrieval to ob‐tain location hypotheses and then matching local features in these candidate locations.This hierarchical approach saves runtime for real-time operations and proposes a hierarchical feature network (HF-Net) that jointly estimates local and global features,thereby maximizing shared computation,and compresses the model through multi-task distillation.
AR navigation usually works in the following process: 1) The real-world view is got from the user’s point of view;2) the location information is obtained and used to track the user;3) virtual-world information is generated based on the real-world view and location information;4) the generated virtual world information is registered into the real-world view and dis‐played to the user,creating augmented reality.The main chal‐lenge of AR navigation is how to integrate the virtual and real worlds,and design and present the navigation interface.Regis‐tration is the process of correctly aligning virtual information with the real world,which gives the user the illusion of keeping the virtual and the real coexist‐ing.For AR in navigation,accu‐rate registration is critical,and AR navigation systems can cause confusion when orienta‐tion changes rapidly due to reg‐istration errors.So even small offsets of registering dummy in‐formation can be harmful.In an AR navigation system,the dis‐play should not interfere with the user’s movement.The aug‐mented reality display technol‐ogy is also known as video seethrough.Video see-through dis‐play refers to placing a digital screen between the real world and the user,where the user can see the real world and augmented information,use a cam‐era to capture the real-world view,and then combine it with the augmented information and display it on the screen supe‐rior.Typical examples of displays include head-mounted dis‐plays with cameras and smartphone displays.
On the basis of scene visual perception,this paper designs an AR navigation APP developed based on Unity and AR‐Core.Its overall framework is shown in Fig.1.The system con‐sists of three parts,namely,the navigation map generation tool,the cloud navigation algorithm,and the terminal naviga‐tion APP design.
The navigation map generation tool works offline,including scene panoramic video capture,dense point cloud generation,point cloud and plane CAD map alignment,navigation map management and other functions.The map generated by the navigation map generation tool is stored in the cloud.In addi‐tion,the cloud is also responsible for providing navigation al‐gorithms to the terminal,including visual localization meth‐ods,path planning algorithms,path correction algorithms,floor judgment algorithms and cross-layer guidance algo‐rithms.When users request a navigation activity with the ter‐minal APP,they first select the current location map,and the cloud issues the corresponding navigation map according to the user’s selection.After selecting the starting point and end‐ing point,the user requests the navigation service from the cloud,and realizes local real-time localization,global path and current position display,and AR path rendering in the lo‐cal APP.
▲Figure 1.Overall framework of an AR navigation application (APP)
This paper uses a panoramic camera to capture video to col‐lect mapping data.Instead of rotating the camera around its optical center,this panoramic camera can be used to capture multiple images of a scene from different viewpoints,from which stereoscopic information about the scene can be calculated.The stereo information is then be used to create a 3D model of the scene,and arbitrary views can be computed.This ap‐proach is beneficial for 3D re‐construction of large-scale scenes.The dense reconstruc‐tion results of the proposed ap‐proach on the building dataset are shown in Fig.2.
Taking a large shopping mall as an example,for the process‐ing and 3D reconstruction of the data collected from the pan‐oramic video,this paper goes through the following steps:
1) Shoot a panoramic video of the scene,and the shooting area should be covered as much as possible;
2) Frame the obtained panoramic video to obtain a pan‐oramic image and segment the panoramic image according to the field of view (FOV);
3) Realize sparse point cloud reconstruction for each floor and finally output all camera parameters and sparse 3D point cloud;
4) Complete the single-layer dense point cloud reconstruc‐tion;
5) Integrate multiple layers of dense point clouds to obtain a complete 3D structure of the scene.
The point cloud obtained in Section 3.1 is based on the camera coordinate system,which must be aligned with the world coordinate system if it is to be used for navigation tasks.This paper takes the CAD map as the world coordinate sys‐tem,because CAD can provide accurate position information and scale information.The problem is transformed into the alignment of the point cloud map and the plane CAD.The spe‐cific process of its realization is as follows:
1) The point cloud is dimensionally reduced and projected to the XoY plane to form a plane point cloud map,as shown in Fig.3.
2) Marker points (such as walls and other points that are easy to be distinguished) and the corresponding points are found on the plane point cloud map and the CAD map,re‐spectively.
3) Alignment is completed through the scale information provided by the CAD map,output rotation and the displace‐ment matrix.
▲Figure 2.Result of dense reconstruction: (a) photometric depth map,(b) photometric normal map,(c) geo?metric depth map,(d) geometric normal map,and (e) dense reconstruction effect
Once the point cloudXis sampled,it can be mapped to a 2D plane by simply removing the z coordinates.The problem is transformed into finding the mapping between (Xx,Xy) and pixels (u,v),where (Xx,Xy) is the set of 2D coordinates (x,y) extracted from the point cloudX.It is worth noting that (x,y) are usually float values,while pixel coordinates (u,v) are usually positive integer values.Therefore,(x,y) needs to go through a certain scale,rotation and rounding transformation.
Once the plane point cloud map is obtained,it can be aligned with the CAD map through the affine transformation.To determine the affine matrix,at least three pairs of corre‐sponding points are usually required.Considering the need to reduce errors,this paper selects multiple pairs of correspond‐ing points in the point cloud map and CAD map respectively,and uses the least square method to achieve alignment.It is worth noting that the selection of corresponding points should try to select parts that are easy to identify,such as walls and other fixed objects with clear structural characteristics.Fig.3 shows the process of aligning a point cloud map with a CAD map.After the alignment,the position coordinates of the point cloud in the world coordinate system can be obtained,which is beneficial to the subsequent localization and navigation tasks.The obtained results can be saved separately according to the scene,and the saved content includes the scene pose,corresponding geographic information,camera model,and other information to form a navigation digital map.
▲Figure 3.An example of a 2D point cloud map generation
When a user requests a navigation activity with the terminal APP,he first selects the map corresponding to the current lo‐cation,and the cloud issues the corresponding navigation map according to the user’s selection.After the user selects the destination,the user requests the navigation service from the cloud,and at the same time uploads the current scene graph to the cloud.At this time,the cloud needs to invoke the visual localization algorithm to determine the current initial position of the user as a starting point.After obtaining the coordinates of the starting point and the ending point,the cloud calls the path planning algorithm to obtain the navigation path point se‐quence and sends it to the terminal APP for AR rendering.The user is actually positioned through ARCore during the process of traveling.However,this method will generate accu‐mulated errors after traveling for a certain distance,and since the user may deviate from the recommended path,the path correction algorithm needs to be implemented through the cloud,and the user is directed to the correct path.
According to common practice in the industry,the path planning algorithm designed in this paper does not need to provide a path from any point to any point.The path planning involved in this paper only needs to provide a path from any point (user location or user-selected location) to a specific point (specified end-point set).Therefore,the path planning problem in this paper can be regarded as solving the shortest path problem between the vertices of a directed graph.The ba‐sic flow of the path planning algorithm proposed in this paper is as follows:
1) The passable area is determined through the point cloud map,and the waypoint is selected in the passable area.
2) The route point and the destination point (the selected end-point) form a graph structure.
3) The shortest path is found among all vertices in the graph through a search algorithm.
The process of building route points and destination points into a graph structure forms a road network.In this process,it is necessary to clarify the world coordinates of the waypoint and the destination point,and mark the connection re‐lationship between points to form a graph structure of the road network,which is stored in the form of an adjacency list.Since the purpose of this paper is to find the shortest path among all vertices in the graph,it constitutes an all pairs shortest paths (APSP) problem.The general solution to the APSP problem is the Floyd-Warshall algorithm.After the shortest path among all points is obtained,the result is saved in the cloud according to the scene,so that in practical appli‐cations,there is no need to calculate the planned path on‐line,and only the retrieval function will be implemented,which is time-consuming.
During the user’s journey,the local positioning provided by ARCore will gradually produce errors with the advancing dis‐tance.At the same time,the user may deviate from the recom‐mended navigation path due to internal or external reasons.Therefore,the cloud needs to provide a path correction algo‐rithm to guide the user back to the navigation path (the correct path).The specific workflow of the path correction algorithm is as follows:
1) The user uploads the current scene image while traveling.
2) The cloud determines whether it deviates from the navi‐gation path recommended by the algorithm according to the positioning algorithm.
3) If the user’s deviation is small,the user will be guided to the recommended navigation path through the navigation ar‐rows of the terminal APP.If the user’s deviation is too large,the path planning will be re-planned based on the user’s cur‐rent position.
The path correction process is actually a verification pro‐cess of the real-time local positioning information fed back by the terminal.When the error exceeds the distance thresholdτ,the path correction function can be activated.In practical ap‐plications,the selection of the distance thresholdτis usually between 50 cm and 200 cm.If the threshold is too small,it will increase the influence of visual positioning errors.If the threshold is too large,it will not only lose the accuracy of navi‐gation,but also bring inconvenience to users.
AR systems contain three basic features: the combination of real and virtual worlds,real-time interaction,and accurate 3D registration of virtual and real objects.In this way,AR changes people’s continuous perception of the real environ‐ment and obtains an immersive experience by integrating the composition of the virtual world into people’s perception of the real environment.Specific to AR navigation APPs,users can obtain real-world information from smartphones (through the phone camera),and by applying the AR technology,vir‐tual navigation paths can be added to the smartphone’s inter‐face,enhancing the user’s perception of the real environment for a better navigation experience.From the user’s point of view,a complete AR navigation includes the following pro‐cess: 1) The user selects the current scene and obtains the navigation map delivered by the cloud;2) the user selects the destination according to the navigation map and requests the cloud navigation service;3) the user follows the terminal inter‐face rendering AR path to the end.Due to network bandwidth limitations,users cannot obtain real-time localization by send‐ing the current scene image to the cloud in real time.There‐fore,the ARCore-based method is used to provide real-time lo‐calization.However,this method will generate accumulated er‐rors after traveling for a certain distance.And since users may deviate from the recommended path,path correction needs to be implemented through a correction algorithm to guide users to the correct path.Fig.4 shows the flow of the AR navigation APP and AR rendering.
ARCore is an AR application platform provided by Google,which can be easily combined with 3D engines such as Unreal and Unity.ARCore provides three main applications for mo‐tion tracking,environment understanding,and lighting estima‐tion.Among them,motion tracking enables the phone to know and track its position relative to the world,environment under‐standing enables the phone to perceive the environment,such as the size and location of detectable surfaces,and light esti‐mation allows the phone to obtain the current lighting condi‐tions of the environment.Localization can be achieved using ARCore’s motion-tracking capabilities.
The motion-tracking function of ARCore is actually realized by visual inertial odometry (VIO).VIO includes two parts: a vi‐sual tracking system and an inertial navigation system.The camera obtains a frame of pixel matching to track the user’s pose.The inertial navigation system realizes position and atti‐tude tracking through an IMU,which usually consists of an ac‐celerometer and a gyroscope.The outputs of the two systems are combined through a Kalman filter to determine the final pose of the user.The local positioning function provided by ARCore can track the user’s position in real time,but the er‐ror in the inertial navigation system of ARCore will accumu‐late over time.As the user’s advancing distance increases and time passes,tracking of the user’s position will be offset.In practice,we find that after a user travels about 50 m,the lo‐calization provided by ARCore will begin to deviate.At this time,it is necessary to relocate through the visual localization algorithm and correct the path.
▲Figure 4.Augmented reality (AR) navigation application (APP) and AR rendering result
On the basis of the previous work,the AR navigation APP can obtain the current position of the user and the path point sequence of the path planning from the cloud.Then the next question is how to realize AR rendering of the path point se‐quence on the mobile phone interface.From the perspective of user experience,the AR markers cannot block the user’s line of sight and must provide an obvious guiding role.Therefore,in the actual rendering process,this paper chooses to render the AR markers close to the ground.The environment under‐standing section in ARCore provides plane detection capabili‐ties.In fact,ARCore stipulates that all virtual objects need to rely on planes for rendering.After ARCore implements plane detection,the AR markers can be placed on the ground.The placement of AR markers can be achieved by radiographic in‐spection.The principle of ray detection is to judge whether there is a collision with an object through the ray emitted from the camera position to any position in the 3D world.In this way,the collision object and its position can be detected.By performing collision detection on the planes in the scene,the planes can be judged and AR signs can be placed.Here,this paper adopts two kinds of AR markers,one is the navigation guidance arrow,which is responsible for indicating the for‐ward direction,and the other is the end prompt sign,which re‐minds the user to reach the end-point.Fig.4 shows the actual workflow of the AR navigation APP and the rendering effect of the AR markers.In the figure,from left to right,the user se‐lects the destination (elevator entrance),the navigation guide arrow is rendered,the user follows the navigation guide arrow,and the navigation ends at the end prompt sign.
▲Figure 5.An example of a 2D point cloud map aligned with CAD map
This paper analyzes and introduces related technologies in the field of scene visual perception,based on which we imple‐ment AR navigation.In practical application,there are still some problems to be solved[18–19].For example,this paper adopts a structure-based localization framework,with an ad‐vantage that it can effectively handle large-scale scenes and has high localization accuracy.However,if the environment changes,the 3D structure needs to be re-adjusted to achieve re-registration of point clouds.The alignment method of point cloud map and plane CAD shown in Fig.5 still requires manual selection of corresponding points,which is not condu‐cive to large-scale applications,so it needs to be studied in the follow-up work to realize the automatic process.The pro‐posed localization method in this paper adopts a pure vision solution.In the future,it can also be considered to combine other sensor data such as IMU,depth camera or LiDAR to fur‐ther improve the localization and navigation performance.In addition,most of the current visual localization algorithms can‐not be independent of the scene,and usually need to train dif‐ferent models on different datasets (such as training models on indoor and outdoor datasets),which brings difficulties to prac‐tical applications.For example,in the AR navigation process,image feature matching is usually performed in the cloud.Due to the diversity of the user’s scene,if a scene-related localiza‐tion algorithm is used,the generalization ability of the model will be insufficient,which will lead to poor localization perfor‐mance.Therefore,for AR navigation,it is particularly impor‐tant to enhance the generalization performance of localization algorithms and achieve scene-independent visual localization.