——Neural Metric-Semantic Understanding"/>
ZHU Fang
(1.State Key Laboratory of Mobile Network and Mobile Multimedia Technology,Shenzhen 518057,China;2.ZTE Corporation,Shenzhen 518057,China)
Abstract:Efficient perception of the real world is a long-standing effort of computer vision.Mod?ern visual computing techniques have succeeded in attaching semantic labels to thousands of daily objects and reconstructing dense depth maps of complex scenes.However,simultaneous se?mantic and spatial joint perception,so-called dense 3D semantic mapping,estimating the 3D ge?ometry of a scene and attaching semantic labels to the geometry,remains a challenging problem that,if solved,would make structured vision understanding and editing more widely accessible.Concurrently,progress in computer vision and machine learning has motivated us to pursue the capability of understanding and digitally reconstructing the surrounding world.Neural metric-se?mantic understanding is a new and rapidly emerging field that combines differentiable machine learning techniques with physical knowledge from computer vision,e.g.,the integration of visualinertial simultaneous localization and mapping (SLAM),mesh reconstruction,and semantic un?derstanding.In this paper,we attempt to summarize the recent trends and applications of neural metric-semantic understanding.Starting with an overview of the underlying computer vision and machine learning concepts,we discuss critical aspects of such perception approaches.Specifical?ly,our emphasis is on fully leveraging the joint semantic and 3D information.Later on,many im?portant applications of the perception capability such as novel view synthesis and semantic aug?mented reality (AR) contents manipulation are also presented.Finally,we conclude with a dis?cussion of the technical implications of the technology under a 5G edge computing scenario.
Keywords:visual computing;semantic and spatial joint perception;dense 3D semantic map?ping;neural metric-semantic understanding
The perception of the real world in a meaningful recon?structive way has been one of the primary driving forc?es for the development of sophisticated computer vi?sion techniques.The semantic and spatial joint per?ception of a variety of scenes is shown in Fig.1.Computer vi?sion approaches span a range from real-time mapping,which enables the latest generation of robots,to sophisticated seman?tic identification for the meaningfully structured information in various big data applications.In both cases,one of the main bottlenecks is the exact and consistent context understanding in terms of occlusion,view-angle,and illumination conditions,i.e.,despite of the noticeable progress in fine-grained seman?tic scene understanding tasks like detection and instance seg?mentation,computers still perform unsatisfactorily on visually understanding humans in crowded scenes.Concurrently,pow?erful consistent context understanding models have emerged in the computer vision and machine learning communities.The seminal works related to semantic and spatial joint per?ception,the so-called dense 3D semantic mapping framework by HERMANS et al.[1],have evolved in recent years into joint volumetric 3D reconstruction and semantic segmentation for?mulas for both the unmanned system and the human-involved virtual/augmented reality (VR/AR) immersive experience.Here,the synthesis of more plausible depth in parts of the scene or more reliable semantic image classification can be achieved by jointly optimizing geometry and semantics in 3D.Very recently,such an area has been explored as“metric-se?mantic understanding”.One of the first publications that used the term metric-semantic understanding is Kimera[2].It en?ables machines to learn to perceive their surroundings by com?bining the-state-of-the-art geometric and semantic understand?ing into a modern perception way.Furthermore,the authors al?so argue that the semantic information based on the geometric information provides the ideal level of abstraction to provide humans with models of the environment that are easy to under?stand.Instead of implicitly combining the geometry and se?mantic segmentation of 3D,a variety of other methods more ex?plicitly follow this notion of collaboration to exploit compo?nents of the perception pipeline.
While classical computer vision starts from the affine imag?ing of the physical world to addressing the geometrical consis?tency by modeling,for example,the camera’s viewpoint,odometry,and depth map properties,machine learning comes from an end-to-end trainable (differentiable) and statistical perspective.It is a well-known fact that the differentiable ma?chine learning technique can capture more complex dependen?cies and achieve a high level of expressiveness,while,if used only,cannot be metric or explicitly follow the strict consisten?cy behind the physical world.To this end,the quality of main?ly traditional computer vision-based dense 3D semantic map?ping relies on the physical correctness of the employed mod?els.Direct joint estimation of geometry and semantics in a multi-view 3D reconstruction setting,which implicitly com?bines the geometry and semantic information in the scenes,is hard and error-prone and leads to artifacts in the reconstruct?ed map.Thus,the classic computer vision-based geometry re?constructions suffer from not only classical issues,such as poorly textured areas,repetitive patterns,and occlusions,but also several additional challenges,such as higher noise level,and,often,the presence of shake and motion blur.To this end,traditional metric-semantic understanding methods try to over?come these issues by using heuristic regularization,like con?vex anisotropic regularizers,to combine captured imagery.But in the complex scenery,these methods require thousands of it?erations for convergence or are unable to fully capture the complex semantic and geometric dependencies behind them.Neural metric-semantic understanding brings the promise of addressing both geometry reconstruction and fusion of geome?try and semantic information by using deep networks to learn complex mappings from captured images to 3D semantic maps.The underlying principle is to combine the differentia?ble machine learning techniques with physical knowledge from computer vision to yield new and powerful algorithms for semantic and spatial collaborative perception.
Neural metric-semantic understanding does not yet have a clear definition in the literature.Here,we define neural met?ric-semantic understanding as:deep image or video semantic&spatial collaborative perception approaches and also submodules that enable the explicit or implicit fusion of semantic and geometric context properties of the scene,such as deep convolutional neural networks in volumetric space for 3D se?mantic segmentation,incorporation of conventional multi-view stereo concepts within a deep learning framework,fine-tuning of the deep network by using the extracted geometric con?straints,and a representation of semantics as an invariant scene for medium-term continuous tracking of large scale 3D scanning.
▲Figure 1.Semantic and spatial joint perception of a variety of scenes[2–3]
This paper defines the components of the semantic and spatial collaborative perception pipeline and exploits the dif?ferent directions of neural metric-semantic understanding formulations,embedded in corresponding components.One central scheme around which we structure this paper is the combination of computer vision imaging principles and learn?ing-based primitives to yield new and powerful algorithms for visual content’s consistent understanding,since consistency in the real-world understanding is essential for many media editing and structural data indexing applications.We start by discussing previous explorations’fundamental concepts and components of metric-semantic understanding,which are prerequisites for the semantic and spatial collaborative perception pipeline.Afterwards,we discuss critical aspects of emerging neural-based metric-semantic understanding ap?proaches,fusions of learning-based primitives and affine im?aging principles,such as type of fusion,how the fusion is pro?vided,which components of the metric-semantic understand?ing pipeline are learned,and explicit v.s.implicit fusion.Fol?lowing,we discuss the panorama of applications that is en?abled by semantic and spatial collaborative perception.The applications range from relighting,novel view synthesis,to the manipulation of semantic contents for augmented reality(AR).The semantic manipulation of AR contents,achieving natural interaction between the virtual and real world and fi?nally facilitating natural interaction between“digital twins”and the real world,has many technical implications on the evolving storage-computing network,especially when instant response computing and privacy preserving strategies can be carried out with the help of edge computing based on 5G.We then conclude with these implications.
Metric-semantic understanding,sometimes called“dense 3D semantic mapping”,has been continuously studied in the literature,such as Ref.[2]and Refs.[4–8].It includes robot perception and mixed reality.The perceptional understanding using classic computer vision or with some convolutional neu?ral networks(CNNs)as classification assistance has been stud?ied extensively.The thorough analysis survey[9]of such classi?cal computer vision methods,for the implicit combination of the geometry and semantic segmentation of 3D,focuses on specific heuristic regularization,such as surface normal direc?tions[10]and special treatment for highly reflective objects[11].Recent explorations regarding explicitly semantic and spatial collaborative perception through the components of the per?ception pipeline,with the emerging machine learning tech?niques,have also been discussed in Refs.[12–15].Recent re?ports,like Refs.[16–19],discuss various applications with the help of metric-semantic understanding techniques,such as novel view synthesis,relighting,and semantic AR contents manipulation.However,none of the above reports or literature provides a structured or comprehensive look into the new and rapidly emerging field,neural metric-semantic understanding,which combines differentiable machine learning techniques with physical knowledge.Such a comprehensive approach,es?pecially linking clues from classic computer vision to the“new”neural assistance,is critical,since the“next generation”semantic and spatial collaborative perception can reach new heights in the performance of these tasks,which motivates us to pursue the modern computer vision capability of understanding and digitally reconstructing the surrounding world.
In this section,we discuss the theoretical fundamentals of working in the semantic and spatial collaborative perception space.First,we discuss dense depth map formation models in computer vision,followed by the classic methods of high-quali?ty 3D scanning of large-scale scenes.Next,we discuss ap?proaches to semantic generative models in deep learning.In the end,we discuss the core principles of volumetric semantic 3D reconstruction.
Classical computer vision methods approximate the reverse prediction process of image formation in the real world.Light sources emit photons that interact with the objects in the scene,as a function of their geometry and material properties,before being recorded by multiple cameras with overlapping views.This process is known as dense depth estimation.Early passive stereo methods,referred to as an in-depth analysis in Ref.[20],relied on at least two recorded frames based on the known camera geometry to extract stereo correspondence,the so-called dense disparity map.Among them,some multi-view stereo methods use multi-valued,voxel-based,or layer-based presentations,while most stereo correspondence methods com?pute a univalued disparity functiond(x,y) with respect to a reference image.The central element to methods that produce a univalued disparity mapd(x,y) is the concept of a disparity space (x,y,d).The term disparity describes the difference in the location of corresponding features seen by the left and right eyes.The correspondence between a pixel (x,y) in refer?ence imagerand a pixel (x’,y’) in matching imagemis then given by Eq.(1).And the common steps in the stereo algo?rithms generally include matching cost computation,support aggregation,disparity computation,and disparity optimization.The actual sequence of steps taken depends on the specific al?gorithm.
Passive stereo matching algorithms work well on textured scenes but require demanding computation.Later on,active stereo methods (e.g,Kinect),which triangulate correspondenc?es between a structured active illumination and a camera,have raised a lot of interest.While unstructured surfaces are no longer a problem,the lateral resolution of the active stereoonly methods is limited by the resolution of the projection sys?tem under the constraint of size or power.Currently,accurate real-time dense depth estimation is mostly fulfilled with the fu?sion of sensors,which ultimately improves speed,robustness and quality.A thorough re-inspection regarding the classical paradigm and the fusion between the time of flight (ToF) and stereo,can refer to Ref.[21].To exploit the complementary strengths,accurate but sparse active range measurements and the ambiguous but dense passive stereo information must be fused under the principle described in Eq.(2)below.
wherewstereoandwToFrepresent confidence/weights,Erepre?sents the objective energy to be minimized,andRrepresents the regularizer.
Different optimization strategies can refer to variably con?crete formulas corresponding to the principle described in Eq.(2),such as the local method in Eq.(3) and the variational framework in Eq.(4).
In Eq.(4),ρrepresents the local term for penalizing the de?viation from the ToF or stereo data,andXrepresents spatial indicator functions for valid/trusted ToF/stereo.
Given the accurate dense depth map of the observed view,high-quality consistent 3D scanning of large-scale scenes is the next key step to the geometric and photometric registration between the virtual and real world.The most important tasks under the objective are estimating globally optimized poses,robust tracking with recovery from gross tracking failures,and re-estimating the 3D model to ensure global consistency,as mentioned by DAI et al.[22].The core of the above tasks is a ro?bust pose estimation strategy,which globally optimizes the camera trajectory per frame,considering the complete history of the single view depth and image input in an efficient localto-global hierarchical optimization framework,as described in Refs.[22–24].While each has trade-offs,global optimization methods based on implicit bundle adjustment (BA) are the de facto methods for the highest quality reconstructions.Finally,the optimization for both dense photometric and geometric alignment is based on the energy illustrated in Eq.(5):
wherevkrepresents the back-projection of thek-th vertex andnkis the corresponding normal;Drepresents the live depth map andCrepresents the live color image;ξis the motion pa?rameter and exp(ξ)is the matrix exponential that maps a mem?ber of the Lie Algebra se3 to a mem?ber of the corresponding Lie group SE3;Tis the current estimate of the transformation from the previous cam?era pose to the current one;Erepre?sents the cost function that needs to be minimized andwrepresents manu?ally defined weights.
Besides the geometric and photo?metric registration following the above methods,semantic generative models assist in semantic content reg?istration of the corresponding largescale scenes.Such scene comprehen?sion,which necessitates recognizing instances of scene participants along with general scene semantics,can be addressed by the panoptic segmenta?tion task with corresponding semantic generative models such as those in Refs.[25–26].Such semantic generative models generally need a deep neural network (e.g,Feature Pyramid Network)as a backbone to efficiently encode and fuse semantically rich multi-scale features,which is followed by a panoptic head net?work to extract coherently understandable visual scenes at both the fundamental pixel level and distinctive object in?stance level,as shown in Fig.2.The model predicts four out?puts:semantics prediction from the semantic head,class,bounding box,and mask prediction from the instance head.All the aforementioned predictions are then fused in the pan?optic fusion module to yield the final panoptic segmentation output.Moreover,advances in the state-of-the-art deep learn?ing methods continually boost the performance of these tasks to new heights.
With the above programs,depth maps and pixel-wise se?mantic classification scores are achieved as inputs to the final objective,the semantic understanding of 3D environments.The core processing will be carried by the volumetric seman?tic reconstruction,which is cast as a volumetric fusion of depth maps and pixel-wise semantic classification scores.In practical applications,3D reconstruction systems or semantic segmentation algorithms are not robust enough and often lead to challenging results given surfaces observed under very cer?tain viewing angles.Many of these limitations under such fu?sion processes can be overcome by casting dense 3D recon?struction and semantic segmentation as a joint optimization formulation,shown in Fig.3.The general idea of the formula?tion is that each of the voxels gets assigned one out ofL+1 la?bels where labeli=0 denotes the free space label and theLlabels withi>0 indicate the occupied space,which is seg?mented into several semantic classes.Such formulation,socalled objective function of the volumetric multi-label ap?proaches,can be resolved with the objective function of the convex multi-label energy extended from the volumetric 3D re?construction energy,as described in Eq.(6).The energyE(x)consists of two parts,in which the former data term is a func?tion of a given label,and is parameterized by the internal prob?ability distribution of the voxel/surfel.The subsequent pair?wise smoothness term is a function of the labeling of two con?nected voxels/surfels in the graph,and is parameterized by the geometry of the map.
▲Figure 2.Overview of the overall architecture for the classical panoptic segmentation (pictures tak?en from Ref.[26])
▲Figure 3.Dense semantic 3D reconstruction[9]
whereErepresents the objective function of the convex multilabel energy,represents the label assigned to voxels,rep?resents a cost for assigning labelito voxels,andrepre?sents transition-specific,direction-and-location-dependent pe?nalizer of the surface area formed as an interface between la?belsiandj.
This type of formulation describes a convex relaxation pro?cedure,which is closely related to linear programming (LP)re?laxations for approximate maximum a posteriori (MAP)estima?tion inference in Markov random fields (MRFs).The classical solutions to addressing this procedure include the Bayesian,conditional random field (CRF),MRF and variation frame?work.The work of H?NE et al.[9]can be referred to for thor?ough exploration regarding such formations and approaches.HAN et al.[15]also address some latest emerging technique problems,inspired by the continually boosted deep learning achievement.
Following the above overview of the underlying computer vision and machine learning concepts,we will discuss the new explorations regard?ing fully leveraging the joint semantic&3D information,neural metric-se?mantic understanding.Given the high-quality geometric and semantic scene understanding specification,classic semantic and spatial collabor?ative perception methods can recon?struct global 3D semantic dense maps for a variety of real-world scenes.Moreover,such dense 3D semantic mapping techniques give us explicit editing control over all the elements of the perception pipeline,and strict?ly follow physical knowledge from computer vision—camera viewpoint,lighting,geometry and materials.However,build?ing high-quality semantic &3D reconstruction,especially di?rectly from poorly textured areas,under a higher noise level,in dynamic surrounding environments,requires significant manual effort,and automated high consistent context under?standing from images is an open research problem.On the oth?er hand,the emerging learning-based techniques are now start?ing to produce a plausible dense depth map or even 3D scan?ning of scenes,which is either from random noise or condi?tioned on certain user specifications.However,they do not yet allow geometrical consistency and cannot always handle the true depth by a single scale factor.In contrast,neural metricsemantic understanding brings the promise of combining these approaches to enable high quality co-consistency under both semantic and geometric scenarios.Neural metric-semantic un?derstanding techniques are diverse,differing in the fusion that they provide over the perception pipeline,the type of fusion and the network structures they utilize.A typical neural met?ric-semantic understanding approach takes red-green-blue depth (RGBD) sequences corresponding to certain scenes as input,builds a dense 3D reconstruction from them,and adopts the volumetric 3D convolution for point cloud segmentation to extract the final semantic 3D understanding.The dense 3D re?construction is not restricted by directly using classical com?puter vision methods to geometric modeling of the environ?ment and can be optimized with the combination of differentia?ble machine learning techniques for high quality consistent understanding.At the same time,neural metric-semantic un?derstanding approaches incorporate ideas from classical com?puter vision in the form of orthogonal approaches to reduce drift,traditionally-obtained geometric constraints,and net?work architectures—to make the learning task easie and the output more consistent.
We propose a taxonomy of neural metric-semantic under?standing approaches along the axes that we consider the most important:
·Joint volumetric multi-label formulation
·Semantically geometric and photometric registration
·Semantical depth map regulation
In the following,we will discuss current state-of-the-art methods under these axes.
According to the general pipeline of metric-semantic under?standing,depth maps and pixel-wise semantic classification scores are achieved as inputs of the final objective,the“se?mantic understanding of 3D environments”.Various approach?es are proposed to tackle joint optimization formulation.Au?thors in Refs.[15,27–28]directly use 3D convolutional neu?ral networks approache on voxels (representation of 3D scenes),like 2D convolution on pixels,while the methods in Ref.[13],such as variational methods for convex relaxation,incorporate the physical knowledge to an emerging differentia?ble learning network.
3D convolutional neural network methods rely on generic 3D convolutional neural network architectures,and take the three-dimensional representation of 3D scenes as input.The curse of dimensionality applies,in particular,to data that lives on grids,which have three or more dimensions.The number of points on the grid grows exponentially with its dimensionality.In such scenarios,as the counterpart of 2D convolutional pro?cessing for two-dimensional pictures,it becomes increasing?ly important to reduce the computational resources needed for 3D data convolutional processing,such as exploiting spar?sity and reduces the number of global memory accesses.Pri?or work in Ref.[28]implements sparse convolutions (SCs)and introduces a novel convolution operator termed submani?fold sparse convolution (SSC) that restricts computation and storage to“active”sites.The utilization of the sparsity na?ture of points in the 3D volumetric space forms the basis for a new mainstream solution,submanifold sparse convolution?al networks(SSCNs),which are optimized for efficient seman?tic segmentation of 3D representation of scenes.A later trial in Ref.[15]extends the SSCN with explorations in address?ing the efficiency bottleneck of sparse 3D CNN,which lies in the unorganized memory access of the sparse convolution steps,for the demand of online computations.
Directly applying 3D convolutional neural networks to vox?els like 2D convolution on pixels will introduce some limita?tions,such as the insufficient capacity of deep learning tech?niques to delineate visual objects.This,for instance,can re?sult in non-sharp boundaries and blob-like shapes in seman?tic segmentation tasks.While in the classical perception pipeline,probabilistic graphical models have been devel?oped as effective methods to enhance the accuracy of the above task,as illustrated in Section 3.4.To this end,com?pared with the classic convex relaxation procedure which al?ways requires regularizers with hand-designed priors,a new differentiable learning network method[13]combines the ad?vantages of classical variational approaches with recent ad?vances in deep learning,and improves the inference/optimi?zation formulation from hand-tuned and not-easy conver?gence to a simple,generic,and substantially more scalable way.A reason for the improvement is that previously em?ployed priors are not rich enough to capture the complex rela?tionships of our 3D world,while learning-based differentia?ble networks break through automatically in an end-to-end trainable model.Furthermore,such an explicitly reused con?cept of variational energy minimization has led to great ad?vances when dealing with noise and missing information.
On a separate track to the progress of joint optimization with neural deep learning techniques,some novel frameworks in Ref.[29]aggregate inputs from the initial stage of the previ?ous pipeline and the information of multiple 2D observations from different view angles,and straightly reconstruct the final 3D semantic results with full deep learning framework.Rather than using the above methods,projecting color data into a vol?umetric grid and operating solely in 3D,with end-to-end net?work architecture,directly extracting feature maps from asso?ciated RGB images and then mapping into the volumetric fea?ture grid of a 3D network using a differentiable back projec?tion layer can result in more sufficient details.
Despite of the full exploration of the joint optimization for?mulation with geometry and semantic map as the input,emerg?ing neural network techniques have also tried to leverage the combination of differentiable machine learning techniques with physical knowledge from computer vision in the submod?ules of the perception pipeline,to enable the classic metric-se?mantic understanding performance in complex scenes.The seminal methods in Refs.[14]and [31]aim to address the un?derlying key challenges of such scenarios,namely globally consistent geometric and photometric registration,with some revolutionary thinking,such as fine-tuning the deep network by using the extracted geometric constraints and representing semantics as an invariant scene for medium-term continuous tracking of large scale 3D scanning.
Robust data association is a core problem of visual odome?try and the cornerstone of large-scale geometry reconstruc?tions.Currently,the state-of-the-art classic metric-semantic understanding methods use short-term tracking to obtain continuous frame-to-frame constraints,while long-term con?straints are established using loop closures,as illustrated in Ref.[14].Although these two approaches are orthogonal and greatly reduce drift by collaboration,invariant representa?tion of scenes to viewpoint and illumination changes cannot always be guaranteed,because of the gap between action in?terval spans.The author originally proposes using semantics for medium-term continuous tracking of points to improve the first drift correction strategy.The underlying intuition is that changes in viewpoint,scale,illumination,etc.,only af?fect the low-level appearance of objects but not their seman?tic meaning.By readily integrating semantic reprojection er?rors into existing video odometry (VO) approaches and com?bining differentiable machine learning techniques with phys?ical knowledge from computer vision,translational drift in fast or complex scenes has reduced significantly,as reported in the literature.
The reverse thinking of the above method,emerging as an?other optimizing direction of deep learning in computer vision,is reflected in the method proposed by LUO et al.[31].The meth?od leverages a convolutional neural network trained for singleimage depth estimation along with conventional structurefrom-motion reconstruction to establish geometric constraints on pixels in the image sequence.The authors firstly train a sin?gle-image depth estimation network to synthesize plausible depth for general color images,and then fine-tune the network by using the extracted geometric constraints via traditional re?construction methods at the test time.This novel formula,which combines the strengths of traditional techniques and learning-based techniques,addresses the geometrical consis?tency of the reconstruction over time even under a gentle amount of dynamic scene motion.
As the basic input of the semantic understanding of 3D en?vironments,input geometry and semantic maps,recorded by the overlapped views or“active”sensing,always suffer from inaccuracy and incompatible resolutions because of the differ?ent sensing schemes.Plenty of progress as shown in Refs.[30,32–33]has been made to reduce the noise and boost geomet?ric details,especially after consumer depth sensors coming in?to our daily lives,marked by the recent integration in the lat?est iPhone.In many classic metric-semantic understanding ap?proaches,volumetric depth map“fusion”has become a stan?dard method,which shows geometric details boosting with sparse depth and dense RGB information,based on truncated signed distance functions.Due to the disadvantages and the real-time requirement of related classic methods,neuralbased novel depth map regulation approaches emerge in multi?ple ways for new heights of performance:1) semantic informa?tion which enriches the scene representation and is incorporat?ed into the fusion process;2) leveraging the multi-frame fused geometry and the accompanying high-quality color image through a joint training strategy;3) depth upsampling method which is tolerant to outlier factors (such as mismeasured depth points,flipping points,and disocclusion) and to spontaneously adapt to each scene by a self-learning framework in an online update manner.
Instead of explicitly combining the geometry and semantic segmentation of 3D in the former,others follow that by includ?ing this notion of collaboration more implicitly.However,effi?ciently encoding and fusing“semantically”rich multi-scale features from an end-to-end trainable (differentiable) way is abnormally obvious.Furthermore,recently there has also been immense progress on learning-based methods that operate on single images.These methods result in the pleasing ability to synthesize plausible depth,in particular,in dynamic scenes as well as limitations of the sensing range.In order to con?struct fine-grained depth sensing,one of the seminal works by TULSIANI et al[3]specializes those object’s representation in scenes to some particular instances,signaling that both topdown and bottom-up cues influence the perception,and per?fectly deform into shapes even slightly different from those in the training.Fig.1 illustrates the pleasing semantic object re?construction result,which reflects the impressive influence in?troduced by neural semantic depth map regulation.
Semantic and spatial collaborative perception has many important use cases including,but not limited to,relighting,novel view synthesis,as well as semantic AR contents manip?ulation.The following is a detailed discussion of various ap?plications.
Relighting is known as a procedure for the photo-realistically rendering of a scene under a novel illumination.It is a funda?mental component for a number of media editing applications including AR and visual effects.The previously challenging set?tings like large-scale outdoor scene relighting can be addressed with the help of multi-view-based semantic and spatial collabor?ative perception.Relighting in the wild[18]casts the problem as a multi-modal image synthesis problem,which takes a rendered deep buffer as input,containing depth and color channels,to?gether with a semantic label (also known as an“appearance code”),and outputs realistic views of tourist landmarks under various lighting conditions,as shown in Fig.4.Fig.4a shows that the model is rendered into a deep buffer of depth,color and semantic labels,and Fig.4b shows that a relighting method translates these buffers into realistic renderings under multi?ple appearances.The input views including depth and color channels are used to reconstruct the 3D geometry of the scene;the semantic labels are also taken as the input to indicate the location of transient objects like pedestrians.Using the above corresponding rendered deep buffers and pairs of real photos,a multi-modal image synthesis pipeline learns an implicit model of appearance,which represents the time of the day,weather conditions and other properties not presented in the 3D model.A similar principle is also adopted by the multiview relighting method[34].Furthermore,the author considers that such geometry is coarse and erroneous,and directly re?lighting it would produce poor results.Instead,the geometry is used to construct intermediate buffers—normals,reflection features,and RGB shadow maps—as auxiliary inputs to guide a neural network-based relighting method.The above methods all generalize real scenes,producing high-quality results for applications like the creation of time-lapse effects from multi?ple images.
▲Figure 4.Relighting in the wild[18]reconstructs a proxy 3D model from a large-scale Internet photo collection
Rendering of a scene under novel camera perspectives of the scene with a fixed set of images given— a procedure known as“novel view synthesis”or“free viewpoint videos”—is a critical component of the emerging media entertainment applications,360 VR.The topic has gained a lot of interest in the research community and reached compelling quality results with the work of COLLET et al.[35]and its real-time counterpart by DOU et al.[36–37].Key challenges of such applications are inferring the scene’s 3D structure through given sparse observations,for ex?ample,the painting of unseen parts of the scene.Recently,re?constructing a learned representation of the scene from the ob?servations,and learning of priors on geometry,appearance and other scene properties in learned feature space with a differen?tiable renderer,has become a hot topic and made significant progress in previously open challenges such as learning from extremely sparse observations,as shown in Fig.5.Such seman?tic and spatial collaborative perception-based approaches range from explicit 3D disentanglement of multi-plane images[38]to proposing 3D-structured representations such as voxel grids of features in Refs.[16]and [17].Among them,HoloGAN[16]im?plements an explicit affine transformation layer that directly ap?plies view manipulations to learn 3D features to build an uncon?ditional generative model that allows explicit viewpoint chang?es.Scene representation networks (SRNs)[17]encode both scene geometry and appearance in a single fully connected neural net?work,to parameterize surface geometry via an implicit function.Although such approaches show better results compared with previous ones,they still have limitations,i.e.,they are restricted to a specific use case and limited by the training data.
▲Figure 5.Scene representation networks[17] allow full 3D reconstruction from a single image (bottom row,surface normals and color render) by learning strong priors via a continuous,3D-structure-aware neural scene representation
Semantic AR contents manipulation is,but not only,the key procedure of the emerging AR experience paradigm,the so-called“retargetable AR”[19].As the authors illustrate,re?targetable AR is a novel AR framework that yields an AR ex?perience that is aware of scene contexts set in various real environments,achieving natural interaction between the vir?tual and real world,as shown in Fig.6,in which images are taken from Ref.[19].It is expressed as an abstract AR scene graph based on the relationships among objects.Such a retar?getable correspondence,which is between the realistic scene and the constructed graph,provides a semantically regis?tered content arrangement,and finally facilitates natural in?teraction between“digital twins”and the real world.The key procedure,semantic AR contents manipulation,is an exten?sion of the original solution,only a geometric and photomet?ric registration between the virtual and real world[39],to the integration of virtual objects into real environments accurate?ly and naturally.It is achieved by the integration of the ad?vanced abstraction (3D scene graph),and the accurately un?derlying semantic and spatial collaborative perception,which is the fusion of geometric and semantic information densely reconstructed and labeled in the scene.A similar idea is also proposed by ROSINOL et al.[2],stating that the ideal level of abstraction will be more practical and crucial for the later augmented reality/mixed reality (AR/MR) sys?tems.Even more,linked by such mechanism,the massive knowledge map combined with natural language expressions,and also the above deep understanding of physical environ?ments can be collaboratively learned and managed.
In the above sections,we present a multitude of applica?tions with various target domains by semantic and spatial col?laborative perception.While some applications are mostly in?sensitive to the processing time and response time,others,with legitimate and extremely useful use cases,should be used in an instant reaction manner (e.g.,semantic AR con?tents manipulation).Methodsfor image and video manipula?tion are as old as the media themselves,and understandingbased structured visual editing is currently common,for exam?ple,in the Internet industry.Neural metric-semantic under?standing approaches have the potential to lower the barrier for entry,making manipulation technology accessible to non-ex?perts with limited resources.Although we believe that all the methods discussed in this paper have the potential to positive?ly influence the world via better content creation and storytell?ing,we must not be complacent.It is important to proactively discuss and devise a plan to systematically arrange the submodules of the above methods under the 5G edge computing scenario for instant reaction and also privacy protection pur?pose.We believe it is critical that understanding-based syn?thesizing images and videos are extremely resource-and pow?er-consuming.We also believe that it is essential to raise sig?nificant privacy concerns before directly uploading visual raw data to cloud-based semantic and spatial collaborative percep?tion systems,like,for the localization purpose,even if only de?rived image features are uploaded.
▲Figure 6.Illustration of semantic AR contents manipulation:(a) retargetable AR;(b) framework that retargets the AR scene to various real scenes by comparing the AR scene graph with 3D scene graphs constructed in each of the scenes[19]
Such related topics regarding“to cloud or not to cloud”were first explored by NAQVI et al.[40],and then extended to edge computing architectures,even with 5G,by BARESI et al.[41–43].Given the evaluation regarding the added value of cloud computing as a key enabler for AR applications on mo?bile devices[40],the authors disclose an important principle that the latency due to connectivity type and the amount of da?ta to be communicated is a major trade-off,and the dynamic deployment and reconfiguration of the framework components between mobile and cloud ends are really important.Further?more,with respect to the final quality of experience require?ments,context-awareness based resource allocation at the wireless network edge[40,42]and the adoption of serverless edge computing architecture[41–43]become the consensus.With the deployment of services to the cloud,the initially widely ig?nored privacy concerns become an emerging key challenge.The possibility was strikingly demonstrated in Ref.[44],even when only the extracted features are uploaded.
The importance of developing corresponding safe disclosure technologies and building corresponding communities has ris?en to an urgent position.Such safeguarding measures would reduce the potential for misuse while allowing creative uses of semantic and spatial collaborative perception technologies.In one recent example in the field of image-based localization[45],the authors adopted a cloud-based“obfuscate upload”ap?proach,refraining from uploading the full 3D points of struc?ture-from-motion maps immediately,instead of uploading ran?dom line features,lifted from 2D/3D feature points.
Learning from this example,we believe researchers and re?lated business operators must make privacy preserving strate?gies a key part of all the edge-based semantic and spatial col?laborative perception systems with a potential for misuse,but not an afterthought.Also,it is important that we,as a commu?nity,continue to develop responsible neural metric-semantic understanding techniques to enable cloud-based semantic and spatial collaborative perception solutions without sacrificing the privacy of users by hiding the privacy concerning contents of the uploading media information.
Neural metric-semantic understanding and also the newly neural extension have raised a lot of interest in the past few years.This paper investigates the linkage between the clas?sical and concurrent explorations and a variety of directions related to the topic,which reflects the immense increase of research in this field.The target application is not bound to a specific one but spans a variety of use cases that range from novel view synthesis,relighting,to the manipulation of semantic contents for AR.We believe that metric-semantic understanding will have a profound impact on making com?plex structured vision understanding and editing tasks ac?cessible to a much broader audience.We hope that this arti?cle,which especially focuses on neural metric-semantic un?derstanding,can introduce such modern perception capabili?ty to a large research community,which in turn will help to develop the next generation of computer vision applications under the direction.