(School of Data and Computer Science,Sun Yat-sen University,Guangzhou,Guangdong 510006,China)
Abstract:Edge computation offloading allows mobile end devices to execute compute-intensive tasks on edge servers.End devices can decide whether the tasks are offloaded to edge servers,cloud servers or executed locally according to current network condition and devices'profiles in an online manner.In this paper,we propose an edge computation offloading framework based on deep imitation learning (DIL) and knowledge distillation (KD),which assists end devices to quickly make fine-grained decisions to optimize the delay of computation tasks online.We formalize a computation offloading problem into a multi-label classification problem.Training samples for our DIL model are generated in an offline manner.After the model is trained,we leverage KD to obtain a lightweight DIL model,by which we further reduce the model's inference delay.Numerical experiment shows that the offloading decisions made by our model not only outperform those made by other related policies in latency metric,but also have the shortest inference delay among all policies.
Keywords:mobile edge computation offloading;deep imitation learning;knowledge distillation
Nowadays more and more end devices are running compute-intensive tasks,such as landmarks recognition apps in smartphones[1],vehicles detection apps used for traffic monitoring in cameras[2],and augmented reality apps in Google Glass.The advantages of executing compute-intensive tasks on end devices are twofold.On the one hand,most data,such as images,audios and videos,are generated at end devices.Compared with sending these data to the cloud server,processing data locally on end devices can avoid time-consuming data transmission and reduce heavy bandwidth consumption.On the other hand,some tasks are sensitive to latency and the execution result can be out of date if being late.In some cases (e.g.,face recognition applications),high latency can result in poor user experience.If computation tasks are offloaded to the cloud,the unreliable and delaysignificant wide-area connection can be problematic.Hence,executing compute-intensive tasks on end devices is a potential solution to lower end-to-end latency.
However,compared with cloud servers,the computing resources of end devices are very limited.Even a smartphone's computing capability is far weaker than a cloud server,not to mention the Google Glass and cameras.It turns out that executing compute-intensive tasks on end devices may result in high computation latency.In addition,end devices often have energy consumption restrictions;for example,most smartphone users do not want a single app to consume too much power.Thus,it is unwise to execute tasks on end devices indiscriminately.
Recently,edge computing has emerged as a new paradigm different from local execution and cloud computing,and has attracted more and more attention.The European Telecommunications Standards Institute provided a concept of multi-access edge computing (MEC)[3].In the MEC architecture,distributed edge servers are located at the network edge to provide computing capabilities and IT services with high bandwidth and real-time processing.Edge servers become the third offloading location of compute-intensive tasks in addition to end devices and cloud.However,due to edge servers'restricted computing capability,they cannot completely take place of cloud servers.Many factors,including available computation and communication resources,should be taken into consideration when making offloading decisions.To tackle this challenge,in this paper,we design a computation offloading framework which jointly considers computation and communication and dynamically makes optimal offloading decisions to minimize the end-to-end execution latency.
Recent advances in deciding offloading strategies focus on learning-based methods.YU et al.[4]propose to“imitate”the optimal decisions of traditional methods by deep imitation learning (DIL),where DIL[5]uses instances generated from human's behaviors to learn the decision strategies in specific environments.DIL enjoys two advantages compared with traditional methods[6]and deep reinforcement learning methods[7].First,inference delay of DIL is much shorter than that of traditional methods especially when the amount of input data is large (as shown in our experiment in Section 5).Second,DIL has higher accuracy in imitating optimal offloading decisions compared with approaches based on deep-reinforcement-learning(DRL).
However,the DIL model is built upon deep neural network(DNN),which is compute-intensive and typically requires high inference latency.On this issue,model compression[8]is proposed,and knowledge distillation (KD) is one of the solutions[9].The idea behind KD is similar to transfer learning.KD not only effectively reduces the size of the neural network and improves the inference efficiency,but also improves the accuracy in the case where training samples are insufficient and unbalanced,which may appear in DIL training phase.Hence,we believe that applying KD can benefit the deployment of DIL model.
In this article,we leverage the emerging edge computing paradigm and propose a framework based on DIL and KD,which jointly considers available computation and communication resources and makes fine-grained offloading decisions for end devices.The objective of the proposed framework is to minimize the end-to-end latency of compute-intensive tasks on end devices.We use offloading decision instances to train our DIL model offline and compress the model to a lightweight one by KD online for quickly making near-optimal offloading decisions.
The rest of this article is organized as follows.We briefly review related works in Section 2.We explain how to build a DIL model and use it in computation offloading decisions in Section 3.Then we describe how to use KD to further optimize the performance of the DIL model in Section 4.Numerical experiment results are shown in Section 5.At last we discuss some future directions and conclude in Section 6.
To achieve lower latency or energy,mobile end devices usually choose to offload tasks to the cloud or edge servers.However,due to the complexity of network conditions in practice,for different devices at different times,the optimal computation offloading decisions are different.It is difficult to find this optimal decision in real time.Traditional computation offloading strategies are mostly based on mathematical modeling.Researchers in Ref.[6]study the computation offloading problem in multi-user MEC environment.They firstly prove that finding the best offloading strategies in multi-channel and multiuser condition is NP-hard.Then they model this problem as an offloading game and design a distributed approach to reach the Nash equilibrium.The authors in Ref.[10]study offloading video objects detection tasks to cloud server.In Ref.[10],a big YOLO model is deployed in cloud while a lite YOLO model is deployed at end devices.Many factors such as bit rate,resolution and bandwidth are considered and the offloading problem is formulated into a multi-label classification problem.A near-optimal solution is found by an iteration approach and it successfully achieves higher accuracy in video objects detection.The main disadvantage of mathematical modeling methods is that they are so complicated that they may cause non-negligible inference delays and are difficult to be deployed in MEC network.
One of the typical compute-intensive tasks is DNN inference,on which many researchers study specialized computation offloading strategies.KANG et al.[11]propose Neurosurgeon framework for DNN offloading.Neurosurgeon divides DNN into two parts.One part runs at end devices and the other runs at the cloud.This method reduces the calculation at end devices,and tries to find a balanced point between computation and transmission.Neurosurgeon evaluates the latency of each DNN layer by regression models offline,and uses these models to calculate the best divided point online tailored to end devices'performance and bandwidth.
Recently,some researchers introduce DRL to find computation offloading strategies.In this case,the latency or energy consumption serves as agents'reward.The authors in Ref.[7]consider a condition of vehicular networks based on software defined network and jointly optimize networking,caching,and computer resource by a double-dueling Deep-Q-Network.The main drawback of DRL-based approaches in computation offloading is that the offline training and online inference takes many overheads.To tackle this challenge,we propose to utilize DIL for computation offloading,the training cost and inference latency of which are significantly lower than those of DRL.
DIL refers to training agents to imitate human's behaviors by a number of demos.Compared with DRL,training and inference time of DIL is much shorter.The authors in Ref.[4]build an edge computation offloading framework based on DIL.However,since DIL is based on DNN,if the size of DNN grows too large,it may still result in high inference delay.On this issue,we use Knowledge Distillation to compress the DIL model.
KD is firstly proposed in Ref.[9],where the authors show that small DNNs can achieve approximately high accuracy as large DNNs with relatively less inference latency.This motivates us to compress the models to reduce inference delay with tiny accuracy loss.In KD,a large DNN is trained on a large training set and a lite DNN is trained on a small training set whose labels are the output of large DNN after“softened”.
In our work,we compress our DIL model through KD to further reduce the inference delay,and improve the model's performance when training samples are missing and unbalanced.
We study the problem of making fine-grained offloading decisions for a single end device user.A compute-intensive taskAon end device needs to be executed.We firstly split taskAinto some subtasks,following Ref.[12].Each subtask can be denoted by a tupleat=(t,εt,dt,dt+1).TaskAcan be seen as a set of all subtasksat.Andεtrepresents the computation complexity of thet-th subtask (usually in central processing unit (CPU) cycles).All of the computation complexity forms a setE={εt|t?[0,|A|)}.Thedtdenotes the size of input data of thet-th subtask (usually in bytes).Whent=0,d0represents the size of input data of taskA.dt+1denotes the size of output data of thetth subtask,and is also the input data size of the (t+1)-th subtask.Whent=|A|,d|A|represents output size data of taskA.Sizes of all data flow jointly form the setD={dt|t∈[0,|A|+1)}.
▲Figure 1.Subtasks are offloaded to end device,edge server and cloud server respectively.
As shown inFig.1,during the runtime of the mobile end device,a wireless connection with an edge server is established,and the edge server maintains a connection with the cloud server through the Internet.When a computation task in the end device needs to be executed,it will be divided into some subtasks.Each subtask can choose to be executed locally on end device or sent to the edge server.When the edge server receives a requirement of execution of a subtask,it can decide whether to execute it locally on edge server or further send it to cloud server.Execution of a subtask leads to computation latency,which depends on the profile of end device and edge server and the computation complexity of subtasksE.If two adjacent subtasks are offloaded to different locations,transmission latency will also occur,which mainly depends on the bandwidth between end device,edge server and cloud server and transmission data sizeD.In this paper,due to the strong computing capability of the cloud server,cloud computation latency is far less than the transmission latency.Hence,when the subtask is offloaded to cloud server,the computation latency can be ignored and only the transmission latency is concerned.
When a computation task needs to be executed,end devices split it into some subtasks and evaluate computation complexityEand transmission data sizesDof all subtasks.We can leverage the method introduced in Ref.[12]to evaluateEandD.Then all subtasks,E,Dand the computing capability of the end device (denoted byp1) are sent to edge server;p1can be measured in CPU frequency (in Hz).The edge server measures the bandwidth between the end device and edge server (denoted byb1) and the bandwidth between the edge server and cloud server (denoted byb2).Factors mentioned above and the computing capability of edge server (denoted byp2) jointly form the description of current offloading requirementS=(E,D,p1,p2,b1,b2).The edge server is responsible for making offloading decisions of each subtask according toS.
For each subtaskat,its offloading decision is represented byIt∈{0,1,2}.It=0,1,2 indicates that subtaskatis executed at end device,edge server or cloud server respectively.Offloading decision of the whole taskAis given byI={It|t∈[0,|A|)}.Obviously,|I|=3|A|.The offloading problem turns into finding the offloading decisionIwith the shortest end-to-end latency according to givenS.
Now we compute the end-to-end latency of a specificI.As we have discussed,end-to-end latency can be divided into computation latency and transmission latency.Letdenote the computation latency oft-th subtask.WhenIt=0,1,the subtask is executed at end device or edge server,hence=εt/p2,respectively.WhenIt=2,as mentioned in Section 3.1,computation latency at cloud server is ignored,hence=0.GivenSand offloading decisionI,computation latency of the whole taskAis:
Our goal is to find the offloading decisionI*with the shortest end-to-end latency,which is:
So far,we have formulated computation offloading problem to an end-to-end latency minimization problem.By changing the parameter of argmin to energy,we can switch optimization objective to the energy consumption.LetSrepresent the description of offloading requirement,Irepresent the offloading decision,Rexec(S,I)be the energy consumption of computation andRtrans(S,I) be the energy consumption of transmission.Then the best offloading decisionI*is:I*=argminI(Rexec(S,I)+Rtrans(S,I)).If it is required to optimize latency and energy simultaneously,we can set the parameter of argmin to a weighted sum of latency and energy.
The above minimization problem can be considered as a combinatorial optimization problem.Existing technologies such as traditional offloading algorithms or reinforcement learning are difficult to solve such problems efficiently.Hence,we apply DIL to deal with it.Finding the best offloading decisionI*can be formulated to a multi-label classification problem[13].DecisionIis a set of|A|labels and the three values ofItcorresponding to three classes.The idea of DIL is to use a DNN to learn the mapping fromSto the best offloading decisionI*.To this end,offloading requirementScan serve as features of input samples andI*serves as the real labels of samples,as shown inFig.2.
DIL for offloading consists of three phases described as follows:
1) Generate training samples offline.DIL is supervised learning and it needs a number of features labels pair(S,I*).The featureScan be obtained by collecting the actual offloading task requirement,or randomly generating features based on the distribution of various parameters in the actual offloading task requirement.Since labelsI*are generated offline,some expensive non-real-time algorithms can be applied.In addition,performance of our DIL model is limited by the quality of labels,and only the labels with high accuracy can ensure highly accurate DIL model.Note that the size of decision space is 3|A|.In summary,when |A| is small,we can use an exhaustive approach to obtain the optimal offloading decision by searching the whole decision space.When |A| is large,we solve this problem as integer programming problem by existing efficient solvers such as CPLEX.
▲Figure 2.Deep imitation learning model for edge computing offloading.Given the offloading requirement S=(E,D,p1,b1,p2,b2) as the input,the deep imitation learning model can output the offloading decision I*=(a1,a2,a3,a4,a5,a6).
2) Train DIL model offline.We train a DNN model to learn the mapping fromStoI*.In this multi-label classification problem,the output of DNN consists of predictions of |A| labels.Each prediction has three possibilities corresponding to three values ofIt.Hence the output layer of DNN has 3× |A|neurons and the activation function is SoftMax.All hidden layers are full connected layers.
3) Make offloading decisions online.After our DIL model is trained,it is deployed to edge server to make offloading decisions online.Experiment shows that the efficiency of DIL model inference is higher than baseline models.
DIL is based on learning.DIL's performance is closely related to the training samples.If the training samples are diverse,DIL model can deal with more conditions,i.e.,it becomes more robust.If training samples contain offloading requirement under the conditions with fluctuation of wireless channels,DIL model can learn how to make a good decision under these conditions.In practice,training samples are from actual offloading requirement.The fluctuation of the wireless channels is also covered.
After the DIL model is trained,we should consider where the DIL model is deployed for online inference.Same as the computation tasks,DIL model can be deployed on end devices,edge server or cloud server.However,if DIL model is deployed on the cloud server,the wide-area connection will become an unstable factor.To ensure model's performance,we expect that the inference result of DIL model can be obtained with a low and predictable delay.Hence,even though the computing capability of cloud server is much stronger,it is not recommended to deploy DIL model on cloud server.In addition,since having all model inference workload on end device may lead to high energy consumption,we believe that edge server is a better place for DIL model deployment.
Since our DIL model is based on compute-intensive DNN execution,the inference latency could be high due to the limited computing capability of edge servers.We hope that the DIL model running on the edge server is lightweight and the model inference delay is minimized.Towards that,a potential solution is to put the three phases mentioned above into edge server to train a DIL model based on small DNN locally on edge server.However,it raises two problems.First,limited by the number of parameters,the learning capability of a small DNN is insufficient.Compared with large DNN,it may cause loss of accuracy and make performance worse.Second,in the phase of generated demo offline,training samples are obtained by collecting the actual offloading task requirement or randomly generated based on distribution of various parameters in the actual offloading task.However,the service area of an edge server is highly limited.Compared with the samples collected by cloud server,samples collected by edge server may be not enough and unbalanced.This further incurs the accuracy and performance of small DNN.To this end,directly training a lightweight DIL model on edge server is not practical[15].
The authors in Ref.[9]proposed KD,which can be used for DNN compression.This technology helps us transfer the knowledge from a large DNN to a small DNN.When the training samples are inadequate and unbalanced,accuracy of the DNN trained by KD is higher than that of the DNN directly trained on samples.Large DNN is called the“teacher”and small DNN is called the“student”.Back to our offloading problem,we can leverage the strong computing capability of cloud server and a number of samples to train a large DNN with high accuracy to serve as the teacher,and then transfer the knowledge learned by large DNN to small DNN which is deployed to edge server by KD,achieving low inference delay and small scale with tiny loss of accuracy,as shown inFig.3.
KD can be applied to any neural networks whose output layer is activated by SoftMax;in other words,the networks are used for solving classification problem.In KD,we train two networks,the teacher network and the student network.Training the teacher network is the same as training conventional network,and training the student network is also similar.The only difference is that the initial labels of student network before training are from the teacher network's trained labels,rather than from the training dataset.
In some cases,teacher network's trained labels may be very small and close to zero (e.g.,<10-3),which is nearly the same as the original one-hot encoded labels and remains difficult for student network to learn the differences between labels.To alleviate this problem,we amplify the differences by further“softening”the labels.Letpibe the probability of thei-thclass predicted by the teacher,andqiis the softened probability corresponding topi.We slightly change the form of the softening formula in Ref.[9]to computeqi:
whereCis the total number of classes,in our offloading problemC=3.Tis a tunable hyper-parameter with the constraintT≥1.IfT=1,qi=pi.The labels will be softer with higherT.For instance,if original label is (0.999,2 × 10-4,3× 10-6),whenT=5,the softened label will be (0.71,0.20,0.09);whenT=10,the softened label will be(0.53,0.28,0.19).In the following experiment we setT=5.Back to the offloading problem,we use a teacher network trained at cloud server to predict labels of the training set obtained by edge server.Then soften these labels by the formula mentioned above and train student network by softened labels at edge server.
We show the complete flowchart of our DIL offloading framework with KD inFig.4.
▲Figure 3.Compress model by knowledge distillation to get a lightweight model deployed to edge server.
▲Figure 4.Complete flowchart of our edge offloading framework based on DIL and KD.
In this section,we set up a numerical experiment to evaluate the performance of DIL model described in Section 3.We consider that an MEC network consists of an end device user and an edge server connected by wireless connection,meanwhile the edge server connects to cloud server via the Internet[16].We assume that the compute-intensive taskAon end device is divided into 6 subtasks,which is |A|=6.If the number of subtasks of some computation tasks is not 6,we can merge some subtasks or insert empty subtasks to make the number of subtasks 6.The computation complexity of each subtasksεt(measured in CPU cycles) is in the interval of[0,2 000]× 106,following uniform distribution.Sizes of data transmission between subtasks follow uniform distribution withdt∈[0,10]MB,like the setting in Ref.[14].In addition,we assume that the computing capability of end device and edge server (both measured by CPU frequency in Hz) is in the intervals of [100,1000]MHz and [500,5 000]MHz respectively,both following the uniform distribution.The bandwidth between end device and edge server and the bandwidth between edge server and cloud server are uniformly distributed inb1∈[0,2]MB/s andb2∈[0,3]MB/s respectively.We randomly generate 100 000 samples offline to train DIL model and 10 000 testing samples for testing.
Our DIL model is based on a DNN with 5 hidden layers.All hidden layers are full connected layers and consist of 256 neurons.The number of parameters in the whole DNN is 1.6 million.Activation function of hidden layers is RELU and output layer is activated by SoftMax.To evaluate the performance of our DIL based offloading framework,we consider some baseline frameworks listed follow:
1) Optimal.Exhaustive method:For each sample,search the whole 3|A|decision space,compute the latency described in Section 3.2 and choose the offloading decision with minimal latency.Note that this minimal latency is the lower bound in the decision space.Hence,this decision is bound to be optimal.
2)Greedy.For each sample,find the offloading location one by one for each subtask to minimize the computation and transmission latency of current subtask.
3) DRL.Offload framework based on deep reinforcement learning.Features of samples serve as environment and offloading decisions serve as actions.The opposite number of latency acts as reward.The deep Q network is similar to that in Ref.[7].
4) Others.Local:The whole task is executed on end device,which is for anyt,It=0;Edge:All subtasks are executed on edge server,which meansIt=1;Cloud:All subtasks are offloaded to cloud server,which isIt=2;Random:Randomly choose offloading location for each subtask,that is to say,Itare randomly chosen from{0,1,2}.
Fig.5shows the normalized latency of the DIL model and baseline frameworks with the latency of optimal decision are normalized to 1.0,and then the latency of decision made by our DIL model is 1.095,with an increase less than 10%.Experiment results show that our model outperforms other baseline frameworks.Note that latency of“Edge”is less than“Local”and“Cloud”,which indicates that edge server can certainly improve the compute-intensive tasks in end-to-end latency.At last,latency of“Random”is far higher than others,this is because randomly choosing offloading location will cause high transmission latency,which is expectable.
As mentioned in Section 4,we should compress our DIL model before deploying it to edge server and deal with the situation in which training samples on edge server are insufficient and unbalanced.We call our compressed model“KD-DIL”for short.In this section,we assume the CPU cycles of subtasks are uniformly distributed inεt∈[500,1500]× 106.Sizes of transmission data between subtasks are indt∈[3,8]MB,following uniform distribution.The distribution range ofεtanddtis reduced by half compared with that in Section 5.1.Distributions of other parameters remain the same.In order to simulate the case in which training samples are insufficient,we only generate 1 000 samples for training in this section,reduced by 99%compared with that in Section 5.1.Testing samples remain the same as that in Section 5.1.
Our KD-DIL model is still based on DNN consisting of full connect layers.There are only 2 hidden layers in DNN with 32 neurons in each layer.The number of parameters of the whole DNN is about 10 000,reduced by 99.375% compared with that in Section 5.1.The following baseline models are used for evaluating the performance of our KD-DIL model.
1)Baseline DIL:This DIL model is based on the DNN which is same as that in KD-DIL.The difference is that Baseline DIL is directly trained on the training set described above without applying KD described in Section 4.
▲Figure 5.Normalized end-to-end latency of offloading decisions made by our DIL model and baselines.
2) DRL:Deep reinforcement learning based on DQN.The difference between this and DRL model in Section 5.1 is that it is trained on training set with 1 000 samples described above instead of that with 100 000 samples described in Section 5.1.
3) Greedy:Same as Greedy in Section 5.1.
Fig.6shows the normalized latency of KD-DIL models and baseline models.Again,the latency of optimal decision is normalized to 1.0.It shows that our KD-DIL model still outperforms baseline models.Note that the performance of DRL has a sharp decreasing compared that in Section 5.1 because of the change of training set.It is further shown that when the number and distribution of training samples are changed,the accuracy loss of our KD-DIL model is relatively small.
At last,Table 1shows the normalized inference delay of all models with delay of“Greedy”being normalized to 1.00,since the greedy method is the typical method for computation offloading.We measure the delay of making 100 000 decisions of all the models,and divide this delay by 100 000 to get the average delay of each decision.As shown in Table 1,compared with the large DIL model,the inference delay of KD-DIL model decrease by 63% (0.17/0.51).Table 1 shows that the inference delay of the Greedy approach is slightly higher than DIL model.As described in Section 5.1,the Greedy approach finds deployment place for each subtask by iterations.The number of iterations equals to that of subtasks.In practice,the number of subtasks may be much higher than 6,so the inference delay of the Greedy approach may become correspondingly higher.
Lastly,the inference of the optimal approach and DRL is hundreds of times that of our DIL models.Because optimal apply exhaustive method,high inference delay is expectable.While making decisions by DRL,we treat each strategy as an action and end-to-end latency as reward.We calculate each action's reward to find the highest reward,which needs many times of DNN inference.Hence,the delay of DRL inference is much higher than DIL.
▲Figure 6.Normalized KD-DIL model and baselines when using a small training set.
Flowcharts of subtasks can be represented by directed acy-clic graph (DAG) known as computation graph.In computation graph,nodes denote subtasks,edges denote data flow and directions of edge represent data transmission directions.DNN can also be regarded as a computation graph.In many programming frameworks dedicated to deep learning,such as TensorFlow,the concept of computation graph is applied.Offloading a computation graph in MEC network to optimize end-to-end latency is a difficult problem.The subtasks flowchart studied in this article has a list structure.In our future work we will focus on how to modify our work to adapt to DAG.
▼Table 1.Inference delay of all models
In this article,we have studied fine-grained edge computing offloading framework.In the situation in which an end device wirelessly connects to an edge server,compute-intensive tasks can choose to be executed at end device,edge server or cloud server.We first review existing edge offloading framework including mathematic model method (game theory) and reinforcement learning.Then we provide model of computing task and describe the execution process of a task.Offloading problem is formulated into a multi-label classification problem and is solved by a deep imitation learning model.Next,in order to deal with the insufficient and unbalanced training sample,we apply knowledge distillation to get a lightweight model with tiny accuracy loss,making it easier to be deployed to edge server.Numerical experiment shows that the offloading decisions made by our model have the lowest end-to-end latency and the inference delay of our model is the shortest,and after knowledge distillation we successfully reduce the inference delay by 63% with tiny accuracy loss.At last we briefly discuss some future directions of edge computation offloading.