Yanni Wan,, Jiahu Qin,, Xinghuo Yu,,Tao Yang,, and Yu Kang,
Abstract—This paper studies price-based residential demand response management (PB-RDRM) in smart grids, in which nondispatchable and dispatchable loads (including general loads and plug-in electric vehicles (PEVs)) are both involved. The PBRDRM is composed of a bi-level optimization problem, in which the upper-level dynamic retail pricing problem aims to maximize the profit of a utility company (UC) by selecting optimal retail prices (RPs), while the lower-level demand response (DR)problem expects to minimize the comprehensive cost of loads by coordinating their energy consumption behavior. The challenges here are mainly two-fold: 1) the uncertainty of energy consumption and RPs; 2) the flexible PEVs’ temporally coupled constraints, which make it impossible to directly develop a modelbased optimization algorithm to solve the PB-RDRM. To address these challenges, we first model the dynamic retail pricing problem as a Markovian decision process (MDP), and then employ a model-free reinforcement learning (RL) algorithm to learn the optimal dynamic RPs of UC according to the loads’responses. Our proposed RL-based DR algorithm is benchmarked against two model-based optimization approaches(i.e., distributed dual decomposition-based (DDB) method and distributed primal-dual interior (PDI)-based method), which require exact load and electricity price models. The comparison results show that, compared with the benchmark solutions, our proposed algorithm can not only adaptively decide the RPs through on-line learning processes, but also achieve larger social welfare within an unknown electricity market environment.
THE rapid development of information and communication technologies (ICTs) in power system, especially the introduction of two-way information and energy flow, has led to a revolutionary transition from the traditional power grid to smart grid [1]. The smart grid, a typical cyber-physical system(CPS), integrates advanced monitoring, control, and communication techniques into the physical power system to provide reliable energy supply, promote the active participation of loads, and ensure the stable operation of the system [2]. Due to the cyber-physical fusion characteristics of the smart grid, demand response management (DRM) has become a research hotspot in the field of energy management[3], [4]. The purpose of DRM is to utilize the changes in energy usage of loads to cope with time-varying electricity prices or reward/punishment incentives, so as to achieve cost reduction or other interests [5].
The existing literature mainly focuses on two branches of DRM, namely price-based DRM (PBDRM) and incentivebased DRM (IBDRM) [6]. The PBDRM encourages loads to adjust their energy usage patterns in accordance with timebased pricing mechanisms, such as real-time pricing [7] and time-of-use (TOU) pricing [8]. The IBDRM prefers to provide loads with rewards/punishments for their contribution/failure in demand reduction during peak periods [3]. Although both of these two DRMs can promote the active participation of loads, as mentioned in [6], PBDRM is more common than IBDRM, so this study mainly focuses on the PBDRM.
Up to now, many efforts have been devoted to investigating the PBDRM [9]–[16], mainly from social and individual perspectives. From the social perspective, one expects to maximize the social benefit including interests of both the utility company (UC) and users. For example, the work in [9]studies the distributed real-time demand response (DR) in a multiseller-multibuyer environment and proposes a distributed dual decomposition-based (DDB) method to maximize social welfare, i.e., the comfort of users minus the energy cost of UC. Another work in [10] proposes a distributed fast-DDB DR algorithm to obtain the consumption/generation behavior of the end-user/energy-supplier that can yield the optimal social welfare. In addition to common residential loads, the authors in [11]–[14] further consider a holistic framework which optimizes and controls the building heating, ventilation and air conditioning (HVAC) system and the residential energy management with smart homes under a dynamic retail pricing strategy in addition to achieving DR goals. From individuals’ point of view, one prefers to reduce the electricity bills of users or to maximize the revenue of UC by selecting appropriate pricing mechanisms. For example, the work in[15] studies a deterministic DR with day-ahead electricity prices, aiming to minimize the energy cost for customers.Some other works may focus on the benefit of UC, see, for instance, the objective in [16] is to minimize the energy cost of UC.
Nevertheless, most of the works are either based on a given pricing mechanism (e.g., TOU pricing in [8] and day-ahead pricing in [15]) or a predetermined abstract pricing model(e.g., linear pricing strategy in [17]). That is to say, the existing PBDRM depends largely on deterministic pricing models which cannot reflect the uncertainty and flexibility of the dynamic electricity market. Additionally, in the long run,the UC expects to estimate/predict the impact of its current retail pricing strategy on the immediate and all subsequent responses of loads. However, the existing works show that UC is myopic and only focuses on the immediate response of loads to the current pricing strategy. In view of this, it is urgent to design a real dynamic pricing mechanism such that it can adapt to flexible load changes and the dynamic electricity market environment. Moreover, it is necessary to develop an effective methodology to solve the dynamic PBDRM under an unknown electricity market environment.
The development of artificial intelligence (AI) has prompted many experts and scholars to adopt learning-based methods (a powerful tool for sequential decisions within unknown environment [18]) to solve the decision-making problems arising from the smart grid, such as the PEV charging problem[19], energy management problem [20]–[24], and demandside management problem (DSM) [25], [26]. Specifically, in[19], the authors use the reinforcement learning (RL)algorithm to determine the optimal charging behavior of an electric vehicle (EV) fleet without a prior knowledge about the exact model of each EV. The work in [20] develops a RLbased algorithm to solve the problems of dynamic pricing and energy consumption scheduling without requiring a priori system information. Some other works, see [21], [23]–[26],focus more on energy management and DSM (usually refer to the DR for load units). For instance, in [21], the authors study the distributed energy management problem by means of RL,in which the uncertainties caused by renewable energies and continuous fluctuations in energy consumption can be effectively addressed. Moreover, to further improve the scalability and reliability of learning-based approaches [22]and reduce the power loss during the energy trading between energy generation and consumption, the authors in [23]propose a Bayesian RL-based approach with coalition formation, which effectively addresses the uncertainty in generation and demand. A model-free RL-based approach is employed in [24] to train the action-value function to determine the optimal energy management strategy in a hybrid electric powertrain. When considering the DSM, the authors propose a multiagent RL approach based on the Q-learning in[25], which not only enhances the possibility of dedicating separate DR programs for different load devices, but also accelerates the calculation process. Another work in [26]investigates the building DR control framework and formulates the DR control problem as a MDP. On this basis, a cost-effective RL-based edge-cloud integrated solution is proposed, which shows good performance in control efficiency and the learning efficiency in different-sized buildings. However, the considered energy management and DSM problems only involve the general dispatchable loads(i.e., energy usage changes with RP, such as air conditioning and lighting) while ignoring the non-dispatchable loads (i.e.,energy usage cannot change at any time, such as a refrigerator) and a class of more flexible loads, such as plug-in electric vehicles (PEVs) [27].
Inspired by the application of RL in energy scheduling and trading, this paper adopts a model-free RL to learn the optimal RPs within an unknown electricity market environment. The main contributions are shown below:
1) This paper studies the price-based residential demand response management (PB-RDRM) in a smart grid, in which both the non-dispatchable and dispatchable loads are considered. Compared with the existing PBDRM with general dispatchable loads such as household appliances [9] and commercial buildings [11], this work innovatively considers a more flexible PEV load with two working modes of charging and discharging.
2) Unlike the existing works that focus on the individual interests [15], [16], the considered PB-RDRM is modeled from a social perspective. Specifically, the PB-RDRM is composed of a bi-level optimization problem, where the upper level aims to maximize the profit of the UC and the lower level expects to minimize the comprehensive cost of loads.Therefore, the goal of PB-RDRM is to coordinate the energy consumption of all loads to maximize the social welfare (i.e.,the weighted sum of UC’s profit and loads’ comprehensive cost) under the dynamic RPs.
3) Considering the uncertainty induced by energy consumption and RPs, as well as the temporally coupled constraints of PEVs, a model-free RL-based DR algorithm is proposed. The comparison results between the proposed model-free algorithm and two benchmarked model-based optimization approaches (i.e., distributed DDB [9] and PDI methods [10]) show that our proposed algorithm can not only adaptively decide the dynamic RPs by an on-line learning process, but also achieve the optimal PB-RDRM within an unknown electricity market environment.
The organization of this paper is as follows. Section II presents the problem statement. Section III provides the RLbased DR algorithm and Section IV conducts the simulation results to verify the performance of the proposed algorithm.Finally, the conclusions are drawn in Section V.
Consider a retail electricity market in residential area as shown in Fig. 1, including the load (lower) and UC (upper)levels. Note that there is a two-way information flow between the UC and load levels. Specifically, the UC (a form of distribution system operator (DSO)) releases the information of dynamic RPs to loads while the loads deliver the energy demand information to UC. The PB-RDRM aims to coordinate the energy consumption of a finite set N ={1,2,...,N} of residential loads within a time period T ={1,2,...,T} in response to the dynamic RPs, thereby maximizing social welfare. Since the problem model involves the information and energy interactions between the UC and loads, the mathematical models of them shall be introduced first after which we will give the system objective.
Fig. 1. Retail electricity market model in residential area.
According to the users’ preferences and loads’ energy consumption characteristics, the loads are usually classified into two categories [28], namely dispatchable loads Ndand non-dispatchble loads Nn. In this paper, in addition to the general dispatchable loads G, we consider a more flexible load, i.e., PEVs V . That is, Nd=G∪V.
1) General Dispatchable Loads:The consumed energy of a general dispatchable loadn∈G is described as [28], [29]
whereandare the energy consumption (kWh) and energy demand (kWh) of general dispatchable loadnat time slott, respectively. Here the energy demand refers to the expected energy requirement of loads before they receive the RP from UC, while the energy consumption is the actual consumed energy of loads after they receive the RP signal. ξtis the price elasticity coefficient indicating the ratio of energy demand change to the RP variation at time slott. Note that ξtis usually negative to show the reciprocal relationship between energy demand and electricity price [28].and λtrespectively represent the RP ($/kWh) and wholesale price($/kWh) at time slottand follows≥λt. The intuition behind (1) is that the current energy consumption of general dispatchable loadndepends on the current energy demand information and the demand reduction amount resulting from the changes in RP. Here note that when the general dispatchable loadnconsumes energyat time slott, the remaining required energycannot be satisfied, thus resulting in loadnexperiencing dissatisfaction. To characterize such dissatisfaction, a dissatisfaction function is defined as follows [30]:
whereandare the lower and upper limits of the battery capacity (kWh) of PEVn, respectively. Considering the overall interests of the electricity market, it is impossible to completely obey the PEV owners’ charging willingness,leading to the dissatisfaction of PEV owners. Thus, the following dissatisfaction function is defined [30]:
where κ ($/kWh) is the degradation coefficient.
3) Non-Dispatchable Loads:Since the energy consumption of non-dispatchable loads cannot be shifted or curtailed, these energy demands can be critically met at any time. Therefore,for anyn∈Nn, one has
whereandare the energy consumption and energy demand of non-dispatchable loadnat time slott, respectively.
From the point of view of loads, one expects to decide optimal energy consumption of all loads to minimize the comprehensive cost which is described below:
For UC, since it first purchases electrical energy from the grid operator at predetermined wholesale prices, and then sells the purchased energy to various types of loads at RPs set by itself, the goal of the UC is to select optimal RPs so as to maximize its profit, i.e.,
Recall that the aim of PB-RDRM is to adjust the energy usage patterns of loads to cope with time-varying RPs, so as to maximize social welfare (including both UC’s profit ($) and loads’ comprehensive cost ($)) from a social perspective.Therefore, the considered PB-RDRM can be formulated as the following optimization problem:
where ρ ∈[0,1] is a weighting parameter to show the relative social value of the UC’s profit and the loads’ comprehensive cost from the social perspective [29], [32]. It is worth mentioning that many optimization methods, such as twostage stochastic programming, Lyapunov optimization techniques, model predictive control, and robust optimization in[13], have been used to solve the DRM problem similar to the optimization problem (14). Although the above methods are relatively mature, they still have the following limitations: 1)They need to have prior knowledge of the exact load model.However, the model of some loads, like the PEVs considered in this paper, is affected by many factors, such as the temporally coupled SoC constraints [31] and the randomness of EV’s commuting behavior [33], resulting in an exact load model that is often difficult to obtain or even unavailable; 2)They depend on the accurate prediction of uncertain parameters (e.g., the RP in the current work). However, in most cases, the prediction error cannot be guaranteed to be small enough, thus affecting the performance of the optimization approaches; 3) Almost all above methods are offline. Therefore, we must fully complete the calculation process, and then choose the best result which, however, is time-consuming when the problem size is large. To tackle the above limitations, we next adopt a RL-based approach which can adaptively determine the optimal policy by on-learning process without requiring the exact load model.
This section discusses how to employ the RL method for UC to decide the optimal retail pricing policy so as to solve the PB-RDRM.
RL is a type of machine learning (ML) approach that is evolved from the behaviorist psychology, focusing on how an agent can find an optimal policy within a stochastic/unknown environment that can maximize cumulative rewards [18].Different from the supervised ML, RL explores the unknown environment through continuous actions, continuously optimizes behavioral strategies according to the reward provided by environment, and finally finds the optimal policy(i.e., a sequence of actions) that yields the maximum cumulative reward. Fig. 2 depicts the basic principle of RL.Specifically, the agent does not know what reward and the next state produced by the environment when taking the current action at the initial moment, thus with no knowledge of how to choose actions to maximize the cumulative reward.To tackle this issue, at initial states0, the agent randomly takes an actiona0from the action set and acts on the environment, resulting in the state of environment moving froms0tos1(purple arrows). At the same time, the agent receives an immediate rewardr0from the environment(orange arrows). The process repeats until the end of one episode (the completion of a task or the end of a period of time). Moreover, the current action affects both the immediate reward as well as the next state and all future rewards.Therefore, the RL has two significant characteristics, namely the delayed reward and the trial-and-error search.
Fig. 2. Basic principle of RL.
To determine the optimal RPs, we first use the RL framework to illustrate the retail electricity market model (see Fig. 3). Specifically, the UC acts as the agent, all the loads serve as the environment, the retail prices are denoted as the actions that the agent acts on the environment, the energy demand, energy consumptions of loads, and time index represent the state, and the social welfare (i.e., weighted sum of UC’s profit and loads’ comprehensive cost) is the reward.Then, we further adopt a discrete-time Markovian decision process (MDP) to model the dynamic retail pricing problem as it is usually the first step when using the RL method [34],[35]. The MDP is represented by a quintuple ,wherein each component is described as follows:
Fig. 3. Illustration of RL framework for retail electricity market model.
Fig. 4. Energy demand/consumption of non-dispatchable loads.
Fig. 5. Energy demand of dispatchable loads. (a) General dispatchable loads (b) PEVs.
1) State Set:S={s1,s2,...,sT}, wherest=(et,pt,t). The environment state at time slottis represented by three kinds of information, i.e., energy demandet, energy consumptionptof all loads, and time stept;
2) Action Set:A={a1,a2,...,aT}, whereat=ηt. To be specific, the action at time slottis denoted as the RPs ηtthat the UC sets for all loads at that time;
3) Reward Set:R={r1,r2,...,rT}, wherert=ρUt?(1?ρ)Ct. That is to say, the reward at time slottis the social welfare received by the system at that time;
4) State Transition Matrix:P={Pss′} , where=P{st+1=s′|st=s,at=a} is the transition probability1Since the energy demand and consumption of loads is affected by many factors, the state transition is rather difficult to obtain. Therefore, we next employ a model-free Q-learning method to solve the dynamic retail pricing problem.when adopting actionaat statesand the environment moves to the next states′;
5) Discount Factor:γ ∈[0,1] indicating the relative importance of subsequent rewards and the current reward.
One episode of MDP is denoted by (s1,a1,s2,r1;a2,s3,r2;...;aT?1,sT,rT?1). The total return of one episode is denoted by, which represents the cumulative reward. Due to the delayed reward feature of RL, the discounted future return from time slottis usually expressed as, where γ ∈[0,1] is the discount factor:γ=0 implies that the system is totally myopic and only focuses on the immediate reward, while γ=1 shows that the system treat all rewards fairly. Thus, to show the foresight of system, one usually chooses an appropriate discount factor.Here note that since we focus on the social welfare during the entire time horizon, the reward at each time slot is equally important. As a result, the discount factor is 1 in our problem formulation. In addition, denote thepolicyπ, which is a mapping from states to actions, i.e., π:S→A. Then the retail pricing problem aims to seek the optimal policy π?that can maximize the cumulative return, i.e.,
After mapping the retail pricing problem to the MDP framework, the RL method can be used to seek optimal retail pricing policies. Here we adopt Q-learning, one of the modelfree RL methods, to analyze how a UC chooses RPs while interacting with all loads to achieve the system objective (14).Almost all RL methods rely on the estimation ofvalue functions, which refer to a set of functions related to states (or state-action pairs), to illustrate the performance of the agent in a given state (or a state-action pair). Thus, the basic principle of Q-learning is to assign anaction-value function(i.e.,Q(s,a)) to each state-action pair (s,a) and update that at each iteration so as to acquire the optimalQ(s,a). The optimal action-value functionQ?(s,a) is denoted as the maximum cumulative discount future return starting from states, taking actiona, and thereafter adopting the optimal policy π?, which obeys the Bellman optimality equation [18], i.e.,
where E is the expected operator used to show the randomness of the next state and reward,s′∈Srepresents the state at next time slot anda′∈Ais the action adopted at states′. Therefore,by acquiringQ?(s,a), one can immediately obtain the optimal policy by
The overall implementation of the RL-based DR algorithm is summarized in Algorithm 1. Specifically, to begin with,input a set of predefined parameters including the loads’energy demands, dissatisfaction coefficients, price elasticity coefficients, wholesale prices, and weighting parameters, etc.Then initialize the action-value functionQ0(s,a) to zeros and the UC learns the optimal RP policy by following steps:
S1) Observe the initial statestand select an actionatwith an ?-greedy policy within the RP boundaries.
S2) After performingat, the UC obtains an immediate reward byrt=ρUt?(1?ρ)Ctand observes the next statest+1.
S3) UpdateQk(st,at) by the following mechanism:
where θ ∈[0,1] is a learning rate indicating the coverage degree of newly obtained Q-values over the old ones.
S4) Check whether to reach the end of one episode (that is,whether to reach the final time slotT), if not, go back to S1);otherwise, go to S5).
S5) Check the stopping criterion. Specifically, compare the values ofQkandQk?1to see if it converges, if not, go to the next iteration; otherwise, go to S6).
S6) Calculate the optimal retail pricing policy based on(17).
S7) Calculate the optimal energy consumption of dispatchable loads by (1) and (4).
Remark 1:The basic principle of the ?-greedy policy (the most common exploration mechanism in RL) is either to choose a random action from the action setAwith probability ? or select the action that corresponds to the maximum actionvalue function with probability 1??. Such exploration and selection mechanism not only avoids the complete randomness of system but also promotes the efficient exploration in action space and thus can adaptively decide the optimal policy(i.e., the dynamic RPs) by the on-line learning process.Moreover, the iterative stopping criterion is |Qk?Qk?1|≤δ,where δ is a very small positive constant indicating the gap tolerance of the previous Q-value and the current one,ensuring that the Q-value eventually approaches to the maximum. That is to say, the proposed RL-based DR algorithm can guarantee to an optimal PB-RDRM within an unknown electricity market environment.
Algorithm 1 RL-Based DR Algorithm 1: Input: A set of predefined parameters Q0(s,a)=0k=0t=0 2: Initialize: An initial action-value function , , 3: Iteration: 4: For each episode do t ←t+1 k ←k+1 5: Repeat: st 6: Step 1: Observe the state (i.e., energy demand, energy con- sumption, and time step) and choose an action (i.e., RP) using the -greedy policy rt=ρUt ?(1?ρ)Ct st+1 at ? 7: Step 2: Calculate the immediate reward and observe the next state Qk(st,at)8: Step 3: Update the action-value function by (18)9: Step 4: Check whether to reach the end of one episode t=T 10: if 11: break;12: end if 13: Step 5: Check the stopping criterion|Qk ?Qk?1|≤δ 14: if 15: break;16: end if 17: Step 6: Compute the optimal retail pricing policy by (17)18: Step 7: Compute the optimal energy consumption by (1) and (4)19: Output: The optimal energy consumption profile
This section conducts case studies to verify the performance of the proposed RL-based algorithm. In particular, two modelbased optimization approaches (i.e., distributed DDB [9] and PDI [10] methods) are adopted as benchmarks for comparison. The algorithms are implemented by MATLAB R2016a on a desktop PC with i3-6100 CPU @ 3.70 GHz, 8 GB of RAM, and a 64-bit Windows 10 operating system.
1) Performance Evaluation:In this case, we consider the DRM of a residential area with 5 non-dispatchable loads and 10 dispatchable loads (including 6 general dispatchable loads and 4 PEVs) in a whole day (i.e., 24 hours). The energy demand profiles of non-dispatchable and dispatchable loads are obtained from San Diego Gas & Electric [36] and shown in Figs. 4 and 5. As shown in Fig. 5, the energy demand trends of all six general dispatchable loads are almost the same,resulting in two demand peaks (i.e., 10:00–15:00 and 19:00–22:00). Thus, if the actual energy consumption of those dispatchable loads are not properly coordinated, the electricity burden of power grid shall be largely increased and we cannot guarantee the economic operation of electricity market.Additionally, note that since we consider residential PEVs,their arrival and departure time is almost fixed and known in advance. The wholesale prices are determined by grid operator and is derived from Commonwealth Edison Company [37](see also in Fig. 6). The remaining related parameters of dispatchable loads are listed in Table I. For illustration, the numerical value of the wholesale price and the remaining related parameters (including the elasticity coefficients ξt,weighting factor ρ, learning rate θ, and gap tolerance δ) are summarized in Table II. Note that the action space is discretized by 0.1. That is to say, the RP increases or decreases by a multiple of 0.1 at each iteration. The numerical results of the proposed RL-based DR algorithm are shown below.
Fig. 6. Daily optimal retail pricing policies of loads. (a) Non-dispatchable load. (b) General dispatchable load. (c) PEV.
Fig. 6 shows the daily optimal retail pricing policies received by three types of loads. It can be observed that the trends of RPs and wholesale prices are similar, which is justified in terms of maximizing social welfare. Moreover, all the RPs fall within the lower and upper price bounds, thus satisfying the constraint (13). It is worth noting that due to the changes of price elasticity from off-peak to mid-peak (onpeak) periods at 12:00 (16:00), there appears to be sudden decreases at these two time slots. From the definition of price elasticity, one knows that a continuous increase in RP may lead to more demand reduction, thus causing great reductions in UC’s profit. Another observation is that the price difference(i.e., retail price minus wholesale price) for each load unit during three periods satisfies: off-peak > mid-peak > on-peak.This is because the price elasticity coefficient of on-peak period is smaller than that of mid- and off-peak periods. Once obtaining the optimal RPs for all loads, the optimal energy consumption can be directly calculated by (1) and (4), which is shown in Fig. 7. One can see that the PEV discharges at some peak hours to relieve the electricity pressure and increase its own profit. And for further analysis, the total demand reduction of each dispatchable load is displayed in Fig. 8. It can be observed that Gen6 reduces less energy compared with other dispatchable loads, which is because a load unit with a larger αnprefers a smaller demand reduction to prevent from experiencing more dissatisfaction.
Next, we proceed to verify the convergence of Algorithm 1,that is, judging whether the Q-value converges to the maximum. For clarity, we choose five Q-values of each type of load as an example and the numerical results are displayed in Fig. 9. Clearly, at the beginning, the UC has no knowledge of which RP can result in a larger reward. But as the iteration proceeds, since the UC learns the dynamic responses of loads through trial and error, the Q-values gradually increase and eventually converge to the maximums.
Now let us move on to the discussion about the impact of ρ and θ. Since the demand reduction of loads is closely related to the time-varying price elasticity, the fixed elasticity coefficients are not representative. To tackle this issue, we adoptMonte Carlo simulations(2000 simulations with changing elasticity coefficients) to capture the trends of average RP, UC’s total profit, and loads’ total cost as the weighting parameter ρ changes. As shown in Fig. 10, the average RP (red solid line), the UC’s total profit (black solidline), and loads’ total cost (blue solid line) increase as ρ varies from 0 to 1 with a step of 0.1. It is because the increase in ρ means that from a social point of view, maximizing the profit of UC is more important than minimizing the comprehensive cost of loads, thereby resulting in an increase in the RP determined by UC. Correspondingly, as RP increases, the amount of energy consumed by loads is gradually reduced,leading to a slight increase in the total cost of the loads. Fig. 11 shows that with θ increasing from 0 to 1, the convergence of Q-values gradually becomes faster. In particular, θ=0 means UC learns nothing while θ=1 shows the UC focuses only on the latest information.
Fig. 7. Daily energy consumption of loads. (a) General dispatchable load.(b) PEV.
Fig. 8. Total demand reduction of dispatchable loads.
Fig. 9. Convergence of Q-values. (a) Non-dispatchable load. (b) General dispatchable load. (c) PEV.
Fig. 10. Impact of ρ on average retail price, UC's total profit, and loads'total cost.
Fig. 11. Impact of learning rate θ on convergence of Q-values.
TABLE I RELATED PARAMETER SETTINGS OF DISPATCHABLE LOAD UNITS
TABLE II NUMERICAL VALUES OF WHOLESALE PRICE AND OTHER RELATED PARAMETERS
2) Comparison With Benchmarks:To further evaluate the effectiveness of the proposed model. First we adopt a benchmark without PBDRM for comparison. Fig. 12 shows the daily energy consumption of all loads under two different situations, namely with and without PBDRM. Here note that for the sake of illustration, the figure plots the average RP of all loads. It can be seen from Fig. 12(a) that there is no energy demand reduction or shift in the absence of PBDRM, causing a fluctuating energy consumption profile. By contrast, as shown in Fig. 12(b), with the help of PBDRM, the loads reduce energy consumption when the price is high, resulting in less total energy consumption and a more smooth profile.Therefore, the proposed PB-RDRM effectively coordinates the energy consumption of residential loads and significantly improves the social welfare of residential retail electricity market.
Fig. 12. Energy consumption of all loads. (a) without and (b) with PBDRM.
Then, the proposed RL-based DR algorithm is benchmarked against the other two model-based optimization algorithms(i.e., distributed DDB [9] and PDI [10] methods), and a scheme with perfect information of random parameters (i.e.,the RP and the EVs’ commuting behavior). Note that both of the two model-based optimization approaches are based on deterministic load and price models. It is because the PBRDRM is essentially a bi-level optimization problem, so when the problem model is accurately formulated, one can use the conventional optimization techniques to solve it directly.Specifically, by means of the DDB method in [9], the optimization problem of all lower-level loads can be decoupled into several sub-optimization problems, each of which can be solved in a distributed manner. In simulations,the initial Lagrangian multipliers are set to zeros. In addition,another optimization technique, namely the PDI method, has been shown to be effective in dealing with the DR in smart grid [10]. In our comparative simulation study, let the value of the initial random dual vector be 0.5, α=0.05, σ=2.5,β=0.2 , and εν=εfeas=10?2. Note that the above parameter settings correspond to the algorithm in [10] and are independent of the parameters presented in this paper. The comparison results are shown in Fig. 13. It can be observed that the trends of RP (solid lines) obtained by the four compared algorithms are similar, but the energy consumption profile (bar charts) learned from our proposed RL-based algorithm is much smoother than that of two model-based optimization approaches. Moreover, the trends of RPs and energy consumption profiles obtained by our proposed approach are the closest to the scheme with perfect information of random parameters (refer as “Com-based” in Fig. 13). In addition, Table III lists the numerical comparison results of the UC’s total profit, the loads’ total cost, and the social welfare. It can be observed from Table III that the scheme with complete information of random parameters provides an upper bound for the social welfare generated by the PB-RDRM considered in this paper. Moreover, the result of our proposed RL-based approach is closer to this upper bound than those of the other two model-based optimization approaches.
Fig. 13. Comparison results of the total energy consumption and RP solutions of RL-based algorithm and three benchmarked algorithms(including DDB, PDI, and Com-based).
TABLE III NUMERICAL COMPARISON RESULTS
Moreover, note that the complexity of RL algorithms applied to the goal-directed exploration tasks, especially for the Q-learning and its variants, have been thoroughly analyzed in an earlier paper [38]. Specifically, according to Theorem 2 and Corollary 3 in [38], one can see that the complexity of the one-step Q-Learning algorithm (corresponding to the inner loop of the RL-based DR algorithm proposed in the current work) is O(md), wheremanddare the total number of stateaction pairs and the maximum number of actions that the agent executes to reach the next state, respectively. Therefore,the complexity of our proposed RL-based DR algorithm is O(md), which obviously depends on the size of the state space. As for the two compared model-based approaches (i.e.,distributed DDB [9] and distributed PDI [10]), they are shown to have polynomial-time complexity. Therefore, with the increase in the number of residential loads and the time horizon, the proposed algorithm has a comparable complexity with distributed DDB and PDI methods. In addition,considering the performance advantages of our proposed algorithm in addressing the uncertainty of RPs and energy consumption, as well as yielding larger social welfare, it turns out that the proposed RL-based DR algorithm can effectively solve the PB-RDRM within an unknown electricity market environment.
Next, to verify the scalability of the proposed algorithm, we consider more loads (i.e., the total number of loads changes from 50 to 200) participating in the PB-RDRM. Fig. 14 traces the convergence rate of Q-values with different numbers of loads. It can be observed that the more loads there are, the more iterations required for Q-values to converge. Note that the iterative process of 200 loads takes about 7.17×103swhile that of 50 loads is 3.89×103s. The main reason for such an increase in time and number of iterations is that when adding one load, there areW24permutations to perform,whereW∈{1,...,|A|}. Fortunately, with the advent of a new generation of advanced computing platforms such as grid computing and cloud computing [2], such computing pressure is no longer an obstacle to the development of smart grids as they synthesize and coordinate various local computing facilities such as smart meters, to provide the required subcomputing and storage tasks.
Fig. 14. Convergence rate of Q-values with different numbers of loads.
This paper investigates the problem of PB-RDRM, in which the flexible PEVs are innovatively considered. We first formulate the upper dynamic retail pricing problem as a MDP with unknown state transition probability from the social perspective. Then a model-free RL-based approach is proposed to obtain the optimal retail pricing policies to coordinate energy consumption profiles. The proposed approach is shown to address the uncertainty induced by energy consumption and RP, as well as the temporally coupled constraints of PEVs, without any prior knowledge about the exact models of load and electricity price. The simulation results show that our proposed RL-based algorithm can not only adaptively decide the dynamic RPs by the on-line learning process, but also outperform the model-based optimization approaches in solving the PB-RDRM within an unknown market environment.
Note that the tabular Q-learning algorithm we are using is limited by the changing dimensions of the state vector. However, unlike the commercial area, the number of loads in residential areas is almost fixed, resulting in the dimension of state vector being constant in the considered PB-RDRM. Therefore, the corresponding Q table does not need to be reconstructed and trained repeatedly. That is to say, the proposed Q learning-based algorithm is applicable in the investigated PBRDRM. In the future, we intend to use function approximation or neural networks to replace the Q-table so as to expand the algorithm to solve larger problems. We will also focus on the pros and cons of both PBDRM and IBDRM to explore the coordination between these two DRMs.
IEEE/CAA Journal of Automatica Sinica2022年1期