(丁義明)
College of Science, Wuhan University of Science and Technology, Wuhan 430081, China;Hubei Province Key Laboratory of System Science in Metallurgical Process,Wuhan University of Science and Technology, Wuhan 430065, China
E-mail: dingym@wust.edu.cn
Liang WU (吳量)
Center of Statistical Research, School of Statistics, Southwestern University of Finance and Economics, Chengdu 611130, China
E-mail: wuliang@swufe.edu.cn
Xuyan XIANG (向緒言))?
Hunan Province Cooperative Innovation Center for TCDDLEEZ, School of Mathematics and Physics Science, Hunan University of Arts and Science, Changde 415000, China
E-mail: xyxiang2001@126.com
Abstract Long memory is an important phenomenon that arises sometimes in the analysis of time series or spatial data.Most of the definitions concerning the long memory of a stationary process are based on the second-order properties of the process.The mutual information between the past and future Ip-f of a stationary process represents the information stored in the history of the process which can be used to predict the future.We suggest that a stationary process can be referred to as long memory if its Ip-f is infinite.For a stationary process with finite block entropy, Ip-f is equal to the excess entropy,which is the summation of redundancies that relate the convergence rate of the conditional (differential)entropy to the entropy rate.Since the definitions of the Ip-f and the excess entropy of a stationary process require a very weak moment condition on the distribution of the process,it can be applied to processes whose distributions are without a bounded second moment.A significant property of Ip-f is that it is invariant under one-to-one transformation;this enables us to know the Ip-f of a stationary process from other processes.For a stationary Gaussian process,the long memory in the sense of mutual information is more strict than that in the sense of covariance.We demonstrate that the Ip-f of fractional Gaussian noise is infinite if and only if the Hurst parameter is H ∈(1/2,1).
Key words mutual information between past and future;long memory;stationary process;excess entropy;fractional Gaussian noise
Long memory processes and long range dependence processes are synonymous notions[5,12,22]which play an important role in various fields,such as hydrology,geophysics,physics,finance,biology,medicine,climatology,environmental sciences,economics,telecommunications,etc..As was mentioned in[5],although long memory and related topics date from the late 19th century,it can properly be said that the notion only really started to attract the interest of a significant number of mathematical researchers(and,in particular,probabilists and statisticians)since the work of Mandelbrot and his colleagues,which laid the foundations for the fractional Brownian motion (FBM) model and its increments (such as the fractional Gaussian noise (FGN) model)– the classical models in the studies of long memory [5,28].Similar path-breaking roles can be attributed to Hurst[15]for hydrology,Dobrushin(and before,Kolmogorov[17])for physics,and Granger [13]for economics.
A stationary process is a sequence of random variables whose probability law is time invariant.A stationary second-order moment process has long memory when the sum of the autocorrelation function (ACF) diverges,or there exists a pole at zero frequency of its power spectrum [4,5].That is to say,the ACF and the power spectrum of the long memory process both follow a power-law,while the underlying process has no characteristic timescale of decay.The correlation of the long memory process decays so fast that it cannot be distinguished from noise rapidly.This is in striking contrast to many standard stationary processes.The long memory phenomenon relates to the rate of decay of statistical dependence of a stationary process,with the implication that this decays more slowly than an exponential decay,which is typically a power-like decay.Some self-similar processes may exhibit long memory,but not all processes with long memory are self-similar[21].When the definitions of long memory are given,they vary from author to author(the econometric survey[14]mentions 11 different definitions).Different definitions of long memory are used for different applications.Most of the definitions of long memory that appear in the literature are based on the second-order properties of a stochastic process.Such properties include the asymptotic behavior of covariances,spectral density,and variances of partial sums.Reasons for the popularity of the second-order properties in this context are both historical and practical: second-order properties are relatively simple concepts and are easy to estimate from the data.
Long memory is popularly defined from the aspect of a covariant stationary series{Xn}with spectral density via the divergence of the summation of autocovariance
wherern=Cov(Xm,Xm+n).Conversely,for short memory.Note that correlations provide only limited information about the process if the process is “not very close” to being Gaussian,and rate of decay of correlations may change significantly after instantaneous one-to-one transformations of the process,though this is not valid for processes without a bounded second moment [22].The question then arises as to whether it is possible to develop a new approach to improve the definition of the long memory of a stationary process so that it is invariant under one-to-one transformation,can capture more dependence information,and can be used for the process without a bounded second moment.For stationary processes without a bounded second moment,some scholars use extreme value to describe long memory [11,19,20,22].Samorodnitsky suggested using notions from ergodic theory,including ergodic strong mixing,to describe the memory of a stationary process,because these are invariant under one-to-one transformation.The key step to this approach is to look for reasonable strong mixing conditions to significantly distinguish short and long memory stationary processes.Mixing coefficients [1,2]are also invariant under one-to-one transformation,but they are difficult to compute and still lack reasonable signs for distinguishing significantly long memory and short memory processes.It is well known that Shannon entropy is invariant under one-to-one transformation.One can expect to find suitable concepts in information theory for distinguishing short and long memory stationary processes.Mutual information is used to capture the dependence between two random variables that are independent if and only if the mutual information between those two random variables is zero.For a stationary processX={···,X-1,X0,X1,X2,···},we can regardXas two random variables: the past=···X-2X-1X0and the future=X1X2···.The mutual information betweenisIp-f(X),which represents the information between theof the stationary processX(see Section 3 for detailed definition regardingIp-f).Stationary processXusually admits a Shannon (differential) entropy ratehμ(X) (for continuous-valued stationary process,differential entropy rate may be-∞) and the conditional entropyH(Xn|X1,···,Xn-1)→hμ(X) (or the conditional differential entropyh(Xn|X1,···,Xn-1)→hμ(X)) asn →∞[6,16].In what follows,we try to demonstrate thatIp-fcan be used to distinguish long memory and short memory stationary processes: a stationary processXis long memory ifIp-f(X)=+∞,and it is short memory ifIp-f(X)<+∞.The mutual information description of long memory is also related to the ten key challenges of the post-Shannon era raised by Huawei [29].Such an approach is helpful for the following reasons:
1) The definition of mutual informationIp-f(X)requires a weak moment condition rather than the second moment condition<+∞,therefore it can be used to detect the long memory behavior of a stationary process with heavy tail distribution;
2)Ip-f(X)can distinguish short and long memory stationary processes asin a process with a bounded second moment (see Section 3 for details);
3)Ip-f(X) is invariant under one-to-one transformation (Theorem 3.8);
4) This is closely related to a second moment characterization if the stationary process is Gaussian (Theorem 3.11).For fractional Gaussian noise,Ip-f(X) is infinite if and only if the Hurst parameter isH ∈(1/2,1) (Theorem 3.13);
5) For a stationary process with finite block entropy,Ip-f(X)is equal to the excess entropyE(X) ofX,which is an intuitive measure of information stored inX(Theorem 4.6) [7,27].WhetherIp-f(X) is finite or not is up to the convergence rate of the conditional (differential)entropy and the (differential) entropy ratehμ(X).
The rest of this paper is organized as follows: in Section 2,we recall some basic concepts about information theory.In Section 3,we give the definition of long memory using mutual information for stationary processes,and show that the mutual information is invariant under one-to-one transformation.Furthermore,we illustrate howIp-fis related to covariance when the stationary process is Gaussian,and prove that,for fractional Gaussian noise,Ip-f=+∞if and only if the Hurst parameter isH ∈(1/2,1).In Section 4,we demonstrate that,for a stationary process with finite block entropy,Ip-fis equal to its excess entropy.
We recall some basic concepts and theorems about information theory from the books [6]and [16].
The Shannon entropyH(X)of a discrete random variableX,taking valuesx ∈S,is defined as
where the probability thatXtakes on the particular valuexis written asp(x)≡P(X=x).
Shannon entropy measures the uncertainty of a discrete random variable.
The joint Shannon entropyH(X,Y) of two discrete random variables (X,Y) is defined as
wherep(x,y) is the joint distribution of (X,Y).
The joint Shannon entropy ofndiscrete random variablesX1,X2,···,Xnis
wherep(x1,x2,···,xn) is the joint distribution of (X1,X2,···,Xn).
The conditional Shannon entropyH(X|Y),which is the entropy of a discrete random variableXthat is conditional on the knowledge of another discrete random variableY,is
WhenXis a continuous random variable with densityf(x),the differential entropyh(X)ofXis defined as
whereSis the support set of the random variable.
The differential entropyh(X)may be negative and does not work as an uncertainty measure,but the difference ofh(X)-h(Y) indicates the difference between the uncertainties ofXandY.
The joint entropy ofncontinuous random variablesX1,X2,···,Xnwith densityfis defined as
Similarly,ifXandYare continuous random variables with a joint density functionf(x,y),the joint differential entropyh(X,Y) of (X,Y) is defined as
and the conditional differential entropyh(X|Y) as
For both Shannon entropy and differential entropy,we have the following properties[6,16]:
1)H(X,Y)=H(X)+H(Y|X)=H(Y)+H(X|Y),h(X,Y)=h(X)+h(Y|X)=h(Y)+h(X|Y);
2) Conditioning reduces entropy
3) Chain rules
A stochastic processX={Xi,i ∈Z} is an indexed sequence of random variables.A stochastic process is (strictly) stationary if the joint probability distribution does not change when shifted in time,i.e.,the distribution of (Xi1+s,Xi2+s,···,Xin+s) is independent ofsfor any positive integernandi1,i2,···,in ∈Z+.
LetX={Xi,i ∈Z} be a stationary stochastic process.Forn=1,2,···,then-block entropy ofXisHX(n) :=H(X1,X2,···,Xn) ifXis a discrete-valued process,orhX(n) :=h(X1,X2,···,Xn) ifXis a continuous-valued process.For convenience,HX(n) (orhX(n)) is denoted byH(n) (orh(n)),if no confusion occurs,andH(0)=0 (orh(0)=0).
We say that a stationary processXadmits finite block entropy if all of the block entropiesH(n) (orh(n)) are finite,i.e.,0≤H(n)<∞(or-∞ In this paper,we focus on a stationary process with finite block entropy. The following lemma collects some properties of the block (differential) entropy sequence{H(n)} or{h(n)} of a stationary processX: Lemma 2.1LetX={Xn,n ∈Z} be a stationary process that admits finite block entropy.We have the following properties: 1) Nonincreasing entropy gain ?H(n):=H(n)-H(n-1)=H(Xn|Xn-1,···,X1) and?h(n):=h(n)-h(n-1)=h(Xn|Xn-1,···,X1)) are nonincreasing; 2) Subadditivity for all nonnegative integersmandn, 3) Monotonicity of entropy per element both{H(n)/n}and{h(n)/n}are nonincreasing. ProofWithout loss of generality,we only prove the results for Shannon entropy;the differential entropy cases are similar. 1) By the chain rules,we have thatIt follows that ?H(n) :=H(n)-H(n-1)=H(Xn|Xn-1,···,X1),which is nonincreasing due to the reduced entropy by condition. 2) Sinceh(n) is nonincreasing,by the chain rule,we know that 3) By the stationarity ofX,the chain rule and the nonincreasing entropy gain,we have that Remark 2.21) SinceH(Xn|Xn-1,···,X1)≥0,the entropy gain ?H(n)is nonnegative,but ?h(n)=h(Xn|Xn-1,···,X1) is not always nonnegative. 2) IfXis a discrete-valued stationary process,Xadmits finite block entropy if and only ifH(X1)=H(1)<+∞,because the fact that{H(n)} is subadditive implies thatH(n)≤nH(1)<+∞.In particular,if E|X1|δ<+∞for someδ>0,thenH(X1)<+∞[3,9].ThusH(X1)<+∞is weaker than the second moment condition<+∞.For continuous-valued process,we should avoid the case ofh(n)=-∞,so the condition of finite block entropy is required. 3) Suppose thatXis a continuous-valued stationary process,and thatXdoes not admit finite block entropy.Letkbe the minimal positive integerksuch thath(k)=-∞.Thenh(n)=-∞forn ≥k.In fact,by the subadditivity of{h(n)},h(mk)≤mh(k)=-∞indicates thath(mk)=-∞.Ifn=mk+s,m ≥1,0 In this section,we give the definition of long memory by using mutual information,and discuss the rationale behind this definition. The mutual information between two random variablesXandYis defined as follows [16]: Definition 3.1LetXandYbe random variables taking values in X and Y.LetμXandμYbe the probability measures ofXandYon the measurable spaces (X,B(X)) and(Y,B(Y)).μXYis the joint probability measure ofXandYon the measurable space (X×Y,B(X)×B(Y)).μX×μYis the product probability measure ofXandYon the measurable space (X×Y,B(X)×B(Y)).The mutual informationI(X;Y) betweenXandYis defined as ifμXY ?μX×μY(i.e.μXYis absolutely continuous with respect toμX×μY);otherwiseI(X;Y) is infinite. The mutual informationI(X;Y) is the relative entropy (or Kullback-Leibler divergence)between the joint probability measureμXYand the product probability measureμX×μY. Suppose thatXandYare two discrete random variables with a joint distributionp(x,y)and marginal distributionsp(x) andp(y),that the Radon-Nikodym derivative is given by and thatI(X;Y) can be written as It follows that For continuous random variablesXandYwith a joint density functionf(x,y)and marginal density functionsf(x) andf(y), LetP={P1,···,Pk} be a finite partition of the image of the random variableX.The quantization ofXwith partitionPis the discrete random variable [X]Pdefined by Thus the joint distribution of two quantizations [X]Pand [Y]Qis defined as whereQ={Q1,···,Qs} is a finite partition of the image of random variableY. An equivalent definition of mutual information using quantization[6]gives the relationship between the mutual information of discrete random variables and that of continuous random variables. Definition 3.2The mutual information between two random variablesXandYis defined as wherePandQmean finite partitions. We collect some useful properties about mutual information which immediately follow from Theorem 1.6.3 of [16]. Lemma 3.3Suppose thatX,Y,Zare three random variables.Then the following properties hold: 1)I(X;Y)=I(Y;X); 2)I(X;Y)≥0; 3)I(X;(Y,Z))≥I(X;Z). The mutual information of a stationary process between the past and future is defined as follows (see [18]for the stationary Gaussian process): Definition 3.4LetX={Xn,n ∈Z} be a stationary process with finite entropyH(n)(orh(n)) for eachn ∈Z+.LetIp-f(X,n) :=I(X-(n-1)···X0;X1···Xn) be the mutual information between the past and future with lengthn.The mutual information between the past and futureIp-f(X) is defined as For convenience,Ip-f(X) andIp-f(X,n) are denoted byIp-fandIp-f(n) as long as no confusion occurs. Remark 3.51) For convenience,we set thatIp-f(0)=0. 2) By Lemma 3.3 and the stationarity ofX,Ip-f(n)=I(X1···Xn;Xn+1···X2n).This is nonnegative and nondecreasing,soIp-fis infinite or a nonnegative constant. 3) The mutual information between the past and future can be also defined aswhereIp-f(X,n,m) :=I(X1···Xn;Xn+1···Xn+m).Following from the monotonicity of the mutual information (see the third claim of Lemma 3.3),this definition is equivalent to Definition 3.4. 4) The definition of the mutual informationIp-frequires finite entropy,which only needs a weak moment condition rather than the second moment condition.For more details on this,for a random variableX,if E|X|δ<+∞for someδ>0,thenH(X)<+∞,since where Γ is the Gamma function [3,9,26]. Now we distinguish the long memory and short memory a stationary process by usingIp-f. Definition 3.6A stationary process is long memory ifIp-fis infinite,and it is short memory ifIp-fis finite. The definition of long memory from the perspective ofIp-fwas discussed in the case of a stationary Gaussian process by Li [18]. The information gainSnis defined as follows: forn ≥1,letSn:=?Ip-f(n)=Ip-f(n)-Ip-f(n-1) be thenth information gain.SinceIp-f(n) is non-decreasing,Sn ≥0.We have that In practice,Ip-fis usually approximated byIp-f(N) for some large positive integerN.In this case,the acquired information is,and the missed information is IfIp-f<+∞,asN →+∞,the ratio of the acquired information to the missed information is IfIp-f=+∞,no matter whatNis,the ratio of the acquired information to the missed information is One can see that there is a significant difference on the asymptotic behavior of{AMR(N)}between the stationary process withIp-f<+∞andIp-f=+∞. In what follows,we show that the mutual informationIp-fis invariant under one-to-one transformation. Lemma 3.7Letg1andg2be two one-to-one transformations.XandYare two random variables.We have thatI(X;Y)=I(g1(X);g2(Y)). ProofLetP=(P1,···,Pk) andQ=(Q1,···,Qm) be partitions of the ranges of the random variablesXandY,respectively.Sinceg1andg2are one-to-one transformations,g1(P)=(g1(P1),···,g1(Pk)) andg2(Q)=(g2(Q1),···,g2(Qm)) are partitions of the ranges of random variablesg1(X) andg2(Y),respectively. Observe that,for alli ∈{1,2,···,k} andj ∈{1,2,···,m}, By eq.(3.1) and eq.(3.2),we know that Take the supremum overPandQ,I(X;Y)≤I(g1(X);g2(Y)). By symmetry arguments,I(g1(X);g2(Y))≤I(X;Y). Lemma 3.7 is proven. Theorem 3.8LetX={Xn,n ∈Z} be a stationary stochastic process,and letgbe a one-to-one transformation.Then the mutual information between the past and futureIp-fof the processY:=g(X)={g(Xn),n ∈Z} is equal to that ofX. ProofSincegis a one-to-one transformation,by Lemma 3.7,we have that which implies thatIp-f(X,n)=Ip-f(Y,n) for alln ≥0. As a result,the stationary processesXandYadmit the sameIp-f. The significance of the definition of long memory lies in consistency: if there is a one-to-one correspondence between two stationary processes,then either both of them are long memory or neither of them is long memory. In this subsection,we discuss the relationship between the definitions of long memory in the sense of mutual information and of covariance for the stationary Gaussian process. LetX={Xn,n ∈Z} be a zero-mean stationary Gaussian stochastic process.Xcan be completely characterized by its correlation functionrk,j=rk-j=E[XkXj],or equivalently by its power spectral densityf(λ),which is the Fourier transform of the covariance function Set that for any integerkif it is well defined.Here{bk} are referred to as cepstrum coefficients [18]. The following Lemma 3.9 was proven by Li in [18]: Lemma 3.9LetX={Xn,n ∈Z} be a stationary Guassian process [18]. 1)Ip-fis finite if and only if the cepstrum coefficients satisfy the condition that<∞.In this case, 2) If the spectral densityf(λ) is continuous,andf(λ)>0,thenIp-fis finite if and only if the autocovariance functions satisfy the condition that Lemma 3.10Let{c(n)} be a decreasing positive series.Then ProofSince{c(n)} is a decreasing positive series,converges{nc(n)}→0.Thus{nc(n)}is bounded by a constantM.Therefore, Remember that for a stationary Gaussian processXwith autocovariance{rn},Xis longmemory ifThe following result shows the relationship betweenandIp-ffor a stationary Gaussian process: Theorem 3.11LetX={Xn,n ∈Z} be a stationary Gaussian process with decreasing autocovariance{|rn|}.Suppose that the spectral densityf(λ) is continuous and thatf(λ)>0.Then In other words,the fact thatXis not long memory in the sense of covariance implies that it is also not long memory in the sense of mutual information. The proof of Theorem 3.11 immediately follows from Lemmas 3.9 and 3.10. Remark 3.12Setting thata(n)=,then we have thatbut also thatThe converse implication in Lemma 3.10 is not true.Thus for the process considered in Theorem 3.11,long memory in the sense of mutual information is stricter than for that in the sense of covariance. Fractional Brownian motion (FBM) has been widely applied to a large number of natural shapes and phenomena.An FBM with the Hurst parameterH ∈(0,1)is a centered continuoustime Guassian processBH(·) with the covariance function fors,t ≥0.BHreduces to an ordinary Brownian motion forH=1/2. The incremental process of an FBM is a stationary discrete-time process and is called fractional Gaussian noise(FGN).The auto-covariance function of FGNX={Xk:k=0,1,···}can be derived as follows: It is plain to see that as|k|→∞.Of course,ifH=1/2,thenρk=0 for allk ≥1 (a Brownian motion has independent increments).One can conclude that the summability of correlations<+∞) holds when 0 The following result shows that long memory in the sense of mutual information is equivalent to that in the sense of covariance for FGN: Theorem 3.13LetX={Xn,n ∈Z} be the (discrete) increment process of a fractional Brownian motion with Hurst parameterH ∈(0,1) andH≠1/2.Then Remark 3.14This theorem shows that,for fractional Gaussian noise,the informatic characterization of long memory is identical to second moment characterization. ProofThe spectral density of the increment of fractional Brownian motionBH(t) (fractional Guassian noise) was obtained by Sinai [24]as where Γ(.) denotes the Gamma function and for-π ≤λ ≤π. This spectral density can be rewritten as whereCis a positive constant. It can be seen that the spectral density of FGN is positive. The spectral densityf(λ) is proportional to|λ|1-2Hnearλ=0.Thus,whenH ∈(0,1/2),f(λ) is continuous,but whenH ∈(1/2,1),it is not continuous atλ=0. Notice that ifH=1/2,the FGN is the increment of classical Brownian motion,and it follows thatIp-f=0. The theorem will be proven in two steps. ?Step 1:H ∈(0,1/2)=?Ip-f<∞. WhenH ∈(0,1/2),the spectral density of FGN forH ∈(0,1/2)is positive and continuous,and by eq.(3.4),Following on from Theorem 3.11,we have thatIp-f<∞.We have proven thatH ∈(0,1/2)Ip-f<∞. ?Step 2:H ∈(1/2,1)Ip-f=∞. Since the spectral density of FGN forH ∈(1/2,1) is not continuous whenλ=0,we prove it by the first claim of Lemma 3.9,which does not require continuous spectral density. Now we estimate the mutual information of FGN via the logarithm of the spectral density.We have that logf(λ) is an even function on [-π,π],i.e.,logf(λ)=logf(-λ).Forn≠ 0,we obtain the following decomposition: Forb1(n),we get that Thusb1(n)~asngoes to infinity. For the twice continuously differentiable functiong,the Fourier coefficient of ordernbehaves like[25].Observe that log[(1+g1(λ))/2]is twice differentiable,so there exists a positive constantM1<∞such that Now we estimateb3(n).Denote that We have that Since whenH>1/2,|x|2H-1,|x|2H,|x|2H+1,AH(x),(x),(x) are continuous functions on [-π,π],(x) are also continuous functions on [-π,π].We conclude that for some positive constantM2<∞. Combining the estimations ofb1(n),b2(n),b3(n), It follows that there exists a positive integerN0such that|b(n)|≥forn>N0. Hence,by Lemma 3.9,the mutual information is Remark 3.15From the proof of Theorem 3.13,it can be seen that,forH>1/2,|b(n)|~asn →+∞,which implies that In this section,we try to relateIp-fwith excess entropy,which is an intuitive measure of memory stored in a stationary stochastic process,so that we can obtain a calculation ofIp-f. First we recall the definition of entropy rate of a stochastic processX={Xn} [16]. Definition 4.1LetX={Xn,n ∈Z} be a stochastic process. 1) Assume that eachXnis a discrete random variable.The Shannon entropy rate ofXis defined by when the limit exists. 2) Assume that (X1,···,Xn) has a continuous joint distribution for eachn ∈Z+.The differential entropy rate ofXis defined by when the limit exists. It is known that for a stationary process with finite block entropy,the Shannon(differential)entropy rate always exists [6,16].In what follows,when we mention the entropy ratehμof a stationary processXwith finite block entropy,hμis a Shannon entropy rate ifXis a discretevalued process,and it is a differential entropy rate ifXis a continuous-valued process. IfXis a discrete-valued stationary process,by Lemma 2.1,?H(n)=H(n)-H(n-1) is nonnegative and nonincreasing.Then the limit of ?H(n) exists and is finite,it is equal to the entropy ratehμbecause IfXis a continuous-valued stationary process,by Lemma 2.1,?h(n)=h(n)-h(n-1)is nonincreasing and maybe negative.Because the entropy ratehμis equal to,hμexists and may be-∞[23].Details of the differential entropy rate can be found in [10,16]. We show an example for different values ofhμin the case of a continuous-valued stationary process. Example 1Supposing that{Xn} is a stationary Gaussian process,we have the joint entropy [6] whereK(n)is the covariance matrix with elements of=r(i-j)=E(Xi-EXi)(Xj-EXj).Thus it is Toeplitz with entriesr(0),r(1),···,r(n-1) along the top row.The density of eigenvalues ofK(n) tends to the spectrum of the process asn →∞.It has been shown by Kolmogorov that the differential entropy rate of a stationary Gaussian process can be given by whereS(λ) is the power spectral density of the stationary Gaussian processX.On the other hand, A significant property of the entropy rate is the AEP (asymptotic equipartition property),also known as the the Shannon-McMillan-Breiman theorem,which states that ifhμis the entropy rate of a finite-valued stationary ergodic process{Xn},then with probability 1.The entropy ratehμquantifies the irreducible randomness in sequences produced by a stationary source;the randomness that remains after the correlations and structures in longer and longer sequence blocks are taken into account. For a discrete-valued stationary processX,by the definition of the entropy rate, However,the valuehμindicates nothing about howH(n)/napproaches this limit.Moreover,there may be sublinear terms inH(n).For example,one may haveH(n)~nhμ+corH(n)~nhμ+logn.The sublinear terms inH(n) and the manner in whichH(n) converges to its asymptotic form may reveal important structural properties about a stationary process. Definition 4.2LetX={Xn,n ∈Z} be a stationary process with finite block entropy,andhμbe the entropy rate ofX.Then, 1) ifXis a discrete-valued process,the excess entropy ofXis 2) ifXis a continuous-valued process,the excess entropy ofXis Remark 4.3This definition follows from [8].Note that for discrete-valued stationary process andn ∈Z+, Then{?H(n)-hμ}nis a monotonically nonincreasing and nonnegative sequence and converges to 0.Moreover,the excess entropy isE=+∞or a nonnegative constant. For continuous-valued stationary process andn ∈Z+, If the entropy ratehμ>-∞,{?h(n)-hμ}nis monotonically nonincreasing and nonnegative sequence and converges to 0. ?H(n)-hμ=H(Xn|X1,···,Xn-1)-hμis referred to as a per-symbol redundancyr(n),because it tells us how much additional information must be gained about the process in order to reveal the actual per-symbol uncertaintyhμ.In other words,the excess entropyEis the summation of per-symbol redundancy [8].Note that for a stationary process with finite block entropy and a finite entropy rate,the conditional entropy isH(Xn|Xn-1,···,X1)(orh(Xn|Xn-1,···,X1)) converges decreasingly tohμ),the excess entropyE<+∞if the rate of convergence is fast,andE=+∞if the rate of convergence is slow.Substituting?H(n)=(H(n)-H(n-1)) into the definition of the excess entropy,we know that If the excess entropy is finite,we obtain thatH(n)≈nhμ+Easn →∞. Using the notion of entropy rate,one can see that the past has little to say about the future. Proposition 4.4IfXis a stationary process with finite block entropy,and the entropy ratehμ>-∞,then Thus,the dependence between adjacentn-blocks of a stationary process with bounded entropy rate does not grow linearly withn. ProofSuppose thatXis a discrete-valued process.By definition, SinceXis stationary,we obtain thatIp-f(n)=2H(n)-H(2n).It follows that IfXis a continuous-valued process with bounded entropy ratehμ>-∞,the proof is similar. We give two examples for values of excess entropyE. Example 2For independent and identical distribution discrete-valued processes,the entropy rate ishμ=H(1),sinceH(n)=nH(1),and thus the excess entropy isE=0. Example 3For an irreducible positive recurrent Markov chainXdefined on a countable number of states,given the transition matrixPij,the entropy rate ofXis given by whereuiis the stationary distribution of the chain.By the Markovian property,the excess entropy of the Markov chain is Finally we discuss the relationship between the excess entropy and the mutual information.One useful Lemma is given below. Lemma 4.5Let{an,n ∈Z+} be a nonincreasing positive series and letSuppose thatThenAnis convergent if and only ifBnis convergent,and these have the same limit when they are convergent. Suppose thatBnis convergent.Noticing that it follows that?ε>0,?N>0,?n,s ∈N,n>N, As a result, We obtain,forn>N,that The lemma is proven. The following result shows that the excess entropy and the mutual information are identical for a stationary process with finite block entropy: Theorem 4.6LetX={Xn,n ∈Z} be a stationary process with finite block entropy.Then the excess entropy and the mutual information are identical: ProofWe prove the result for a discrete-valued stationary process and a continuousvalued process,respectively. First,we suppose thatXis a discrete-valued process.Denote that Denote thatan=?H(n)-hμ,and by the definition of entropy rate,we know thatan ≥0,an ≥an+1for a positive integern,and thatFurthermore,we have that We conclude that In fact,the first inequality follows from{an} is nonnegative,and the second inequality follows from the fact that{an} is nonincreasing. SinceEnis the partial summation of the nonnegative series{an},Enis nondecreasing.Notice that and that bothIp-f(n) andDnare nondecreasing.It follows that the following three limits exist: By (4.2) and Lemma 4.5,EnandDnare convergent at the same time,and have the same limit.Furthermore,we have that WhenEnandDnare not convergent at the same time,sinceEnandDnare nondecreasing,by (4.2),we have that The theorem is proven for a discrete-valued process. Now we suppose thatXis a continuous-valued stationary process. If the differential entropy ratehμis finite,i.e.,hμ>-∞,set that=?h(n)-hμ,n=1,2,···.Then{} is a nonnegative and nonincreasing sequence that converges to 0.By the same argument as for the discrete-valued process,one can show thatIp-f=E. If the differential entropy ratehμis infinite,i.e.,hμ=-∞,we have that=?h(n)-hμ=∞forn=1,2,···,becauseXadmits finite block entropy.By the definition of excess entropy, On the other hand,setting thatare the past and future ofX,by the third claim in Lemma 2, Since the differential entropy rate ishμ(X)=h(X1|)=-∞,we conclude that Hence,Ip-f=E=+∞. The proof for continuous-valued process is complete. Remark 4.71) The equalityE=Ip-fis claimed in [8]for a stationary process with a discrete state space;here an heuristic “proof” was also given.The proof is simple so is omitted here. 2) The definition of excess entropyEdepends on the entropy ratehμ.The equationE=Ip-f=Dprovides two series to approximateIp-fand the excess entropy{2H(n)-H(2n)}n,as well as{nH(n-1)-(n-1)H(n)}n,which enables us to obtain the lower bound of the excess entropyEwithout knowing the entropy ratehμ. 3) For a continuous-valued stationary process with finite block entropy,ifhμ=-∞,thenIp-f=E=+∞.This is always long memory. The finiteness or infiniteness of mutual information between past and futureIp-fcan be regarded as a sign between the short memory and long memory stationary processes.For a stationary process with finite block entropy,Ip-fis the same as for the excess entropyE,which provides a good approximation ofIp-f.The definition ofIp-fand the excess entropy of a stationary process require a very weak moment condition on the distribution of the process,and can be applied to processes with distributions without a bounded second moment.A significant property ofIp-fis that it is invariant under one-to-one transformation.The invariance enables us to know theIp-fof a stationary process from theIp-fof other processes.Since conditional entropy can capture the dependence between random variables well,Ip-fand excess entropy are relevant for capturing the dependence of a stationary process whose distribution far from a Gaussian distribution.For stationary Gaussian processes,the long memory in the sense ofIp-fis a bit more strict than for that in the sense of covariance.For fractional Gaussian noise,theIp-f=∞if and only ifH ∈(,1).An important problem here is to provide an effective algorithm for approximatingIp-for the excess entropyE,which is essential in future of applications.It would also be interesting to use an informatic approach to consider the long memory behaviors of harmonizable processes and measure preserving transformations. Conflict of InterestYiming Ding is an editorial board member for Acta Mathematica Scientia and was not involved in the editorial review or the decision to publish this article.All authors declare that there are no competing interests.3 Long Memory and Mutual Information
3.1 Definition of Long Memory
3.2 Invariance Under One-to-One Transformation
3.3 Stationary Gaussian Process
3.4 Fractional Gaussian Noise
4 Excess Entropy
5 Conclusion
Acta Mathematica Scientia(English Series)2023年6期