时间序列模型,其中为趋势项,为季节项(假设周期为,则),为平稳项(统计特性不随时间变化而改变)
一、对于趋势项的两种处理方式:
-
估计趋势并从原序列中去除,趋势的估计方法主要有以下几种:
-
滑动平均:
可理解为Kernel Regression的一种特殊形式(Nadaraya–Watson estimator)
假设滑动窗口为,则趋势项
-
线性回归:
趋势项,通常取1或2
-
Local Polynomial Regression:
取离目标点最近的几个点进行加权线性回归
假设窗口为,核函数取为,则需求解
趋势项,通常取1或2
-
Splines Regression:
-
Truncated Polynomial
假设选中的knots为,以cubic splines(i.e., degree=3)为例,它的基函数可取为
趋势项
-
B-Spline
和Truncated Polynomial等价但能有效减少计算量以及特征之间的共线性问题,仍以cubic splines(i.e., degree=3, M=degree+1=4)为例
记和为boundary knots(),令以及
记为第i个阶数为m的基函数,有(注:)
根据递推公式,可以得到
趋势项(注:B-spline可以减少计算量的主要原因是生成的基函数最多仅在跨度为个knots范围内为非0值)
-
-
-
将相邻数据相减从而直接去除趋势,即
趋势估计应用(使用的数据集可从这里下载,提取码为16mm)
data = read.table("AvTempAtlanta.txt",header=T)
temp = as.vector(t(data[,-c(1,14)])) #去掉第一列和第十四列,转置并变为向量类型(R语言中的数组为列优先)
temp = ts(temp,start=1879,frequency=12)
ts.plot(temp,ylab="Temperature") #左图
## time
time.pts = c(1:length(temp)) #1,2,...,length(temp)
time.pts = c(time.pts - min(time.pts))/max(time.pts)
## Fit a moving average
mav.fit = ksmooth(time.pts, temp, kernel = "box")
temp.fit.mav = ts(mav.fit$y,start=1879,frequency=12)
## Fit a linear regression(quadraric polynomial)
x1 = time.pts
x2 = time.pts^2
lm.fit = lm(temp~x1+x2)
print(summary(lm.fit))
temp.fit.lm = ts(fitted(lm.fit),start=1879,frequency=12)
## Fit a local polynomial regression
loc.fit = loess(temp~time.pts)
temp.fit.loc = ts(fitted(loc.fit),start=1879,frequency=12)
## Fit a splines regression
library(mgcv)
gam.fit = gam(temp~s(time.pts))
temp.fit.gam = ts(fitted(gam.fit),start=1879,frequency=12)
## Compare all estimated trends
all.val = c(temp.fit.mav,temp.fit.lm,temp.fit.gam,temp.fit.loc)
ylim= c(min(all.val),max(all.val))
ts.plot(temp.fit.lm,lwd=2,col="green",ylim=ylim,ylab="Temperature") #右图
lines(temp.fit.mav,lwd=2,col="purple")
lines(temp.fit.gam,lwd=2,col="red")
lines(temp.fit.loc,lwd=2,col="brown")
legend(x=1900,y=64,legend=c("MAV","LM","GAM","LOESS"),lty = 1, col=c("purple","green","red","brown"))
二、对于季节项的两种处理方式:
-
估计季节项并从原序列中去除,估计方法主要有以下几种:
-
季节平均
-
三角函数,
- 若存在多个周期,则有
-
-
数据相减直接去除季节性,即
季节估计应用
library(TSA)
## Estimate seasonality using seasonal mean model
month = season(temp)
model1 = lm(temp~month-1) #all seasonal mean effects (model without intercept)
print(summary(model1))
## Estimate seasonality using cos-sin model
har2=harmonic(temp,2)
model2=lm(temp~har2)
print(summary(model2))
## Compare Seasonality Estimates
st1 = coef(model1)
st2 = fitted(model2)[1:12]
plot(1:12,st1,lwd=2,type="l",col='green',xlab="Month",ylab="Seasonality") #左图
lines(1:12,st2,lwd=2, col="brown")
趋势和季节估计应用
## Linear Regression
x1 = time.pts
x2 = time.pts^2
har2=harmonic(temp,2)
lm.fit = lm(temp~x1+x2+har2)
print(summary(lm.fit))
dif.fit.lm = ts((temp-fitted(lm.fit)),start=1879,frequency=12)
## Spline Regression for Trend and Linear Regression for Seasonality
gam.fit = gam(temp~s(time.pts)+har2)
print(summary(gam.fit))
dif.fit.gam = ts((temp-fitted(gam.fit)),start=1879,frequency=12)
## Compare approaches
ts.plot(dif.fit.lm,ylab="Residual Process",col="brown") #右图
lines(dif.fit.gam,col="blue")
三、平稳项
Auto-Covariance Function
如果满足,则序列是(弱)平稳序列
针对平稳序列,定义Auto-Covariance Function , 可取任意值,则Auto-Correlation Function(ACF)
Sample Auto-Covariance Function ,其中
Sample ACF可表示为
acf(temp,lag.max=12*4,main="") #左图(用于对比)
acf(dif.fit.lm,lag.max=12*4,main="") #中图
acf(dif.fit.gam,lag.max=12*4,main="") #右图
四、完整案例应用
使用的数据集可从这里下载,提取码为uqq8
- Process Data
edvoldata = read.csv("EGDailyVolume.csv",header=T) ## Process Dates year = edvoldata$Year month = edvoldata$Month day = edvoldata$Day datemat = cbind(as.character(day),as.character(month),as.character(year)) paste.dates = function(date){ day = date[1]; month=date[2]; year = date[3] return(paste(day,month,year,sep="/")) } dates = apply(datemat,1,paste.dates) dates = as.Date(dates, format="%d/%m/%Y") edvoldata = cbind(dates,edvoldata) ## Transform ED Volume data(Stabilize Variance) Volume.tr = sqrt(edvoldata$Volume+3/8) hist(edvoldata$Volume,nclass=20,xlab="ED Volume", main="",col="brown") #左图 hist(Volume.tr,nclass=20,xlab= "Transformed ED Volume", main="",col="blue") #右图
- Trend and Seasonality
library(mgcv) time.pts = c(1:length(Volume.tr)) time.pts = c(time.pts - min(time.pts))/max(time.pts) ## Model Trend + Monthly Seasonality ## Use Splines Trend and Seasonal Mean Model month = as.factor(format(dates,"%b")) gam.fit.seastr.1 = gam(Volume.tr~s(time.pts)+month) print(summary(gam.fit.seastr.1)) vol.fit.gam.seastr.1 = fitted(gam.fit.seastr.1) ## Add day-of-the-week seasonality week = as.factor(weekdays(dates)) gam.fit.seastr.2 = gam(Volume.tr~s(time.pts)+month+week) print(summary(gam.fit.seastr.2)) vol.fit.gam.seastr.2 = fitted(gam.fit.seastr.2) ## Compare the two fits: with & without day-of-the-week seasonality plot(dates,Volume.tr,type="l",ylab="Transformed Daily ED Volume") #下图 lines(dates,vol.fit.gam.seastr.2,lwd=2,col="red") lines(dates,vol.fit.gam.seastr.1,lwd=2,col="green")
- Stationary Residual
## Residual Process: Trend Removal gam.fit = gam(Volume.tr~s(time.pts)) vol.fit.gam = fitted(gam.fit) resid.1 = Volume.tr-vol.fit.gam ## Residual Process: Seasonal Removal lm.fit.seastr.2 = lm(Volume.tr~month+week) vol.fit.lm.seastr.2 = fitted(lm.fit.seastr.2) resid.2 = Volume.tr-vol.fit.lm.seastr.2 ## Residual Process: Trend & Seasonal Removal resid.3 = Volume.tr-vol.fit.gam.seastr.2 ## Compare Residuals y.min = min(c(resid.1,resid.2,resid.3)) y.max = max(c(resid.1,resid.2,resid.3)) plot(dates, resid.1, ylim=c(y.min, y.max), type="l", ylab="Residual Process") #上图 lines(dates,resid.2,col="blue") lines(dates,resid.3,col="brown") legend("bottom",legend=c("Trend","Season","Trend+Season"),lty = 1, col=c("black","blue","brown")) ## ACF acf(resid.1,lag.max=12*4,main="") #左下图 acf(resid.2,lag.max=12*4,main="",col="blue") #中下图 acf(resid.3,lag.max=12*4,main="",col="brown") #右下图