I wan to do a linear regression in R using the lm()
function. My data is an annual time series with one field for year (22 years) and another for state (50 states). I want to fit a regression for each state so that at the end I have a vector of lm responses. I can imagine doing for loop for each state then doing the regression inside the loop and adding the results of each regression to a vector. That does not seem very R-like, however. In SAS I would do a 'by' statement and in SQL I would do a 'group by'. What's the R way of doing this?
-
## make fake data > ngroups <- 2 > group <- 1:ngroups > nobs <- 100 > dta <- data.frame(group=rep(group,each=nobs),y=rnorm(nobs*ngroups),x=runif(nobs*ngroups)) > head(dta) group y x 1 1 0.6482007 0.5429575 2 1 -0.4637118 0.7052843 3 1 -0.5129840 0.7312955 4 1 -0.6612649 0.9028034 5 1 -0.5197448 0.1661308 6 1 0.4240346 0.8944253 > > ## function to extract the results of one model > foo <- function(z) { + ## coef and se in a data frame + mr <- data.frame(coef(summary(lm(y~x,data=z)))) + ## put row names (predictors/indep variables) + mr$predictor <- rownames(mr) + mr + } > ## see that it works > foo(subset(dta,group==1)) Estimate Std..Error t.value Pr...t.. predictor (Intercept) 0.2176477 0.1919140 1.134090 0.2595235 (Intercept) x -0.3669890 0.3321875 -1.104765 0.2719666 x > ## one option: use command by > res <- by(dta,dta$group,foo) > res dta$group: 1 Estimate Std..Error t.value Pr...t.. predictor (Intercept) 0.2176477 0.1919140 1.134090 0.2595235 (Intercept) x -0.3669890 0.3321875 -1.104765 0.2719666 x ------------------------------------------------------------ dta$group: 2 Estimate Std..Error t.value Pr...t.. predictor (Intercept) -0.04039422 0.1682335 -0.2401081 0.8107480 (Intercept) x 0.06286456 0.3020321 0.2081387 0.8355526 x > ## using package plyr is better > library(plyr) > res <- ddply(dta,"group",foo) > res group Estimate Std..Error t.value Pr...t.. predictor 1 1 0.21764767 0.1919140 1.1340897 0.2595235 (Intercept) 2 1 -0.36698898 0.3321875 -1.1047647 0.2719666 x 3 2 -0.04039422 0.1682335 -0.2401081 0.8107480 (Intercept) 4 2 0.06286456 0.3020321 0.2081387 0.8355526 x >
-
Here's one way using the
lme4
package.> library(lme4) > d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)), + year=rep(1:10, 2), + response=c(rnorm(10), rnorm(10))) > xyplot(response ~ year, groups=state, data=d, type='l') > fits <- lmList(response ~ year | state, data=d) > fits Call: lmList(formula = response ~ year | state, data = d) Coefficients: (Intercept) year CA -1.34420990 0.17139963 NY 0.00196176 -0.01852429 Degrees of freedom: 20 total; 16 residual Residual standard error: 0.8201316
-
In my opinion is a mixed linear model a better approach for this kind of data. The code below given in the fixed effect the overall trend. The random effects indicate how the trend for each individual state differ from the global trend. The correlation structure takes the temporal autocorrelation into account. Have a look at Pinheiro & Bates (Mixed Effects Models in S and S-Plus).
library(nlme) lme(response ~ year, random = ~year|state, correlation = corAR1(~year))
JD Long : This is a really good general stats theory answer which makes me think about some things I had not considered. The application which caused me to ask the question would not be applicable to this solution, but I'm glad you brought it up. Thank you.hadley : It's not a good idea to start with a mixed model - how do you know that any of the assumptions are warranted?Thierry : One should check the assumption by model validation (and knowledge of the data). BTW you cannot warrant the assumption on the individual lm's either. You would have to validate all models seperately. -
Here's an approach using the plyr package:
d <- data.frame( state = rep(c('NY', 'CA'), 10), year = rep(1:10, 2), response= rnorm(20) ) library(plyr) # Break up d by state, then fit the specified model to each piece and # return a list models <- dlply(d, "state", function(df) lm(response ~ year, data = df)) # Apply coef to each model and return a data frame ldply(models, coef) # Print the summary of each model l_ply(models, summary, .print = TRUE)
0 comments:
Post a Comment