Wednesday, April 6, 2011

Linear Regression and group by in R

I wan to do a linear regression in R using the lm() function. My data is an annual time series with one field for year (22 years) and another for state (50 states). I want to fit a regression for each state so that at the end I have a vector of lm responses. I can imagine doing for loop for each state then doing the regression inside the loop and adding the results of each regression to a vector. That does not seem very R-like, however. In SAS I would do a 'by' statement and in SQL I would do a 'group by'. What's the R way of doing this?

From stackoverflow
  • ## make fake data
    > ngroups <- 2
    > group <- 1:ngroups
    > nobs <- 100
    > dta <- data.frame(group=rep(group,each=nobs),y=rnorm(nobs*ngroups),x=runif(nobs*ngroups))
    > head(dta)
      group          y         x
    1     1  0.6482007 0.5429575
    2     1 -0.4637118 0.7052843
    3     1 -0.5129840 0.7312955
    4     1 -0.6612649 0.9028034
    5     1 -0.5197448 0.1661308
    6     1  0.4240346 0.8944253
    > 
    > ## function to extract the results of one model
    > foo <- function(z) {
    +   ## coef and se in a data frame
    +   mr <- data.frame(coef(summary(lm(y~x,data=z))))
    +   ## put row names (predictors/indep variables)
    +   mr$predictor <- rownames(mr)
    +   mr
    + }
    > ## see that it works
    > foo(subset(dta,group==1))
                  Estimate Std..Error   t.value  Pr...t..   predictor
    (Intercept)  0.2176477  0.1919140  1.134090 0.2595235 (Intercept)
    x           -0.3669890  0.3321875 -1.104765 0.2719666           x
    > ## one option: use command by
    > res <- by(dta,dta$group,foo)
    > res
    dta$group: 1
                  Estimate Std..Error   t.value  Pr...t..   predictor
    (Intercept)  0.2176477  0.1919140  1.134090 0.2595235 (Intercept)
    x           -0.3669890  0.3321875 -1.104765 0.2719666           x
    ------------------------------------------------------------ 
    dta$group: 2
                   Estimate Std..Error    t.value  Pr...t..   predictor
    (Intercept) -0.04039422  0.1682335 -0.2401081 0.8107480 (Intercept)
    x            0.06286456  0.3020321  0.2081387 0.8355526           x
    > ## using package plyr is better
    > library(plyr)
    > res <- ddply(dta,"group",foo)
    > res
      group    Estimate Std..Error    t.value  Pr...t..   predictor
    1     1  0.21764767  0.1919140  1.1340897 0.2595235 (Intercept)
    2     1 -0.36698898  0.3321875 -1.1047647 0.2719666           x
    3     2 -0.04039422  0.1682335 -0.2401081 0.8107480 (Intercept)
    4     2  0.06286456  0.3020321  0.2081387 0.8355526           x
    >
    
  • Here's one way using the lme4 package.

    > library(lme4)
    > d <- data.frame(state=rep(c('NY', 'CA'), c(10, 10)),
    +                 year=rep(1:10, 2),
    +                 response=c(rnorm(10), rnorm(10)))
    
    > xyplot(response ~ year, groups=state, data=d, type='l')
    
    > fits <- lmList(response ~ year | state, data=d)
    > fits
    Call: lmList(formula = response ~ year | state, data = d)
    Coefficients:
       (Intercept)        year
    CA -1.34420990  0.17139963
    NY  0.00196176 -0.01852429
    
    Degrees of freedom: 20 total; 16 residual
    Residual standard error: 0.8201316
    
  • In my opinion is a mixed linear model a better approach for this kind of data. The code below given in the fixed effect the overall trend. The random effects indicate how the trend for each individual state differ from the global trend. The correlation structure takes the temporal autocorrelation into account. Have a look at Pinheiro & Bates (Mixed Effects Models in S and S-Plus).

    library(nlme)
    lme(response ~ year, random = ~year|state, correlation = corAR1(~year))
    
    JD Long : This is a really good general stats theory answer which makes me think about some things I had not considered. The application which caused me to ask the question would not be applicable to this solution, but I'm glad you brought it up. Thank you.
    hadley : It's not a good idea to start with a mixed model - how do you know that any of the assumptions are warranted?
    Thierry : One should check the assumption by model validation (and knowledge of the data). BTW you cannot warrant the assumption on the individual lm's either. You would have to validate all models seperately.
  • Here's an approach using the plyr package:

    d <- data.frame(
      state = rep(c('NY', 'CA'), 10),
      year = rep(1:10, 2),
      response= rnorm(20)
    )
    
    library(plyr)
    # Break up d by state, then fit the specified model to each piece and
    # return a list
    models <- dlply(d, "state", function(df) 
      lm(response ~ year, data = df))
    
    # Apply coef to each model and return a data frame
    ldply(models, coef)
    
    # Print the summary of each model
    l_ply(models, summary, .print = TRUE)
    

0 comments:

Post a Comment