R error which says “Models were not all fitted to the same size of dataset”

The main cause of that error is when there are missing values in one or more of the predictor variables. In recent versions of R the default action is to omit all rows that have any values missing (the previous default was to produce an error). So for example if the data frame has 100 rows and there is one missing value in X3 then your model glm1 will be fit to 99 rows of data (dropping the row where X3 is missing), but the glm2 object will be fit to the full 100 rows of data (since it does not use X3, no rows need to be deleted).

So then the anova function gives you an error because the 2 models were fit to different datasets (and how do you compute degrees of freedom, etc.).

One solution is to create a new data frame that has only the columns that will be used in at least one of your models and remove all the rows with any missing values (the na.omit or na.exclude function will make this easy), then fit both models to the same data frame that does not have any missing values.

Other options would be to look at tools for multiple imputation or other ways of dealing with missing data.

landroni

To avoid the "models were not all fitted to the same size of dataset" error, you must fit both models on the exact same subset of data. There are two simple ways to do this:

  • either use data=glm1$model in the 2nd model fit
  • or retrieve the correctly subsetted dataset by using data=na.omit(orig.data[ , all.vars(formula(glm1))]) in the 2nd model fit

Here's a reproducible example using lm (for glm the same approach should work) and update:

# 1st approach
# define a convenience wrapper
update_nested <- function(object, formula., ..., evaluate = TRUE){
    update(object = object, formula. = formula., data = object$model, ..., evaluate = evaluate)
}

# prepare data with NAs
data(mtcars)
for(i in 1:ncol(mtcars)) mtcars[i,i] <- NA

xa <- lm(mpg~cyl+disp, mtcars)
xb <- update_nested(xa, .~.-cyl)
anova(xa, xb)
## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp
## Model 2: mpg ~ disp
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     26 256.91                              
## 2     27 301.32 -1   -44.411 4.4945 0.04371 *
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# 2nd approach
xc <- update(xa, .~.-cyl, data=na.omit(mtcars[ , all.vars(formula(xa))]))
anova(xa, xc)
## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp
## Model 2: mpg ~ disp
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     26 256.91                              
## 2     27 301.32 -1   -44.411 4.4945 0.04371 *
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

See also:

The solution is to use:

glm1 <-glm(Y ~ X1 + X2 + X3, family = binomial(link = logit), na.action = na.exclude)
glm2 <-glm(Y ~ X1 + X2, family = binomial(link = logit), na.action = na.exclude)

anova(glm2,glm1)

This will make R include the cases with missing data (NA) in the fitted model. This ensures that datasets are identical across different fit models no matter how missing data is distributed.

I'm guessing that you meant to type:

glm1 <-glm(Y ~ X1+X2+X3, family=binomial(link=logit))

glm2 <-glm(Y ~ X1 + X2, family=binomial(link=logit))

The formula interface for R regression functions does not recognize commas as adding covariates to the RHS of the formula. And don't use attach(); use the data argument to regression functions.

The cause is well described by Greg Snow. An alternative and very easy solution is to add a new variable, matching the problematic variable's NA's and otherwise with the value 1. Include it in both models and R will exclude the same rows in both models (--> datasets will match).