One of the most common issues in regression analyses of even balanced experimental data is collinearity between main effects and interactions. To avoid this problem, a simple first step is to center all predictors. In my experience folks often fail to do that simply because it’s a bit more work and we’re all lazy. So here’s an attempt at a simple R function that takes single variables as well as entire dataframes.Synopsis: Outputs the centered values of the input variable, which can be a numeric variable, a factor, or a data frame.
- If the input is a numeric variable, the output is the centered variable.
- If the input is a factor, the output is a numeric variable with centered factor level values. That is, the factor’s levels are converted into numerical values in their inherent order (if not specified otherwise, R defaults to alphanumerical order). More specifically, this centers any binary factor so that the value below 0 will be the 1st level of the original factor, and the value above 0 will be the 2nd level.
- If the input is a data frame or matrix, the output is a new matrix of the same dimension and with the centered values and column names that correspond to the colnames() of the input preceded by “c” (e.g. “Variable1″ will be “cVariable1″).
I haven’t tested it extensively yet, but I used it for a couple of different problems and it generally works. Nothing big, but maybe useful (updated 5/12/09, based on suggestions by Matthew Carlson, Psychology, U. of Chicago):
myCenter <- function(x) {
if (is.numeric(x)) { return(x - mean(x)) }
if (is.factor(x)) {
x <- as.numeric(x)
return(x - mean(x))
}
if (is.data.frame(x) || is.matrix(x)) {
m <- matrix(nrow=nrow(x), ncol=ncol(x))
colnames(m) <- paste("c", colnames(x), sep="")
for (i in 1:ncol(x)) {
if (is.factor(x[,i])) {
y <- as.numeric(x[,i])
m[,i] <- y - mean(y, na.rm=T)
}
if (is.numeric(x[,i])) {
m[,i] <- x[,i] - mean(x[,i], na.rm=T)
}
}
return(as.data.frame(m))
}
}
Please let us know whether this works.
5th line: return(x) – mean(x)
Surely that is a typo? Should be: return(x – mean(x))
Ooops. You are right. Fixed above and thanks!
Hi Dr. Jaeger,
I had a question regarding centering of categorical variables. It looks like what you are doing in this script is converting the categories into numbers and then centering, which if you have a balanced 2-level predictor results in values of -.5 and +.5. However in my (logistic) data, I have two issues. One is that while the design was balanced, missing items have lead to there being an imbalance in the dataset. If I recode my variables as -.5 and +.5, it wouldn’t sum to zero. Do I just convert to numbers anyway, find the mean, whatever it is, and subtract from each? It seems like this is what the script does, but I wanted to verify that.
Another issue is that one predictor of mine has three levels, say A, B, and a control group C. How should this be centered? Do I create two predictors from this one (so predictor1 is 1 when level A, 0 else; predictor2 is 1 when level B, 0 else) and then center as above? Or is there a way to do this within one predictor?
Thanks!
Stefanie Kuchinsky
Grad Student
University of Illinois
Hi Stefanie,
regarding your first question: yes. In the regression model all factors are recoded to numerical predictors anyway. Default is treatment coding, but often you want centered predictors (with a mean of 0), e.g. to avoid collinearity with interactions. For cases like that, the centering script does the job.
Regarding your second question: it depends on the precise hypothesis you want to test. But generally, if you want to compare levels against each other, you would use treatment or contrast coding. Contrast (=sum) coding leads to centered predictors for balanced data sets (see help for contr.sum() and contrasts() in R). To get that property for unbalanced data, you can indeed centered the two predictors encoding level 1 vs. 2 and level 2 vs. 3.
Regarding your third question
— the best place for theoretical statistical and R questions is the R-lang email list for language researchers (https://ling.ucsd.edu/mailman/listinfo.cgi/r-lang). If you haven’t you should definitely subscribe =).
Florian