Centering several variables

Posted on April 27, 2009 Updated on July 30, 2010

One of the most common issues in regression analyses of even balanced experimental data is collinearity between main effects and interactions. To avoid this problem, a simple first step is to center all predictors. In my experience folks often fail to do that simply because it’s a bit more work and we’re all lazy. So here’s an attempt at a simple R function that takes single variables as well as entire dataframes.Synopsis: Outputs the centered values of the input variable, which can be a numeric variable, a factor, or a data frame.

If the input is a numeric variable, the output is the centered variable.
If the input is a factor, the output is a numeric variable with centered factor level values. That is, the factor’s levels are converted into numerical values in their inherent order (if not specified otherwise, R defaults to alphanumerical order). More specifically, this centers any binary factor so that the value below 0 will be the 1st level of the original factor, and the value above 0 will be the 2nd level.
If the input is a data frame or matrix, the output is a new matrix of the same dimension and with the centered values and column names that correspond to the colnames() of the input preceded by “c” (e.g. “Variable1” will be “cVariable1”).

I haven’t tested it extensively yet, but I used it for a couple of different problems and it generally works. Nothing big, but maybe useful (updated 7/30/10 to further simplify the function):

myCenter= function(x) {
	if (is.numeric(x)) { return(x - mean(x, na.rm=T)) }
	if (is.factor(x)) {
		x= as.numeric(x)
		return(x - mean(x, na.rm=T))
	}
	if (is.data.frame(x) || is.matrix(x)) {
		m= matrix(nrow=nrow(x), ncol=ncol(x))
		colnames(m)= paste("c", colnames(x), sep="")
		for (i in 1:ncol(x)) {
			m[,i]= myCenter(x[,i])
		}
		return(as.data.frame(m))
	}
}

Please let us know whether this works.

This entry was posted in HLP lab, Statistics & Methodology, statistics/R and tagged centering, preparing data, R code.

9 thoughts on “Centering several variables”

Max Bane said:
April 28, 2009 at 10:38 am

5th line: return(x) – mean(x)

Surely that is a typo? Should be: return(x – mean(x))

LikeLike

tiflo said:
April 28, 2009 at 12:52 pm

Ooops. You are right. Fixed above and thanks!

LikeLike

Stefanie Kuchinsky said:
July 28, 2009 at 2:14 pm

Hi Dr. Jaeger,
I had a question regarding centering of categorical variables. It looks like what you are doing in this script is converting the categories into numbers and then centering, which if you have a balanced 2-level predictor results in values of -.5 and +.5. However in my (logistic) data, I have two issues. One is that while the design was balanced, missing items have lead to there being an imbalance in the dataset. If I recode my variables as -.5 and +.5, it wouldn’t sum to zero. Do I just convert to numbers anyway, find the mean, whatever it is, and subtract from each? It seems like this is what the script does, but I wanted to verify that.

Another issue is that one predictor of mine has three levels, say A, B, and a control group C. How should this be centered? Do I create two predictors from this one (so predictor1 is 1 when level A, 0 else; predictor2 is 1 when level B, 0 else) and then center as above? Or is there a way to do this within one predictor?

Thanks!
Stefanie Kuchinsky
Grad Student
University of Illinois

LikeLike

tiflo said:
July 31, 2009 at 3:12 pm

Hi Stefanie,

regarding your first question: yes. In the regression model all factors are recoded to numerical predictors anyway. Default is treatment coding, but often you want centered predictors (with a mean of 0), e.g. to avoid collinearity with interactions. For cases like that, the centering script does the job.

Regarding your second question: it depends on the precise hypothesis you want to test. But generally, if you want to compare levels against each other, you would use treatment or contrast coding. Contrast (=sum) coding leads to centered predictors for balanced data sets (see help for contr.sum() and contrasts() in R). To get that property for unbalanced data, you can indeed centered the two predictors encoding level 1 vs. 2 and level 2 vs. 3.

Regarding your third question 😉 — the best place for theoretical statistical and R questions is the R-lang email list for language researchers (https://ling.ucsd.edu/mailman/listinfo.cgi/r-lang). If you haven’t you should definitely subscribe =).

Florian

LikeLiked by 1 person

Xiao He said:
August 1, 2010 at 1:03 pm

Hi Dr. Jaeger,

I have a question regarding results obtained from lmer with a model fitted with centered categorical predictors.

I originally fitted a model like the one below. The predictor is a three-level categorical predictor. The p-values I obtained were extremely small. Specifically, it was shown that levels 3 and 2 were significantly different from level 1. Indeed, when I examined the visual representations of the data, it was clear that subjects’ responses to the three levels of items were distinctively different. For level 1: only 30% of responses had positive responses, for level 2, 90% of them had positive, and for level 3, 60% of them were positive.

model1<-(positive~predictor + (1|subject) + (1|item), data=data)

After I read the code and explanation you posted in this post, I decided to center the predictor:

data$cpredictor <- myCenter (data$predictor).

I ran the same model with the new predictor
model1<-(positive~cpredictor + (1|subject) + (1|item), data=data)

But this time, there was no significance obtained. I wonder why there is such a drastic difference between the two results. I'd be extremely grateful if you could explain this difference a little bit. Thank you 🙂

Best,
Xiao
Grad student
Univ. of Southern California

LikeLike

tiflo said:
August 1, 2010 at 8:33 pm

Hi Xiao,

I would need to know more about your model output to understand what’s going on. Are you saying that your predictor variable “predictor” has three levels? When you feed a three-level (or, for that matter, more than two-level) factor into the script posted here, it will simply convert the factor into a numeric variable (i.e. value of level 1–>1, value of level 2–>2, …. where levels are either explicitly ordered by you or, if not, alphanumerically ordered) and then center it.

This isn’t usually very interpretable! I should probably include a little warning in the script to address this. To see what this does, let’s say you have a three-level factor “predictor” with levels=c(‘phone’,’syllable’,’word’). After putting this variable through the script posted above, you will have a new variable “cpredictor” with three distinct values, which would only add one DF to your model rather than the two DF the original factor contributed. So, that’s already quite different. The values of cpredictor will depend on the distribution of the three levels (number of cases for each level). Assuming a balanced data set, the three levels will be:

phone –> -1
syllable –> 0
word –> 1

So, suddenly you are testing a linear order hypothesis about the three levels rather than whether the three levels have distinct means! That’s probably not what you want. Let me know, if this helps. In any case, I also recommend emailing to R-lang. I am on sabbatical and will be out of e-contact for a while starting 8/4.

Florian

LikeLike

sohee jeon said:
November 10, 2011 at 7:44 pm

Hello Dr. Jaeger,

I have a question regarding centering categorical variables. I am trying to create interaction variables using two 5-point scale variables. I wonder if centering the two categorical variables and multiplying them is a good thing to do in order to create interaction terms.

Thank you very much,
Sohee Jeon.

LikeLike

tiflo said:
November 10, 2011 at 8:58 pm

Dear Sohee,

first let me mention that the email list ling-r-lang is a great place to post questions like this to get answers from many different R-users who are working on language. Now, about your problem, let me make sure I understand you correctly: you have two five-way categorical predictors (i.e. categorical predictors with 5-levels each), right? In that case, there would be 4 contrasts associated with each of the two categorical variables, not just one.

The next important question is whether those levels ordered or unordered (i.e. do they have an inherent order by hypothesis). Depending on the answer to this, you might want to employ contrast or Helmert coding. For an introduction to different coding schemes for categorical predictors in R, see, for example, Maureen Gillespie’s slides posted at https://hlplab.wordpress.com/2010/05/10/mini-womm/.

HTH,
Florian

Florian

LikeLike

Daniele Panizza said:
March 19, 2017 at 2:32 pm

so sad I only found this now !
Thanks Florian 🙂

LikeLike