### One slide on developing a regression model with interpretable coefficients

Posted on Updated on

While Victor Kuperman and I are preparing our slides for WOMM, I’ve been thinking about how to visualize the process from input variables to a full model. Even though it involves many steps that hugely depend on the type of regression model, which in turn depends on the type of outcome (dependent) variable, there are a number of steps that one always needs to go through if we want interpretable coefficient estimates (as well as unbiased standard error estimates for those coefficients).

From left to right, we start with initial data analysis, which leads to the  input variables. We should know our data before we analyze it, understand the basic distributions in the data. It may also be good to get an idea of the variance associated with potential random effects (e.g. do subjects’ reading times differ?). The input variables are then transformed (dashed lines merely indicate that the variable changes; the labels above the dashed lines give an example of what type of operation may be applied to the variable), as well as coded and/or centered (maybe even standardized), before we should create any higher order terms based on any of the predictors (e.g. interactions or non-linear terms for continuous predictors, such as rcs() or pol()). At several steps during this process, we should check for outliers since they can be overly influential. During the initial data exploration, we may exclude cases that must clearly be measurement errors or that miss variable information. At a later stage, after the predictors have been transformed, we may exclude outliers based on distributional assumptions (e.g. excluding RTs more than 3 absolute standard deviations above the mean RTs for the subject).

All these steps are partly driven by what hypotheses we want to test. Ever coefficient in the model can be thought of as testing a hypothesis, but what hypothesis we are testing depends on the transforms and coding we’ve applied to the variable and the relation of this variable to other variables in the model.

For the second point it is necessary to assess collinearity — at least if we want unbiased standard error estimates and if we want to be able to reliably interpret individual coefficients. If there is collinearity in the model for the predictor of interest (collinearity only affects the collinear predictors), we may use a variety of strategies to reduce it, e.g.:

• centering: reduces collinearity with
• intercept (which we usually don’t care about, but if all variables are centered, the intercept has the nice property of estimating the overall mean)
• higher order terms of the same predictor (e.g. interactions, higher order non-linear terms as in rcs(), pol(), etc.)
• stratification: assess the effect while holding the correlated predictor constant (easiest for categorical predictors)
• residualization: regress one predictor against the correlated predictor. Here we need to be cautious in order to be conservative! Depending on what the hypothesis is we may regress predictor1 against predictor2 or vice versa (this will answer slightly different questions). It’s best if there is a conceptual reason to prefer one over the other.
• … (PCA; normalizing/dividing predictors by each other; e.g. weight in length in words vs. disfluency per word)

But in any case, this will change our interpretation of the coefficient and we have to be careful when we interpret it with regard to our initial hypothesis. For interpretability, it may also be necessary to back-transform a predictor when we talk about the results and it may be necessary to talk about other predictors when we summarize the effect of one predictor (e.g. when we used residualization.

The resulting model needs to be evaluated (residuals, predicted vs. actual outcomes, etc.) and maybe some of the steps need to be repeated before the final model will be completely interpretable. We also may want to exclude outliers based on distributional assumptions of the model (e.g. when there are only a few cases with really high residuals; although I usually don’t do this).

Anywho, probably this graph already exists somewhere ;).