Modeling self-paced reading data: Effects of word length, word position, spill-over, etc.

Posted on Updated on

I’ve been using a two-step approach, where in the first step I use all data (including fillers, but not practice items) of an experiment to fit a model of log-transformed raw reading times with:

  • word length (Wlen)
  • position of word in stimulus (Wpos)
  • position of stimulus in list (Lpos)
  • experiment ID: different stimulus types may have an effect. This is really just a bad, but easy way to account for the possibility that what matters is not the linear position in a stimulus, but somehow the position relative to syntactic and prosodic phrasing. By including a variable that capture the construction type, some the variance associated with the word’s position relative to phrasing can be accounted for.
  • subject differences

For example:

l <- lmer(logRT ~ EXPTsimple + Wlen + log(Lpos) + rcs(Wpos) +
              (1 | SUBJ), d)
d$logRTresidual <- residuals(l)

The residuals of that model are used as dependent variable for the second model and spill-over variables are entered into the model, along with the experimental manipulations. For example:

l <- lmer(logRTresidual ~ CONDITION +
            (1 | SUBJ) + (1 | ITEM), target , subset= EXPT != "ProdComp3"
ml <- mcmcsamp(l, n=10000)

17 thoughts on “Modeling self-paced reading data: Effects of word length, word position, spill-over, etc.

    LingLangLung » More sophisticated spillover analysis said:
    February 25, 2008 at 3:22 pm

    […] this last week, and he mentioned an approach Florian Jaeger has been trying out.  Luckily, Florian blogged about this recently, and even included the R-code for the […]


    tiflo said:
    February 25, 2008 at 3:46 pm

    awesome. somebody is reading this! Hi Laura (or LingLangLung)


    […] effects in self-paced reading I’ve been working on some R-code for spill-over analysis for self-paced reading experiments. I’ll be posting the actual analysis a later. Here’s some code that adds the spill-over […]


    XIAO HE said:
    February 21, 2011 at 4:13 pm

    Hi Dr. Jaeger,

    I tried to fit a model following the code you provided above (as well as the codes in the relevant links); however, I am not sure how to interpret the results. Do you happen to have any slides or presentations where you discuss how to interpret lmer results obtained from this kind of model fitting process? Thank you!


      tiflo said:
      February 21, 2011 at 10:58 pm

      Hi Xiao,

      some people have applied the (updated) approach outlined here. For example, I think Philip Hofmeister has followed my suggestion in one of his papers. I’ll get him to post a link to his paper here. Check back in a day or two.



    Xiao He said:
    November 28, 2011 at 9:14 pm

    Hi Dr. Jaeger,

    Sorry to beat a dead horse…so to speak. I posted a question 9 months ago, for various reasons, I hadn’t had time to finish re-analyzing my data until now. My question is about how to interpret lmer output derived from a model that includes spill-over variables as fixed effects. I read Philip Hofmeister’s paper, but am still unclear as to how the model is created and how the result is interpreted. My data is from a study with a 2 by 2 design. I did trimming, log transformation and centering before conducting my analysis.

    I residualized the log reading time based on your code:
    > l = lmer(logRT~p + item + (1|subject) + (1|item), data=data)
    #p is the position of a word in a sentence; item is the position of the sentence in the list
    > data$logRTresidual = residuals(l)

    Then, I created spill-over variables according to the code you provided on your blog. I have 5 spill-over words, so I have 5 spill-over variables in total. I created a model as shown below:
    > lmer(logRTresidual~ms*es + SPILLOVER_1 + SPILLOVER_2 + SPILLOVER_3 + SPILLOVER_4 + SPILLOVER_5 + (1|subject) + (1|item), data=data)

    The output is at the bottom of my post. Take the main effect of “ms” as an example. Since it is centered and the design is balanced, we can interpret the result as ANOVA-like main effect. So my specific questions are:

    1). Does this output suggest that in the spillover region, there is a significant main effect of “ms”? If this is the case, this analysis seems to be different from the analyses done in Philip’s studies where (if I interpreted what he wrote correctly) he seems to have conducted separate analyses for individual words in the spillover region, and with each word he analyzed, he included only the immediately preceding word as a fixed predictor.

    2). To check whether there is significant interaction, do I still do model comparison, comparing a full model with a model that doesn’t have the interaction term?

    Thanks so much for your time.


    [random effects ommited]

    Fixed effects:
    Estimate Std. Error t value
    (Intercept) -1.314492 0.151007 -8.705
    ms 0.035616 0.013887 2.565
    es 0.005392 0.013704 0.393
    SPILLOVER_1 0.171864 0.016267 10.565
    SPILLOVER_2 0.040822 0.015451 2.642
    SPILLOVER_3 0.020948 0.015214 1.377
    SPILLOVER_4 -0.009387 0.015612 -0.601
    SPILLOVER_5 -0.008206 0.015986 -0.513
    ms:es -0.123743 0.027342 -4.526


      tiflo said:
      November 28, 2011 at 10:36 pm

      Hi Xiao,

      First, let me note that the first step of your analysis, the residualization, differs from I suggest in my post. The standard length correction would be to use an ordinary linear model against raw RTs.

      l = lm(RT ~ length, data)

      Now, if a logarithm transform is warranted (there is disagreement about this in the literature), you could regress against log RTs. You could also extend the regression to correct for effects of the position of the stimulus in a list. If so, I’d actually recommend a log-transformed effect of position since that is what seems to fit as well or better than linear effects in the analyses I have so far performed. In an ordinary linear model, that would give you (using the same variable names as you):

      l = lm(logRT ~ length + log10(p), data)

      In my post, I suggested that one could also control for word position in sentence and I did that using a restricted cubic spline. Why? Because while I wasn’t sure how best model potential effects of word position, a survey of the literature leaves no doubt that this effect cannot be linear. For example, you see numerous mentions of what is sometimes called ‘wrap-up’ effects, slow-downs at the end of stimuli. Overall, it seems that readers first get faster and then considerably slower towards the end of the stimulus. I refer to Kuperman et al (2010) for a nice recent investigation of end-of-line, end-of-clause, and end-of-sentence effects. I think he even looked at paragraph effects though it’s been a while since I read the paper. This paper appeared 2 years after my post so I did not consider it in my original post.

      Finally, I considered removing individual subject variance already at the point of residualization, where it can be estimated using the information of items and fillers thereby potentially increasing the accuracy of the random variance(s) considerably. Generally, the advantage of using a two-stage analysis (doing residualization first) is only given if you have a lot more data for residualization than you would have if you threw in those predictors into the analysis of your items.

      I am saying all of this because your residualization seems to have a random effect for items. This suggests that you residualize using data from only items. This, to my opinion, defies the purpose of doing a separate residualization.

      Second, I am a bit confused by the second step in your analysis. It seems that you are interpreting it as being on the spill-over region, but your code says that you used the same data set used for residualization (data). Perhaps, this is a typo? Of course, you need to select the relevant region that you want to analyze (notice that in my example code the data sets used for the second analysis has a different name – that’s because it’s only the RTs on the target word). It seems that, for you, the target region is what you call the spill-over region (e.g. the two words following a point of disambiguation). Usually, in order not to inflate the Type I error rate due to auto-correlations, we would analyze the average per-word (residual) reading times over that region. So you need to create that average first, e.g. via data aggregation.

      It seems to me that perhaps you are confused about the difference between the spillover predictors (the five predictor variables in your model) and the spillover region (which seems to be the target region of interest for you, if I understand you correctly)? Sorry, perhaps I am just misreading your post.

      Now, to answer your questions:

      1) The highly significant spill-over effects simply mean that the reading time of words in your data set is highly dependent on the reading time on the previous word(s). Put simply, there are strong auto-correlations between data points. Note that this are observed despite the fact that your model includes a random intercept by subject. What this means – and this should come at no surprise to anyone- is that RTs associated with words that are read in adjacency resemble each other more than just expected from the average per-word RT of the subject. Notice also that it makes a lot of sense that the spill-over effect is stronger from words that are closer.

      The output you have does suggest that there is a significant main effect of ms on whatever the region is that you selected (if you didn’t select any region, it seems that the ms condition is overall lower on average across the sentence).

      2) If the interaction term exhibits no signs of multi-collinearity (e.g. if the fixed effect correlations of the interaction in your model output, which you didn’t post, are all below .4), the above output already constitutes evidence for an interaction and no model comparison is required. Indeed, if all variables were centered, the design was balanced, and you didn’t overfit to the data (see above), then there should be no problems with collinearity and it seems that you have a highly significant interaction.



      Kuperman, V., Dambacher, M., Nuthmann, A. and Kliegl, R. (2010). The effect of word position on eye-movements in sentence and paragraph reading. The Quarterly Journal of Experimental Psychology, 63(9), 1838-1857.

      Liked by 1 person

        Xiao He said:
        November 29, 2011 at 12:24 am

        Hi Dr. Jaeger,

        Thanks for the detailed response.

        (1). Maybe I am looking at the wrong code? The code in the article on this page does use lmer() for residualization. Is it for something different? (The code is duplicated below)

        l <- lmer(logRT ~ EXPTsimple + Wlen + log(Lpos) + rcs(Wpos) + (1 | SUBJ), d)
        d$logRTresidual <- residuals(l)

        But I do remember reading your discussion on using ols for residualization in one of your PPTs.

        (2). The reason why I did not use “length” in the residualization procedure is because my study is on Chinese, and the words are all one to two characters long (Sorry, I should’ve mentioned it in my previous post)

        (3). Regarding SPILLOVER variables, I didn’t interpret them as the actual spillover regions. When I say spill-over region, I am referring to the region following my critical word – in my case, the critical word is a reflexive, and the spillover region is what follows the reflexive.

        What I am confused about is how to incorporate them in my analysis. For instance, words in positions 7 – 11 are in the spillover region. If I want to analyze Word 7 (i.e., p=7), I should do the following?

        lmer(logRTresidual~ms*es + SPILLOVER_1 + SPILLOVER_2 + SPILLOVER_3 + SPILLOVER_4 + SPILLOVER_5 + (1|subject) + (1|item), data=data, subset=p=7)

        #the data set “data” only contains experimental items

        (4). Lastly, when you say “we would analyze the average per-word (residual) reading times over that region”, do you mean we should take the average of the region of interest for each item for each subject, as opposed to conducting separate analyses for individual words. Hence, if there are 3 words in the region following the reflexive, I should take an average of the 3 words, and analyze the average residual RTs?

        Thank you again for your explanation and patience. I’ve never analyzed RT data in this way, but I am very eager to learn how to do it.



          tiflo said:
          November 29, 2011 at 12:52 am

          Hi Xiao,

          Regarding (1), yes I was using lmer. I was merely pointing out that the standard is still to use simple lm models. Just so you know.

          Regarding (2), yes that makes sense ;). Note though that I wouldn’t use item random effects and that you really should use fillers and items for residualization.

          Regarding (3), about the region: ah good. I was concerned. Btw, it should be …. subset = p == 7, if at all (if you aggregate –see below– and use the data set that only contains the aggregated RTs, you won’t need the subset option). If you don’t want to aggregate (which will likely inflate the Type I error rate), you could write … subset = p %in% 7:11.

          Regarding (4), I recommend you first aggregate to get one data point per stimulus, thereby avoiding potential problems with auto-correlations. That is, obtain the average per-word RT in the critical region for each response (i.e. the five words in position 7, 8, 9, 10, and 11 in your hypothetical scenario).

          Finally, make sure to consider random slopes – you are currently only using random intercepts, which can be a real problem, see Jaeger et al (2011) and Barr et al (submitted).




          Xiao He said:
          November 29, 2011 at 12:08 pm

          Thanks! This explanation is extremely helpful. I have one last question, regarding the spillover variables when I use the average per-word residual RT.

          I wonder if I should also aggregate the spillover variables. More specifically, let’s assume that each target word has 5 spillover variables (e.g., SPILLOVER_1, SPILLOVER_2, …, SPILLOVER_5), for every stimulus, should we aggregate all the target words’ SPILLOVER_1 variables, all the target words’ SPILLOVER_2 variables, so on and so forth. So in the end, for each stimulus, we will have one average SPILLOVER_1, one average SPILLOVER_2 … and one average SPILLOVER_5 for the entire target region (just like we only have one aggregate residual RT of the target region per stimuli).

          Thank you again for all the helpful explanations!



          tiflo said:
          November 29, 2011 at 2:22 pm

          Hi Xiao,

          what I did when I was in a similar situation is to create spillover variables based on the (residual) RTs in word position 6, 5, 4, … So rather than averaging the spillover of the previous word, I defined previous relative to the onset of the target region. That has the advantage that the spillover variables have a relatively clean interpretation. There are really the spillover of previous regions onto the target region.




    Hannah said:
    January 25, 2015 at 7:32 pm

    Hello Dr. Jaeger,

    I’m not sure if you still check on this page, but I have a question about the method you explained above. I have also read Xiao’s comments and questions and your responses. This is my first time trying the spillover analysis, I have trouble interpreting the results. Below are three sets of my output data, one without spillover included and the other two with spillover included.

    #LogResid.6 <-residual of log-transformed RT in region 6 (DV)
    #article <-IV1
    #NUM interLR6=lmer(LogResid.6~article*NUM+(1|participant)+(1|item), data=data)
    > pvals.fnc(interLR6)
    Estimate MCMCmean HPD95lower HPD95upper pMCMC Pr(>|t|)
    (Intercept) -0.0620 -0.0620 -0.1330 0.0136 0.1012 0.1329
    articleIN 0.0384 0.0379 -0.0260 0.1029 0.2502 0.2360
    NUM02pl 0.0471 0.0461 -0.0207 0.1078 0.1650 0.1469
    articleIN:NUM02pl -0.0448 -0.0438 -0.1336 0.0489 0.3448 0.3288

    As you can see, there is no significant interaction here. The overall data (including fillers and everything; the model above has only experimental items in the critical region) showed the effect of spillover_2 so I tried and included it in the next analysis.

    sol_2.6 interLR6=lmer(LogResid.6~article*NUM*sol_2.6+(1|participant)+(1|item), data=data)
    > pvals.fnc(interLR6)
    Estimate MCMCmean HPD95lower HPD95upper pMCMC Pr(>|t|)
    (Intercept) -0.2608 -0.2687 -1.0036 0.5129 0.4882 0.4951
    articleIN 1.3172 1.2870 0.2003 2.2885 0.0172 0.0117
    NUM02pl 0.7162 0.7507 -0.2665 1.7441 0.1500 0.1624
    sol_2.6 0.0315 0.0328 -0.0912 0.1466 0.5894 0.5994
    articleIN:NUM02pl -1.8396 -1.8716 -3.3327 -0.4039 0.0112 0.0130
    articleIN:sol_2.6 -0.2022 -0.1978 -0.3590 -0.0294 0.0208 0.0141
    NUM02pl:sol_2.6 -0.1054 -0.1110 -0.2697 0.0476 0.1766 0.1913
    articleIN:NUM02pl:sol_2.6 0.2835 0.2889 0.0508 0.5144 0.0136 0.0155

    –> When the spillover_2 (sol_2.6) is added, the interaction of the two IVs became significant as well as the three way interaction of the IVs and spillover.

    The last analysis includes all three spillover variables but not as interacting…I only added them into the model.

    > interLR6=lmer(LogResid.6~article*NUM+sol_1.6+ sol_2.6+sol_3.6+(1|participant)+(1|item), data=data)
    > pvals.fnc(interLR6)
    Estimate MCMCmean HPD95lower HPD95upper pMCMC Pr(>|t|)
    (Intercept) 0.2960 0.2967 -0.3676 1.0210 0.4018 0.3805
    articleIN 0.0380 0.0375 -0.0269 0.1015 0.2516 0.2396
    NUM02pl 0.0479 0.0472 -0.0186 0.1115 0.1512 0.1415
    sol_1.6 0.0066 0.0063 -0.0541 0.0719 0.8444 0.8309
    sol_2.6 -0.0571 -0.0554 -0.1138 0.0074 0.0746 0.0591
    sol_3.6 -0.0060 -0.0075 -0.0672 0.0568 0.8132 0.8436
    articleIN:NUM02pl -0.0502 -0.0499 -0.1368 0.0483 0.2876 0.2766

    Only the sol_2.6 is reaching the marginal significance and the interaction of the two IVs is insignificant again. I believed that I knew how to handle mixed effects modeling until I tried this spillover analysis. Could you please help me how to interpret these outputs?

    Thank you very much,


      tiflo responded:
      January 25, 2015 at 9:22 pm

      Hi Hannah,

      It’s a little hard to follow your post, as no details about the variables, coding, or the data set, not to mention the goal of the analysis, are given. I assume you sum-coded your variables? Or in some other way made sure that the interaction won’t be significant with its main effects? Additionally, I noticed that you only use random intercepts. Unless your manipulations were both between items and between participants (rather than within), this would be anti-conservative.

      It’s theoretically possible that an effect only shows up once other variables (in this case, spill over effects) are controlled for. After all, that is one of the reasons for the use of multiple regression =). With the information you have provided, I can’t ascertain that this is the case here. Perhaps you could send a more detailed email to the lingRlang list? (but make sure to first address the points I raised above). I hope this helps at least a little bit.



    Tessa said:
    September 8, 2017 at 7:45 am

    Dear Dr. Jaeger,

    Thank you for your wonderful blog, I often find useful notes on here when I am looking for the right way to do my analyses! I read this particular post thinking of using your method on my own self-paced reading data, and then I saw your comment in response to Xiao’s question:

    “Generally, the advantage of using a two-stage analysis (doing residualization first) is only given if you have a lot more data for residualization than you would have if you threw in those predictors into the analysis of your items.”

    I think this applies to my data as well, which includes only target items (of various types, hence the lack of fillers). Do I understand correctly that using residualization is only recommended when the data set contains many more (filler) items than are used in the final analyses? I imagine there are some advantages in model estimation in the final step as well, given that adding predictors might increase the chance of convergence issues, but perhaps this is not sufficient reason to use residualization? Would you happen to have any references on the use of covariates versus residualization that I could look into?

    Many thanks,



      tiflo responded:
      September 8, 2017 at 10:49 am

      Dear Tessa,

      I didn’t mean my statement as a hard rule, rather as a gradient statement. Let’s say you have k data points for critical items and j for critical items and fillers. The larger the ratio between j and k, the better your estimate of the functional relation you want to residualize out (e.g., RT ~ word length), as long as the additional data points are follow the same functional relation (e.g., as long as it’s reasonable to believe the word length affects RTs in the same way for critical and filler items).

      My hunch is that convergence issues are not a good reason to take the two-stage approach. If j==k it’ll generally be better to jointly estimate the effects of interest and the effect you want to residualize out in the same model. hth (and thanks for your interest in the blog).



        Tessa said:
        September 11, 2017 at 4:08 am

        Dear Florian,

        Thank you for your quick reply, it’s very helpful!




Questions? Thoughts?

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s