Benjamin Van Durme and Austin Frank (who still doesn’t have a webpage) have been doing some neat comparisons of web-based estimate of language experience vs. traditional data sources. This work is part of a project funded by the University of Rochester’s Provost Award for Multidisciplinary Research. Since I really like the results, I am gonna use some lazy time to blog about my favorites.
We found that web-based probability estimates can be used to investigate probability-sensitive human behavior. We used databases of word naming, picture naming, and lexical decision tasks, as well as a database of word durations derived from the Switchboard corpus of spontaneous speech. Comparing Google Web 1T 5-gram counts vs. CELEX (spoken and written), BNC (spoken and written), and Switchboard counts, we estimated word frequencies and compared models using these different frequency estimates against the different types of probability sensitive language behaviors mentioned above (word naming RTs, etc.).
I find this encouraging, as web data, unlike traditionally used data sources, is cheap and readily available for many languages, thereby facilitating cross-linguistics work on probability-sensitive human language processing. Additionally, we found at least preliminary evidence that simple principal component analysis over the various frequency estimates leads to better correlation against human language behavior.
The last row shows the fits of the 1st principal component against lexical decision RTs, word naming and picture naming latencies in English. As evident by the fact that the data spreads more nicely along the fitted line, the PCA fit is better than the fits against CELEX or Google data alone. This suggests that we could substantially improve controls in psycholinguistic experiments by using the principal components of various corpora, which together reflect human language experience better than every single corpus alone.
Finally, a cute little finding that I particularly like: Using the Balota et al (2001) data set (conveniently implemented in Harald Baayen’s languageR package), we were able to compare web-based estimates against the lexical decision times of young vs. older subjects. Web-based estimates provide a better fit against young subjects, compared to old subjects:
I love this! Let’s see whether it holds up.