How good is the web as an approximation of language experience?

Posted on December 17, 2009 Updated on December 21, 2009

Benjamin Van Durme and Austin Frank (who still doesn’t have a webpage) have been doing some neat comparisons of web-based estimate of language experience vs. traditional data sources. This work is part of a project funded by the University of Rochester’s Provost Award for Multidisciplinary Research. Since I really like the results, I am gonna use some lazy time to blog about my favorites.

We found that web-based probability estimates can be used to investigate probability-sensitive human behavior. We used databases of word naming, picture naming, and lexical decision tasks, as well as a database of word durations derived from the Switchboard corpus of spontaneous speech. Comparing Google Web 1T 5-gram counts vs. CELEX (spoken and written), BNC (spoken and written), and Switchboard counts, we estimated word frequencies and compared models using these different frequency estimates against the different types of probability sensitive language behaviors mentioned above (word naming RTs, etc.).

I find this encouraging, as web data, unlike traditionally used data sources, is cheap and readily available for many languages, thereby facilitating cross-linguistics work on probability-sensitive human language processing. Additionally, we found at least preliminary evidence that simple principal component analysis over the various frequency estimates leads to better correlation against human language behavior.

pca.fit — CELEX (written), Google Ngram, and their 1st principal component fitted against probability-sensitive human language behavior

The last row shows the fits of the 1st principal component against lexical decision RTs, word naming and picture naming latencies in English. As evident by the fact that the data spreads more nicely along the fitted line, the PCA fit is better than the fits against CELEX or Google data alone. This suggests that we could substantially improve controls in psycholinguistic experiments by using the principal components of various corpora, which together reflect human language experience better than every single corpus alone.

Finally, a cute little finding that I particularly like: Using the Balota et al (2001) data set (conveniently implemented in Harald Baayen’s languageR package), we were able to compare web-based estimates against the lexical decision times of young vs. older subjects. Web-based estimates provide a better fit against young subjects, compared to old subjects:

google.age — Google Ngram-based frequency estimates fitted against lexical decision RTs

I love this! Let’s see whether it holds up.

This entry was posted in corpus-based research, Preliminary, Results, Statistics & Methodology and tagged Austin Frank, Benjamin Van Durme, estimating probabilities, google ngrams, psycholinguistics, web counts.

3 thoughts on “How good is the web as an approximation of language experience?”

Klinton said:
December 17, 2009 at 1:55 pm

Very cool results! For the last one, did you also check the same fits for a non-web corpus? Or could it be that the young people are just more predictable in general? 😉

LikeLike

tiflo said:
December 17, 2009 at 2:02 pm

Haha, I was thinking about that possibility as I was typing it. We’ve done some such comparison (and found no difference for old vs. young fit for CELEX as I recall), but we haven’t done all corpora. Maybe Austin can say more about this. I’ll bug him.

LikeLike