24
Jan
10

Two nice resources to find the language of your choice

I was just reading Haspelmath’s post on the CyberLingBlog in reply to a summary of a recent talk by Newmeyer. Most of you probable know the World Atlas of Languages, which allows you to browse through language and linguistic properties, view their distributions over beautiful maps, and contains nice introductory articles to many typological features. It’s very well structured and gives you references for each language, too. Here’s a link to a page on a specific language, Polish.

There is another database that I didn’t know about which let’s you browse or search a (as of yet rather small set of) languages for morpho-syntactic properties: Syntactic Structures of the World’s Languages. Properties are defined in a pragmatic and manageable way. For example, SVO is defines allowing that order in a “neutral” context. The definition also makes clear that SVO can be “yes”, while other word order features are “yes”, too.

It seems that you can even contribute to this database by entering your own data (though maybe you need to apply?), including examples with glosses. Looks interesting. The usual caveats apply, but great that someone is trying! Aside of this project I remember only one similar project that someone at the University of Vienna started while I was an undergrad …  but as far as I remember that never reached critical mass.

If anybody knows of similar databases out there, feel free to post them below. Or even better: contribute to the CyberLingBlog. Everybody is invited.

23
Jan
10

Blog on ggplot2

If you haven’t already, check out this nice R blog with lots of code for good ggplot2 and lattice figures.

21
Dec
09

Some thoughts on the sensitivity of mixed models to (two types of) outliers

Outliers are a nasty problem for both ANOVA or regression analyses, though at least some folks  consider them more of a problem for the latter type of analysis. So, I thought I post some of my thoughts on a recent discussion about outliers that took place at CUNY 2009. Hopefully, some of you will react and enlighten me/us (maybe there are some data, some simulations out there that may speak to the issues I mention below?). I first summarize a case where one outlier out of 400 apparently drove the result of a regression analysis, but wouldn’t have done so if the researchers had used ANO(C)OVA. After that I’ll have some simulation data for you on  another type of “outlier” (I am not even sure whether outlier is the right word): the case where a few levels of a group-level predictor may be driving the result. That is, how do we make sure that our results aren’t just due to item-specific or subject-specific properties. Continue reading ‘Some thoughts on the sensitivity of mixed models to (two types of) outliers’

19
Dec
09

new blog on cyberlinguistics

Folks,

it still looks like a new-born, but hopefully, it will join as a source for anybody interested in … cyber linguistics. How is that for a name, huh?!? The blog is meant to bring together researchers in corpus linguistics, data archiving, field workers, etc. with the goal to share news on conferences, funding, tools (e.g. for data collection, data organization, data annotation)  and new books/tutorials etc.  You can find a detailed mission statement and philosophy of the Cyberling blog on their page.

Be cyber,

Florian


18
Dec
09

Some R code to understand the difference between treatment and sum (ANOVA-style) coding

Below, I’ve posted some code that

  1. generates an artificial data set
  2. creates both treatment (a.k.a. dummy) and sum (a.k.a. contrast or ANOVA-style) coding for the data set
  3. compares the lmer output for the two coding systems
  4. suggests a way to test simple effects in a linear mixed model

Mostly though the code is just meant as a starting point for people who want to play with a balanced (non-Latin square design) data set to understand the consequences of coding your predictor variables in different ways.

Continue reading ‘Some R code to understand the difference between treatment and sum (ANOVA-style) coding’

18
Dec
09

Results of animacy and accessibility in Yucatec

Good news! We’ve analyzed the previously mentioned experiment on animacy and word order in Yucatec. We coded animacy of the Agent and Patient referents (human, animal, inanimate), transitivity (transitive, intransitive) and voice (active, passive, other) of the verb. We also coded the definiteness of the Agent and Patient referents (definite, indefinite).

Overall, Agent-Verb-Patient word order was strongly preferred (see Table 1). Moreover, human subjects were more likely to appear earlier in the sentence (ps<0.0001, interaction n.s., N=597), which is predicted by direct accessibility accounts. Human agents and patients were were more likely to be described as definite (ps<0.0002), and definite NPs showed a tendency to be mentioned earlier (agent: p<0.0001; patient: n.s., interaction p<0.0001). Still, the effect of animacy held independently (ps<0.002; interaction n.s.). The agent animacy effect was somewhat mediated by an effect on transitivity (whether participants described an event as e.g. an apple hitting a man or an apple falling on a man in that inanimate agents were less often described transitively (p<0.0001; no patient effects). The agent animacy effect remained significant even for transitive sentences (p<0.004; no interaction, N=502). In terms of the effects of voice, human agents correlated with the use of active voice (p<0.0001), and human patients correlated with the use of passive voice, though not at strongly (p<0.03, N=604).

Table: Word order and voice

Agent, Patient and Verb of 531 transitives (excluding 161 non-transitives)

Word order Total Active Passive Other
Agent-Verb-Patient 440 427 7 6
Patient-Verb-Agent 63 2 61 0
Other 28 20 7 1

What does this mean? Good news! Interesting results. In Yucatec, the passive voice is encoded by verbal morphology. Passive voice does not presuppose or preclude a word order change. When a patient was human, sentences were more likely to be in the passive voice. Moreover, human patients were more likely to be mentioned earlier. So, we’ve seen the use of passive voice morphology and earlier mention with human patients.

17
Dec
09

How good is the web as an approximation of language experience?

Benjamin Van Durme and Austin Frank (who still doesn’t have a webpage) have been doing some neat comparisons of web-based estimate of language experience vs. traditional data sources. This work is part of a project funded by the University of Rochester’s Provost Award for Multidisciplinary Research. Since I really like the results, I am gonna use some lazy time to blog about my favorites.

We found that web-based probability estimates can be used to investigate probability-sensitive human behavior. We used databases of word naming, picture naming, and lexical decision tasks, as well as a database of word durations derived from the Switchboard corpus of spontaneous speech. Comparing Google Web 1T 5-gram counts vs. CELEX (spoken and written), BNC (spoken and written), and Switchboard counts, we estimated word frequencies and compared models using these different frequency estimates against the different types of probability sensitive language behaviors mentioned above (word naming RTs, etc.).

I find this encouraging, as web data, unlike traditionally used data sources, is cheap and readily available for many languages, thereby facilitating cross-linguistics work on probability-sensitive human language processing. Additionally, we found at least preliminary evidence that simple principal component analysis over the various frequency estimates leads to better correlation against human language behavior.

CELEX (written), Google Ngram, and their 1st principal component fitted against probability-sensitive human language behavior

Continue reading ‘How good is the web as an approximation of language experience?’

13
Dec
09

Ever noticed? Have a closer look at Google counts

The other day, Anne Pier Salverda made me aware of the following strange co-incidence. While googling for “two women were having dinner” only yields a handful of hits (3 when I was performing the search today), the search for “two men were having dinner” yields tens of thousands of hits (about 220,000 when I performed the search). Let’s not ask why Anne Pier was searching for these strings to begin with ;) , but it was curious, so I looked for more.

This does not seem to be driven by highly repeated mentions of only a few stories.  Neither does it seem to be gender specific. Searches for “two x were having dinner”, where x is one of “boys”, “girls”, “guys” yield no or only a few hits.

So, is this really just a weird coincidence of this particular string “two men were having dinner”? Curiously, the same set of pattern in the present progressive yields the same asymmetry: “two men are having dinner” yielded about 185,000 hits, where as the other searches (“women”, “girls”, “boys”, “guys”) yielded no or only a handful of hits. The same asymmetry also holds for “two x have dinner”. Huh?!?

Then I had a closer look at these thousands of hits. You can do it yourself for any of the searchers (provided that Google hasn’t changed this already): As soon as you click to see any of the hits beyond page 1, Google suddenly claims that there are only about 40-50 hits (depending on the particular “Two men HAVE dinner” string used). These few hits in turn seem to come from only a few news sources. So, actually, all that seems to be going on is that there are a few more hits for the “men” examples — perfectly consistent with the fact that the internet (surprisingly) seems to talk more about “men” than “women” (a 10:7 when I last looked).

Great. Have you ever noticed anything comparable? Is this a bug? I just check bing.com and they seem to return the right count. What a sad day.

13
Dec
09

Self-paced reading via WWW

And while I am at it, I may point out this sweet tool to run self-paced reading experiments via the WWW developed by Alex Drummond (many thanks to Carlos Gomez Gallo for pointing this software out to me). For an implemented example, see Masha Polinsky’s lab page. Also check this page on how this web-based self-paced reading paradigm has been tested with different keyboard setups.

13
Dec
09

Mech Turk and Written Recall

Ah, we’ve just got back the first result of two studies that used a written recall paradigm via Mechanical Turk to test a couple of predictions of Uniform Information Density. You can see an example template of a written recall procedure here (JavaScript required). Each study took about 1 day for the equivalent of 20 participants (balanced across 4 lists) at $.02 per trial plus some boni (see below).

The next step is to implement a spoken recall paradigm. If anyone out there has already done that, let me know.

We also tested progressive payment as a way to elicit more balanced data sets. Whereas normal MechTurk data sets exhibit Zipf distributions with regard to the trials per participant, a simple progressive scheme ($.20 for at least 20 trials, $.50 for at least 40 trials, etc.) worked quite well to drastically increase the percentage of data that comes from participants who’ve done the entire experiment.

Furthermore, HLP lab manager Andrew Watts has written a little script that makes sure that each item gets only seen in one condition by each participant and that conditions are counterbalanced across participants (worker IDs). We’re still working on some details, but once it’s ready for prime time, we’ll share it here.

01
Nov
09

New support group for the Upstate region

Ok, we may have named in a haste, but maybe it’s still useful: We’ve created an email list that can be used to announce stats and machine learning workshops of interest to language researchers, psycholinguists, etc. in the Upstate area (Rochester, Cornell, Buffalo, etc.). Feel free to join, it’s very low traffic: http://groups.google.com/group/statistics-northeast

23
Oct
09

Getting Lingalyzer to work on MacOS X

We recently ran into a problem getting Lingalyzer, the analysis program from Doug Rohde’s Linger to work on MacOS X. The problem at least occurs on 10.5 and up, but could well occur on lower versions as well. Lingalyzer depends on a statistics suite called |stat, which is where the actual problem lies.

When you run the lingalyzer script, it dies with the error "warning: this program uses gets(), which is unsafe." We were initially confused, because Lingalyzer is written in Tcl, which has a gets() function, and lingalyzer uses it quite a bit. But the problem was actually the |stat programs that it was calling, which use the C gets() functions. The gets() function is well known for being a buffer overflow risk. GCC warns you sternly not to use it, but MacOS X goes so far as to trap calls to it and refuse to execute the offending program.

It turns out that there is a relatively easy solution, namely replacing all calls to gets() with calls to fgets(). Wherever in the source code you see:
while (gets (line))
replace it with:
while (fgets (line, sizeof(line), stdin))

I have a patch that can be applied to the |stat source code that replaces all of them, as well as adding the CFLAGS to the makefile to build a Universal Binary. However, the license for |stat appears to prohibit redistributing modified versions of the code, and a patch might run afoul of that. If you ask nicely I can email it to you though. The license also prohibits even local modifications for any purpose other than making it run on your system, so if MacOS didn’t terminate programs with gets() with extreme prejudice, then even the changes I made would be in violation. Weird.

29
Aug
09

Nagelkerke and CoxSnell Pseudo R2 for Mixed Logit Models

What to do when you need an intuitive measure of model quality for your logit (logistic) model? The problem is that logit models don’t have a nice measure such as R-square for linear models, which has a super intuitive interpretation. However, several pseudo R-square measures have been suggested are some are more commonly used (e.g. Nagelkerke R2). In R, some model-fitting procedures for ordinary logistic regression provide the Nagelkerke R-square as part of the standard output (e.g. lrm in Harrell’s Design package). However, no such measure is provided for the most widely used mixed logit model-fitting procedure (lmer in Bates’ lme4 library). Below I provide some code that provides Nagelkerke and CoxSnell pseudo R-squares for mixed logit models. Continue reading ‘Nagelkerke and CoxSnell Pseudo R2 for Mixed Logit Models’

18
Aug
09

www est mort, vive www

Our web server (www.hlp.rochester.edu) bit the dust yesterday. It was a seven year old computer that originally came from a compute cluster, and then we had running non-stop for the last two years as a web server. A new computer is being ordered, but we’re likely to be down for a week or two, just so you know, if you care.

UPDATE (2009-09-16): WWW is back online.

11
Aug
09

HLP lab is growing/shrinking = grinking

It’s maaaaaaaaaa pleasure to announce that a couple of new folks will be joining/visiting HLP lab this Fall.  It’s less of a pleasure to say that some of the folks will leave the lab to move on, but that’s how it goes. So, here comes an introduction and a farewell.

Two new students will join via the PhD program:

  • Ting Qian has decided to join us for his graduate studies. He completed his B.Sc. in BCS (Artificial Intelligence track) at the University of Rochester after transferring from SUNY Oswego. He has worked on genetic algorithms, attribute-driven extraction of lexical classes, as well as on the distribution of information throughout discourses in written and spoken Mandarin Chinese (publications on the latter topic can be found on the HLP lab website).
  • Masha Fedzechkina will join us coming from the University of Cologne (originally from Belarus), where she finished a Magister in Data Processing including classes in linguistics and CS. She’s planning to work on processing-driven effects on acquisition using both computational and behavioral methods.

Additionally, two graduate researchers from other universities will join the lab to lead our NSF-funded research project on Field-based Psycholinguistics in the Yucatan (joint work with Juergen Bohnemeyer, UB):

  • Alice Lemieux will join us from the University of Chicago where she’s working on her PhD in Linguistics. She’s done other fieldwork (on Washo) before and is interested in language contact.
  • Lindsay K. Butler will join us from the University of Arizona where she’s working on her PhD in Linguistics. Lindsay already has some background in Yucatec from an instensive summer workshop at UNC (including a couple of weeks in the Yucatan).

Alice and Lindsay will take classes on Yucatec from Juergen Bohnemeyer to obtain basic speaking knowledge of Yucatec and to learn about the linguistic structure of Yucatec. They will run sentence production studies on Yucatec Maya (in the Yucatan) including experiments on accessibility and weigt effects on word order and morphological choices. I want to take this opportunity to publically thank Lis Norcliffe and Tania Nikitina for absolutely invaluable help with the preparation of the grant. Lis hugely influenced the design of our experiments and co-wrote large sections of the grants. Of course, she’s not to be blamed for anything we may screw up ;) . Thanks also go to Carlos Gomez Gallo and Katrina Housel who helped gather the pilot data for the grant during our previous visit to the Yucatan.

  • In September, we also will have an HLP lab visitor. Elma Kerz is joining us from the RWTH Aachen in Germany where she teaches computer-based linguistics, psycholinguistics, and various other things. Her work includes paper in construction grammar, cognitive grammar, and on grammaticalization.

Finally, HLP will have its first alumni this year:

  • Benjamin VanDurme (“The Durmster”; PhD in CS with a Minor in Linguistics) got offers from Stanford (post-doc with Dan Jurafsky and Chris Manning) and the Human Language Technology Center of Excellence at Hopkins. After much consideration he chose Hopkins as research faculty where he will start soon. Most of Ben’s work over recent years was with Len Schubert and, more recently, also with Dan Gildea, but he’s also been involved in several HLP lab projects including work on the link between words’ redundancy in context and their pronunciation as well as work assessing the use of Google n-grams for research on language processing. It’s unclear what will happen to our espresso machine, now that he’s gone. Ben, just you wait and see. You’ll come back for that (lukewarm) espresso ;) !
  • Carlos Gomez Gallo is about to wrap things up, too (things=PhD in CS with a Minor in Linguistics). He received post-doc offers from Minnesota and Harvard, and an offer to join the American University at Beirut as faculty. He’s going to join Maria Polinksy’s lab at Harvard to run cross-linguistics studies on linguistic representations and processing. It looks like collaborations (among other things on Spanish and Maya data we collected in the Yucatan last Winter) will continue since Masha is interested in similar questions. Carlos is also about to finish a manuscript on work investigating incremental language production beyond the clausal level. Much of his work at Rochester was concerned with the creation and use of the Fruitcart corpus (which he will not fail to mention if you run into him ;) ). This work was done in collaboration with with James Allen and others in his lab. Carlos, Good luck at Harvard!

So, welcome and ciao ciao (but visit).

http://www.cs.rochester.edu/~schubert/



Blog Stats

  • 39,237 hits

 

February 2010
M T W T F S S
« Jan    
1234567
891011121314
15161718192021
22232425262728

Categories

RSS Language Log