Ok, we may have named in a haste, but maybe it’s still useful: We’ve created an email list that can be used to announce stats and machine learning workshops of interest to language researchers, psycholinguists, etc. in the Upstate area (Rochester, Cornell, Buffalo, etc.). Feel free to join, it’s very low traffic: http://groups.google.com/group/statistics-northeast
Author Archive for tiflo
What to do when you need an intuitive measure of model quality for your logit (logistic) model? The problem is that logit models don’t have a nice measure such as R-square for linear models, which has a super intuitive interpretation. However, several pseudo R-square measures have been suggested are some are more commonly used (e.g. Nagelkerke R2). In R, some model-fitting procedures for ordinary logistic regression provide the Nagelkerke R-square as part of the standard output (e.g. lrm in Harrell’s Design package). However, no such measure is provided for the most widely used mixed logit model-fitting procedure (lmer in Bates’ lme4 library). Below I provide some code that provides Nagelkerke and CoxSnell pseudo R-squares for mixed logit models. Continue reading ‘Nagelkerke and CoxSnell Pseudo R2 for Mixed Logit Models’
It’s maaaaaaaaaa pleasure to announce that a couple of new folks will be joining/visiting HLP lab this Fall. It’s less of a pleasure to say that some of the folks will leave the lab to move on, but that’s how it goes. So, here comes an introduction and a farewell.
Two new students will join via the PhD program:
- Ting Qian has decided to join us for his graduate studies. He completed his B.Sc. in BCS (Artificial Intelligence track) at the University of Rochester after transferring from SUNY Oswego. He has worked on genetic algorithms, attribute-driven extraction of lexical classes, as well as on the distribution of information throughout discourses in written and spoken Mandarin Chinese (publications on the latter topic can be found on the HLP lab website).
- Masha Fedzechkina will join us coming from the University of Cologne (originally from Belarus), where she finished a Magister in Data Processing including classes in linguistics and CS. She’s planning to work on processing-driven effects on acquisition using both computational and behavioral methods.
Additionally, two graduate researchers from other universities will join the lab to lead our NSF-funded research project on Field-based Psycholinguistics in the Yucatan (joint work with Juergen Bohnemeyer, UB):
- Alice Lemieux will join us from the University of Chicago where she’s working on her PhD in Linguistics. She’s done other fieldwork (on Washo) before and is interested in language contact.
- Lindsay K. Butler will join us from the University of Arizona where she’s working on her PhD in Linguistics. Lindsay already has some background in Yucatec from an instensive summer workshop at UNC (including a couple of weeks in the Yucatan).
Alice and Lindsay will take classes on Yucatec from Juergen Bohnemeyer to obtain basic speaking knowledge of Yucatec and to learn about the linguistic structure of Yucatec. They will run sentence production studies on Yucatec Maya (in the Yucatan) including experiments on accessibility and weigt effects on word order and morphological choices. I want to take this opportunity to publically thank Lis Norcliffe and Tania Nikitina for absolutely invaluable help with the preparation of the grant. Lis hugely influenced the design of our experiments and co-wrote large sections of the grants. Of course, she’s not to be blamed for anything we may screw up
. Thanks also go to Carlos Gomez Gallo and Katrina Housel who helped gather the pilot data for the grant during our previous visit to the Yucatan.
- In September, we also will have an HLP lab visitor. Elma Kerz is joining us from the RWTH Aachen in Germany where she teaches computer-based linguistics, psycholinguistics, and various other things. Her work includes paper in construction grammar, cognitive grammar, and on grammaticalization.
Finally, HLP will have its first alumni this year:
- Benjamin VanDurme (“The Durmster”; PhD in CS with a Minor in Linguistics) got offers from Stanford (post-doc with Dan Jurafsky and Chris Manning) and the Human Language Technology Center of Excellence at Hopkins. After much consideration he chose Hopkins as research faculty where he will start soon. Most of Ben’s work over recent years was with Len Schubert and, more recently, also with Dan Gildea, but he’s also been involved in several HLP lab projects including work on the link between words’ redundancy in context and their pronunciation as well as work assessing the use of Google n-grams for research on language processing. It’s unclear what will happen to our espresso machine, now that he’s gone. Ben, just you wait and see. You’ll come back for that (lukewarm) espresso
! - Carlos Gomez Gallo is about to wrap things up, too (things=PhD in CS with a Minor in Linguistics). He received post-doc offers from Minnesota and Harvard, and an offer to join the American University at Beirut as faculty. He’s going to join Maria Polinksy’s lab at Harvard to run cross-linguistics studies on linguistic representations and processing. It looks like collaborations (among other things on Spanish and Maya data we collected in the Yucatan last Winter) will continue since Masha is interested in similar questions. Carlos is also about to finish a manuscript on work investigating incremental language production beyond the clausal level. Much of his work at Rochester was concerned with the creation and use of the Fruitcart corpus (which he will not fail to mention if you run into him
). This work was done in collaboration with with James Allen and others in his lab. Carlos, Good luck at Harvard!
So, welcome and ciao ciao (but visit).
Berlin – it still rocks
Ok, everyonce in a while I should be allowed a kinda personal post here (I just decided that). So, I choose … dickes B. I just had a couple of wonderful days in Berlin. It still is full of surprises and August is still the best time to visit the best city ever
. Why? Continue reading ‘Berlin – it still rocks’
The LSA Summer Institute is almost over and it has been a lot of fun so far. I didn’t get to see nearly as many talks and classes as I had hoped to, but instead there were tons of interesting conversations, new ideas, and just nice moments hanging out in the sun.
Brief update: It couldn’t have been different — I missed my flight. That happens every time I try to leave the Bay area. I am so used to it, I am not even trying to be on time anymore
. Ah well, it gives me a chance to enjoy a cappuccino in my favorite SF Cafe (Ritual Roasters) and even to attend Dan’s party (yippie!). Oh, and to upload some random pictures from the class room. Yeah, pretty dark I know. If you have better pictures — can you send them to me and I upload them? Also, here are some pics from our office hours at Caffee Strada (thanks to Judith and Alex for a great job!):
- Random class room shot (2)
- Random class room shot
- Late night “office hours” at Jupiter’s
- Michi smiling with TGrep2 at his command (almost!)
- Judith and Alex working hard to spread the word of Switchboard
- Judith (at hour 2 of 6)
- hmm, probably at Jupiter’s again
LSA125-ers — thanks for an enjoyable class, for all the questions, and I hope you keep enjoying your projects (or, if nothing else, now know for certain that you really really never want to work with corpora
. Send us an update about your papers as they progress.
To everyone else out there: If you’re interested in the use of syntactic corpora to investigate language production, you may find our LSA125 class webpage useful (see especially the links and information on the corpus pages, but also the slides). If you use material from this page, please let us know. Thanks to Judith, we now have a nicely documented version of the TGrep2 Database Tools, which we have dubbed TDTlite. Alex and Judith have also prepared example projects. TDTlite allows you to combine the output of TGrep2 searchers on syntactic corpora into a nice tab-delimited database that can be importated into R, Excel, or the stats program of your choice. While it doesn’t give you the full flexibility of scripting things yourself, it makes it considerably easier to start your own corpus-based project. We’re in the progress of polishing things up for distribution (thanks to all the brave members of our class who helped us to understand which parts still need further improvement!). So, if something like that might be of interest to you, let us know whether you would like further information. We hope to have a beta release by the end of August.
Come to Cyberling
For all those of you at the LSA 09 Summer Institute: If you’re interested in replicability, scientific standards, resource creation and sharing, etc., come to next weekend’s Cyberling workshop. We need your input.
Two paper updates
Just a quick note. It feels good to be able to announce that the overview paper on cross-linguistic production (written with Elisabeth Norcliffe) is now available online. Thanks for your feedback and all the references that were sent our way.
Also: I’ve finally written up the first study from my thesis. Well, a considerably updated version of it. Anyway, if you’re interested in Uniform Information Density and/or the syntactic reduction of complement clauses in spontaneous speech … have a look at the paper… feedback is welcome. Oh, and did I mention that it is a corpus-based study
.
Over and out (from lovely Berkeley, enjoying the LSA Summer Institute)
….. during the LSA Summer Institute at Berkeley. Come and join us. More information and registration can be found here.
I am always annoyed that one has to remind R to reduce the number of levels of a factor after a subset (of the original data set) has been created. In addition to screwing up tables (b/c they will contain all the zero rows/columns, too), this also can affect comparison of factor values (“Factors do not have the same number of levels”), and it makes RData files much bigger than they need to be. In our lab, we often work with large data files (up to 800,000 rows and 100-350 variables are relatively common), so that an RData file containing just that data.frame can easily be 100MB+. Say, you select 5,000 rows out of 800,000, that may still leave you at an RData file size of 50MB+ because R remembers all original levels for all factors still in the data.frame. The little script I attach below, takes either a factor or a data.frame as input and returns the factor or the data.frame in such a way that only levels still in the data are considered. In the files, Ir recently worked with, that reduced the file size by 90%+, which in turns leads to a considerable speed-up in analyzing the data (I mean, on a small laptop, you will definitely feel the difference …). Anyway, nothing big, but I maybe some of you may find it useful: Continue reading ‘little function to clean up factor variables after subset-ing’
The CCC, NSF, and CRA have announced a joint funding program for Computing Innovation Fellows. It provides 1-to-2 years of post-doc funding and you can take that funding to any lab registers on their webpage. You can also join several labs during that post-doc period. Check out our CI fellows page.






