New text resources available

Posted on Updated on

Two new resources have recently become available that may be of interest to the NLP and Psycholingustics communities.  First, the New York Times has released “The New York Times Annotated Corpus”.  It’s available through the LDC.  It’s been marked up with tags for people, places, topics, and organizations.  650,000 of the 1.8 million articles  contain human-written summaries (36%).  The LDC listing can be found here. A nice write up of the release is at the NYT Open Blog.

I could see this being especially useful to anyone who’s thinking about topic models/LSA, text summarization, and possibly knowledge extraction.  For any masochists out there, I reckon you could get a lot of love from the community if you were able to provide a parsed version of this corpus, either in their XML format, in the NITE format, or in a TGrep2able form (or all three, if you want to be canonized as St. Parser the First).

Second, a small group has used the Twitter API to collect a huge dataset of “tweets” from the site.  Twitter is sometimes referred to as a “micro-blogging” service.  Users post messages (called “tweets”) of 140 characters or less to the site, and these are displayed either to the whole world, or just to people in the author’s social network.  The corpus consists of 10 million such messages produced by 2.7 million unique users, with information about 58 million connections between users.  There’s some information that may be available about each user, including the user’s self-reported name, location, description, website, and time zone.  The corpus was released and then withdrawn due to some concerns from the folks in charge of Twitter.  It should be released again soon.  The place to keep track of news on this corpus is on the blog of the InfoChimps folks.

The applications are less obvious for this corpus and our lab, but it could still be a lot of fun.  For one thing, we always wonder about how external (usually cognitive) constraints influence the application of optimal production strategies.  In this case, we have a hard constraint that messages be less than 140 characters.  It might be interesting to look at how to implement this constraint in an information theoretic model (channel capacity?), and look at whether the system is optimal overall, and optimal given that constraint.  A second idea has to do with network structure.  It might be neat to look at whether there’s priming between tweets, and whether this depends on the connections between people.  More directly, the corpus contains information about “conversations”– cases where one user replies to another user’s tweet.  Anyone working on alignment or priming or other conversation-dependent behaviors should be able to use this feature.  In theory we could set up a Switchboard-style resource using just the conversation-like subset of the Twitter corpus.  And if anyone wants to be St. Parser the Second, maybe you could try a dependency parser on these odd little blurbs.

Will you use this data? How?

(For anyone working on stats or visualization projects, you might want to check out the other datasets put out by infochimps. There’s some stuff there that would make for great classroom examples or visualization demos.)


Questions? Thoughts?

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s