This post is looong overdue… – in January of 2009 Ben Van Durme and I reparsed the BNC with the Charniak-Johnson parser and encoded it for TGrep2, with the cool extra feature that each sentence node is now linked to a set of meta-data about the text itself and about the speaker/author. We wrote a little python script that takes TGrep2 match IDs and gives you a file with the meta-data for all sentences that match the TGrep2 IDs. For the written part of the BNC, the following information is available: medium, genre, domain, age and sex of intended audience, size of the text in words, date of publication, and age and gender of the author. For the spoken part: age, sex, social class, dialect/accent, first language, level of education of the speaker, as well as whether or not the speaker was recruited.
This makes it a neat resource for psycho- and sociolinguistic research. If you’re interested in using it (and have a license for the BNC), let us know.