I recently parsed the British National Corpus (BNC) using the latest version of the parser by Charniak’s group @ Brown. In running the results through ‘tgrep2 -p’ (i.e., building a corpus file), I ran into some troubles that I thought I’d put up here in case they save someone a bit of grief.
(Eventually I realized the problem stemmed from a few unmatched parenthesis. An obvious source of error, and easily checked for, but given the following points it took a while before I checked for it specifically):
tgrep2 does not limit itself to line by line reading, as I have naively assumed. It reads from the beginning of an opening paren until it finds the end, newlines or no newlines.
tgrep2 does not complain if a file ends before the last tree is complete. That means the following file: “(TOP (NN test)”, will not throw an error. Early on I tried fragmenting my data into 1 sentence fragments and building corpus files from each, looking for errors. There were none, which mistakenly led me to believe each individual sentence was fine.
If you have a tree that is missing a closing parenthesis, then since tgrep2 ignores newline characters (presumably in order to suck up “pretty printed” trees that span multiple lines), then multiple subsequent lines of parse trees will all be read in as if they are subtrees of the earlier, non-completed tree.
If the above happens over a short span, then tgrep2 gives a warning about how it can’t handle trees with, e.g., 400, kids without the user specifying the “-K” flag.
If the above happens over a long span, then tgrep2 simply seg faults with no error. (Swell in a handbasket)
Once I wrote a quick script for checking parenthesis counts, I found 33 trees (tending to be formulas) that had a right paren embedded inside a token. I still don’t know how these errors came to be. As a note: Charniak’s parser will replace a single parenthesis with a new symbol, but if the paren is embedded inside a string then technically there should be no problem, as a right paren when used for bracketing can only be followed by whitespace or another right paren. tgrep2 does not buy this argument.