reduce levels

little function to clean up factor variables after subset-ing

Posted on Updated on

I am always annoyed that one has to remind R to reduce the number of levels of a factor after a subset (of the original data set) has been created. In addition to screwing up tables (b/c they will contain all the zero rows/columns, too), this also can affect comparison of factor values (“Factors do not have the same number of levels”), and it makes RData files much bigger than they need to be. In our lab, we often work with large data files (up to 800,000 rows and 100-350 variables are relatively common), so that an RData file containing just that data.frame can easily be 100MB+. Say, you select 5,000 rows out of 800,000, that may still leave you at an RData file size of 50MB+ because R remembers all original levels for all factors still in the data.frame. The little script I attach below, takes either a factor or a data.frame as input and returns the factor or the data.frame in such a way that only levels still in the data are considered. In the files, Ir recently worked with, that reduced the file size by 90%+, which in turns leads to a considerable speed-up in analyzing the data (I mean, on a small laptop, you will definitely feel the difference …). Anyway, nothing big, but I maybe some of you may find it useful: Read the rest of this entry »