little function to clean up factor variables after subset-ing

Posted on Updated on


I am always annoyed that one has to remind R to reduce the number of levels of a factor after a subset (of the original data set) has been created. In addition to screwing up tables (b/c they will contain all the zero rows/columns, too), this also can affect comparison of factor values (“Factors do not have the same number of levels”), and it makes RData files much bigger than they need to be. In our lab, we often work with large data files (up to 800,000 rows and 100-350 variables are relatively common), so that an RData file containing just that data.frame can easily be 100MB+. Say, you select 5,000 rows out of 800,000, that may still leave you at an RData file size of 50MB+ because R remembers all original levels for all factors still in the data.frame. The little script I attach below, takes either a factor or a data.frame as input and returns the factor or the data.frame in such a way that only levels still in the data are considered. In the files, Ir recently worked with, that reduced the file size by 90%+, which in turns leads to a considerable speed-up in analyzing the data (I mean, on a small laptop, you will definitely feel the difference …). Anyway, nothing big, but I maybe some of you may find it useful:

myFactorCleanup <- function(x) {
	if (is.factor(x)) {
		return(as.factor(as.character(x)))
	}
	if (is.data.frame(x) || is.matrix(x)) {
		for (i in 1:ncol(x)) {
			if(is.factor(x[,i])) { x[,i] <- as.factor(as.character(x[,i])) }
		}
		return(as.data.frame(x))
	}
}
)))

3 thoughts on “little function to clean up factor variables after subset-ing

    dej said:
    June 14, 2009 at 10:35 pm
    tiflo said:
    June 14, 2009 at 10:58 pm

    lol. nice. I had been looking for this. so, actually there function is much nicer (thanks!):

    drop.levels <- function(dat){
    # Drop unused factor levels from all factors in a data.frame
    # Author: Kevin Wright. Idea by Brian Ripley.
    dat[] <- lapply(dat, function(x) x[,drop=TRUE])
    return(dat)
    }

    Like

    […] little function to clean up factor variables after subset-ing « HLP/Jaeger lab blog – dat[] ← lapply(dat, function(ⅹ) x[,drop=TRUE]) (tags: data.frame R cleanup ) […]

    Like

Questions? Thoughts?