Wednesday, September 23, 2015

Dynamically constructed DDPLY


The function ddply from library plyr,is a powerful tool that performs grouping actions in data frames. For example, to group the ChickWeight data (available in default R datasets) based on the chicken diets, we can use:

> ddply(ChickWeight, .(Diet), summarize, weight=mean(weight), Time=median(Time))

The above function will show the mean weight and median time recorded for the 50 chickens of the dataset.

However, there are cases when the a-priori knowledge of the data frame structure is not known and some columns might be missing. This can be the common case when the ddply command is used inside a user function. In such a case, a ddply command with a fixed column specification will fail. Suppose the following:

> ChickWeight2 <- ChickWeight[,-2]
> ddply(ChickWeight2, .(Diet), summarize, weight=mean(weight), Time=median(Time)) #this will fail

The above ddply command will fail, returning an 'object 'Time' not found' error. to deal with such situations I include hereinafter a solution. The code handles optionally the existence of potential columns and their desired aggregate function.

First we set up one dataset (complete or with missing columns):

>#create a dataset with all or fewer columns, choose one of these lines
>ChickWeight2 <- ChickWeight[,-2] #missing column 'Time'
># OR
>ChickWeight2 <- ChickWeight #includes column 'Time'

Then we apply the following commands, constructing a list parameter for a call to ddply.

>plyrlist <- .()  #initialize an empty list
>if (length(colnames(ChickWeight2)[which(colnames(ChickWeight2)=='weight')])>0) plyrlist <- (c(plyrlist , .(weight = mean( weight)))) #add optional term, conditional on column existence
>if (length(colnames(ChickWeight2)[which(colnames(ChickWeight2)=='Time')])>0) plyrlist <- (c(plyrlist , .(Time= median(Time)))) #add optional term, conditional on column existence

>do.call(ddply,c(.data = quote(ChickWeight2), .variables = 'Diet',.fun = quote(summarize), plyrlist)) #call to ddply, this works, irrespective the existence of Time 
This approach can be expanded to handle as many columns as desired.