Tuesday, September 23, 2014

Plot C5.0 decision trees in R

Introduction

In many cases, the user of C50 R-package wants to present the resulted decision tree in some graphical form. Such a plotting feature however is not available in the C50 package. This R code overcomes the plotting issue by providing automatic interpretation of the decision tree to GraphViz design language, enabling then to straightforwardly plot the tree model using the dot command of GraphViz. In case the user has not a local installation GraphViz (open source software), they may freely acquire it from this site.

The steps to plot the decision tree are the following:
  1. Generate your model using the C50 package in R, as usual.
  2. Interpret the model output and save it to a file, by calling C5.0.graphviz function (given later in this page), using as parameters your C5.0 model and a desired text output filename.
  3. In your operating system, call the GraphViz's dot command with proper parameter syntax (example given below).

Usage

C5.0.graphviz ( C5.0.model, filename, fontname ='Arial',  col.draw ='black', col.font ='blue',  col.conclusion ='lightpink',  col.question = 'grey78', shape.conclusion ='box3d',  shape.question ='diamond', bool.substitute = c('None',     'yesno','truefalse','TF'), prefix=FALSE, vertical=TRUE )

Arguments

C5.0.model       The name of a variable which is a valid C5.0 model result.
filename         The name of a file where the output GraphViz model will be saved to.
fontname         The font that will be used for the graph.
col.draw         The color of the drawing lines.
col.font         The color of the font.
col.conclusion   The color which will be used for conclusion nodes (tree leaves).
col.question     The color which will be used for question nodes (tree inner nodes).
shape.conclusion The shape which will be used for conclusion nodes (tree leaves).
shape.question   The shape which will be used for question nodes (tree inner nodes).
bool.substitute  A substitution may take place for boolean comparisons. Default is 'None' which will plot '= 0' and '= 1' on the respective decision tree branches. The option 'yesno', will plot 'no' and 'yes' respectively, the 'truefalse' will plot 'false' and 'true' and the 'TF' option will plot '.F.' and '.T.' (considering the value of 0 as 'false' and 1 as 'true')
prefix           When set to true, the class nodes will have the prefix 'Class' before the class number (useful to multiclass problems where the class is referred by a single number).
vertical         The orientation of the decision tree. Default is vertical, and if this is set to False, the tree is drawn from left to right.

Details

In GraphViz, the X11 color scheme, the SVG scheme, and the Brewer scheme are supported, with X11 being the default. For an exhaustive list of candidate colors, the user can check the respective GraphViz page here.

An exhaustive list of candidate node shapes, is also included in this GraphViz page.

Some information on GraphViz's fonts can be found here.

Value

If successful, the function will create a text file at the given directory, containing the decision tree model described in GraphViz's dot language.

Note

In respect of boolean substitutions using the bool.substitute parameter, it is noted that the routine in this version is not able to know whether the comparison is indeed of boolean nature, neither traces other boolean comparisons with arithmetic arguments (e.g. between '1' and '2'), i.e. the comparison is performed between '0' and '1' (both required, as inner node options, for a successful substitution).

Update: Version 2 extends the translation into multi-branched trees. Version 1 was able to handle only trees with binary splits. 
Version 2.2 corrects a missing initialization of the firstindent variable.

Example

We use the example from the C50 package, a data set from the MLC++ machine learning software for modeling customer churn.

library(C50)
data(churn)
treeModel <- C5.0(x = churnTrain[, -20], y = churnTrain$churn)
summary(treeModel) #to compare output
C5.0.graphviz(treeModel, 'c:\\dtreeout.txt', col.question ='cyan')

(The generated output of the C5.0.graphviz routine, contained in the dtreeout.txt file is shown in the Appendix, at the end of this page).

Then, in the operating system, we ensure we have access to dot command of the GraphViz package (having the directory either in path, or having navigated to the respective directory) and we enter the following command (here, presumed from a WINDOWS command prompt):

dot -Tpng c:\dtreeout.txt > c:\dtreeout.png

This command produces a graphic file named 'dtreeout.png'.  This file, includes the following graph, which depicts graphically the example decision tree (not shown in actual size here):



Code

#---------------------------------------------------------#
# Function: C5.0.graphviz                                 # 
# Version: 2.2.0                                          #
# Date: 26/09/2014                                        #
# Author: Athanasios Tsakonas                             #
# This code implements C5.0.graphviz conversion routine   #
#---------------------------------------------------------#

C5.0.graphviz <- function( C5.0.model, filename, fontname ='Arial',col.draw ='black',
col.font ='blue',col.conclusion ='lightpink',col.question = 'grey78',
shape.conclusion ='box3d',shape.question ='diamond', 
bool.substitute = 'None', prefix=FALSE, vertical=TRUE ) {

library(cwhmisc)  
library(stringr) 
treeout <- C5.0.model$output
treeout<- substr(treeout, cpos(treeout, 'Decision tree:', start=1)+14,nchar(treeout))
treeout<- substr(treeout, 1,cpos(treeout, 'Evaluation on training data', start=1)-2)
variables <- data.frame(matrix(nrow=500, ncol=4)) 
names(variables) <- c('SYMBOL','TOKEN', 'TYPE' , 'QUERY') 
connectors <- data.frame(matrix(nrow=500, ncol=3)) 
names(connectors) <- c('TOKEN', 'START','END')
theStack <- data.frame(matrix(nrow=500, ncol=1)) 
names(theStack) <- c('ITEM')
theStackIndex <- 1
currentvar <- 1
currentcon <- 1
open_connection <- TRUE
previousindent <- -1
firstindent <- 4
substitutes <- data.frame(None=c('= 0','= 1'), yesno=c('no','yes'),
truefalse=c('false', 'true'),TF=c('F','T'))
dtreestring<-unlist( scan(text= treeout,   sep='\n', what =list('character')))

for (linecount in c(1:length(dtreestring))) {
lineindent<-0
shortstring <- str_trim(dtreestring[linecount], side='left')
leadingspaces <- nchar(dtreestring[linecount]) - nchar(shortstring)
lineindent <- leadingspaces/4
dtreestring[linecount]<-str_trim(dtreestring[linecount], side='left') 
while (!is.na(cpos(dtreestring[linecount], ':   ', start=1)) ) {
lineindent<-lineindent + 1 
dtreestring[linecount]<-substr(dtreestring[linecount],
ifelse(is.na(cpos(dtreestring[linecount], ':   ', start=1)), 1,
cpos(dtreestring[linecount], ':   ', start=1)+4),
nchar(dtreestring[linecount]) )
shortstring <- str_trim(dtreestring[linecount], side='left')
leadingspaces <- nchar(dtreestring[linecount]) - nchar(shortstring)
lineindent <- lineindent + leadingspaces/4
dtreestring[linecount]<-str_trim(dtreestring[linecount], side='left')
}
if (!is.na(cpos(dtreestring[linecount], ':...', start=1)))
lineindent<- lineindent +  1 
dtreestring[linecount]<-substr(dtreestring[linecount],
ifelse(is.na(cpos(dtreestring[linecount], ':...', start=1)), 1,
cpos(dtreestring[linecount], ':...', start=1)+4),
nchar(dtreestring[linecount]) )
dtreestring[linecount]<-str_trim(dtreestring[linecount])
stringlist <- strsplit(dtreestring[linecount],'\\:')
stringpart <- strsplit(unlist(stringlist)[1],'\\s')
if (open_connection==TRUE) { 
variables[currentvar,'TOKEN'] <- unlist(stringpart)[1]
variables[currentvar,'SYMBOL'] <- paste('node',as.character(currentvar), sep='')
variables[currentvar,'TYPE'] <- shape.question
variables[currentvar,'QUERY'] <- 1
   theStack[theStackIndex,'ITEM']<-variables[currentvar,'SYMBOL']
theStack[theStackIndex,'INDENT'] <-firstindent 
theStackIndex<-theStackIndex+1
currentvar <- currentvar + 1
if(currentvar>2) {
  connectors[currentcon - 1,'END'] <- variables[currentvar - 1, 'SYMBOL']
}
   }
connectors[currentcon,'TOKEN'] <- paste(unlist(stringpart)[2],unlist(stringpart)[3])
if (connectors[currentcon,'TOKEN']=='= 0') 
connectors[currentcon,'TOKEN'] <- as.character(substitutes[1,bool.substitute])
if (connectors[currentcon,'TOKEN']=='= 1') 
connectors[currentcon,'TOKEN'] <- as.character(substitutes[2,bool.substitute])
if (open_connection==TRUE) { 
if (lineindent<previousindent) {
theStackIndex <- theStackIndex-(( previousindent- lineindent)  +1 )
currentsymbol <-theStack[theStackIndex,'ITEM']
} else
currentsymbol <-variables[currentvar - 1,'SYMBOL']
} else {  
currentsymbol <-theStack[theStackIndex-((previousindent -lineindent ) +1    ),'ITEM']
theStackIndex <- theStackIndex-(( previousindent- lineindent)    )
}
connectors[currentcon, 'START'] <- currentsymbol
currentcon <- currentcon + 1
open_connection <- TRUE 
if (length(unlist(stringlist))==2) {
 stringpart2 <- strsplit(unlist(stringlist)[2],'\\s')
variables[currentvar,'TOKEN'] <- paste(ifelse((prefix==FALSE),'','Class'), unlist(stringpart2)[2]) 
variables[currentvar,'SYMBOL'] <- paste('node',as.character(currentvar), sep='')
variables[currentvar,'TYPE'] <- shape.conclusion
variables[currentvar,'QUERY'] <- 0
currentvar <- currentvar + 1
connectors[currentcon - 1,'END'] <- variables[currentvar - 1,'SYMBOL']
open_connection <- FALSE
}
previousindent<-lineindent
}
runningstring <- paste('digraph g {', 'graph ', sep='\n')
runningstring <- paste(runningstring, ' [rankdir="', sep='')
runningstring <- paste(runningstring, ifelse(vertical==TRUE,'TB','LR'), sep='' )
runningstring <- paste(runningstring, '"]', sep='')
  for (lines in c(1:(currentvar-1))) {
  runningline <- paste(variables[lines,'SYMBOL'], '[shape="')
  runningline <- paste(runningline,variables[lines,'TYPE'], sep='' )
  runningline <- paste(runningline,'" label ="', sep='' )
  runningline <- paste(runningline,variables[lines,'TOKEN'], sep='' )
  runningline <- paste(runningline,
  '" style=filled fontcolor=', sep='')
  runningline <- paste(runningline, col.font)
  runningline <- paste(runningline,' color=' )
  runningline <- paste(runningline, col.draw)
  runningline <- paste(runningline,' fontname=')
  runningline <- paste(runningline, fontname)
  runningline <- paste(runningline,' fillcolor=')
  runningline <- paste(runningline,
  ifelse(variables[lines,'QUERY']== 0 ,col.conclusion,col.question))
  runningline <- paste(runningline,'];')
  runningstring <- paste(runningstring, runningline , sep='\n')
  }
  for (lines in c(1:(currentcon-1)))
  runningline <- paste (connectors[lines,'START'], '->')
  runningline <- paste (runningline, connectors[lines,'END'])
  runningline <- paste (runningline,'[label="')
  runningline <- paste (runningline,connectors[lines,'TOKEN'], sep='')
  runningline <- paste (runningline,'" fontname=', sep='')
  runningline <- paste (runningline, fontname)
  runningline <- paste (runningline,'];')
  runningstring <- paste(runningstring, runningline , sep='\n')
  }
runningstring <- paste(runningstring,'}')
cat(runningstring)
  sink(filename, split=TRUE)
cat(runningstring)
sink()
}

Appendix

The example decision tree as shown in the C50 package summary and the generated output by the C5.0.graphviz routine are shown below.

C50 summary output tree:

total_day_minutes > 264.4:
:...voice_mail_plan = yes:
:   :...international_plan = no: no (45/1)
:   :   international_plan = yes: yes (8/3)
:   voice_mail_plan = no:
:   :...total_eve_minutes > 187.7:
:       :...total_night_minutes > 126.9: yes (94/1)
:       :   total_night_minutes <= 126.9:
:       :   :...total_day_minutes <= 277: no (4)
:       :       total_day_minutes > 277: yes (3)
:       total_eve_minutes <= 187.7:
:       :...total_eve_charge <= 12.26: no (15/1)
:           total_eve_charge > 12.26:
:           :...total_day_minutes <= 277:
:               :...total_night_minutes <= 224.8: no (13)
:               :   total_night_minutes > 224.8: yes (5/1)
:               total_day_minutes > 277:
:               :...total_night_minutes > 151.9: yes (18)
:                   total_night_minutes <= 151.9:
:                   :...account_length <= 123: no (4)
:                       account_length > 123: yes (2)
total_day_minutes <= 264.4:
:...number_customer_service_calls > 3:
    :...total_day_minutes <= 160.2:
    :   :...total_eve_charge <= 19.83: yes (79/3)
    :   :   total_eve_charge > 19.83:
    :   :   :...total_day_minutes <= 120.5: yes (10)
    :   :       total_day_minutes > 120.5: no (13/3)
    :   total_day_minutes > 160.2:
    :   :...total_eve_charge > 12.05: no (130/24)
    :       total_eve_charge <= 12.05:
    :       :...total_eve_calls <= 125: yes (16/2)
    :           total_eve_calls > 125: no (3)
    number_customer_service_calls <= 3:
    :...international_plan = yes:
        :...total_intl_calls <= 2: yes (51)
        :   total_intl_calls > 2:
        :   :...total_intl_minutes <= 13.1: no (173/7)
        :       total_intl_minutes > 13.1: yes (43)
        international_plan = no:
        :...total_day_minutes <= 223.2: no (2221/60)
            total_day_minutes > 223.2:
            :...total_eve_charge <= 20.5: no (295/22)
                total_eve_charge > 20.5:
                :...voice_mail_plan = yes: no (20)
                    voice_mail_plan = no:
                    :...total_night_minutes > 174.2: yes (50/8)
                        total_night_minutes <= 174.2:
                        :...total_day_minutes <= 246.6: no (12)
                            total_day_minutes > 246.6:
                            :...total_day_charge <= 43.33: yes (4)
                                total_day_charge > 43.33: no (2)

Produced C5.0.graphviz dot description: 

digraph g {
graph [rankdir="TB"]
node1 [shape="diamond" label ="total_day_minutes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node2 [shape="diamond" label ="voice_mail_plan" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node3 [shape="diamond" label ="international_plan" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node4 [shape="box3d" label =" no" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node5 [shape="box3d" label =" yes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node6 [shape="diamond" label ="total_eve_minutes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node7 [shape="diamond" label ="total_night_minutes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node8 [shape="box3d" label =" yes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node9 [shape="diamond" label ="total_day_minutes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node10 [shape="box3d" label =" no" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node11 [shape="box3d" label =" yes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node12 [shape="diamond" label ="total_eve_charge" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node13 [shape="box3d" label =" no" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node14 [shape="diamond" label ="total_day_minutes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node15 [shape="diamond" label ="total_night_minutes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node16 [shape="box3d" label =" no" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node17 [shape="box3d" label =" yes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node18 [shape="diamond" label ="total_night_minutes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node19 [shape="box3d" label =" yes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node20 [shape="diamond" label ="account_length" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node21 [shape="box3d" label =" no" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node22 [shape="box3d" label =" yes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node23 [shape="diamond" label ="number_customer_service_calls" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node24 [shape="diamond" label ="total_day_minutes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node25 [shape="diamond" label ="total_eve_charge" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node26 [shape="box3d" label =" yes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node27 [shape="diamond" label ="total_day_minutes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node28 [shape="box3d" label =" yes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node29 [shape="box3d" label =" no" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node30 [shape="diamond" label ="total_eve_charge" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node31 [shape="box3d" label =" no" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node32 [shape="diamond" label ="total_eve_calls" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node33 [shape="box3d" label =" yes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node34 [shape="box3d" label =" no" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node35 [shape="diamond" label ="international_plan" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node36 [shape="diamond" label ="total_intl_calls" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node37 [shape="box3d" label =" yes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node38 [shape="diamond" label ="total_intl_minutes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node39 [shape="box3d" label =" no" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node40 [shape="box3d" label =" yes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node41 [shape="diamond" label ="total_day_minutes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node42 [shape="box3d" label =" no" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node43 [shape="diamond" label ="total_eve_charge" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node44 [shape="box3d" label =" no" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node45 [shape="diamond" label ="voice_mail_plan" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node46 [shape="box3d" label =" no" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node47 [shape="diamond" label ="total_night_minutes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node48 [shape="box3d" label =" yes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node49 [shape="diamond" label ="total_day_minutes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node50 [shape="box3d" label =" no" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node51 [shape="diamond" label ="total_day_charge" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= cyan ];
node52 [shape="box3d" label =" yes" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node53 [shape="box3d" label =" no" style=filled fontcolor= blue  color= black  fontname= Arial  fillcolor= lightpink ];
node1 -> node2 [label="> 264.4" fontname= Arial ];
node2 -> node3 [label="= yes" fontname= Arial ];
node3 -> node4 [label="= no" fontname= Arial ];
node3 -> node5 [label="= yes" fontname= Arial ];
node2 -> node6 [label="= no" fontname= Arial ];
node6 -> node7 [label="> 187.7" fontname= Arial ];
node7 -> node8 [label="> 126.9" fontname= Arial ];
node7 -> node9 [label="<= 126.9" fontname= Arial ];
node9 -> node10 [label="<= 277" fontname= Arial ];
node9 -> node11 [label="> 277" fontname= Arial ];
node6 -> node12 [label="<= 187.7" fontname= Arial ];
node12 -> node13 [label="<= 12.26" fontname= Arial ];
node12 -> node14 [label="> 12.26" fontname= Arial ];
node14 -> node15 [label="<= 277" fontname= Arial ];
node15 -> node16 [label="<= 224.8" fontname= Arial ];
node15 -> node17 [label="> 224.8" fontname= Arial ];
node14 -> node18 [label="> 277" fontname= Arial ];
node18 -> node19 [label="> 151.9" fontname= Arial ];
node18 -> node20 [label="<= 151.9" fontname= Arial ];
node20 -> node21 [label="<= 123" fontname= Arial ];
node20 -> node22 [label="> 123" fontname= Arial ];
node1 -> node23 [label="<= 264.4" fontname= Arial ];
node23 -> node24 [label="> 3" fontname= Arial ];
node24 -> node25 [label="<= 160.2" fontname= Arial ];
node25 -> node26 [label="<= 19.83" fontname= Arial ];
node25 -> node27 [label="> 19.83" fontname= Arial ];
node27 -> node28 [label="<= 120.5" fontname= Arial ];
node27 -> node29 [label="> 120.5" fontname= Arial ];
node24 -> node30 [label="> 160.2" fontname= Arial ];
node30 -> node31 [label="> 12.05" fontname= Arial ];
node30 -> node32 [label="<= 12.05" fontname= Arial ];
node32 -> node33 [label="<= 125" fontname= Arial ];
node32 -> node34 [label="> 125" fontname= Arial ];
node23 -> node35 [label="<= 3" fontname= Arial ];
node35 -> node36 [label="= yes" fontname= Arial ];
node36 -> node37 [label="<= 2" fontname= Arial ];
node36 -> node38 [label="> 2" fontname= Arial ];
node38 -> node39 [label="<= 13.1" fontname= Arial ];
node38 -> node40 [label="> 13.1" fontname= Arial ];
node35 -> node41 [label="= no" fontname= Arial ];
node41 -> node42 [label="<= 223.2" fontname= Arial ];
node41 -> node43 [label="> 223.2" fontname= Arial ];
node43 -> node44 [label="<= 20.5" fontname= Arial ];
node43 -> node45 [label="> 20.5" fontname= Arial ];
node45 -> node46 [label="= yes" fontname= Arial ];
node45 -> node47 [label="= no" fontname= Arial ];
node47 -> node48 [label="> 174.2" fontname= Arial ];
node47 -> node49 [label="<= 174.2" fontname= Arial ];
node49 -> node50 [label="<= 246.6" fontname= Arial ];
node49 -> node51 [label="> 246.6" fontname= Arial ];
node51 -> node52 [label="<= 43.33" fontname= Arial ];
node51 -> node53 [label="> 43.33" fontname= Arial ]; }

References

None.