Segments - Hierarchical Cluster Analysis

From Q
Jump to navigation Jump to search


Creates a dendrogram of the distance between variables using hierarchical cluster analysis. Please see What is Hierarchical Clustering?, What is Dendrogram? and What are the Strengths and Weaknesses of Hierarchical Clustering? for more information on hierarchical clustering and dendrograms.

Example

Usage

In Q, go to Create > Segments > Hierarchical Cluster Analysis

In Displayr, go to Insert > Group/Segment > Hierarchical Cluster Analysis

A new object will be added to the Page and the object inspector will become available on the right-hand side of the screen. In the object inspector under Inputs > Variables select the variables from your data that you want to include in your analysis.

Options

Variables The variables that you would like to analyze.

Number of clusters The number of clusters to color-code in the dendrogram.

Distance The formula used to compute the distance between points, prior to clustering. The options are (refer to dist for more information):

Euclidean
Maximum
Manhattan
Canberra
Binary

Clustering method The algorithm used to form the clusters. The default is "Ward2 (ward.D2)", which is usually known as Ward's method. The options are (refer to hclust for more information):

Ward1 (ward.D)
Ward2 (ward.D2)
Single
Complete
Average
McQuitty
Median
Centroid

Variable names Displays Variable Names in the output.

Categorical as binary Represent unordered categorical variables as binary variables. Otherwise, they are represented as sequential integers (i.e., 1 for the first category, 2 for the second, etc.). Numeric - Multi variables are treated according to their numeric values and not converted to binary.

Label margin Set the width of the right-hand margin to accommodate long labels.

Acknowledgements

The R package networkD3 is used to create the dendrogram, while hierarchical clustering is performed by the hclust function in the stats R package.

Code

var heading_text = 'Hierarchical Cluster Analysis';
if (!!form.setObjectInspectorTitle)
    form.setObjectInspectorTitle(heading_text, heading_text);
else 
    form.setHeading(heading_text);
form.dropBox({name: "formVariables", label: "Variables", types: ["V:numeric, categorical, ordered categorical"],
              multi:true, min_inputs: 2, height: 8, prompt: "Select two or more Variables."});
form.numericUpDown({name: "formClusters", label: "Number of clusters", default_value: 1, increment: 1, maximum:100, minimum: 1,
                    prompt: "Specify the number of clusters to color-code in the dendrogram"});
form.comboBox({name: "formDistanceMethod", label: "Distance", alternatives: ["Euclidean", "Maximum", "Manhattan", "Canberra", "Binary"],
               default_value: "Euclidean", prompt: "Specify the method used to compute distances"});
form.comboBox({name: "formClusteringMethod", label: "Clustering method",
               alternatives: ["Ward1 (ward.D)", "Ward2 (ward.D2)", "Single", "Complete", "Average", "McQuitty", "Median", "Centroid"],
               default_value: "Ward2 (ward.D2)", prompt: "Specify the algorithm used to form the clusters"});
form.checkBox({label: "Variable names", name: "formNames", default_value: false, prompt: "Display names instead of labels"});
form.checkBox({name: "binaryCat", label: "Categorical as binary", default_value: false, prompt: "Code categorical variables as dummy variables"});
form.numericUpDown({name: "formLabelMargin", label: "Label margin", default_value: 200, increment: 50, minimum: 0, maximum: 10000,
                    prompt: "Set the width of the right-hand margin to accommodate long labels"});
library(flipData)
library(networkD3)

dat <- TidyRawData(QDataFrame(formVariables), subset = QFilter, as.binary = binaryCat,
                    weights = QCalibratedWeight, as.numeric = TRUE,
                    missing = "Exclude cases with missing data",
                    extract.common.lab.prefix = !formNames)
if (!formNames)
    colnames(dat) <- sapply(dat, attr, "label")

weights <- attr(dat, "weights")
if (!is.null(weights))
    dat <- sweep(dat, 1, weights, "*")

number.segments <- formClusters
if (number.segments > ncol(dat))
  stop("You have more segments than variables in the analysis.")

# Computing the distance matrix.
distance.method <- switch(formDistanceMethod,
                          Euclidean = "euclidean",
                          Maximum = "maximum",
                          Manhattan = "manhattan",
                          Canberra = "canberra",
                          Binary = "binary")
distance.matrix <- dist(t(data.matrix(dat)), method = distance.method)
# Hierarchical cluster analysis.
clustering.method <- switch(formClusteringMethod,
                            `Ward1 (ward.D)` = "ward.D",
                            `Ward2 (ward.D2)` = "ward.D2",
                            Single = "single",
                            Complete = "complete",
                            Average = "average",
                            McQuitty = "mcquitty",
                            Median = "median",
                            Centroid = "centroid")
hc <- hclust(distance.matrix, clustering.method)
qColors <- c("red","blue","green","brown","orange")
while ( number.segments > length(qColors))
       qColors <- c(qColors, qColors)
colors <- qColors[cutree(hc, number.segments)]

diagonalNetwork(as.radialNetwork(hc))
hca <- dendroNetwork(hc,
    textColour = colors,
    width = QOutputSizeWidth * 72 - 20,
    height = QOutputSizeHeight * 72 - 20,
    margins = list(top = 0, right = formLabelMargin, bottom = 0, left = 5))

See Also

Further reading: Market Segmentation Software