Segments - K-Means Cluster Analysis

From Q
Jump to navigation Jump to search


Identifies clusters using K-Means cluster analysis. See What is k-Means Cluster Analysis? and What is Cluster Analysis? for more information.

Example

The output from K-Means shows the average value of each variable (rows) by cluster (columns). Where weights are provided, the percentages show weighted data but the n does not. The Variance Explained is a multivariate R-squared statistic, which is sometimes known as omega-squared in the cluster analysis literature. The Calinksi-Harabasz statistic can be useful when selecting the number of segments (higher is better), however, it should not be relied upon as the ultimate arbiter of number of segments as it is not particularly scientific.

Usage

In Q, go to Create > Segments > K-Means Cluster Analysis

In Displayr, go to Insert > Group/Segment > K-Means Cluster Analysis

A new object will be added to the Page and the object inspector will become available on the right-hand side of the screen. In the object inspector under Inputs > Variables select the variables from your data that you want to include in your analysis.

Options

Use one of Variable containing segment membership or K-Means (default). The second option allows users to construct a K-Means model and profile the predicted clusters in one step.

Variables The variables or a question containing variables to be used to identify the clusters.

Number of clusters The number of clusters to identify.

Missing data (see Missing Data Options):

Error if missing data
Exclude cases with missing data
Use partial data. This is the default.
Imputation (replace missing values with estimates)

Algorithm:

Batch This is the default and is the only algorithm that can accommodate weights or missing values. Refer to the Technical Details section below.
Hartigan-Wong Refer to kmeans for more information on this and the algorithms below.
Forgy
Lloyd
MacQueen

Output:

Means
Means table Show the cluster means. Best if wanting to export to another program.
Segment profiling table Show the composition of the Profiling variables within predicted clusters. More options to control the appearance are described here.

Weight Where a weight has been set for the R Output, the calibrated weight is used. See Weights in R.

Cluster labels An optional comma-separated list used to name the clusters predicted by the K-Means model.

Profiling variables Select other variables or variable sets to crosstab with the segment cluster.

SAVE VARIABLE(S)

Cluster Membership Saves a new variable to the data set that contains the segment membership.

Technical details

The Batch algorithm works as follows:

  1. The Hartigan-Wong K-Means algorithm is used to find clusters with missing data set to Exclude cases with missing data.
  2. Cases are assigned to the most similar cluster. Where Missing data is set to use partial data (the default), this means that cases that were ignored by Hartigan-Wong are now included in the analysis.
  3. The cluster centers are updated. Where weights have been applied, this means that the cluster centers now reflect weights (they were ignored by Hartigan-Wong).
  4. The previous two steps are repeated until the either the maximum number of iterations, iter.max has been exceeded (which defaults to 100), or, the Omega-Squared does not increase.

Acknowledgements

Uses the kmeans function from the stats R package.

Code

var defaultSegType = "K-Means";
var segmentType = form.comboBox({name: "formSegmentType", label: "Use", alternatives: ["K-Means", "Variable containing segment membership"], default_value: defaultSegType}).getValue();

if (segmentType == "K-Means")
{
    pageTitle = 'K-Means Cluster Analysis';
    form.dropBox({label: "Variables", types: ["Q: pickone, pickonemulti, number, numbermulti, numbergrid, pickany, pickanycompact, pickanygrid", "Variable: Numeric, Date, Money, Categorical, OrderedCategorical"],  name: "formVariables", multi: true, min_inputs: 1, height: 8, prompt: "Select at least two Variables"}) 
    form.numericUpDown({label: "Number of clusters", increment: 1, minimum: 2, maximum: 1000, default_value: 2, name: "formNumberClusters", prompt: "Specify the number of clusters to identify"})
    form.comboBox({label: "Missing data",  alternatives: ["Error if missing data", "Exclude cases with missing data", "Use partial data", "Imputation (replace missing values with estimates)"],  name: "formMissing", default_value: "Use partial data", prompt: "Options for handling cases with missing data"})
    form.comboBox({label: "Algorithm", alternatives: ["Batch", "Hartigan-Wong", "Forgy", "Lloyd", "MacQueen"], name: "formAlgorithm", default_value: "Batch", prompt: "Specify the k-means clustering algorithm to use"})
    var outputType = form.comboBox({label: "Output", alternatives: ["Means", "Means table", "Segment profiling table"], name: "formOutput", default_value: "Means", prompt: "Select the output type"}).getValue();
    form.textBox({name: "formColLabels", label: "Cluster labels", default_value: "Cluster 1, Cluster 2, Cluster 3", required: false, prompt: "Specify names for clusters as a comma separated list. Leave blank to use default names. If labels are duplicated, then the corresponding clusters will be merged."});
    form.checkBox({label: "Variable names", name: "formNames", default_value: false, prompt: "Display names instead of labels"})
    form.checkBox({label: "Categorical as binary", name: "formBinary", default_value: false, prompt: "Code categorical variables as dummy variables"})

}
else
{
    pageTitle = "Segment Comparison Table";
    form.dropBox({label: "Variable",
            types:["Variable: Categorical, OrderedCategorical, Date"],
            prompt: "Categorical grouping variable used to predict the outcome variables",
            name: "formSegmentation"})
}
if (!!form.setObjectInspectorTitle)
    form.setObjectInspectorTitle(pageTitle, pageTitle);
else 
    form.setHeading(pageTitle);
var profVar = form.dropBox({label: "Profiling variables", name: "formProfVar",
            types:["Questions:!Text", "Variables:!Text"], multi: true, 
            required: segmentType != "K-Means" || outputType == "Segment profiling table"});

if (segmentType == "Variable containing segment membership" || 
    outputType == "Segment profiling table")
{
    form.group({label: "Display", expanded: true});
    form.checkBox({name: "formIndexValues", label: "Show index values", default_value: false,
        prompt: "Show column percentages as a ratio to the row total"});
    //var condFill = form.checkBox({name: "formCondFill", label: "Color cell fill conditional on cell values", default_value: true,
    //    prompt: "Values which are higher than the row mean are shown in blue, and values lower than the row mean are shown in red. Numeric values are scaled by 2 * standard deviation."}); 

    var condShade = form.comboBox({name: "formCondShadeType", label: "Shade", alternatives: ["None", "Cell colors", "Font colors", "Boxes", "Arrows", "Fonts and arrows", "Bars"], default_value: "Bars", prompt: "Select whether cells or elements inside the cell should be shaded to highlight the magnitude of the cell value"});
    if (condShade.getValue() == "Boxes")
    {
        form.numericUpDown({name: "formCondBoxWidth", label: "Box width", default_value: 2});
        form.numericUpDown({name: "formCondBoxRadius", label: "Box corner roundness", default_value: 0, maximum: 50, prompt: "Increase value to get rounder corner; setting to 50 gives ovals"});
    }

    if (condShade.getValue() != "None")
    {
        form.colorPicker({name: "formCondShade1", label: "Very small values", default_value: "#E99598"});
        form.numericUpDown({name: "formCondThres1", label: "Threshold for very small values", default_value: -0.2, minimum: -10, increment: 0.01, prompt: "Values are considered very small if the standardized value is smaller than the threshold. Numeric variables are standardized by centering by the population mean and scaling by the standard deviation; categorical variables are standardized by taking the ratio to the population mean"});


        form.colorPicker({name: "formCondShade2", label: "Small values", default_value: "#E5C8C4"});
        form.numericUpDown({name: "formCondThres2", label: "Threshold for small values", default_value: -0.1, minimum: -10, increment: 0.01, prompt: "Values are considered very if the standardized value is smaller than the threshold. Numeric variables are standardized by centering by the population mean and scaling by the standard deviation; categorical variables are standardized by taking the ratio to the population mean"});

        form.colorPicker({name: "formCondShade3", label: "Large values", default_value: "#A9C0DA"});
        form.numericUpDown({name: "formCondThres3", label: "Threshold for large values", default_value: 0.1, minimum: -10, increment: 0.01, prompt: "Values are considered large if the standardized value is larger than the threshold. Numeric variables are standardized by centering by the population mean and scaling by the standard deviation; categorical variables are standardized by taking the ratio to the population mean"});
        
        form.colorPicker({name: "formCondShade4", label: "Very large values", default_value: "#82A5CB"});
        form.numericUpDown({name: "formCondThres4", label: "Threshold for very large values", default_value: 0.2, minimum: -10, increment: 0.01, prompt: "Values are considered very large if the standardized value is larger than the threshold. Numeric variables are standardized by centering by the population mean and scaling by the standard deviation; categorical variables are standardized by taking the ratio to the population mean"});
    }

    var condText = form.checkBox({name: "formCondText", label: "Color cell text conditional on significance testing", default_value: true,
        prompt: "Grey out values which are not significantly different from the row mean"});
    if (condText.getValue())
    {
        form.colorPicker({name: "formCondTextColor", label: "Non-significant font color", default_value: "#999999"});
        form.checkBox({name: "formNonparametric", label: "Use non-parametric test", default_value: false, 
                      prompt: "Convert numeric variables to ranks before performing significance tests"});   
        form.checkBox({name: "formFDR", label: "False discovery rate correction", default_value: false});
        form.numericUpDown({name: "formConfidenceLevel", label: "Confidence level", default_value: 0.95, minimum: 0.0, maximum: 1.0, increment: 0.01});
    }
    if (condShade.getValue() != "None" && condText.getValue())
        form.checkBox({name: "formCondShadeSigOnly", label: "Only shade significant results", default_value: true});

    if (segmentType != "K-Means")
        form.textBox({name: "formColLabels", label: "Column labels", default_value: "", required: false, prompt: "Specify columns names as a comma separated list. Leave blank to use default names"});
    form.textBox({name: "formRowsHide", label: "Rows to hide", default_value: "NET, Total, SUM", prompt: "Specify rows to hide as a comma-separated list. Use double-quotes to escape commas", required: false})

    form.page("Format");
    form.group("Number formatting");
    form.numericUpDown({name: "formDecimalsPercentage", label: "Decimals shown for percentages", default_value:0});
    form.numericUpDown({name: "formDecimalsNumeric", label: "Decimals shown for numeric data", default_value: 1});

    form.group("Font");
    var fontFamilies = font_families = !!Q.GetAvailableFontNames ? Q.GetAvailableFontNames() : ["Arial", "Arial Black", "Century Gothic", "Comic Sans MS",
                     "Courier New", "Georgia", "Impact", "Open Sans", "Tahoma", "Times New Roman", "Trebuchet MS", "Verdana"];
    form.comboBox({name: "formFontFamily", label: "Font family", default_value: "Open Sans", alternatives: fontFamilies, editable: true, prompt: "Select the font to use. You can also type the name of a font directly (including custom fonts)."});
    form.colorPicker({name: "formFontColor", label: "Font color", default_value: "#444444"});
    var fontSize = form.numericUpDown({name: "formFontSize", label: "Font size", default_value: 8, increment: 0.5});
    form.comboBox({name: "formFontUnits", label: "Font units", alternatives: ["pt", "px"], default_value: "pt"});


    form.group("Spacing");
    form.numericUpDown({name: "formRowHeight", label: "Row height", default_value: fontSize.getValue() + 5});
    form.textBox({name: "formColumnWidths", label: "Column widths", required: false, default_value: "100px, 100px",
         prompt: "Comma separated values, e.g. '40px, 25%' or leave blank for equal widths"});

    form.group("Borders and fill");
    form.colorPicker({name: "formColHeadFill", label: "Column header fill", default_value: "#AEB7BA"});
    form.colorPicker({name: "formRowHeadFill", label: "Row header fill", default_value: "#F1F3F4"});
    form.colorPicker({name: "formSummaryFill", label: "Summary rows fill", default_value: "#FFFFFF"});
    form.colorPicker({name: "formCellFill", label: "Cell fill", default_value: "#FFFFFF"});
    form.colorPicker({name: "formBorderColor", label: "Border color", default_value: "#FFFFFF"});
    form.numericUpDown({name: "formBorderWidth", label: "Border width", default_value: 1});
}
library(flipAnalysisOfVariance)
library(flipCluster)

if (formSegmentType == "K-Means")
    kmeans <- KMeans(formVariables, 
        centers = formNumberClusters,
        algorithm = formAlgorithm,
        output = formOutput,
        subset = QFilter,
        weights = QPopulationWeight,
        missing = formMissing,
        show.labels = !formNames,
        binary = formBinary,
        show.index.values = get0("formIndexValues", ifnotfound = FALSE),
        centers.names = get0("formColLabels"),
        cond.shade = get0("formCondShadeType", ifnotfound = "None"),
        cond.box.radius = get0("formCondBoxRadius"),
        cond.box.width = get0("formCondBoxWidth"),
        cond.shade.colors = if (exists("formCondShadeType")) c(formCondShade1, formCondShade2, formCondShade3, formCondShade4, formCondShade4),
        cond.shade.cutoffs = if (exists("formCondShadeType")) c(formCondThres1, formCondThres2, formCondThres3, formCondThres4),
        cond.shade.sig.only = get0("formCondShadeSigOnly", ifnotfound = FALSE),

        format.percentage.decimals = get0("formDecimalsPercentage", ifnotfound = 0),
        format.numeric.decimals = get0("formDecimalsNumeric", ifnotfound = 0),    
        font.color.nonsignificant = get0("formCondTextColor", ifnotfound = ""),
        font.color.confidence = get0("formConfidenceLevel", ifnotfound = 0.95),
        font.color.FDRcorrection = get0("formFDR", ifnotfound = FALSE),
        font.color.nonparametric = get0("formNonparametric", ifnotfound = FALSE),
        row.names.to.remove = get0("formRowsHide", ifnotfound = ""),
        col.widths = get0("formColumnWidths", ifnotfound = ""),
        row.height = if (exists("formRowHeight")) paste0(formRowHeight, formFontUnits) else NULL, # set NULL for autofit
        global.font.family = get0("formFontFamily", ifnotfound = ""),
        font.size = get0("formFontSize", ifnotfound = 0),
        font.color = get0("formFontColor", ifnotfound = ""),
        font.unit = get0("formFontUnits", ifnotfound = ""),
        col.header.fill = get0("formColHeadFill", ifnotfound = ""),
        row.header.fill = get0("formRowHeadFill", ifnotfound = ""),
        summary.cell.fill = get0("formSummaryFill", ifnotfound = ""),
        cell.fill = get0("formCellFill", ifnotfound = ""),
        border.color = get0("formBorderColor"),
        border.width = get0("formBorderWidth", ifnotfound = 0),
        profile.var = formProfVar)

if (formSegmentType != "K-Means")
{
    segment.table <- SegmentComparisonTable(formProfVar, 
                       group = formSegmentation,
                       show.index.values = formIndexValues,
                       cond.shade = get0("formCondShadeType"),
                       cond.box.radius = get0("formCondBoxRadius"),
                       cond.box.width = get0("formCondBoxWidth"),
                       cond.shade.colors = if (exists("formCondShadeType")) c(formCondShade1, formCondShade2, formCondShade3, formCondShade4),
                       cond.shade.cutoffs = if (exists("formCondShadeType")) c(formCondThres1, formCondThres2, formCondThres3, formCondThres4),
                       cond.shade.sig.only = get0("formCondShadeSigOnly", ifnotfound = FALSE),
                       col.header.labels = get0("formColLabels"),
                       format.percentage.decimals = formDecimalsPercentage,
                       format.numeric.decimals = formDecimalsNumeric,    
                       font.color.set.if.nonsignificant = formCondText,
                       font.color.nonsignificant = formCondTextColor,
                       font.color.confidence = formConfidenceLevel,
                       font.color.FDRcorrection = formFDR,
                       font.color.nonparametric = formNonparametric,
                       row.names.to.remove = formRowsHide,
                       col.widths = formColumnWidths,
                       row.height = paste0(formRowHeight, formFontUnits), # set NULL for autofit
                       global.font.family = formFontFamily,
                       font.size = formFontSize,
                       font.color = formFontColor,
                       font.unit = formFontUnits,
                       col.header.fill = formColHeadFill,
                       row.header.fill = formRowHeadFill,
                       summary.cell.fill = formSummaryFill,
                       cell.fill = formCellFill,
                       border.color = formBorderColor,
                       border.width = formBorderWidth,
                       subset = QFilter,
                       weights = QPopulationWeight)
} else
    kmeans

See Also

Save variables from k-means cluster analysis outputs: Segments - Save Variable(s) - Cluster Membership

Further reading: Market Segmentation Software