Data - Data Set - Combine - By Case

From Q
(Redirected from Data - Merge - By Case)
Jump to: navigation, search

Combine data sets in the Displayr cloud drive by case

This page is currently under construction, or it refers to features which are under development and not yet available for use.
This page is under construction. Its contents are only visible to developers!

Combine multiple data sets together by case (matching variables). To use this feature, the data sets to be combined need to be uploaded to the cloud drive (accessed via the user icon button > Displayr cloud drive). The resulting combined data set is also written to the cloud drive. Once the combined data file has been created in the cloud drive, it can be added to a document via Data Sets > Add > Displayr Cloud Drive, and analyses can be performed as with any other data file in Displayr.

The data sets are combined by matching their variables, which is done automatically by matching variable names, labels and value labels, and/or manually. Automatic matching of names and labels may be exact or fuzzy (with a specified minimum match percentage), and manual matching can be specified over a range of variables for convenience (see Options section below). By default, all input variables are included in the combined data set but there is an option to exclude variables from the combined data set if they do not contain input from at least one data set from a list of specified data sets.

Example

The output shown below is a result of combining three data sets. The output displays the variables in the combined data set (named "Combined data set.sav"), along with some additional information. Each variable shown here is a result of combining variables from different data sets that have been matched together. The filled squares indicate which data sets contributed to each variable. Variables whose input variables contain differences in variable name, variable label or value label are highlighted in blue. Variables that have been created as a result of fuzzy matching input variables are highlighted in orange, and variables that were created from manual matching of input variables are highlighted in pink. More information on highlighted variables is available by clicking on the highlighted area, which expands out to show a table of the input variable names, labels and types and also a table of the input values and labels for categorical variables.

Options

DATA SETS

Data set 1, Data set 2, ... The names of the SPSS .sav data files containing data sets to be combined. These data files need to be uploaded to the Displayr cloud drive.

Combined data set name The name to use for the resulting combined data set that is written to the cloud drive.

AUTOMATIC VARIABLE MATCHING

Automatically determine what to match When selected, an algorithm determines what to match by (variable names, variable labels and/or value labels). When unselected, checkboxes appear below to manually choose what to match by (see below).

Match by variable names Whether to match variables using variable names.

Match by variable labels Whether to match variables using variable labels.

Match by value labels Whether to match variables using value labels.

Ignore letter case when matching Ignore letter case in variable names, variable labels and value labels when matching variables.

Ignore non-alphanumeric characters when matching Ignore non-alphanumeric characters in variable names, variable labels and value labels when matching variables

Minimum fuzzy match percentage Minimum match percentage for variable names, variable labels and value labels to be considered matching.

VARIABLES

Variables to manually combine These are text input controls where each should contain comma-separated names of variables from different data sets that should be combined together into a variable. Variable ranges are supported (a range is specified by the start and end variable names separated by a dash, e.g. "Q2-Q6"), which allows ranges of variables from different data sets to be combined into a range of variables. If the variables to combine have the same name in each data set, the variable name only needs to be specified once. To specify a variable from a specific data set, the data set index needs to be added in parentheses as a suffix, e.g., "Q2(3)".

Variables that should not be combined These are text input controls where each should contain comma-separated names of variables from different data sets that should be combined together into a variable. Variable ranges are supported (a range is specified by the start and end variable names separated by a dash, e.g. "Q2-Q6"), which allows specifying ranges of variables from different data sets that should not be combined (i.e. first variable in each range should not be combined, etc.). If the variables that should not be combined have the same name in each data set, the variable name only needs to be specified once. To specify a variable from a specific data set, the data set index needs to be added in parentheses as a suffix, e.g., "Q2(3)".

Variables to manually include These are text input controls where each should contain comma-separated names of variables to include in the combined data set. Variable ranges are supported (a range is specified by the start and end variable names separated by a dash, e.g. "Q2-Q6") as well as variable name wildcards, e.g. "Q2_*" which matches all variables with names starting with "Q2_". To specify a variable from a specific data set, the data set index needs to be added in parentheses as a suffix, e.g., "Q2(3)". These controls are only useful when not all data sets are selected when Combined variables must contain input from at least one data set (select to specify) is checked because otherwise variables in all data sets will be included in the combined data set.

Variables to manually omit These are text input controls where each should contain comma-separated names of variables to include in the combined data set. Variable ranges are supported (a range is specified by the start and end variable names separated by a dash, e.g. "Q2-Q6") as well as variable name wildcards, e.g. "Q2_*" which matches all variables with names starting with "Q2_". To specify a variable from a specific data set, the data set index needs to be added in parentheses as a suffix, e.g., "Q2(3)".

Combined variables must contain input from at least one data set (select to specify) When selected, checkboxes appear below allowing data sets to be selected. Only combined variables that contain input from these data sets will be included in the combined data set. For example, suppose that data sets 3 and 4 are selected by the checkboxes. Then a variable that is created by combining variables from data sets 1 and 2 will not be included in the combined data set, whereas a variable created by combining variables from data sets 2 and 3 will be.

NAMES AND LABELS

Prefer names and labels from the Combo box with the options "First data set" and "Last data set". If the former is chosen, variable names, variable labels and value labels from the first data set will be preferred for the combined variables, and vice versa when the latter is chosen.

When there are multiple labels for the same value Combo box with the options "Create new values for the labels" and "Use label from preferred data set". If the former is chosen, a new value is created for each of the labels (except the one from the preferred data set) whereas if the latter is chosen, the labels are replaced by the one from the preferred data set.

AUTOMATIC UPDATING

Automatic updating Whether to automatically update the stacked data set. This is used when the input data set is regularly updated.

Update period The time unit for regular updates. Shown when Automatic updating is selected.

Frequency The multiple of the Update period for regular updating. Shown when Automatic updating is selected.

Start date and time The date and time of the first update in the format dd-mm-yyyy hh:mm or mm-dd-yyyy hh:mm. Shown when Automatic updating is selected.

US date format Whether the Start date and time is expressed in US format i.e. mm-dd-yyyy hh:mm. Shown when Automatic updating is selected.

Time zone An optional time zone for the Start date and time, or else default of UTC applies. Format must be Continent/City, e.g. America/Los_Angeles. See Wikipedia for a list of time zones. Shown when Automatic updating is selected.

Code

form.setObjectInspectorTitle("Combine Data Sets by Case", "Combine Data Sets by Case");

form.group({label: "Data Sets", expanded: true});

var i = 1;
while (true)
{
    var data_set_name = form.textBox({name: "formDataSetName" + i,
                                      label: "Data set " + i,
                                      required: i < 1,
                                      prompt: "Name of data set in Displayr cloud drive"}).getValue();
    i++;
    if (data_set_name == "")
        break;
}
let n_data_sets = i - 2;

form.textBox({name: "formMergedDataSetName",
              label: "Combined data set name",
              default_value: "Combined data set.sav",
              required: false,
              prompt: "Name of combined data set to be saved to the Displayr cloud drive"});

form.group({label: "Automatic Variable Matching", expanded: true});

let auto_select = form.checkBox({name: "formAutoSelect",
                                 label: "Automatically determine what to match",
                                 default_value: true,
                                 prompt: "Uncheck to manually select what to match by (variable names, variable labels and/or value labels)"}).getValue();

if (!auto_select)
{
    form.checkBox({name: "formMatchByVarNames",
                   label: "Match by variable names",
                   default_value: true,
                   prompt: "Match variables using variable names"});
    form.checkBox({name: "formMatchByVarLabels",
                   label: "Match by variable labels",
                   default_value: true,
                   prompt: "Match variables using variable labels"});
    form.checkBox({name: "formMatchByValueLabels",
                   label: "Match by value labels",
                   default_value: true,
                   prompt: "Match variables using value labels"});
}

form.checkBox({name: "formIgnoreCase",
               label: "Ignore letter case when matching",
               default_value: true,
               prompt: "Ignore letter case in variable names, variable labels and value labels when matching variables"});

form.checkBox({name: "formIgnoreNonAlphanumChars",
               label: "Ignore non-alphanumeric characters when matching",
               default_value: true,
               prompt: "Ignore non-alphanumeric characters in variable names, variable labels and value labels when matching variables"});

form.numericUpDown({name: "formMinMatchPercentage",
                    label: "Minimum fuzzy match percentage",
                    default_value: 90,
                    maximum: 100,
                    minimum: 0,
                    prompt: "Minimum match percentage for variable names, variable labels and value labels to be considered matching"});

form.group({label: "Variables", expanded: true});

var i = 1;
while (true)
{
    var combine_str = form.textBox({name: "formVarToCombine" + i,
                                    label: "Variables to manually combine",
                                    required: false,
                                    prompt: "e.g.: Q1,Q6(2) or Q1-Q3,Q4-Q6"}).getValue();
    i++;
    if (combine_str == "")
        break;
}

var i = 1;
while (true)
{
    var not_combine_str = form.textBox({name: "formVarToNotCombine" + i,
                                        label: "Variables that should not be combined",
                                        required: false,
                                        prompt: "e.g.: Q1,Q6(2) or Q1-Q3,Q4-Q6"}).getValue();
    i++;
    if (not_combine_str == "")
        break;
}

var i = 1;
while (true)
{
    var keep_str = form.textBox({name: "formVarToKeep" + i,
                                 label: "Variables to manually include",
                                 required: false,
                                 prompt: "e.g.: Q1,Q2 or Q1-Q6 or Q2_*"}).getValue();
    i++;
    if (keep_str == "")
        break;
}

var i = 1;
while (true)
{
    var omit_str = form.textBox({name: "formVarToOmit" + i,
                                 label: "Variables to manually omit",
                                 required: false,
                                 prompt: "e.g.: Q1,Q2 or Q1-Q6 or Q2_*"}).getValue();
    i++;
    if (omit_str == "")
        break;
}

let is_vars_discarded = form.checkBox({name: "formIsKeepVarsFromDataSet",
                                       label: "Combined variables must contain input from at least one data set (select to specify)",
                                       default_value: false,
                                       prompt: "Manually combined or included variables will also be included"}).getValue();
if (is_vars_discarded)
{
    for (let i = 0; i < n_data_sets; i++)
    {
        form.checkBox({name: "formKeepVarsFromDataSet" + (i + 1),
                       label: "\u200b\xA0".repeat(4) + "Include combined variables that contain input from data set "+ (i + 1),
                       default_value: true});
    }
}

form.group({label: "Names and Labels", expanded: true});

form.comboBox({name: "formFirstLast",
               label: "Prefer names and labels from the ",
               alternatives: ["First data set", "Last data set"],
               default_value: "First data set",
               prompt: "Preferred data sets for selecting variable names, variable labels and value labels of combined variables"});

form.comboBox({name: "formMutlipleLabels",
               label: "When there are multiple labels for the same value",
               alternatives: ["Create new values for the labels", "Use label from preferred data set"],
               default_value: "Create new values for the labels",
               prompt: "Choose how to combine a value with different labels from different inputs"});

// Controls for regular updating
form.group({label: "Automatic Updating", expanded: false});
var updating = form.checkBox({label:"Automatic updating", name:"formUpdating", default_value:false, prompt:"Regularly update the output"}).getValue();
if (updating) {
    var period = form.comboBox({name: "formUpdatePeriod", label: "Update period", 
               alternatives: ["Months", "Weeks", "Days", "Hours", "Minutes", "Seconds"], default_value: "Days", prompt: "The time units for updating"}).getValue();
    var defaultFrequency = 1;
    if (period == "Seconds")
        defaultFrequency = 600;
    else if (period == "Minutes")
        defaultFrequency = 10;
    form.numericUpDown({name: "formFrequency", label: "Frequency", default_value: defaultFrequency,
                        prompt: "The update frequency in units of the update period", increment: 1, minimum: defaultFrequency, maximum: 1000000});
    var start = form.textBox({name: "formStart", label: "Start date and time", prompt: "The first update date and time",
              required: false, prompt: "Default now, or e.g. 31-12-2018 18:00:00"}).getValue();
    if (start != "") {
        form.checkBox({label:"US date format", name:"formUSDate", default_value:false, prompt: "Specify update start date as mm-dd-yyyy"});
        form.textBox({name: "formTimeZone", label: "Time zone", 
                  required: false, prompt: "Leave blank for UTC or enter e.g. America/New_York"});    
    }
}
library(flipData)
library(flipTime)

vars <- ls()
n.data.sets <- max(as.numeric(substr(vars[grepl("^formDataSetName[[:digit:]]+", vars)], 16, 20))) - 1
data.sets <- vapply(seq_len(n.data.sets), function(i) get0(paste0("formDataSetName", i)),
                    character(1))

n.variables.to.combine <- max(as.numeric(substr(vars[grepl("^formVarToCombine[[:digit:]]+", vars)],
                                                nchar("formVarToCombine") + 1, 
                                                nchar("formVarToCombine") + 5)))
variables.to.combine <- vapply(seq_len(n.variables.to.combine), function(i) get0(paste0("formVarToCombine", i)),
                               character(1))
variables.to.combine <- variables.to.combine[-n.variables.to.combine]

n.variables.to.not.combine <- max(as.numeric(substr(vars[grepl("^formVarToNotCombine[[:digit:]]+", vars)],
                                                    nchar("formVarToNotCombine") + 1, 
                                                    nchar("formVarToNotCombine") + 5)))
variables.to.not.combine <- vapply(seq_len(n.variables.to.not.combine), function(i) get0(paste0("formVarToNotCombine", i)),
                                   character(1))
variables.to.not.combine <- variables.to.not.combine[-n.variables.to.not.combine]

n.variables.to.keep <- max(as.numeric(substr(vars[grepl("^formVarToKeep[[:digit:]]+", vars)],
                                             nchar("formVarToKeep") + 1, 
                                             nchar("formVarToKeep") + 5)))
variables.to.keep <- vapply(seq_len(n.variables.to.keep), function(i) get0(paste0("formVarToKeep", i)),
                            character(1))
variables.to.keep <- variables.to.keep[-n.variables.to.keep]

n.variables.to.omit <- max(as.numeric(substr(vars[grepl("^formVarToOmit[[:digit:]]+", vars)],
                                             nchar("formVarToOmit") + 1, 
                                             nchar("formVarToOmit") + 5)))
variables.to.omit <- vapply(seq_len(n.variables.to.omit), function(i) get0(paste0("formVarToOmit", i)),
                            character(1))
variables.to.omit <- variables.to.omit[-n.variables.to.omit]

if (formIsKeepVarsFromDataSet)
{
    data.sets.whose.variables.are.kept <- which(vapply(seq_len(n.data.sets), function(i) {
        get0(paste0("formKeepVarsFromDataSet", i))
    }, logical(1)))
} else {
    data.sets.whose.variables.are.kept <- seq_len(n.data.sets)
}

# Create regular updating message
if (formUpdating) {
    if (formStart != "") {
        if (formTimeZone == "") formTimeZone <- "UTC"
        UpdateAt(formStart, us.format = formUSDate, time.zone = formTimeZone,
                units = tolower(formUpdatePeriod), frequency = formFrequency, options = "wakeup")
    } else
        UpdateEvery(formFrequency, units = tolower(formUpdatePeriod), options = "wakeup")
}

MergeDataSetsByCase(data.sets,
                    merged.data.set.name = formMergedDataSetName,
                    auto.select.what.to.match.by = formAutoSelect,
                    match.by.variable.names = get0("formMatchByVarNames"),
                    match.by.variable.labels = get0("formMatchByVarLabels"),
                    match.by.value.labels = get0("formMatchByValueLabels"),
                    ignore.case = formIgnoreCase,
                    ignore.non.alphanumeric = formIgnoreNonAlphanumChars,
                    min.match.percentage = formMinMatchPercentage,
                    variables.to.combine = variables.to.combine,
                    variables.to.not.combine = variables.to.not.combine,
                    variables.to.keep = variables.to.keep,
                    variables.to.omit = variables.to.omit,
                    use.names.and.labels.from = formFirstLast,
                    data.sets.whose.variables.are.kept = data.sets.whose.variables.are.kept,
                    when.multiple.labels.for.one.value = formMutlipleLabels)