Data - Data Set - Combine - By Variable

From Q
Jump to navigation Jump to search
This page is currently under construction, or it refers to features which are under development and not yet available for use.
This page is under construction. Its contents are only visible to developers!

Combine multiple data sets together by variable (matching cases). To use this feature, the data sets to be combined need to be uploaded to the cloud drive (accessed via the user icon button > Displayr cloud drive). The resulting combined data set is also written to the cloud drive. Once the combined data file has been created in the cloud drive, it can be added to a document via Data Sets > Add > Displayr Cloud Drive, and analyses can be performed as with any other data file in Displayr.

The data sets are combined either by matching cases using ID variables from each data set, or joined side-by-side (data sets are required to have the same number of cases). For each input data set, all variables can be included in the combined data set except for those specified to be omitted, or all variables can be omitted from the combined data set except for those specified to be included. For convenience, variable ranges and wildcard names are supported (see Options below).

Example

The output shown below is a result of combining two data sets. The output displays the variables in the combined data set (named "Combined data set.sav"), along with some additional information. The filled squares indicate which data sets contributed to each variable. Cases were matched using ID variables from both data sets (combined ID variable highlighted in blue). The variables Q2, Q3 and Q3_3 from data set 2 were omitted from the combined data set, as indicated by the footer.

Options

DATA SETS

Data set 1, Data set 2, ... The names of the SPSS .sav data files containing data sets to be combined. These data files need to be uploaded to the Displayr cloud drive.

Combined data set name The name to use for the resulting combined data set that is written to the cloud drive.

CASE MATCHING

Combine data sets by Combo box with the options "Matching IDs" and "Joining them side-by-side (no matching)". If the former is chosen, input text controls will appear below for the ID variable name to be specified for each input data set. If the latter is chosen, data sets are combined by joining the data sets side-by-side (data sets are required to have the same number of cases).

Only keep cases matched to all data sets A checkbox which if selected, only cases that have IDs in all data sets are retained. This checkbox is only shown when Combine data sets by has "Matching IDs" selected.

VARIABLES

Variables from data set 1, Variables from data set 2, ... Combo boxes with the options "Include all variables except those manually omitted" and "Only include manually specified variables". If the former is selected, all variables from the data set will be included in the combined data set except those manually specified to be omitted in the text controls that appear below. If the latter is selected, all variables from the data set will be excluded from the combined data set except those manually specified to be included in the text controls that appear below.

Variables to omit from data set 1, Variables to omit from data set 2, ... Input text controls that appear when "Include all variables except those manually omitted" is selected. These are used to specify variables that should be omitted from the combined data set. Variable ranges are supported (a range is specified by the start and end variable names separated by a dash, e.g. "Q2-Q6") and also variable name wildcards, e.g. "Q2_*" which matches all variables with names starting with "Q2_".

Variables to include from data set 1, Variables to include from data set 2, ... Input text controls that appear when "Only include manually specified variables" is selected. These are used to specify variables that should be included in the combined data set. Variable ranges are supported (a range is specified by the start and end variable names separated by a dash, e.g. "Q2-Q6") and also variable name wildcards, e.g. "Q2_*" which matches all variables with names starting with "Q2_".

AUTOMATIC UPDATING

Automatic updating Whether to automatically update the stacked data set. This is used when the input data set is regularly updated.

Update period The time unit for regular updates. Shown when Automatic updating is selected.

Frequency The multiple of the Update period for regular updating. Shown when Automatic updating is selected.

Start date and time The date and time of the first update in the format dd-mm-yyyy hh:mm or mm-dd-yyyy hh:mm. Shown when Automatic updating is selected.

US date format Whether the Start date and time is expressed in US format i.e. mm-dd-yyyy hh:mm. Shown when Automatic updating is selected.

Time zone An optional time zone for the Start date and time, or else default of UTC applies. Format must be Continent/City, e.g. America/Los_Angeles. See Wikipedia for a list of time zones. Shown when Automatic updating is selected.

Variables to include from data set 1, Variables to include from data set 2, ...

Code

form.setObjectInspectorTitle("Combine Data Sets by Variable", "Combine Data Sets by Variable");
form.group({label: "Data Sets", expanded: true});
var i = 1;
while (true)
{
    var data_set_name = form.textBox({name: "formDataSetName" + i,
                                      label: "Data set " + i,
                                      required: i < 3,
                                      prompt: "Name of data set in Displayr cloud drive"}).getValue();
    i++;
    if (i > 2 && data_set_name == "")
        break;
}
let n_data_sets = i - 2;

form.textBox({name: "formMergedDataSetName",
              label: "Combined data set name",
              required: true,
              prompt: "Name of combined data set to be saved to the Displayr cloud drive"});

form.group({label: "Case Matching", expanded: true});

let match_method = form.comboBox({name: "formMatchMethod",
                                  label: "Combine data sets by",
                   alternatives: ["Matching IDs",
                                  "Joining them side-by-side (no matching)"],
                   default_value: "Matching IDs"}).getValue();

if (match_method == "Matching IDs")
{
    for (var i = 0; i < n_data_sets; i++)
    {
        form.textBox({name: "formID" + (i + 1),
                    label: "ID variable (data set " + (i + 1) + ")",
                    required: true,
                    prompt: "Name of ID variable in data set " + (i + 1)});
    }

    form.checkBox({name: "formRequireFullMatch",
               label: "Only keep cases matched to all data sets",
               default_value: false});
}

form.group({label: "Variables", expanded: true});

for (var j = 0; j < n_data_sets; j++)
{
    let include_or_omit = form.comboBox({name: "formIncludeOrOmit" + (j + 1),
                   label: "Variables from data set " + (j + 1),
                   alternatives: ["Include all variables except those manually omitted",
                                  "Only include manually specified variables"],
                   default_value: "Include all variables except those manually omitted"}).getValue();
    var k = 1;
    while (true)
    {
        let lbl = include_or_omit == "Include all variables except those manually omitted" ?
            "Variables to omit from data set " + (j + 1) :
            "Variables to include from data set " + (j + 1);
        var omit_str = form.textBox({name: "formVarToIncludeOrOmit" + (j + 1) + "row" + k,
                                    label: lbl,
                                    required: include_or_omit == "Only include manually specified variables" && k == 1,
                                    prompt: "e.g.: Q1,Q2 or Q1-Q6 or Q2_*"}).getValue();
        k++;
        if (omit_str == "")
            break;
    }
}

// Controls for regular updating
form.group({label: "Automatic Updating", expanded: false});
var updating = form.checkBox({label:"Automatic updating", name:"formUpdating", default_value:false, prompt:"Regularly update the output"}).getValue();
if (updating) {
    var period = form.comboBox({name: "formUpdatePeriod", label: "Update period", 
               alternatives: ["Months", "Weeks", "Days", "Hours", "Minutes", "Seconds"], default_value: "Days", prompt: "The time units for updating"}).getValue();
    var defaultFrequency = 1;
    if (period == "Seconds")
        defaultFrequency = 600;
    else if (period == "Minutes")
        defaultFrequency = 10;
    form.numericUpDown({name: "formFrequency", label: "Frequency", default_value: defaultFrequency,
                        prompt: "The update frequency in units of the update period", increment: 1, minimum: defaultFrequency, maximum: 1000000});
    var start = form.textBox({name: "formStart", label: "Start date and time", prompt: "The first update date and time",
              required: false, prompt: "Default now, or e.g. 31-12-2018 18:00:00"}).getValue();
    if (start != "") {
        form.checkBox({label:"US date format", name:"formUSDate", default_value:false, prompt: "Specify update start date as mm-dd-yyyy"});
        form.textBox({name: "formTimeZone", label: "Time zone", 
                  required: false, prompt: "Leave blank for UTC or enter e.g. America/New_York"});    
    }
}
library(flipData)
library(flipTime)

vars <- ls()
n.data.sets <- max(as.numeric(substr(vars[grepl("^formDataSetName[[:digit:]]+", vars)], 16, 20))) - 1
data.sets <- vapply(seq_len(n.data.sets), function(i) get0(paste0("formDataSetName", i)),
                    character(1))

id.variables <- if (formMatchMethod == "Matching IDs") {
    vapply(seq_len(n.data.sets), function(i) {
        get0(paste0("formID", i))
    }, character(1))
} else {
    NULL
}

include.or.omit.variables <- vapply(seq_len(n.data.sets), function(i) {
    get0(paste0("formIncludeOrOmit", i))
}, character(1))

variables.to.include.or.omit <- lapply(seq_len(n.data.sets), function(i) {
    prefix <- paste0("formVarToIncludeOrOmit", i, "row");
    n.rows <- max(as.numeric(substr(vars[grepl(paste0("^", prefix, "[[:digit:]]+"), vars)],
                                                nchar(prefix) + 1, 
                                                nchar(prefix) + 5)))
    vapply(seq_len(n.rows), function(j) get0(paste0(prefix, j)), character(1))
})

# Create regular updating message
if (formUpdating) {
    if (formStart != "") {
        if (formTimeZone == "") formTimeZone <- "UTC"
        UpdateAt(formStart, us.format = formUSDate, time.zone = formTimeZone,
                units = tolower(formUpdatePeriod), frequency = formFrequency, options = "wakeup")
    } else
        UpdateEvery(formFrequency, units = tolower(formUpdatePeriod), options = "wakeup")
}


MergeDataSetsByVariable(data.sets,
                        merged.data.set.name = formMergedDataSetName,
                        id.variables = id.variables,
                        include.or.omit.variables = include.or.omit.variables,
                        variables.to.include.or.omit = variables.to.include.or.omit,
                        only.keep.cases.matched.to.all.data.sets = get0("formRequireFullMatch"))