Data Sets in R
This page describes how Data Sets are represented when their contents are referred to in R Code.
Variables
If you type a variable's Variable Name into R Code, in either an R Output or an R Variable, Q will automatically use the data for that variable. For example, in the Cola.Q example project file, if you refer to Q3 in R Code it will be interpreted as a variable of length 327.
Missing values appear as NAs in R, and NaNs remain as NaNs.
Categorical variables
When a categorical variable is used in R (i.e., a Pick One or Pick One - Multi) it is automatically converted to a factor, or, if its Variable Type is Ordered Categorical to an ordered factor (these are R classes). If the categories have been merged, this merging will be reflected in the way the data appears in R. This is done as follows:
- If all the categories of the variable are mutually exclusive and exhaustive, they all appear in R.
- Where there are overlapping categories, the broadest of these will be excluded. For example, if the data contains three unique values, 0, 1, and 2, with labels of A, B, and C, respectively, and the categories shown on the table are A, B, C, NET, the NET category will be removed. Similar, if the categories are A, B, B + C, C, NET, then both NET and B + C are removed.
- Any categories that are missing (i.e., hidden), are inserted, such that the categories are mutually exclusive and exhaustive.
Attributes of variables
When a variable from a data set is referred to in R Code, the variable is automatically uploaded to the R Server prior to any R Code being run. A variable will have the following attributes:
- name. This is the Variable Name.
- question. This is the Question Name.
- label. This is the Variable Label.
While these attributes can be accessed in R in the usual way (e.g., attr(my.variable, "label"), the best way to access them is often using flipFormat::Labels, which will attempt to construct a label of form Question Name: Variable Label where these are different, and Variable Label where these two are the same (e.g., flipFormat::Labels(Q3) will show Q3. Age). It falls back to name, and, if even this is not provided, it attempts to discern the original name of the argument.
Questions
You can refer to a Question by its name in R Code. Where names contain spaces, they are surrounded by backticks (i.e., `). For example: `Q3. Age`. Where a Question contains multiple variables, they will be provided in a data.frame.
Where a question contains multiple variables, these can be selected using $. For example, `Q4. Frequency numeric`$Coffee, will return a variable from the question called Q4. Frequency numeric. Here, Coffee refers to the name of one of the categories in the question, and may not correspond to a variable in the initial data file (e.g., because the user may have renamed the category, or created a new category by merging categories).
Multiple data sets
If you have Multiple Data Files in the project, and these contain variables or questions with the same names, the data file name is used to disambiguate (e.g., Cola.sav$`Q3. Age`). See Avoiding ambiguous references names for more information.