Excel and CSV Data File Specifications
The following specifications are designed to minimize the amount of time taken to analyze data from CSV files in Q. Please note that CSV files are, in general, a very poor format and it is preferable to one of the better file formats.
- 1 The overall structure of the file
- 2 Outfile labels rather than values
- 3 Non-Response and other types of missing data
- 4 Don’t Know
- 5 Single Response Questions
- 6 Multiple Response Questions
- 7 Multiple response questions with huge code frames
- 8 Grids
- 9 Rotations and randomizations – within questions
- 10 Rotations and randomizations – between questions
- 11 Rankings
- 12 Un-coded open-ended questions and “other specify”
- 13 Coded open-ended questions
- 14 Variable Names
- 15 Variables To Be Excluded
- 16 Specifications for advanced data types
- 17 See also
The overall structure of the file
All the data should appear in a single table, which should contain:
- A column for each of the different bits of data collected in the survey (e.g., age is in one column, education is in another).
- A single row at the top containing the headings that describe each column.
- A separate row for each respondent (i.e., all the data for the first respondent is shown in the row under the headings, the row beneath this contains the data for the second respondent, and so on).
- In Excel, this table should be on a single sheet.
- Column headers should not begin with numbers.
- Cells in the table should not be merged.
Outfile labels rather than values
In general, it is preferable to outfile labels rather than values. For example, if the questionnaire asked:
Q1c Do you have a credit card? No Yes
it is generally better to outfile all the No responses as No in the columns of the data file, rather than as 0s. Prior to Q4.9 it was generally better to outfile values rather than labels.
Non-Response and other types of missing data
Respondents who were not asked a particular question (i.e., were intentionally or unintentionally skipped), should have a
NA . It is never appropriate to record all missing values in a data file as having a value of 0 (this is very important, as for many binary variables the No response is often coded as a 0, making it impossible to determine which respondents said No and which were not asked the question).
Where there are multiple different types of missing data (e.g., where some questions were not asked of some respondents while others were asked but not answered), they should be coded with different values (e.g.,
NA where not asked to respondents and -99 if asked but not answered).
Sometimes it is appropriate to treat missing values for some of the questions as being equivalent to a “No” response (e.g., giving them a value of 0). For example, if people are asked which brands they have consumed, but are only shown brands that they are aware of. In this instance, the question should be included in the data file twice, once with the
NA values and once with the “No” responses instead.
Don't Know code needs to be different to the non-response code.
Single Response Questions
Each single response question should be represented in the data file by a single variable (i.e., column). A data file that uses a different variable for each unique response code of a single response question is not useful.
Multiple Response Questions
Where there are multiple response variables, a binary variable should be created for each possible response. For example, a question in a questionnaire may have been:
Q1 Which of the following products do you own? MULTIPLE RESPONSE Savings account Checking Account Credit Card Home loan
but the data file should be structured as if you had asked the following four questions.
Q1a Which of the following products do you own? A savings account? No Yes Q1b Which of the following products do you own? A checking account? No Yes Q1c Which of the following products do you own? A credit card? No Yes Q1d Which of the following products do you own? A home loan? No Yes
Multiple response questions with huge code frames
Some data files contain multiple response questions with extremely large code frames (e.g., 6,000 models of cars). Exporting these in the binary format results in data files that can be excessively large. In this situation they can be exported in max-multi format (i.e., essentially as multiple single response variables, each one recoding a separate response). However, this format is, in general, much less flexible than the binary format and should be avoided where possible. In particular, with data in this format there is no way of recording missing values (e.g., if a respondent is not shown as having selected an option, this may mean that they saw the option but did not choose it, or, it may mean that they were never shown the option).
Grid questions should contain names for variables that describe both the specific option being evaluated and also some common aspect to the wording. For example, the following labels are poor, because Q will not be able to recognize they should be grouped together:
Live a long life Be rich Have lots of friends
whereas these are much better:
How strongly do you agree that it is important to... Live a long life How strongly do you agree that it is important to... Be rich How strongly do you agree that it is important to... Have lots of friends
Rotations and randomizations – within questions
Rotations and randomizations that occur between respondents within a question need to be removed from the data prior to creating the data file. Generally, this should be done in accordance with the questionnaire’s coding. For example, if the questionnaire said:
Q1 Which of these have you eaten in the past week? RANDOMIZE ORDER OF 1 AND 2 McDonald’s Taco Bell None of these
Then half the sample will have been asked:
Q1 Which of these have you eaten in the past week? McDonald’s Taco Bell None of these
And the other half will have been asked:
Q1 Which of these have you eaten in the past week? Taco Bell McDonald’s None of these
When the data is exported, it should be done using the order in the questionnaire. That is, even though Taco Bell was shown first half the time and second half the time, the data file should be created as if Taco Bell was always shown second.
Rotations and randomizations – between questions
Where different respondents see questions in different orders, this order needs to be removed from the data prior to creating the data file. For example, if the respondents have been asked to rate the appeal of a random selection of three of four different products, and the order has been randomized or rotated, such as in this table:
ID Q1 Q2 Q3 1 Microsoft Apple IBM 2 Apple Microsoft IBM 3 IBM Google Apple 4 Google Microsoft IBM
then the data should be exported as if people had been asked four different questions and all respondents had seen them in the same order:
ID Q. Microsoft Q. Apple Q. IBM Q. Google 1 Data from Q1 Data from Q2 Data from Q3 NA 2 Data from Q2 Data from Q1 Data from Q3 NA 3 NA Data from Q3 Data from Q1 Data from Q2 4 Data from Q2 NA Data from Q3 Data from Q1
Further, the order with which the data was collected should also be exported as additional variables. For example:
ID Order Microsoft Order Apple Order IBM Order Google 1 1 2 3 NA 2 2 1 3 NA 3 NA 3 1 2 4 2 NA 3 1
Ranking questions need to be recorded with a single variable for each item being ranked. Ideally, the most preferred item will have the highest value and the least preferred the lowest, except where the questionnaire expressly indicates an alternative coding.
Un-coded open-ended questions and “other specify”
Verbatim responses to open-ended questions and “other specify” options should be stored as String variables if the data is text and Numeric variables if numeric.
Coded open-ended questions
Where open-ended questions have been coded, these are then included in the data file as if they are standard single or multiple response variables (in particular, the binary format is appropriate for multiple response questions). An additional string variable should store the raw responses.
Variable names should relate to the question numbers. It is often useful if separate question numbering is used for screeners, general questions and classification variables (i.e., S1, S, …., Q1, Q2, …, C1, C2,…). Where a question is represented by multiple variables, please use a common prefix (e.g., Q4a, Q4b, Q4c), rather than out-filing each variable with a different question number (e.g., Q231, Q232, Q233). Where a question is a loop of a multiple response question, this is generally best represented via a common prefix and two separate looping suffixes (e.g., Q4a1, Q4a2, Q4b1, Q4b2). While these are only guidelines, the core principle is to employ a convention that is easily understandable, whereby the variable names are informative as to the structure of the data.
Variables To Be Excluded
Variables that have no possible meaning to the user of the data file should be excluded from the data file. Some data collection programs automatically export useless variables that only relate to the way in which the questionnaire was set up. Examples of variables with no possible meaning that may be exported include:
- Looped variables, where one variable will have a value of 1 for every respondent, another will have a value of 2 for all respondents, and so on.
- Variables representing un-used codes in multiple response questions.
Specifications for advanced data types
The following specifications are for more advanced data types:
- MaxDiff Specifications contains instructions for the specifications of MaxDiff (maximum difference experiments).
- Experiments Specifications contains specifications for the set up of conjoint, discrete choice and other types of choice experiments.