Excel and CSV Data File Specifications

From Q
Jump to: navigation, search

The following specifications are designed to minimize the amount of time taken to analyze data from CSV files in Q. Please note that CSV files are, in general, a very poor format and it is preferable to one of the better file formats.

The overall structure of the file

All the data should appear in a single table, which should contain:

  • A column for each of the different bits of data collected in the survey (e.g., age is in one column, education is in another).
  • A single row at the top containing the headings that describe each column.
  • A separate row for each respondent (i.e., all the data for the first respondent is shown in the row under the headings, the row beneath this contains the data for the second respondent, and so on).
  • In Excel, this table should be on a single sheet.
  • Column headers should not begin with numbers.

DataFileInExcel.PNG

Outfile labels rather than values

In general, it is preferable to outfile labels rather than values. For example, if the questionnaire asked:

Q1c	Do you have a credit card?
	No
	Yes

it is generally better to outfile all the No responses as No in the columns of the data file, rather than as 0s. Prior to Q4.9 it was generally better to outfile values rather than labels.

Non-Response and other types of missing data

Respondents who were not asked a particular question (i.e., were intentionally or unintentionally skipped), should have a NA . It is never appropriate to record all missing values in a data file as having a value of 0 (this is very important, as for many binary variables the No response is often coded as a 0, making it impossible to determine which respondents said No and which were not asked the question). Where there are multiple different types of missing data (e.g., where some questions were not asked of some respondents while others were asked but not answered), they should be coded with different values (e.g., NA where not asked to respondents and -99 if asked but not answered). Sometimes it is appropriate to treat missing values for some of the questions as being equivalent to a “No” response (e.g., giving them a value of 0). For example, if people are asked which brands they have consumed, but are only shown brands that they are aware of. In this instance, the question should be included in the data file twice, once with the NA values and once with the “No” responses instead.

Don’t Know

The Don't Know code needs to be different to the non-response code.

Single Response Questions

Each single response question should be represented in the data file by a single variable (i.e., column). A data file that uses a different variable for each unique response code of a single response question is not useful.

Multiple Response Questions

Where there are multiple response variables, a binary variable should be created for each possible response. For example, a question in a questionnaire may have been:

Q1	Which of the following products do you own?
	MULTIPLE RESPONSE
	Savings account
	Checking Account
	Credit Card
	Home loan

but the data file should be structured as if you had asked the following four questions.

Q1a	Which of the following products do you own?  A savings account?
	No
	Yes

Q1b	Which of the following products do you own?  A checking account?
	No
	Yes

Q1c	Which of the following products do you own?  A credit card?
	No
	Yes

Q1d	Which of the following products do you own?  A home loan?
	No
	Yes

Multiple response questions with huge code frames

Some data files contain multiple response questions with extremely large code frames (e.g., 6,000 models of cars). Exporting these in the binary format results in data files that can be excessively large. In this situation they can be exported in max-multi format (i.e., essentially as multiple single response variables, each one recoding a separate response). However, this format is, in general, much less flexible than the binary format and should be avoided where possible. In particular, with data in this format there is no way of recording missing values (e.g., if a respondent is not shown as having selected an option, this may mean that they saw the option but did not choose it, or, it may mean that they were never shown the option).

Grids

Grid questions should contain names for variables that describe both the specific option being evaluated and also some common aspect to the wording. For example, the following labels are poor, because Q will not be able to recognize they should be grouped together:

Live a long life
Be rich
Have lots of friends

whereas these are much better:

How strongly do you agree that it is important to... Live a long life
How strongly do you agree that it is important to... Be rich
How strongly do you agree that it is important to... Have lots of friends

Rotations and randomizations – within questions

Rotations and randomizations that occur between respondents within a question need to be removed from the data prior to creating the data file. Generally, this should be done in accordance with the questionnaire’s coding. For example, if the questionnaire said:

Q1	Which of these have you eaten in the past week? RANDOMIZE ORDER OF 1 AND 2
	McDonald’s
	Taco Bell
	None of these

Then half the sample will have been asked:

Q1	Which of these have you eaten in the past week?
	McDonald’s
	Taco Bell
	None of these

And the other half will have been asked:

Q1	Which of these have you eaten in the past week?
	Taco Bell
	McDonald’s
	None of these

When the data is exported, it should be done using the order in the questionnaire. That is, even though Taco Bell was shown first half the time and second half the time, the data file should be created as if Taco Bell was always shown second.

Rotations and randomizations – between questions

Where different respondents see questions in different orders, this order needs to be removed from the data prior to creating the data file. For example, if the respondents have been asked to rate the appeal of a random selection of three of four different products, and the order has been randomized or rotated, such as in this table:

ID	Q1		Q2		Q3
1	Microsoft	Apple		IBM
2	Apple		Microsoft	IBM
3	IBM		Google		Apple
4	Google		Microsoft	IBM

then the data should be exported as if people had been asked four different questions and all respondents had seen them in the same order:

ID	Q. Microsoft	Q. Apple	Q. IBM		Q. Google
1	Data from Q1	Data from Q2	Data from Q3	NA
2	Data from Q2	Data from Q1	Data from Q3	NA
3	NA		Data from Q3	Data from Q1	Data from Q2
4	Data from Q2	NA		Data from Q3	Data from Q1

Further, the order with which the data was collected should also be exported as additional variables. For example:

ID	Order Microsoft	Order Apple	Order IBM	Order Google
1	1		2		3		NA
2	2		1		3		NA
3	NA		3		1		2
4	2		NA		3		1

Rankings

Ranking questions need to be recorded with a single variable for each item being ranked. Ideally, the most preferred item will have the highest value and the least preferred the lowest, except where the questionnaire expressly indicates an alternative coding.

Un-coded open-ended questions and “other specify”

Verbatim responses to open-ended questions and “other specify” options should be stored as String variables if the data is text and Numeric variables if numeric.

Coded open-ended questions

Where open-ended questions have been coded, these are then included in the data file as if they are standard single or multiple response variables (in particular, the binary format is appropriate for multiple response questions). An additional string variable should store the raw responses.

Variable Names

Variable names should relate to the question numbers. It is often useful if separate question numbering is used for screeners, general questions and classification variables (i.e., S1, S, …., Q1, Q2, …, C1, C2,…). Where a question is represented by multiple variables, please use a common prefix (e.g., Q4a, Q4b, Q4c), rather than out-filing each variable with a different question number (e.g., Q231, Q232, Q233). Where a question is a loop of a multiple response question, this is generally best represented via a common prefix and two separate looping suffixes (e.g., Q4a1, Q4a2, Q4b1, Q4b2). While these are only guidelines, the core principle is to employ a convention that is easily understandable, whereby the variable names are informative as to the structure of the data.

Variables To Be Excluded

Variables that have no possible meaning to the user of the data file should be excluded from the data file. Some data collection programs automatically export useless variables that only relate to the way in which the questionnaire was set up. Examples of variables with no possible meaning that may be exported include:

  • Looped variables, where one variable will have a value of 1 for every respondent, another will have a value of 2 for all respondents, and so on.
  • Variables representing un-used codes in multiple response questions.

Specifications for advanced data types

The following specifications are for more advanced data types:

  • MaxDiff Specifications contains instructions for the specifications of MaxDiff (maximum difference experiments).
  • Experiments Specifications contains specifications for the set up of conjoint, discrete choice and other types of choice experiments.

See also

Setting Up Files With No Metadata