# Create New Variables - Automatically Combine Categories - By Value - Percentiles

Create a new variable with categories formed by specifying percentiles

The features in **Automatically Combine Categories - By Value** allow you to create categories from numeric data using one of server different rules. Each category contains a range of values. This is sometimes referred to as *banding*, *binning*, or *aggregating*.

There are four different methods available for combining numeric values in this tool:

**Tidy categories**divides the range of values into*tidy*categories, which are intervals of 2, 5, 10, 20, 50, and so on.**Percentiles**divides the range up into percentiles. This is useful when you want to create categories which contain even proportions of the cases in your data.**Equally spaced categories**divides the range of values into categories that all have the same range. This is similar to**Tidy categories**, but has additional options for customization.**Custom categories**divides the range of values up into ranges of your choice. This is useful when you want to specify uneven ranges of values.

The **Percentiles** method breaks the range of values into categories which contain specified percentages of the data. This option is useful when the goal of forming categories is to capture those which represent even proportions of the data (for example, if you want five categories, each of which contains 20% of the cases). Alternatively, it is useful when your analysis requires grouping cases into specific proportions (for example if you want one category which contains cases with the lowest 50% of the scores, another category which contains the next 45%, and a final category which contains the top 5% of scores).

It is important to note that when working with percentiles, if the distribution of data is clumpy the new categories will not always contain the exact proportions requested.

## Examples

Consider the following numeric data, which has recorded the amount of time (in seconds) for people to complete a particular task:

The times range from `10` seconds to `1004` seconds. Applying the **Percentiles** option with **Percentages** set to `10` results in the new categories which each reprsent roughly `10%` of the data:

There are a few key points to note:

- The categories are not exactly
`10%`which results from slight clumpiness in the data. - The
**Label style**has been set to**Tidy labels**, which results in the ranges of data values in each percentile being displayed. Compare this to the next example, which produces labels which describe the percentiles as percentages.

In some cases, the analysis may be interested in uneven percentile groupings. For example you may wish to compare the cases with the top 5% of scores to those with the next highest 10%, and so on. To acheive this, you can type a list of percentages in the **Percentages** field. For example, if you wish to divide the range at 50%, 80%, 90%, and 95% you can enter `50, 80, 90, 95`, which results in new categories for the data from the example above:

Note that this time the **Label style** is set to **Percentiles**.

The behavior of the **Percentiles** method is slightly different when the variable you are working with contains categories which represent *ranges* of values. An example of this kind of data is the following, which comes from a survey question asking the respondent's their income bracket:

Here, each label represents a range of incomes with two numbers. That is, except for the first and last category, which contain single numbers: `Under $1,000` and `$200,000 or over`. Applying the **Percentiles** method to data like this results in new categories which have as even a distribution as possible. For example, applying **Percentiles** to the example above and choosing the **Number of categories** to be `5` gives the following set of combined categories:

Note that:

- While we have asked for
`5`categories, the table shows`7`because there are two categories in the original data which don't contain any numbers and hence don't make sense to combine with the rest of the data. - Each income bracket has only been combined with those which are directly above or below it, so that the new categories still represent meaningful ranges.
- The start and end points for the new categories correspond to start and end points of the original data.
- The percentages are as even as possible based on the specified number of categories.

## Usage

- Select a table showing the question you want to use to create combined categories, or select one or more variables in the
**Variables and Questions**tab. - Select this option from
**Automate > Browse Online Library > Automatically Combine Categories**. - To change how the categories are combined:
- Select the new variable or question in the
**Variables and Questions**tab. - Right-click and select
**Edit R Variable**. - Choose the desired options in the
**Inputs**section on the right. - Click
**Update R Variable**.

- Select the new variable or question in the

## Options

The settings that are available for **Automatically Combine Categories - By Value** change depending on which *method* you choose, and whether the input data are categorical. Settings that are available for all methods are shown first, and method-specific settings are shown below.

**Variable(s)** Choose which variable(s) are being used to create new categories.

**Use numbers found in category labels** This option will only appear if there are any categorical variables selected in **Variable(s)**. When this is ticked, the tool will try to identify numeric values from the data labels and use those values when forming categories. If this is not ticked, the tool will ignore the category labels and will instead use the underlying data values for each category.
**Labels contain** This option will also only appear if there are any categorical variables selected in **Variable(s)**. This option allows you to communicate the nature of the values in the labels:

**Single values**This causes the tool to assume that your labels always contain single values. Any labels which contain more than one numeric value will not be combined.**Ranges of values**This causes the tool to assume your data labels describe ranges of values. The tool will expect to find pairs of labels, with the exeception of the lowest and highest values in the range.

**Method** The method to be used to identify ranges of values to use for creating new categories. The different methods are described above.

**Category boundary** Whether the value at the start of each range is included in the new category, or the value at the end of each range is included in the new category. For example, if dividing the interval up into ranges of 10, do we create a category which is `10 to 19` (**Start of range**) or `11 to 20` (**End of range**).

**Label style** Determine the style of the labels for the new categories.

**Tidy labels**describes the range of values that is contained within each range in English. For example,`10 to 20`.**Inequality notation**describes the range of values that is contained within each range using greater-than and less-than symbols. For example,`10 to <21`.**Interval notation**descibes the range of values using interval notation. For example,`[10, 20)`describes a range of values which include the value`10`and range up to`20`but does not include the value of`20`. Similarly,`(10, 20]`describes a range which does not include 10 but includes all values greater than`10`, up to and include the value of`20`.

**Use open-ended labels** When ticked, this setting produces open-ended category labels at the start and end of the range. For example, and open ended label would be `Less than 10`, whereas a non-open ended label would be `0 to 9`

**Number prefix / Number suffix** Allows you to add text before and after the numeric value in each new label. For example, you may wish to add a dollar-sign or other currency symbol in front of each number in the new labels. This is not available if using **Ranges of values**, where the text to place before and after is drawn from the original category labels.

**Decimals in label** Choose the number of decimals that are displayed in the new category labels.

**Decimals symbol / Thousands symbol** These options are only available when there are categorical variables selected in **Variable(s)**. They allow you to communicate the number format of the data labels. For example, the number convention in the United States and other countries is to use periods to denote the position for decimal values, and commas to separate units of thousands, resulting in numbers of the form `10,000.50`. In other countries, the convention is reversed, so that the same value may be represented by `10.000,00`.

### Tidy categories

**Target Number of categories** This option allows you to specify how many categories should be created when using the **Tidy categories** option. Note that the **Tidy categories** algorithm always tries to make ranges of values which are intervals of 2, 5, 10, 20, 50, etc, and so it will not always be able to acheive the desired number of categories. For more control, you can try using the **Equally spaced categories** method instead.

### Percentiles

**Percentages** This option allows you to control how categories are created when using the **Percentiles** method. You can enter a single number here, to divide the range up into even percentiles. For example, entering 10 will create the 10, 20, 30, etc percentiles. Alternatively, if you don't want even percentiles, you can enter a comma-separated list of percentiles. For example, you can enter `50, 75, 85, 90, 95, 99, 100` to create these uneven percentiles.

### Equally spaced categories

**Number of categories** Specify how many categories will be created. Altrnatively, rather than dividing the range into a set number of categories, you can set the **Increment** option to divide the range into categories that have a set width.

**Start point / End point** Use these options to determine where you want the range to start and end. This range will then be divided up according to the **Number of categories** setting, or the **Increment** setting.

**Increment** Use this option if you want to specify how wide each interval should be instead of specifying how many categories you wish to create.

### Custom categories

**Cut points** Use this field to specify which values should define the new categories.

**Always include highest and lowest values** When this option is ticked, the entire range of values will always be included in the set of categories even if the highest and lowest values are not entered in **Cut points**. Turning this option off will result in missing values for any numeric values which fall outside the range of values specified in **Cut points**. This is useful if you want to exclude values which are too low or too high (e.g. outliers, respondents providing unrealistically high or low values in a survey).

### Ranges of values

**Number of categories** Specify how many categories will be created. When the input data are categorical, and the category labels contain ranges, this tool will combine adjacent ranges into this many categories.

**Start of range / End of range** Many examples of range categories, like `Age` or `Income` questions from surveys, contain ranges that are open-ended at the start and end of the range. For example, an `Age` question may start with `18 years or younger` and end with `65 years and older`. If combining such range categories into **Equally spaced categories**, the algorithm does not how the start and end points for the range of values. These fields allow you to communicate values to use as the highest and lowest in the range. For example, if you conduct a survey, and the youngest respondents who were allowed to complete the survey were 13 years old, entering `13` in **Start of range** would have the effect of converting `18 years or younger` to `13 to 18` from the perspective of finding ranges of values in this algorithm.

## How to apply this QScript

- Start typing the name of the QScript into the
**Search features and data**box in the top right of the Q window. - Click on the QScript when it appears in the
**QScripts and Rules**section of the search results.

*OR*

- Select
**Automate > Browse Online Library**. - Select this QScript from the list.

## Customizing the QScript

This QScript is written in JavaScript and can be customized by copying and modifying the JavaScript.

### Customizing QScripts in Q4.11 and more recent versions

- Start typing the name of the QScript into the
**Search features and data**box in the top right of the Q window. - Hover your mouse over the QScript when it appears in the
**QScripts and Rules**section of the search results. - Press
**Edit a Copy**(bottom-left corner of the preview). - Modify the JavaScript (see QScripts for more detail on this).
- Either:
- Run the QScript, by pressing the blue triangle button.
- Save the QScript and run it at a later time, using
**Automate > Run QScript (Macro) from File**.

### Customizing QScripts in older versions

## JavaScript

```
includeWeb('QScript Functions for Automatically Combining Categories');
createAutomaticallyCombinedCategoryVariables('Value', {method: 'Percentiles'});
```

## See also

- QScript for more general information about QScripts.
- QScript Examples Library for other examples.
- Online JavaScript Libraries for the libraries of functions that can be used when writing QScripts.
- QScript Reference for information about how QScript can manipulate the different elements of a project.
- JavaScript for information about the JavaScript programming language.
- Table JavaScript and Plot JavaScript for tools for using JavaScript to modify the appearance of tables and charts.

Q Technical Reference

Q Technical Reference

Q Technical Reference

Q Technical Reference > Setting Up Data > Creating New Variables

Q Technical Reference > Setting Up Data > Creating New Variables

Q Technical Reference > Updating and Automation > Automation Online Library

Q Technical Reference > Updating and Automation > JavaScript > QScript > QScript Examples Library > QScript Online Library