Statistical Model for Latent Class Analysis, Mixed-Mode Tree, and Mixed-Mode Cluster Analysis

From Q
Jump to navigation Jump to search

Model

Latent Class Analysis

The statistical framework employed by Q’s Latent Class Analysis is the finite mixture framework, whereby the density of a given vector of data, [math]\displaystyle{ \mathbf{x} }[/math], is computed as the weighted sum of [math]\displaystyle{ C }[/math] class-specific densities, where the parameters of each class, [math]\displaystyle{ \mathbf{\theta}_1,\mathbf{\theta}_2,...,\mathbf{\theta}_C }[/math], and the sizes of each class, [math]\displaystyle{ \pi_1,\pi_2,...,\pi_C }[/math], are estimated as:

density[math]\displaystyle{ (\mathbf{x}) = \sum^C_{c=1}\pi_i g(\mathbf{x}|\mathbf{\theta}_c) }[/math]

Mixed-Mode Tree

Mixed-Mode Tree uses the same statistical model as latent class analysis, except that:

  • The observations being classified are the categories of the independent variables rather than respondents. For example, if age is selected as an independent variable, and there are seven age categories, a latent class analysis is performed grouping together these seven categories, with the individual respondents in each of the age breaks treated as repeated measures.
  • Objective is automatically set to Discrete (see #Estimation).

Mixed-Mode Cluster Analysis

This is identical to Latent Class Analysis, except that:

  • The priors are assumed to be constant (i.e., [math]\displaystyle{ \pi_i }[/math] is set to 1.0 in the density).
  • Objective is automatically set to Clustering (see #Estimation).

Estimation

Estimation is via:

  • The EM algorithm.
  • A modification of the EM algorithm whereby respondents are discretely allocated to segments rather than via probabilities in the E-Step (when Objective is automatically set to Discrete). Objective This setting can be used to make Q mimic the behavior of other data analysis tools.[1] This algorithm is sometimes referred to as a batch k-means algorithm when only numeric data is used.
  • Numerical optimization using a modification of the BFGS algorithm when estimating mixtures of a single Experiment or Ranking question.

Categorical questions

Variables in Pick Any, Pick One - Multi and Pick One questions are modeled as draws from independent multinomial distributions. Where a Pick One - Multi question is included in an analysis, Q’s algorithm takes no account of the grouping of variables into the question. For example, a Pick One - Multi question with 10 variables is treated by Q as being identical to 10 Pick One questions.

Ranking questions

Ranking questions are modeled as exploded categorical questions with ties. For example, the density of a ranking of three categories is treated as the product of the density of a Pick One question with three categories and a Pick One question with two categories. This model is a rank-ordered logit model with ties [2]. See http://surveyanalysis.org/wiki/Rank-Ordered_Logit_Model_With_Ties.

Approximations are employed in situations where numerical precision makes exact computation impossible or impractical. This can cause the log-likelihood to decrease from time-to-time during estimation.

Numeric and Experiment questions

Variables in Number - Multi and Number questions are modeled as draws from independent normal distributions. These distributions are constrained in terms of the standard deviations of the variables. If Pool variance is selected for the Distribution (in Advanced), the model pools the variance for each variable across classes. To use the nomenclature of Fraley and Rafter[3], this model is diagonal and assumes equal volume and shape. If Multivariate Normal - Spherical is selected, a single variance is used for all variables, if Pool variance is also selected, a single variance is assumed for all variables and all classes (spherical, equal volume) where a multivariate normal distribution is assumed (pool variance is not used with a latent class analysis of a ratings-based Experiment question).

When Multivariate Normal – Spherical is selected, the model assumes that differences between all variables are equally important and that variables are measured on the same scale. Thus, if one variable has a much higher standard deviation than another, this variable will, all else being equal, be more likely to differ across the resulting classes. Multivariate Normal – Spherical is generally to be preferred if the variables are all known to be measured on the same scale (e.g., utilities from conjoint analysis). This assumption is implicit in most cluster analysis algorithms.

When Multivariate Normal – Diagonal is selected, the model essentially assumes that the variables are measured on different scales, and takes this into account, so that, the variables that are highly correlated end up being the key discriminators between the classes, even if they have small standard deviations (all else being equal). This is essentially equivalent to transforming variables into z-scores prior to traditional cluster analysis.

Question weights

You can modify the importance of a particular question in determining the final solution by modifying its Weight setting (press Advanced in the Segments dialog box).

All else being equal, increasing the weight of a question will increase the role that it plays in the segmentation. Setting Weight to 0 for a question is equivalent to ignoring a question, whereas setting Weight to 2 is equivalent to putting a question into the analysis twice.

Technical details

  • If you wish to have different weights for different variables within a question, either:
    • Split the question apart on the Variables and Questions tab (e.g., right-click and select Revert to Source).
    • If the question is Number - Multi, you can multiply individual variables by a constant and, provided that the Distribution is set to Multivariate - Spherical, this will change the weight of that variable (in the same way that multiplying variables by a constant changes their importance in cluster analysis).
  • Where d is the density of a particular question, this is replaced by [math]\displaystyle{ d^{weight} }[/math] when computing the likelihood.
  • Information Criteria are invalid if any weight is set to more than 1 (strictly speaking, the theory of information criteria is only applicable if weights are set to 1 for all questions).

See also

References

Template:Reflist

Further reading: Latent Class Analysis Software

  1. Celeaux G, Govaert G (1992). "A classification EM algorithm for clustering and two stochastic versions." Computational Statistics and Data Analysis, 14, 315-332.
  2. Allison, P. D. and N. A. Christakis (1994). "Logit Models for Sets of Ranked Items." Sociological Methodology 24: 199-228.
  3. Fraley, C. and A. E. Raftery (2002), "Model-based clustering, discriminant analysis, and density estimation," Journal of the American Statistical Association, 97, 611-31. AND Fraley, C. and A. E. Raftery (2006), "MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering," Department of Statistics, University of Washington, Technical Report no. 504