Angelos Markos bio photo

Angelos Markos

Data Scientist, Professor, Runner

Email Twitter Facebook LinkedIn

(Brief) Introduction

In this post, we refer to a class of distance-based clustering methods that combine dimension reduction (i.e., reduce the number of variables) with clustering of the observations. These methods are implemented in R with our package clustrd (Markos, Iodice D’Enza and van de Velden 2018) (available on CRAN). For data sets with continuous variables, the package provides implementations of Factorial K-means (Vichi and Kiers 2001) and Reduced K-means (De Soete and Carroll 1994); both methods combine Principal Component Analysis (PCA) with K-means clustering. For data sets with categorical variables, MCA K-means (Hwang, Dillon and Takane 2006), iFCB (Iodice D’Enza and Palumbo 2013) and Cluster Correspondence Analysis (van de Velden, Iodice D’Enza and Palumbo 2017) are available; these methods combine variants of (Multiple) Correspondence Analysis (MCA) with K-means.

Note that the combination of dimension reduction methods (e.g., PCA/MCA) with distance-based clustering techniques (e.g., hierarchical clustering, K-means) is not a new strategy and is usually performed in two separate steps: PCA/MCA is applied first to the data at hand and a clustering algorithm is subsequently applied to the factor scores obtained from the first step. This approach, however, known as tandem analysis, may not be always optimal, as the dimension reduction step optimises a different objective function from the clustering step. As a result, dimension reduction may mask the clustering structure. The ineffectiveness of tandem analysis has been argued by Arabie and Hubert (1994), Milligan (1996), Dolnicar & Grün (2008) and van de Velden et al. (2017), among others. On the contrary, joint dimension reduction and clustering techniques combine the two criteria (dimension reduction and clustering) into a single objective function, which can be optimized via Alternating Least Squares (ALS). For more details see, e.g. van de Velden et al. (2017) and Markos et al. (2018).

The clustrd package contains three main functions: cluspca(), clusmca() and tuneclus(). In this post, we will demonstrate cluspca(), a function that implements joint dimension reduction and clustering methods for continuous data.

Applying Reduced K-means to the Iris data set

Let’s try to cluster a set of observations described by continuous variables. How about the Iris flower data set? If you are now screaming “not this one again!”, please be patient - that’s a convenient choice to demonstrate a clustering method.

We begin with installing and loading the package:

# installing clustrd
install.packages("clustrd")
# loading clustrd
library(clustrd)

The cluspca() function has three arguments that must be provided: a data.frame (data), the number of clusters (nclus) and the number of dimensions (ndim). Optional arguments include alpha, method, center, scale, rotation, nstart, smartStart and seed, described below.

alpha is a scalar between 0 and 1; alpha = 0.5 leads to Reduced K-means, alpha = 0 to Factorial K-means, and alpha = 1 reduces to the tandem approach (PCA followed by K-means). alpha can also take intermediate values between 0 and 1, adjusting for the relative importance of RKM and FKM in the objective function. The default value of alpha is 0.5 (Reduced K-means).

method can be either “RKM” for Reduced K-means or “FKM” for Factorial K-means (default is “RKM”)

center and scale are logicals indicating whether the variables should be shifted to be zero centered or have unit variance, respectively.

rotation specifies the method used to rotate the factors. Options are “none” for no rotation, “varimax” for varimax rotation with Kaiser normalization and “promax” for promax rotation (default = “none”)

nstart indicates the number of iterations (default is 100)

smartStart can accept a cluster membership vector as a starting solution (default is NULL, where a starting solution is randomly generated) - note that these methods are iterative and a smart initialization can help.

seed is an integer used as argument by set.seed() for offsetting the random number generator when smartStart = NULL. The default value is 1234.

We set the number of clusters to 3 and the number of dimensions to 2 (in fact, setting the number of dimensions to the number of clusters minus 1 is recommended in the literature).

set.seed(123)
# apply Reduced K-means to iris (class variable is excluded)
out.cluspca = cluspca(iris[,-5], 3, 2)
summary(out.cluspca)

Here’s a summary of the solution:

Solution with 3 clusters of sizes 53 (35.3%), 50 (33.3%), 47 (31.3%) in 2 dimensions. 
Variables were mean centered and standardized.

Cluster centroids:
            Dim.1   Dim.2
Cluster 1  0.5896  0.7930
Cluster 2 -2.2237 -0.2380
Cluster 3  1.7008 -0.6411

Variable scores:
               Dim.1   Dim.2
Sepal.Length  0.5014 -0.4359
Sepal.Width  -0.2864 -0.8973
Petal.Length  0.5846  0.0024
Petal.Width   0.5699 -0.0699

Within cluster sum of squares by cluster:
[1] 34.5002 44.3823 34.4589
 (between_SS / total_SS =  80.13 %) 

Objective criterion value: 69.4442 

Available output:

 [1] "obscoord"  "attcoord"  "centroid"  "cluster"   "criterion" "size"      "odata"     "scale"    
 [9] "center"    "nstart"

An important feature of Reduced K-means is that we can visualize observations, cluster centers and variables on a low-dimensional space, similar to PCA.

# plotting the RKM solution
plot(out.cluspca, cludesc = TRUE)

This creates a biplot of observations and variables as shown below. Note that cludesc = TRUE creates a parallel coordinate plot to facilitate cluster interpretation. For help(plot.cluspca)

Reduced K-means biplot of observations (points) and variables (biplot axes) of the Iris data set with respect to components 1 (horizontal) and 2 (vertical). Cluster means are labelled C1 through C3.

Reduced K-means Biplot

It is helpful to compute the confusion matrix between the true cluster partition (iris[,5]) and the one obtained by Reduced K-means (out.cluspca$cluster). The agreement between the two cluster solutions is moderate, according to an adjusted Rand index of .62. Note, however, that our goal is clustering, not classification.

> table(iris[,5],out.cluspca$cluster)
            
              1  2  3
  setosa      0 50  0
  versicolor 39  0 11
  virginica  14  0 36

The parallel coordinate plot (shown below) depicts cluster means for each variable and provides quick insights into the differences between groups. The mean values, depicted on the vertical axis, correspond to mean centered and standardized variables. All instances of setosa are in the second cluster having large sepal width and low petal length and width, compared to the other clusters. The first cluster contains most instances of versicolor (39 out of 50), characterized by low sepal width. Most instances of virginica (36 out of 50) are in the third cluster, having high sepal length, petal length and petal width.

Parallel coordinate plot of the cluster means (line thickness is proportional to cluster size).

Reduced K-means Biplot

Choosing the number of clusters and dimensions

In the case of Iris data, where the true clusters are known, the decision on the number of clusters and dimensions is more or less obvious. In many cases, however, the selection of the most appropriate number of clusters and dimensions requires a careful assessment of solutions corresponding to different parameter choices. To facilitate a quantitative appraisal of solutions corresponding to different parameter settings, the clustrd package provides the function tuneclus(). The function requires a dataset (data argument), the range of clusters (nclusrange) and dimensions (ndimrange), as well as the desired method, through the method argument.

The function provides two well-established distance-based statistics to assess the quality of an obtained clustering solution via the criterion argument; the overall average silhouette width (ASW) index (criterion = “asw”) and the Calinski-Harabasz (CH) index (criterion = “ch”). The ASW index, which ranges from −1 to 1, reflects the compactness of the clusters and indicates whether a cluster structure is well separated or not. The CH index is the ratio of between-cluster variance to within-cluster variance, corrected according to the number of clusters, and takes values between 0 and infinity. In general, the higher the ASW and CH values, the better the cluster separation. Also, the function allows the specification of the data matrix used to compute the distances between observations, via the dst argument. When dst = “full” (default) the appropriate distance measure is computed on the original data. In particular, the Euclidean distance is used for continuous variables and Gower’s distance for categorical variables. When the option is set to “low”, the distance is computed between the low-dimensional object scores.

To demonstrate parameter tuning, we apply Reduced K-means to the Iris data for a range of clusters between 3 and 5 and dimensions ranging from 2 to 4. The Euclidean distance was computed between observations on the low dimensional space (dst = “low”). The average silhouette width was used as a cluster quality assessment criterion (criterion = “asw”). As shown below, the best solution was obtained for 3 clusters and 2 dimensions.

# Cluster quality assessment based on the average silhouette width 
# in the low dimensional space
bestRKM = tuneclus(iris[,-5], 3:5, 2:4, method = "RKM", criterion = "asw",
          dst = "low")
bestRKM
The best solution was obtained for 3 clusters of sizes 53 (35.3%), 50 (33.3%), 
47 (31.3%) in 2 dimensions, for a cluster quality criterion value of 0.511. 
Variables were mean centered and standardized.

Cluster quality criterion values across the specified range of clusters (rows) 
and dimensions (columns):
      2     3     4
3 0.511            
4 0.444 0.424      
5 0.413 0.359 0.342

Cluster centroids:
            Dim.1   Dim.2
Cluster 1  0.5896  0.7930
Cluster 2 -2.2237 -0.2380
Cluster 3  1.7008 -0.6411

Within cluster sum of squares by cluster:
[1] 34.5002 44.3823 34.4589
 (between_SS / total_SS =  80.13 %) 

Objective criterion value: 69.4442 

Available output:

[1] "clusobjbest" "nclusbest"   "ndimbest"    "critbest"    "critgrid"

The “best” solution can be subsequently plotted.

plot(bestRKM)

Epilogue

For a thorough treatment of joint dimension reduction and clustering methods and additional applications you can read our JSS paper. The clustrd package is still evolving; ongoing refinements, new features and fixing bugs are being enhanced regularly. Next version will include a function for handling mixed-type data sets; that is data sets with both continuous and categorical variables that are frequently encountered in practice.

In a follow-up post, we’ll demonstrate the application of joint dimension reduction and clustering to categorical and mixed data sets.

References

Arabie, P., & Hubert, L. (1994). Cluster Analysis in Marketing Research. In R. P. Bagozzi. (Ed.), Advanced Methods of Marketing Research (pp. 160-189). Cambridge: Blackwell.

De Soete, G., & Carroll, J. D. (1994). K-means clustering in a low-dimensional euclidean space. In E. Diday, Y. Lechevallier, M. Schader, P. Bertrand, & B. Burtschy (Eds.), New Approaches in Classification and Data Analysis (pp. 212-219). Springer-Verlag, Berlin.

Dolnicar, S., & Grün, B. (2008). Challenging “factor–cluster segmentation”. Journal of Travel Research, 47(1), 63-71.

Hwang, H., Dillon, W. R., & Takane, Y. (2006). An Extension of Multiple Correspondence Analysis for Identifying Heterogenous Subgroups of Respondents. Psychometrika, 71, 161-171.

Iodice D’Enza, A., & Palumbo, F. (2013). Iterative factor clustering of binary data. Computational Statistics, 28(2), 789-807.

Markos, A., Iodice D’Enza, A., & van de Velden, M. (2018). Beyond Tandem Analysis: Joint Dimension Reduction and Clustering in R. Journal of Statistical Software (accepted for publication).

Milligan, G. W. (1996). Clustering validation: results and implications for applied analyses. In P. Arabie, L. J. Hubert (Eds.), Clustering and classification (pp. 341-375). River Edge: World Scientific Publications.

van de Velden, M., Iodice D’Enza, A., & Palumbo, F. (2017). Cluster Correspondence Analysis. Psychometrika, 82(1), 158-185.

Vichi, M., & H. A. (2001). Factorial k-means analysis for two-way data. Computational Statistics & Data Analysis, 37(1), 49-64.