Multiple Correspondence Analysis for Tall Data Sets

Angelos Markos (1), George Menexes (2) and Theophilos Papapadimitriou (3)

(1) Department of Applied Informatics, University of Macedonia, Thessaloniki, Greece
(2) Lab of Agronomy, School of Agriculture, Aristotle University of Thessaloniki, Greece
(3) Department of Int. Economic Relations and Development, Democritus University of Thrace, Komotini

A critical step of the MCA algorithm is the Singular Value Decomposition (SVD) analysis of a coded matrix. The size of this matrix affects drastically the analysis computational cost. As the size of the matrix increases, the method becomes computationally expensive or even impossible. We propose an alternative MCA scheme that overpasses this limitation, without affecting the results accuracy. A set of Monte Carlo simulations and real data applications showed the effciency of the proposed approach over the standard one, especially in the case of tall data sets. We provide a Matlab and R implementation of Multiple Correspondence Analysis. Two versions are available: a) the standard MCA algorithm and b) the modified MCA algorithm.

Software

Matlab Implementation

Standard MCA algorithm - mcastd.m
Fast MCA algorithm - mcatall.m

R Implementation

Standard MCA algorithm - mcastd.r
Fast MCA algorithm - mcatall.r

Datasets

A 267,264 x 6 dataset - talldata.txt