### Abstract

Original language | English |
---|---|

Publication status | Published - 2018 |

Event | Big Data meets Survey Science 2018 - Universitat Pompeu Fabra, Barcelon, Spain Duration: 25 Oct 2018 → 27 Oct 2018 https://www.bigsurv18.org/ |

### Conference

Conference | Big Data meets Survey Science 2018 |
---|---|

Abbreviated title | BigSurv 2018 |

Country | Spain |

City | Barcelon |

Period | 25/10/18 → 27/10/18 |

Internet address |

### Fingerprint

### Cite this

*The effect of survey measurement error on clustering algorithms*. Paper presented at Big Data meets Survey Science 2018, Barcelon, Spain.

}

**The effect of survey measurement error on clustering algorithms.** / Pankowska, P.K.P.; Oberski, D.L.; Pavlopoulos, D.

Research output: Contribution to Conference › Paper › Academic

TY - CONF

T1 - The effect of survey measurement error on clustering algorithms

AU - Pankowska, P.K.P.

AU - Oberski, D.L.

AU - Pavlopoulos, D.

PY - 2018

Y1 - 2018

N2 - Data mining and machine learning often employ a variety of clustering techniques, which aim to separate the data into interesting groups for further analysis or interpretation (Kaufman & Rousseeuw 2005; Aggrawal & Reddy 2014). Examples of well-known algorithms from the data mining literature are K-means, DBSCAN, PAM, Ward, and Gaussian or Binomial mixture models - respectively known as latent profile and latent class analysis in the social science literature. Some of these algorithms (K-means, Ward, mixtures) are commonly applied to surveys, while others (DBSCAN, PAM) may be less familiar to survey researchers, but can be equally useful. Surveys, however, are well-known to contain measurement errors. Such errors may adversely affect clustering - for instance, by producing spurious clusters, or by obscuring clusters that would have been detectable without errors. To date, however, little work has examined the effect that survey errors may exert on commonly used clustering techniques. Furthermore, while adaptations to a few specific clustering algorithms exist to make them "error-aware" (Aggarwal 2009, Ch. 8; Aggarwal & Reddy 2014, Ch. 18), no generic methods to correct clustering techniques for such errors are available. In this paper, we present a novel method for performing error-aware clustering - that is, clustering with correction for measurement error through multiple imputation (Boeschoten et al. 2018). We investigate how clustering of a large labor force survey differs with and without this correction. Implications for the application of clustering techniques to survey data are discussed.

AB - Data mining and machine learning often employ a variety of clustering techniques, which aim to separate the data into interesting groups for further analysis or interpretation (Kaufman & Rousseeuw 2005; Aggrawal & Reddy 2014). Examples of well-known algorithms from the data mining literature are K-means, DBSCAN, PAM, Ward, and Gaussian or Binomial mixture models - respectively known as latent profile and latent class analysis in the social science literature. Some of these algorithms (K-means, Ward, mixtures) are commonly applied to surveys, while others (DBSCAN, PAM) may be less familiar to survey researchers, but can be equally useful. Surveys, however, are well-known to contain measurement errors. Such errors may adversely affect clustering - for instance, by producing spurious clusters, or by obscuring clusters that would have been detectable without errors. To date, however, little work has examined the effect that survey errors may exert on commonly used clustering techniques. Furthermore, while adaptations to a few specific clustering algorithms exist to make them "error-aware" (Aggarwal 2009, Ch. 8; Aggarwal & Reddy 2014, Ch. 18), no generic methods to correct clustering techniques for such errors are available. In this paper, we present a novel method for performing error-aware clustering - that is, clustering with correction for measurement error through multiple imputation (Boeschoten et al. 2018). We investigate how clustering of a large labor force survey differs with and without this correction. Implications for the application of clustering techniques to survey data are discussed.

M3 - Paper

ER -