The effect of survey measurement error on clustering algorithms

Research output: Contribution to ConferencePaperAcademic

Abstract

Data mining and machine learning often employ a variety of clustering techniques, which aim to separate the data into interesting groups for further analysis or interpretation (Kaufman & Rousseeuw 2005; Aggrawal & Reddy 2014). Examples of well-known algorithms from the data mining literature are K-means, DBSCAN, PAM, Ward, and Gaussian or Binomial mixture models - respectively known as latent profile and latent class analysis in the social science literature. Some of these algorithms (K-means, Ward, mixtures) are commonly applied to surveys, while others (DBSCAN, PAM) may be less familiar to survey researchers, but can be equally useful. Surveys, however, are well-known to contain measurement errors. Such errors may adversely affect clustering - for instance, by producing spurious clusters, or by obscuring clusters that would have been detectable without errors. To date, however, little work has examined the effect that survey errors may exert on commonly used clustering techniques. Furthermore, while adaptations to a few specific clustering algorithms exist to make them "error-aware" (Aggarwal 2009, Ch. 8; Aggarwal & Reddy 2014, Ch. 18), no generic methods to correct clustering techniques for such errors are available. In this paper, we present a novel method for performing error-aware clustering - that is, clustering with correction for measurement error through multiple imputation (Boeschoten et al. 2018). We investigate how clustering of a large labor force survey differs with and without this correction. Implications for the application of clustering techniques to survey data are discussed.
Original languageEnglish
Publication statusPublished - 2018
EventBig Data meets Survey Science 2018 - Universitat Pompeu Fabra, Barcelon, Spain
Duration: 25 Oct 201827 Oct 2018
https://www.bigsurv18.org/

Conference

ConferenceBig Data meets Survey Science 2018
Abbreviated titleBigSurv 2018
CountrySpain
CityBarcelon
Period25/10/1827/10/18
Internet address

Fingerprint

Measurement errors
Clustering algorithms
Pulse amplitude modulation
Data mining
Social sciences
Learning systems
Personnel

Cite this

Pankowska, P. K. P., Oberski, D. L., & Pavlopoulos, D. (2018). The effect of survey measurement error on clustering algorithms. Paper presented at Big Data meets Survey Science 2018, Barcelon, Spain.
Pankowska, P.K.P. ; Oberski, D.L. ; Pavlopoulos, D. / The effect of survey measurement error on clustering algorithms. Paper presented at Big Data meets Survey Science 2018, Barcelon, Spain.
@conference{e4f279e6e6dd4498b3c37e080245be71,
title = "The effect of survey measurement error on clustering algorithms",
abstract = "Data mining and machine learning often employ a variety of clustering techniques, which aim to separate the data into interesting groups for further analysis or interpretation (Kaufman & Rousseeuw 2005; Aggrawal & Reddy 2014). Examples of well-known algorithms from the data mining literature are K-means, DBSCAN, PAM, Ward, and Gaussian or Binomial mixture models - respectively known as latent profile and latent class analysis in the social science literature. Some of these algorithms (K-means, Ward, mixtures) are commonly applied to surveys, while others (DBSCAN, PAM) may be less familiar to survey researchers, but can be equally useful. Surveys, however, are well-known to contain measurement errors. Such errors may adversely affect clustering - for instance, by producing spurious clusters, or by obscuring clusters that would have been detectable without errors. To date, however, little work has examined the effect that survey errors may exert on commonly used clustering techniques. Furthermore, while adaptations to a few specific clustering algorithms exist to make them {"}error-aware{"} (Aggarwal 2009, Ch. 8; Aggarwal & Reddy 2014, Ch. 18), no generic methods to correct clustering techniques for such errors are available. In this paper, we present a novel method for performing error-aware clustering - that is, clustering with correction for measurement error through multiple imputation (Boeschoten et al. 2018). We investigate how clustering of a large labor force survey differs with and without this correction. Implications for the application of clustering techniques to survey data are discussed.",
author = "P.K.P. Pankowska and D.L. Oberski and D. Pavlopoulos",
year = "2018",
language = "English",
note = "Big Data meets Survey Science 2018, BigSurv 2018 ; Conference date: 25-10-2018 Through 27-10-2018",
url = "https://www.bigsurv18.org/",

}

Pankowska, PKP, Oberski, DL & Pavlopoulos, D 2018, 'The effect of survey measurement error on clustering algorithms' Paper presented at Big Data meets Survey Science 2018, Barcelon, Spain, 25/10/18 - 27/10/18, .

The effect of survey measurement error on clustering algorithms. / Pankowska, P.K.P.; Oberski, D.L.; Pavlopoulos, D.

2018. Paper presented at Big Data meets Survey Science 2018, Barcelon, Spain.

Research output: Contribution to ConferencePaperAcademic

TY - CONF

T1 - The effect of survey measurement error on clustering algorithms

AU - Pankowska, P.K.P.

AU - Oberski, D.L.

AU - Pavlopoulos, D.

PY - 2018

Y1 - 2018

N2 - Data mining and machine learning often employ a variety of clustering techniques, which aim to separate the data into interesting groups for further analysis or interpretation (Kaufman & Rousseeuw 2005; Aggrawal & Reddy 2014). Examples of well-known algorithms from the data mining literature are K-means, DBSCAN, PAM, Ward, and Gaussian or Binomial mixture models - respectively known as latent profile and latent class analysis in the social science literature. Some of these algorithms (K-means, Ward, mixtures) are commonly applied to surveys, while others (DBSCAN, PAM) may be less familiar to survey researchers, but can be equally useful. Surveys, however, are well-known to contain measurement errors. Such errors may adversely affect clustering - for instance, by producing spurious clusters, or by obscuring clusters that would have been detectable without errors. To date, however, little work has examined the effect that survey errors may exert on commonly used clustering techniques. Furthermore, while adaptations to a few specific clustering algorithms exist to make them "error-aware" (Aggarwal 2009, Ch. 8; Aggarwal & Reddy 2014, Ch. 18), no generic methods to correct clustering techniques for such errors are available. In this paper, we present a novel method for performing error-aware clustering - that is, clustering with correction for measurement error through multiple imputation (Boeschoten et al. 2018). We investigate how clustering of a large labor force survey differs with and without this correction. Implications for the application of clustering techniques to survey data are discussed.

AB - Data mining and machine learning often employ a variety of clustering techniques, which aim to separate the data into interesting groups for further analysis or interpretation (Kaufman & Rousseeuw 2005; Aggrawal & Reddy 2014). Examples of well-known algorithms from the data mining literature are K-means, DBSCAN, PAM, Ward, and Gaussian or Binomial mixture models - respectively known as latent profile and latent class analysis in the social science literature. Some of these algorithms (K-means, Ward, mixtures) are commonly applied to surveys, while others (DBSCAN, PAM) may be less familiar to survey researchers, but can be equally useful. Surveys, however, are well-known to contain measurement errors. Such errors may adversely affect clustering - for instance, by producing spurious clusters, or by obscuring clusters that would have been detectable without errors. To date, however, little work has examined the effect that survey errors may exert on commonly used clustering techniques. Furthermore, while adaptations to a few specific clustering algorithms exist to make them "error-aware" (Aggarwal 2009, Ch. 8; Aggarwal & Reddy 2014, Ch. 18), no generic methods to correct clustering techniques for such errors are available. In this paper, we present a novel method for performing error-aware clustering - that is, clustering with correction for measurement error through multiple imputation (Boeschoten et al. 2018). We investigate how clustering of a large labor force survey differs with and without this correction. Implications for the application of clustering techniques to survey data are discussed.

M3 - Paper

ER -

Pankowska PKP, Oberski DL, Pavlopoulos D. The effect of survey measurement error on clustering algorithms. 2018. Paper presented at Big Data meets Survey Science 2018, Barcelon, Spain.