The sampling threat when mining generalizable inter-library usage patterns

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

Tool support in software engineering often relies on relationships, regularities, patterns, or rules mined from other users’ code. Examples include approaches to bug prediction, code recommendation, and code autocompletion. Mining is typically performed on samples of code rather than the entirety of available software projects. While sampling is crucial for scaling data analysis, it can affect the generalization of the mined patterns.
This paper focuses on sampling software projects filtered for specific libraries and frameworks, and on mining patterns that connect different libraries. We call these inter-library patterns. We observe that limiting the sample to a specific library may hinder the generalization of inter-library patterns, posing a threat to their use or interpretation. Using a simulation and a real case study, we show this threat for different sampling methods. Our simulation shows that only when sampling for the disjunction of both libraries involved in the implication of a pattern, the implication generalizes well. Additionally, we show that real empirical data sampled using the GitHub search API does not behave as expected from our simulation. This identifies a potential threat relevant for many studies that use the GitHub search API for studying inter-library patterns.
Original languageEnglish
Article number103393
Pages (from-to)1-18
Number of pages18
JournalScience of Computer Programming
Volume248
Early online date27 Sept 2025
DOIs
Publication statusPublished - Mar 2026

Fingerprint

Dive into the research topics of 'The sampling threat when mining generalizable inter-library usage patterns'. Together they form a unique fingerprint.

Cite this