TY - GEN
T1 - Classification of APIs by hierarchical clustering
AU - Härtel, Johannes
AU - Aksu, Hakan
AU - Lämmel, Ralf
PY - 2018/5/28
Y1 - 2018/5/28
N2 - APIs can be classified according to the programming domains (e.g., GUIs, databases, collections, or security) that they address. Such classification is vital in searching repositories (e.g., the Maven Central Repository for Java) and for understanding the technology stack used in software projects. We apply hierarchical clustering to a curated suite of Java APIs to compare the computed API clusters with preexisting API classifications. Clustering entails various parameters (e.g., the choice of IDF versus LSI versus LDA). We describe the corresponding variability in terms of a feature model. We exercise all possible configurations to determine the maximum correlation with respect to two baselines: i) a smaller suite of APIs manually classified in previous research; ii) a larger suite of APIs from the Maven Central Repository, thereby taking advantage of crowd-sourced classification while relying on a threshold-based approach for identifying important APIs and versions thereof, subject to an API dependency analysis on GitHub. We discuss the configurations found in this way and we examine the influence of particular features on the correlation between computed clusters and baselines. To this end, we also leverage interactive exploration of the parameter space and the resulting dendrograms. In this manner, we can also identify issues with the use of classifiers (e.g., missing classifiers) in the baselines and limitations of the clustering approach.
AB - APIs can be classified according to the programming domains (e.g., GUIs, databases, collections, or security) that they address. Such classification is vital in searching repositories (e.g., the Maven Central Repository for Java) and for understanding the technology stack used in software projects. We apply hierarchical clustering to a curated suite of Java APIs to compare the computed API clusters with preexisting API classifications. Clustering entails various parameters (e.g., the choice of IDF versus LSI versus LDA). We describe the corresponding variability in terms of a feature model. We exercise all possible configurations to determine the maximum correlation with respect to two baselines: i) a smaller suite of APIs manually classified in previous research; ii) a larger suite of APIs from the Maven Central Repository, thereby taking advantage of crowd-sourced classification while relying on a threshold-based approach for identifying important APIs and versions thereof, subject to an API dependency analysis on GitHub. We discuss the configurations found in this way and we examine the influence of particular features on the correlation between computed clusters and baselines. To this end, we also leverage interactive exploration of the parameter space and the resulting dendrograms. In this manner, we can also identify issues with the use of classifiers (e.g., missing classifiers) in the baselines and limitations of the clustering approach.
UR - https://www.scopus.com/pages/publications/85051626965
U2 - 10.1145/3196321.3196344
DO - 10.1145/3196321.3196344
M3 - Conference contribution
SN - 9781450357142
T3 - Proceedings - International Conference on Software Engineering
SP - 233
EP - 243
BT - Proceedings - 2018 ACM/IEEE 26th International Conference on Program Comprehension, ICPC 2018
PB - IEEE Computer Society
T2 - ACM/IEEE 26th International Conference on Program Comprehension, ICPC 2018, collocated with the 40th International Conference on Software Engineering, ICSE 2018
Y2 - 27 May 2018 through 28 May 2018
ER -