Abstract
Mining high-quality datasets of security defects is important for cybersecurity. In this paper, we focus on mining a dataset of reviews that discuss potential security defects in code or other artifacts. Mining such datasets often involves labeling, and this is challenging because security defects are rare.We investigate the use of active learning with a fine-tuned large language model to make the mining and labeling of such datasets more effective. Our simulations demonstrate that active learning can increase the effectivity of human annotators by a factor of 13. This means we can produce datasets with 13 times more defects than found in random samples of the same size. We conducted an empirical study on over four million unlabeled reviews from GitHub, showing that active learning increases the effectiveness by a factor bigger than 6. In total, 246 out of 1298 labeled reviews can be identified as discussing security defects. We do not depend on a keyword list for upfront candidate selection but dynamically evolve an LLM for this.Our work holds the potential to inspire future research in this area, resolving rare class and imbalance problems at the root where they appear, by adjusting the mining and labeling of the datasets. Our final dataset and model are publicly available.
| Original language | English |
|---|---|
| Title of host publication | EASE '25: Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering |
| Editors | Muhammad Ali Babar, Ayse Tosun, Stefan Wagner, Viktoria Stray |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 1014-1023 |
| Number of pages | 10 |
| ISBN (Electronic) | 9798400713859 |
| DOIs | |
| Publication status | Published - 2025 |
| Event | 29th International Conference on Evaluation and Assessment of Software Engineering, EASE 2025 - Istanbul, Turkey Duration: 17 Jun 2025 → 20 Jun 2025 |
Conference
| Conference | 29th International Conference on Evaluation and Assessment of Software Engineering, EASE 2025 |
|---|---|
| Country/Territory | Turkey |
| City | Istanbul |
| Period | 17/06/25 → 20/06/25 |
Bibliographical note
Publisher Copyright:© 2025 Copyright held by the owner/author(s).
Keywords
- active learning
- code review
- cybersecurity
- dataset curation
- empirical study.
- manual labeling
- Security defects
- simulation
Fingerprint
Dive into the research topics of 'Improved Labeling of Security Defects in Code Review by Active Learning with LLMs'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver