Skip to main navigation Skip to search Skip to main content

Improved Labeling of Security Defects in Code Review by Active Learning with LLMs

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

Mining high-quality datasets of security defects is important for cybersecurity. In this paper, we focus on mining a dataset of reviews that discuss potential security defects in code or other artifacts. Mining such datasets often involves labeling, and this is challenging because security defects are rare.We investigate the use of active learning with a fine-tuned large language model to make the mining and labeling of such datasets more effective. Our simulations demonstrate that active learning can increase the effectivity of human annotators by a factor of 13. This means we can produce datasets with 13 times more defects than found in random samples of the same size. We conducted an empirical study on over four million unlabeled reviews from GitHub, showing that active learning increases the effectiveness by a factor bigger than 6. In total, 246 out of 1298 labeled reviews can be identified as discussing security defects. We do not depend on a keyword list for upfront candidate selection but dynamically evolve an LLM for this.Our work holds the potential to inspire future research in this area, resolving rare class and imbalance problems at the root where they appear, by adjusting the mining and labeling of the datasets. Our final dataset and model are publicly available.

Original languageEnglish
Title of host publicationEASE '25: Proceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering
EditorsMuhammad Ali Babar, Ayse Tosun, Stefan Wagner, Viktoria Stray
PublisherAssociation for Computing Machinery, Inc
Pages1014-1023
Number of pages10
ISBN (Electronic)9798400713859
DOIs
Publication statusPublished - 2025
Event29th International Conference on Evaluation and Assessment of Software Engineering, EASE 2025 - Istanbul, Turkey
Duration: 17 Jun 202520 Jun 2025

Conference

Conference29th International Conference on Evaluation and Assessment of Software Engineering, EASE 2025
Country/TerritoryTurkey
CityIstanbul
Period17/06/2520/06/25

Bibliographical note

Publisher Copyright:
© 2025 Copyright held by the owner/author(s).

Keywords

  • active learning
  • code review
  • cybersecurity
  • dataset curation
  • empirical study.
  • manual labeling
  • Security defects
  • simulation

Fingerprint

Dive into the research topics of 'Improved Labeling of Security Defects in Code Review by Active Learning with LLMs'. Together they form a unique fingerprint.

Cite this