A Graph-Based Stratified Sampling Methodology for the Analysis of (Underground) Forums

Giorgio Di Tizio*, Gilberto Atondo Siu, Alice Hutchings, Fabio Massacci

*Corresponding author for this work

Research output: Contribution to JournalArticleAcademicpeer-review

83 Downloads (Pure)

Abstract

Researchers analyze underground forums to study abuse and cybercrime activities. Due to the size of the forums and the domain expertise required to identify criminal discussions, most approaches employ supervised machine learning techniques to automatically classify the posts of interest. Human annotation is costly. How to select samples to annotate that account for the structure of the forum? We present a methodology to generate stratified samples based on information about the centrality properties of the population and evaluate classifier performance. We observe that by employing a sample obtained from a uniform distribution of the post degree centrality metric, we maintain the same level of precision but significantly increase the recall (+30%) compared to a sample whose distribution is respecting the population stratification. We find that classifiers trained with similar samples disagree on the classification of criminal activities up to 33% of the time when deployed on the entire forum.

Original languageEnglish
Pages (from-to)5473-5483
Number of pages11
JournalIEEE Transactions on Information Forensics and Security
Volume18
Early online date11 Aug 2023
DOIs
Publication statusPublished - 2023

Bibliographical note

Publisher Copyright:
© 2005-2012 IEEE.

Funding

FundersFunder number
Horizon 2020 Framework Programme949127, 952647, 830929

    Keywords

    • Cybercrime
    • machine learning
    • underground forum

    Fingerprint

    Dive into the research topics of 'A Graph-Based Stratified Sampling Methodology for the Analysis of (Underground) Forums'. Together they form a unique fingerprint.

    Cite this