TY - JOUR
T1 - Measuring similarity between Karel programs using character and word n-grams
AU - Sidorov, G.
AU - Ibarra Romero, M.
AU - Markov, I.
AU - Guzman-Cabrera, R.
AU - Chanona-Hernández, L.
AU - Velásquez, F.
PY - 2017/1/1
Y1 - 2017/1/1
N2 - © 2017, Pleiades Publishing, Ltd.We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.
AB - © 2017, Pleiades Publishing, Ltd.We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.
U2 - 10.1134/S0361768817010066
DO - 10.1134/S0361768817010066
M3 - Article
SN - 0361-7688
VL - 43
SP - 47
EP - 50
JO - Programming and Computer Software
JF - Programming and Computer Software
IS - 1
ER -