The comparison of program sources using the sequence alignment of tokens

Article

Published: 19.11.2014

Published in issue: #9(33)/2014

DOI: 10.18698/2308-6033-2014-9-1318

Borrowing detection is a very actual problem now. In this work, one of the known algorithms of the biopolymer sequence alignment was modified to make it possible to compare program sources and detect similar snippets in these sources. The input data of this algorithm are the source codes treated as the sequences of symbols. The set of lexical domains correspond to the alphabet of symbols making up these sequences. The algorithm was implemented and demonstrated with some code fragments written in Scheme language. The perspectives and restrictions of the algorithm application are also discussed.

References
[1] Burrows S., Tahaghoghi S.M.M., Zobel J. Efficient plagiarism detection for large code repositories. Softw. Pract. Exper, 2007, no. 37(2), 151-175.
[2] MOSS (Measure of Software Similarity). Available at: http://theory.stanford.edu/~aiken/moss/ (accessed on 02.10.2014).
[3] Agrawal A., Huang X. Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices. IEEE/ACM Trans. Comput. Biol. Bioinform., 2011, no. 8(1), рр. 194-205.
[4] Lewis J., Ossowski S., Hicks J., Errami M., Garner H.R. Text similarity: An alternative way to search MEDLINE. Bioinformatics, 2006, no. 22 (18), pp. 22982304.
[5] Durbin R., Eddy S.R., Krogh A., Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998, 336 p.