SemanticCloneBench: A Semantic Code Clone Benchmark using Crowd-Source Knowledge

Farouq Al-omari; Chanchal K. Roy; Tonghao Chen

doi:10.1109/iwsc50091.2020.9047643

SemanticCloneBench: A Semantic Code Clone Benchmark using Crowd-Source Knowledge

Farouq Al-omari, Chanchal K. Roy, Tonghao Chen

Abstract

Not only do newly proposed code clone detection techniques, but existing techniques and tools also need to be evaluated and compared. This evaluation process could be done by assessing the reported clones manually or by using benchmarks. The main limitations of available benchmarks include: they are restricted to one programming language; they have a limited number of clone pairs that are confined within the selected system(s); they require manual validation; they do not support all types of code clones. To overcome these limitations, we proposed a methodology to generate a wide range of semantic clone benchmark(s) for different programming languages with minimal human validation. Our technique is based on the knowledge provided by developers who participate in the crowd-sourced information website, Stack Overflow. We applied automatic filtering, selection and validation to the source code in Stack Overflow answers. Finally, we build a semantic code clone benchmark of 4000 clones pairs for the languages Java, C, C# and Python.

Cite:: Farouq Al-omari, Chanchal K. Roy, and Tonghao Chen. 2020. SemanticCloneBench: A Semantic Code Clone Benchmark using Crowd-Source Knowledge. 2020 IEEE 14th International Workshop on Software Clones (IWSC).
Copy Citation:

Cite Search