2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER)
- Anthology ID:
- G20-97
- Month:
- Year:
- 2020
- Address:
- Venue:
- GWF
- SIG:
- Publisher:
- IEEE
- URL:
- https://gwf-uwaterloo.github.io/gwf-publications/G20-97
- DOI:
SAGA: Efficient and Large-Scale Detection of Near-Miss Clones with GPU Acceleration
Guanhua Li
|
Yijian Wu
|
Chanchal K. Roy
|
Jun Sun
|
Xin Peng
|
Nanjie Zhan
|
Bin Hu
|
Jingyi Ma
Clone detection on large code repository is necessary for many big code analysis tasks. The goal is to provide rich information on identical and similar code across projects. Detecting near-miss code clones on big code is challenging since it requires intensive computing and memory resources as the scale of the source code increases. In this work, we propose SAGA, an efficient suffix-array based code clone detection tool designed with sophisticated GPU optimization. SAGA not only detects Type-l and Type-2 clones but also does so for cross-project large repositories and for the most computationally expensive Type-3 clones. Meanwhile, it also works at segment granularity, which is even more challenging. It detects code clones in 100 million lines of code within 11 minutes (with recall and precision comparable to state-of-the-art approaches), which is more than 10 times faster than state-of-the-art tools. It is the only tool that efficiently detects Type-3 near-miss clones at segment granularity in large code repository (e.g., within 11 hours on 1 billion lines of code). We conduct a preliminary case study on 85,202 GitHub Java projects with 1 billion lines of code and exhibit the distribution of clones across projects. We find about 1.23 million Type-3 clone groups, containing 28 million lines of code at arbitrary segment granularity, which are only detectable with SAGA. We believe SAGA is useful in many software engineering applications such as code provenance analysis, code completion, change impact analysis, and many more.
Associating Code Clones with Association Rules for Change Impact Analysis
Manishankar Mondal
|
Banani Roy
|
Chanchal K. Roy
|
Kevin A. Schneider
When a programmer makes changes to a target program entity (files, classes, methods), it is important to identify which other entities might also get impacted. These entities constitute the impact set for the target entity. Association rules have been widely used for discovering the impact sets. However, such rules only depend on the previous co-change history of the program entities ignoring the fact that similar entities might often need to be updated together consistently even if they did not co-change before. Considering this fact, we investigate whether cloning relationships among program entities can be associated with association rules to help us better identify the impact sets. In our research, we particularly investigate whether the impact set detection capability of a clone detector can be utilized to enhance the capability of the state-of-the-art association rule mining technique, Tarmaq, in discovering impact sets. We use the well known clone detector called NiCad in our investigation and consider both regular and micro-clones. Our evolutionary analysis on thousands of commit operations of eight diverse subject systems reveals that consideration of code clones can enhance the impact set detection accuracy of Tarmaq with a significantly higher precision and recall. Micro-clones of 3LOC and 4LOC and regular code clones of 5LOC to 20LOC contribute the most towards enhancing the detection accuracy.
HistoRank: History-Based Ranking of Co-change Candidates
Manishankar Mondal
|
Banani Roy
|
Chanchal K. Roy
|
Kevin A. Schneider
Evolutionary coupling is a well investigated phenomenon during software evolution and maintenance. If two or more program entities co-change (i.e., change together) frequently during evolution, it is expected that the entities are coupled. This type of coupling is called evolutionary coupling or change coupling in the literature. Evolutionary coupling is realized using association rules and two measures: support and confidence. Association rules have been extensively used for predicting co-change candidates for a target program entity (i.e., an entity that a programmer attempts to change). However, association rules often predict a large number of co-change candidates with many false positives. Thus, it is important to rank the predicted co-change candidates so that the true positives get higher priorities. The predicted co-change candidates have always been ranked using the support and confidence measures of the association rules. In our research, we investigate five different ranking mechanisms on thousands of commits of ten diverse subject systems. On the basis of our findings, we propose a history-based ranking approach, HistoRank (History-based Ranking), that analyzes the previous ranking history to dynamically select the most appropriate one from those five ranking mechanisms for ranking co-change candidates of a target program entity. According to our experiment result, HistoRank outperforms each individual ranking mechanism with a significantly better MAP (mean average precision). We investigate different variants of HistoRank and realize that the variant that emphasizes the ranking in the most recent occurrence of co-change in the history performs the best.
Exploring Type Inference Techniques of Dynamically Typed Languages
C M Khaled Saifullah
|
Muhammad Asaduzzaman
|
Chanchal K. Roy
Developers often prefer dynamically typed programming languages, such as JavaScript, because such languages do not require explicit type declarations. However, such a feature hinders software engineering tasks, such as code completion, type related bug fixes and so on. Deep learning-based techniques are proposed in the literature to infer the types of code elements in JavaScript snippets. These techniques are computationally expensive. While several type inference techniques have been developed to detect types in code snippets written in statically typed languages, it is not clear how effective those techniques are for inferring types in dynamically typed languages, such as JavaScript. In this paper, we investigate the type inference techniques of JavaScript to understand the above two issues further. While doing that we propose a new technique that considers the locally specific code tokens as the context to infer the types of code elements. The evaluation result shows that the proposed technique is 20-47% more accurate than the statically typed language-based techniques and 5–14 times faster than the deep learning techniques without sacrificing accuracy. Our analysis of sensitivity, overlapping of predicted types and the number of training examples justify the importance of our technique.