2023
With the emergence of Machine Learning, there has been a surge in leveraging its capabilities for problem-solving across various domains. In the code clone realm, the identification of type-4 or semantic clones has emerged as a crucial yet challenging task. Researchers aim to utilize Machine Learning to tackle this challenge, often relying on the Big-CloneBench dataset. However, it's worth noting that BigCloneBench, originally not designed for semantic clone detection, presents several limitations that hinder its suitability as a comprehensive training dataset for this specific purpose. Furthermore, CLCDSA dataset suffers from a lack of reusable examples aligning with real-world software systems, rendering it inadequate for cross-language clone detection approaches. In this work, we present a comprehensive semantic clone and cross-language clone benchmark, GPTCloneBench <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> by exploiting SemanticCloneBench and OpenAI's GPT-3 model. In particular, using code fragments from SemanticCloneBench as sample inputs along with appropriate prompt engineering for GPT-3 model, we generate semantic and cross-language clones for these specific fragments and then conduct a combination of extensive manual analysis, tool-assisted filtering, functionality testing and automated validation in building the benchmark. From 79,928 clone pairs of GPT-3 output, we created a benchmark with 37,149 true semantic clone pairs, 19,288 false semantic pairs(Type-1/Type-2), and 20,770 cross-language clones across four languages (Java, C, C#, and Python). Our benchmark is 15-fold larger than SemanticCloneBench, has more functional code examples for software systems and programming language support than CLCDSA, and overcomes BigCloneBench's qualities, quantification, and language variety limitations. GPTCloneBench can be found here <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup> .
2022
Advances in scientific domains are led to an increase in the complexity of the experiments. To address this growing complexity, scientists from different domains require to work collaboratively. Scientific Workflow Management Systems (SWfMSs) are popular tools for data-intensive experiments. To the best of our knowledge, very few of the existing SWfMSs support collaboration, and it is not efficient in many cases. Researchers share a single version of the workflow in existing collaborative data analysis systems, which increases the chance of interference as the number of collaborators grows. Moreover, for effective collaboration, contributors require a clear view of the project's status, the information that existing SWfMSs do not provide. Another significant problem is most scientists are not capable of adding collaborative tools to existing SWfMSs, and they need software engineers to take on this responsibility. Even for software engineers such tasks could be challenging and time consuming. In this paper, we attempted to address this crucial issue in scientific workflow composition and doing so in a collaborative setting. Hence, we propose a tool to facilitate collaborative workflow composition. This tool provides branching and versioning, which are standard version control system features to allow multiple researchers to contribute to the project asynchronously. We also suggest some visualizations and a variety of reports to increase group awareness and help the scientists to realize the project's status and issues. As a proof of concept, we developed an API to capture the provenance data and provide collaborative tools. This API is developed as an example for software engineers to help them understand how to integrate collaborative tools into any SWfMS. We collect provenance information during workflow composition and then employ it to track workflow versions using the proposed collaborative tool. Prior to implementing the visualizations, we surveyed to discover how much the proposed visualizations could contribute to group awareness. Moreover, in the survey we investigated to what extent the proposed version control system could help address shortcomings in collaborative experiments. The survey participants provided us with valuable feedback. In future, we will use the survey responses to enhance the proposed version control system and visualizations.
Reading through code, finding relevant methods, classes and files takes a significant portion of software development time. Having good tool support for this code browsing activity can reduce human effort and increase overall developer productivity. To help with program comprehension activities, building an abstract code summary of a software system from its call graph is an active research area. A call graph is a visual representation of the caller-callee relationships between different methods of a software system. Call graphs can be difficult to comprehend for a large code-base. Previous work by Gharibi et al. on abstract code summarizing suggested using the Agglomerative Hierarchical Clustering (AHC) tree for understanding the codebase. Each node in the tree is associated with the top five method names. When we replicated the previous approach, we observed that the number of nodes in the AHC tree is burdensome for developers to explore. We also noticed only five method names for each node is not sufficient to comprehend an abstract node. We propose a technique to transform the AHC tree using cluster flattening for natural grouping and reduced nodes. We also generate a natural text summary for each abstract node derived from method comments. In order to evaluate our proposed approach, we collected developers’ opinions about the abstract code summary tree based on their codebase. The evaluation results confirm that our approach can not only help developers get an overview of their codebases but also could assist them in doing specific software maintenance tasks.
Exploring the source code of a software system is a prevailing task that is frequently done by contributors to a system. Practitioners often use call graphs to aid in understanding the source code of an inadequately documented software system. Call graphs, when visualized, show caller and callee relationships between functions. A static call graph provides an overall structure of a software system and dynamic call graphs generated from dynamic execution logs can be used to trace program behaviour for a particular scenario. Unfortunately a call graph of an entire system can be very complicated and hard to understand. Hierarchically abstracting a call graph can be used to summarize an entire system’s structure and more easily comprehending function calls. In this work, we mine concepts from source code entities (functions) to generate a concept cluster tree with improved naming of cluster nodes to complement existing studies and facilitate more effective program comprehension for developers. We apply three different information retrieval techniques (TFIDF, LDA, and LSI) on function names and function name variants to label the nodes of a concept cluster tree generated by clustering execution paths. From our experiment in comparing automatic labelling with manual labeling by participants for 12 use cases, we found that among the techniques on average, TFIDF performs better with 64% matching. LDA and LSI had 37% and 23% matching respectively. In addition, using the words in function name variants performed at least 5% better in participant ratings for all three techniques on average for the use cases.
Testing software is considered to be one of the most crucial phases in software development life cycle. Software bug fixing requires a significant amount of time and effort. A rich body of recent research explored ways to predict bugs in software artifacts using machine learning based techniques. For a reliable and trustworthy prediction, it is crucial to also consider the explainability aspects of such machine learning models. In this paper, we show how the feature transformation techniques can significantly improve the prediction accuracy and build confidence in building bug prediction models. We propose a novel approach for improved bug prediction that first extracts the features, then finds a weighted transformation of these features using a genetic algorithm that best separates bugs from non-bugs when plotted in a low-dimensional space, and finally, trains the machine learning model using the transformed dataset. In our experiment with real-life bug datasets, the random forest and k-nearest neighbor classifier models that leveraged feature transformation showed 4.25% improvement in recall values on an average of over 8 software systems when compared to the models built on original data.
Software developers often submit questions to technical Q&A sites like Stack Overflow (SO) to resolve their code-level problems. Usually, they include example code segments with their questions to explain the programming issues. When users of SO attempt to answer the questions, they prefer to reproduce the issues reported in questions using the given code segments. However, such code segments could not always reproduce the issues due to several unmet challenges (e.g., too short code segment) that might prevent questions from receiving appropriate and prompt solutions. A previous study produced a catalog of potential challenges that hinder the reproducibility of issues reported at SO questions. However, it is unknown how the practitioners (i.e., developers) perceive the challenge catalog. Understanding the developers’ perspective is inevitable to introduce interactive tool support that promotes reproducibility. We thus attempt to understand developers’ perspectives by surveying 53 users of SO. In particular, we attempt to – (1) see developers’ viewpoints on the agreement to those challenges, (2) find the potential impact of those challenges, (3) see how developers address them, and (4) determine and prioritize tool support needs. Survey results show that about 90% of participants agree to the already exposed challenges. However, they report some additional challenges (e.g., error log missing) that might prevent reproducibility. According to the participants, too short code segment and absence of required Class/Interface/Method from code segments severely prevent reproducibility, followed by missing important part of code. To promote reproducibility, participants strongly recommend introducing tool support that interacts with question submitters with suggestions for improving the code segments if the given code segments fail to reproduce the issues.
Software bug prediction is one of the promising research areas in software engineering. Software developers must allocate a reasonable amount of time and resources to test and debug the developed software extensively to improve software quality. However, it is not always possible to test software thoroughly with limited time and resources to develop high quality software. Sometimes software companies release software products in a hurry to make profit in a competitive environment. As a result the released software might have software defects and can affect the reputation of those software companies. Ideally, any software application that is already in the market should not contain bugs. If it does, depending on its severity, it might cause a great cost. Although a significant amount of work has been done to automate different parts of testing to detect bugs, fixing a bug after it is discovered is still a costly task that developers need to do. Sometimes these bug fixing changes introduce new bugs in the system. Researchers estimated that 80% of the total cost of a software system is spent on fixing bugs [8]. They show that the software faults and failures costs the US economy $59.5 billion a year [9].
Software development is largely dependent on libraries to reuse existing functionalities instead of reinventing the wheel. Software developers often need to find analogical libraries (libraries similar to ones they are already familiar with) as an analogical library may offer improved or additional features. Developers also need to search for analogical libraries across programming languages when developing applications in different languages or for different platforms. However, manually searching for analogical libraries is a time-consuming and difficult task. This paper presents a technique, called XLibRec, that recommends analogical libraries across different programming languages. XLibRec collects Stack Overflow question titles containing library names, library usage information from Stack Overflow posts, and library descriptions from a third party website, Libraries.io. We generate word-vectors for each information and calculate a weight-based cosine similarity score from them to recommend analogical libraries. We performed an extensive evaluation using a large number of analogical libraries across four different programming languages. Results from our evaluation show that the proposed technique can recommend cross-language analogical libraries with great accuracy. The precision for the Top-3 recommendations ranges from 62-81% and has achieved 8-45% higher precision than the state-of-the-art technique.
A software release note is one of the essential documents in the software development life cycle. The software release contains a set of information, e.g., bug fixes and security fixes. Release notes are used in different phases, e.g., requirement engineering, software testing and release management. Different types of practitioners (e.g., project managers and clients) get benefited from the release notes to understand the overview of the latest release. As a result, several studies have been done about release notes production and usage in practice. However, two significant problems (e.g., duplication and inconsistency in release notes contents) exist in producing well-written & well-structured release notes and organizing appropriate information regarding different targeted users' needs. For that reason, practitioners face difficulties in writing and reading the release notes using existing tools. To mitigate these problems, we execute two different studies in our paper. First, we execute an exploratory study by analyzing 3,347 release notes of 21 GitHub repositories to understand the documented contents of the release notes. As a result, we find relevant key artifacts, e.g., issues (29%), pull-requests (32%), commits (19%), and common vulnerabilities and exposures (CVE) issues (6%) in the release note contents. Second, we conduct a survey study with 32 professionals to understand the key information that is included in release notes regarding users' roles. For example, project managers are more interested in learning about new features than less critical bug fixes. Our study can guide future research directions to help practitioners produce the release notes with relevant content and improve the documentation quality.
2021
To process a large amount of data sequentially and systematically, proper management of workflow components (i.e., modules, data, configurations, associations among ports and links) in a Scientific Workflow Management System (SWfMS) is inevitable. Managing data with provenance in a SWfMS to support reusability of workflows, modules, and data is not a simple task. Handling such components is even more burdensome for frequently assembled and executed complex workflows for investigating large datasets with different technologies (i.e., various learning algorithms or models). However, a great many studies propose various techniques and technologies for managing and recommending services in a SWfMS, but only a very few studies consider the management of data in a SWfMS for efficient storing and facilitating workflow executions. Furthermore, there is no study to inquire about the effectiveness and efficiency of such data management in a SWfMS from a user perspective. In this paper, we present and evaluate a GUI version of such a novel approach of intermediate data management with two use cases (Plant Phenotyping and Bioinformatics). The technique we call GUI-RISPTS (Recommending Intermediate States from Pipelines Considering Tool-States) can facilitate executions of workflows with processed data (i.e., intermediate outcomes of modules in a workflow) and can thus reduce the computational time of some modules in a SWfMS. We integrated GUI-RISPTS with an existing workflow management system called SciWorCS. In SciWorCS, we present an interface that users use for selecting the recommendation of intermediate states (i.e., modules' outcomes). We investigated GUI-RISPTS's effectiveness from users' perspectives along with measuring its overhead in terms of storage and efficiency in workflow execution.
To process a large amount of data sequentially and systematically, proper management of workflow components (i.e., modules, data, configurations, associations among ports and links) in a Scientific Workflow Management System (SWfMS) is inevitable. Managing data with provenance in a SWfMS to support reusability of workflows, modules, and data is not a simple task. Handling such components is even more burdensome for frequently assembled and executed complex workflows for investigating large datasets with different technologies (i.e., various learning algorithms or models). However, a great many studies propose various techniques and technologies for managing and recommending services in a SWfMS, but only a very few studies consider the management of data in a SWfMS for efficient storing and facilitating workflow executions. Furthermore, there is no study to inquire about the effectiveness and efficiency of such data management in a SWfMS from a user perspective. In this paper, we present and evaluate a GUI version of such a novel approach of intermediate data management with two use cases (Plant Phenotyping and Bioinformatics). The technique we call GUI-RISPTS (Recommending Intermediate States from Pipelines Considering Tool-States) can facilitate executions of workflows with processed data (i.e., intermediate outcomes of modules in a workflow) and can thus reduce the computational time of some modules in a SWfMS. We integrated GUI-RISPTS with an existing workflow management system called SciWorCS. In SciWorCS, we present an interface that users use for selecting the recommendation of intermediate states (i.e., modules' outcomes). We investigated GUI-RISPTS's effectiveness from users' perspectives along with measuring its overhead in terms of storage and efficiency in workflow execution.
DOI
bib
abs
A Testing Approach While Re-engineering Legacy Systems: An Industrial Case Study
Hamid Khodabandehloo,
Banani Roy,
Manishankar Mondal,
Chanchal K. Roy,
Kevin A. Schneider,
Hamid Khodabandehloo,
Banani Roy,
Manishankar Mondal,
Chanchal K. Roy,
Kevin A. Schneider
2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)
Many organizations use legacy systems as these systems contain their valuable business rules. However, these legacy systems answer the past requirements but are difficult to maintain and evolve due to old technology use. In this situation, stockholders decide to renovate the system with a minimum amount of cost and risk. Although the renovation process is a more affordable choice over redevelopment, it comes with its risks such as performance loss and failure to obtain quality goals. A proper test process can minimize risks incorporated with the renovation process. This work introduces a testing model tailored for the migration and re-engineering process and employs test automation, which results in early bug detection. Moreover, the automated tests ensure functional sameness between the old and the new system. This process enhances reliability, accuracy, and speed of testing.
DOI
bib
abs
A Testing Approach While Re-engineering Legacy Systems: An Industrial Case Study
Hamid Khodabandehloo,
Banani Roy,
Manishankar Mondal,
Chanchal K. Roy,
Kevin A. Schneider,
Hamid Khodabandehloo,
Banani Roy,
Manishankar Mondal,
Chanchal K. Roy,
Kevin A. Schneider
2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)
Many organizations use legacy systems as these systems contain their valuable business rules. However, these legacy systems answer the past requirements but are difficult to maintain and evolve due to old technology use. In this situation, stockholders decide to renovate the system with a minimum amount of cost and risk. Although the renovation process is a more affordable choice over redevelopment, it comes with its risks such as performance loss and failure to obtain quality goals. A proper test process can minimize risks incorporated with the renovation process. This work introduces a testing model tailored for the migration and re-engineering process and employs test automation, which results in early bug detection. Moreover, the automated tests ensure functional sameness between the old and the new system. This process enhances reliability, accuracy, and speed of testing.
Software architectural changes involve more than one module or component and are complex to analyze compared to local code changes. Development teams aiming to review architectural aspects (design) of a change commit consider many essential scenarios such as access rules and restrictions on usage of program entities across modules. Moreover, design review is essential when proper architectural formulations are paramount for developing and deploying a system. Untangling architectural changes, recovering semantic design, and producing design notes are the crucial tasks of the design review process. To support these tasks, we construct a lightweight tool [4] that can detect and decompose semantic slices of a commit containing architectural instances. A semantic slice consists of a description of relational information of involved modules, their classes, methods and connected modules in a change instance, which is easy to understand to a reviewer. We extract various directory and naming structures (DANS) properties from the source code for developing our tool. Utilizing the DANS properties, our tool first detects architectural change instances based on our defined metric and then decomposes the slices (based on string processing). Our preliminary investigation with ten open-source projects (developed in Java and Kotlin) reveals that the DANS properties produce highly reliable precision and recall (93-100%) for detecting and generating architectural slices. Our proposed tool will serve as the preliminary approach for the semantic design recovery and design summary generation for the project releases.
Software architectural changes involve more than one module or component and are complex to analyze compared to local code changes. Development teams aiming to review architectural aspects (design) of a change commit consider many essential scenarios such as access rules and restrictions on usage of program entities across modules. Moreover, design review is essential when proper architectural formulations are paramount for developing and deploying a system. Untangling architectural changes, recovering semantic design, and producing design notes are the crucial tasks of the design review process. To support these tasks, we construct a lightweight tool [4] that can detect and decompose semantic slices of a commit containing architectural instances. A semantic slice consists of a description of relational information of involved modules, their classes, methods and connected modules in a change instance, which is easy to understand to a reviewer. We extract various directory and naming structures (DANS) properties from the source code for developing our tool. Utilizing the DANS properties, our tool first detects architectural change instances based on our defined metric and then decomposes the slices (based on string processing). Our preliminary investigation with ten open-source projects (developed in Java and Kotlin) reveals that the DANS properties produce highly reliable precision and recall (93-100%) for detecting and generating architectural slices. Our proposed tool will serve as the preliminary approach for the semantic design recovery and design summary generation for the project releases.
Evolutionary coupling is a well investigated phenomenon in software maintenance research and practice. Association rules and two related measures, support and confidence, have been used to identify evolutionary coupling among program entities. However, these measures only emphasize the co-change (i.e., changing together) frequency of entities and cannot determine whether the entities co-evolved by experiencing related changes. Consequently, the approach reports false positives and fails to detect evolutionary coupling among infrequently co-changed entities. We propose a new measure, identifier correspondence (id-correspondence), that quantifies the extent to which changes that occurred to the co-changed entities are related based on identifier similarity. Identifiers are the names given to different program entities such as variables, methods, classes, packages, interfaces, structures, unions etc. We use Dice-Sørensen co-efficient for measuring lexical similarity between the identifiers involved in the changed lines of the co-changed entities. Our investigation on thousands of revisions from nine subject systems covering three programming languages shows that id-correspondence can considerably improve the detection accuracy of evolutionary coupling. It outperforms the existing state-of-the-art evolutionary coupling based techniques with significantly higher recall and F-score in predicting future co-change candidates.
Evolutionary coupling is a well investigated phenomenon in software maintenance research and practice. Association rules and two related measures, support and confidence, have been used to identify evolutionary coupling among program entities. However, these measures only emphasize the co-change (i.e., changing together) frequency of entities and cannot determine whether the entities co-evolved by experiencing related changes. Consequently, the approach reports false positives and fails to detect evolutionary coupling among infrequently co-changed entities. We propose a new measure, identifier correspondence (id-correspondence), that quantifies the extent to which changes that occurred to the co-changed entities are related based on identifier similarity. Identifiers are the names given to different program entities such as variables, methods, classes, packages, interfaces, structures, unions etc. We use Dice-Sørensen co-efficient for measuring lexical similarity between the identifiers involved in the changed lines of the co-changed entities. Our investigation on thousands of revisions from nine subject systems covering three programming languages shows that id-correspondence can considerably improve the detection accuracy of evolutionary coupling. It outperforms the existing state-of-the-art evolutionary coupling based techniques with significantly higher recall and F-score in predicting future co-change candidates.
When a programmer changes a particular code fragment, the other similar code fragments in the code-base may also need to be changed together (i.e., co-changed) consistently to ensure that the software system remains consistent. Existing studies and tools apply clone detectors to identify these similar co-change candidates for a target code fragment. However, clone detectors suffer from a confounding configuration choice problem and it affects their accuracy in retrieving co-change candidates.In our research, we propose and empirically evaluate a lightweight co-change suggestion technique that can automatically suggest fragment level similar co-change candidates for a target code fragment using WA-DiSC (Weighted Average Dice-Sørensen Co-efficient) through a context-sensitive mining of the entire code-base. We apply our technique, FLeCCS (Fragment Level Co-change Candidate Suggester), on six subject systems written in three different programming languages (Java, C, and C#) and compare its performance with the existing state-of-the-art techniques. According to our experiment, our technique outperforms not only the existing code clone based techniques but also the association rule mining based techniques in detecting co-change candidates with a significantly higher accuracy (precision and recall). We also find that File Proximity Ranking performs significantly better than Similarity Extent Ranking when ranking the co-change candidates suggested by our proposed technique.
When a programmer changes a particular code fragment, the other similar code fragments in the code-base may also need to be changed together (i.e., co-changed) consistently to ensure that the software system remains consistent. Existing studies and tools apply clone detectors to identify these similar co-change candidates for a target code fragment. However, clone detectors suffer from a confounding configuration choice problem and it affects their accuracy in retrieving co-change candidates.In our research, we propose and empirically evaluate a lightweight co-change suggestion technique that can automatically suggest fragment level similar co-change candidates for a target code fragment using WA-DiSC (Weighted Average Dice-Sørensen Co-efficient) through a context-sensitive mining of the entire code-base. We apply our technique, FLeCCS (Fragment Level Co-change Candidate Suggester), on six subject systems written in three different programming languages (Java, C, and C#) and compare its performance with the existing state-of-the-art techniques. According to our experiment, our technique outperforms not only the existing code clone based techniques but also the association rule mining based techniques in detecting co-change candidates with a significantly higher accuracy (precision and recall). We also find that File Proximity Ranking performs significantly better than Similarity Extent Ranking when ranking the co-change candidates suggested by our proposed technique.
Release notes are admitted as an essential technical document in software maintenance. They summarize the main changes, e.g. bug fixes and new features, that have happened in the software since the previous release. Manually producing release notes is a time-consuming and challenging task. For that reason, sometimes developers neglect to write release notes. For example, we collect data from GitHub with over 1900 releases, and among them, 37% of the release notes are empty. To mitigate this problem, we propose an automatic release notes generation approach by applying the text summarization techniques, i.e. TextRank. To improve the keyword extraction method of traditional TextRank, we integrate the GloVe word embedding technique with TextRank. After generating release notes automatically, we apply machine learning algorithms to classify the release note contents (or sentences). We classify the contents into six categories, e.g. bug fixes and performance improvements, to represent the release notes better for users. We use the evaluation metric, e.g. ROUGE, to evaluate the automatically generated release notes. We also compare the performance of our technique with two popular extractive algorithms, e.g. Luhn’s and latent semantic analysis (LSA). Our evaluation results show that the improved TextRank method outperforms the two algorithms.
Release notes are admitted as an essential technical document in software maintenance. They summarize the main changes, e.g. bug fixes and new features, that have happened in the software since the previous release. Manually producing release notes is a time-consuming and challenging task. For that reason, sometimes developers neglect to write release notes. For example, we collect data from GitHub with over 1900 releases, and among them, 37% of the release notes are empty. To mitigate this problem, we propose an automatic release notes generation approach by applying the text summarization techniques, i.e. TextRank. To improve the keyword extraction method of traditional TextRank, we integrate the GloVe word embedding technique with TextRank. After generating release notes automatically, we apply machine learning algorithms to classify the release note contents (or sentences). We classify the contents into six categories, e.g. bug fixes and performance improvements, to represent the release notes better for users. We use the evaluation metric, e.g. ROUGE, to evaluate the automatically generated release notes. We also compare the performance of our technique with two popular extractive algorithms, e.g. Luhn’s and latent semantic analysis (LSA). Our evaluation results show that the improved TextRank method outperforms the two algorithms.
Applications of image registration tasks are computation-intensive, memory-intensive, and communication-intensive. Robust efforts are required on error recovery and re-usability of both the data and the operations, along with performance optimization. Considering these, we explore various programming models aiming to minimize the folding operations (such as join and reduce) which are the primary candidates of data shuffling, concurrency bugs and expensive communication in a distributed cluster. Particularly, we analyze modular MapReduce execution of an image registration pipeline (IRP) with the external and internal data (data-tunneling) flow mechanism and compare them with the compact model. Experimental analyzes with the ComputeCanada cluster and a crop field data-sets containing 1000 images show that these design options are valuable for large-scale IRPs executed with a MapReduce cluster. Additionally, we present an effectiveness measurement metric to analyze the impact of a design model for the Big IRP, accumulating the error-recovery and re-usability metrics along with the data size and execution time. Our explored design models and their performance analysis can serve as a benchmark for the researchers and application developers who deploy large-scale image registration and other image processing tasks.
Applications of image registration tasks are computation-intensive, memory-intensive, and communication-intensive. Robust efforts are required on error recovery and re-usability of both the data and the operations, along with performance optimization. Considering these, we explore various programming models aiming to minimize the folding operations (such as join and reduce) which are the primary candidates of data shuffling, concurrency bugs and expensive communication in a distributed cluster. Particularly, we analyze modular MapReduce execution of an image registration pipeline (IRP) with the external and internal data (data-tunneling) flow mechanism and compare them with the compact model. Experimental analyzes with the ComputeCanada cluster and a crop field data-sets containing 1000 images show that these design options are valuable for large-scale IRPs executed with a MapReduce cluster. Additionally, we present an effectiveness measurement metric to analyze the impact of a design model for the Big IRP, accumulating the error-recovery and re-usability metrics along with the data size and execution time. Our explored design models and their performance analysis can serve as a benchmark for the researchers and application developers who deploy large-scale image registration and other image processing tasks.
2020
The fork-based development mechanism provides the flexibility and the unified processes for software teams to collaborate easily in a distributed setting without too much coordination overhead.Currently, multiple social coding platforms support fork-based development, such as GitHub, GitLab, and Bitbucket. Although these different platforms virtually share the same features, they have different emphasis. As GitHub is the most popular platform and the corresponding data is publicly available, most of the current studies are focusing on GitHub hosted projects. However, we observed anecdote evidences that people are confused about choosing among these platforms, and some projects are migrating from one platform to another, and the reasons behind these activities remain unknown.With the advances of Software Heritage Graph Dataset (SWHGD),we have the opportunity to investigate the forking activities across platforms. In this paper, we conduct an exploratory study on 10popular open-source projects to identify cross-platform forks and investigate the motivation behind. Preliminary result shows that cross-platform forks do exist. For the 10 subject systems in this study, we found 81,357 forks in total among which 179 forks are on GitLab. Based on our qualitative analysis, we found that most of the cross-platform forks that we identified are mirrors of the repositories on another platform, but we still find cases that were created due to preference of using certain functionalities (e.g. Continuous Integration (CI)) supported by different platforms. This study lays the foundation of future research directions, such as understanding the differences between platforms and supporting cross-platform collaboration.
Scientific workflow management systems such as Galaxy, Taverna and Workspace, have been developed to automate scientific workflow management and are increasingly being used to accelerate the specification, execution, visualization, and monitoring of data-intensive tasks. For example, the popular bioinformatics platform Galaxy is installed on over 168 servers around the world and the social networking space myExperiment shares almost 4,000 Galaxy scientific workflows among its 10,665 members. Most of these systems offer graphical interfaces for composing workflows. However, while graphical languages are considered easier to use, graphical workflow models are more difficult to comprehend and maintain as they become larger and more complex. Text-based languages are considered harder to use but have the potential to provide a clean and concise expression of workflow even for large and complex workflows. A recent study showed that some scientists prefer script/text-based environments to perform complex scientific analysis with workflows. Unfortunately, such environments are unable to meet the needs of scientists who prefer graphical workflows. In order to address the needs of both types of scientists and at the same time to have script-based workflow models because of their underlying benefits, we propose a visually guided workflow modeling framework that combines interactive graphical user interface elements in an integrated development environment with the power of a domain-specific language to compose independently developed and loosely coupled services into workflows. Our domain-specific language provides scientists with a clean, concise, and abstract view of workflow to better support workflow modeling. As a proof of concept, we developed VizSciFlow, a generalized scientific workflow management system that can be customized for use in a variety of scientific domains. As a first use case, we configured and customized VizSciFlow for the bioinformatics domain. We conducted three user studies to assess its usability, expressiveness, efficiency, and flexibility. Results are promising, and in particular, our user studies show that VizSciFlow is more desirable for users to use than either Python or Galaxy for solving complex scientific problems.
When a programmer makes changes to a target program entity (files, classes, methods), it is important to identify which other entities might also get impacted. These entities constitute the impact set for the target entity. Association rules have been widely used for discovering the impact sets. However, such rules only depend on the previous co-change history of the program entities ignoring the fact that similar entities might often need to be updated together consistently even if they did not co-change before. Considering this fact, we investigate whether cloning relationships among program entities can be associated with association rules to help us better identify the impact sets. In our research, we particularly investigate whether the impact set detection capability of a clone detector can be utilized to enhance the capability of the state-of-the-art association rule mining technique, Tarmaq, in discovering impact sets. We use the well known clone detector called NiCad in our investigation and consider both regular and micro-clones. Our evolutionary analysis on thousands of commit operations of eight diverse subject systems reveals that consideration of code clones can enhance the impact set detection accuracy of Tarmaq with a significantly higher precision and recall. Micro-clones of 3LOC and 4LOC and regular code clones of 5LOC to 20LOC contribute the most towards enhancing the detection accuracy.
Evolutionary coupling is a well investigated phenomenon during software evolution and maintenance. If two or more program entities co-change (i.e., change together) frequently during evolution, it is expected that the entities are coupled. This type of coupling is called evolutionary coupling or change coupling in the literature. Evolutionary coupling is realized using association rules and two measures: support and confidence. Association rules have been extensively used for predicting co-change candidates for a target program entity (i.e., an entity that a programmer attempts to change). However, association rules often predict a large number of co-change candidates with many false positives. Thus, it is important to rank the predicted co-change candidates so that the true positives get higher priorities. The predicted co-change candidates have always been ranked using the support and confidence measures of the association rules. In our research, we investigate five different ranking mechanisms on thousands of commits of ten diverse subject systems. On the basis of our findings, we propose a history-based ranking approach, HistoRank (History-based Ranking), that analyzes the previous ranking history to dynamically select the most appropriate one from those five ranking mechanisms for ranking co-change candidates of a target program entity. According to our experiment result, HistoRank outperforms each individual ranking mechanism with a significantly better MAP (mean average precision). We investigate different variants of HistoRank and realize that the variant that emphasizes the ranking in the most recent occurrence of co-change in the history performs the best.
Code clones are the same or nearly similar code fragments in a software system's code-base. While the existing studies have extensively studied regular code clones in software systems, micro-clones have been mostly ignored. Although an existing study investigated consistent changes in exact micro-clones, near-miss micro-clones have never been investigated. In our study, we investigate the importance of near-miss micro-clones in software evolution and maintenance by automatically detecting and analyzing the consistent updates that they experienced during the whole period of evolution of our subject systems. We compare the consistent co-change tendency of near-miss micro-clones with that of exact micro-clones and regular code clones. According to our investigation on thousands of revisions of six open-source subject systems written in two different programming languages, near-miss micro-clones have a significantly higher tendency of experiencing consistent updates compared to exact micro-clones and regular (both exact and near-miss) code clones. Consistent updates in near-miss micro-clones have a high tendency of being related with bug-fixes. Moreover, the percentage of commit operations where near-miss micro-clones experience consistent updates is considerably higher than that of regular clones and exact micro-clones. We finally observe that near-miss micro-clones staying in close proximity to each other have a high tendency of experiencing consistent updates. Our research implies that near-miss micro-clones should be considered equally important as of regular clones and exact micro-clones when making clone management decisions.
Abstract While there are novel approaches for detecting and categorizing similar software applications, previous research focused on detecting similarity in applications written in the same programming language and not on detecting similarity in applications written in different programming languages. Cross-language software similarity detection is inherently more challenging due to variations in language, application structures, support libraries used, and naming conventions. In this paper we propose a novel model, CroLSim, to detect similar software applications across different programming languages. We define a semantic relationship among cross-language libraries and API methods (both local and third party) using functional descriptions and a word-vector learning model. Our experiments show that CroLSim can successfully detect cross-language similar software applications, which outperforms all existing approaches (mean average precision rate of 0.65, confidence rate of 3.6, and 75% highly rated successful queries). Furthermore, we applied CroLSim to a source code repository to see whether our model can recommend cross-language source code fragments if queried directly with source code. From our experiments we found that CroLSim can recommend cross-language functional similar source code when source code is directly used as a query (average precision=0.28, recall=0.85, and F-Measure=0.40).
Abstract Freshwater ecosystems, particularly those in agricultural areas, remain at risk of eutrophication due to anthropogenic inputs of nutrients. While community-based monitoring has helped improve awareness and spur action to mitigate nutrient loads, monitoring is challenging due to the reliance on expensive laboratory technology, poor data management, time lags between measurement and availability of results, and risk of sample degradation during transport or storage. In this study, an easy-to-use smartphone-based application (The Nutrient App) was developed to estimate NO 3 and PO 4 concentrations through the image-processing of on-site qualitative colorimetric-based results obtained via cheap commercially-available instantaneous test kits. The app was tested in rivers, wetlands, and lakes across Canada and relative errors between 30% (filtered samples) and 70% (unfiltered samples) were obtained for both NO 3 and PO 4 . The app can be used to identify sources and hotspots of contamination, which can empower communities to take immediate remedial action to reduce nutrient pollution.
Abstract A code clone is a pair of code fragments, within or between software systems that are similar. Since code clones often negatively impact the maintainability of a software system, several code clone detection techniques and tools have been proposed and studied over the last decade. However, the clone detection tools are not always perfect and their clone detection reports often contain a number of false positives or irrelevant clones from specific project management or user perspective. To detect all possible similar source code patterns in general, the clone detection tools work on the syntax level while lacking user-specific preferences. This often means the clones must be manually inspected before analysis in order to remove those false positives from consideration. This manual clone validation effort is very time-consuming and often error-prone, in particular for large-scale clone detection. In this paper, we propose a machine learning approach for automating the validation process. First, a training dataset is built by taking code clones from several clone detection tools for different subject systems and then manually validating those clones. Second, several features are extracted from those clones to train the machine learning model by the proposed approach. The trained algorithm is then used to automatically validate clones without human inspection. Thus the proposed approach can be used to remove the false positive clones from the detection results, automatically evaluate the precision of any clone detectors for any given set of datasets, evaluate existing clone benchmark datasets, or even be used to build new clone benchmarks and datasets with minimum effort. In an experiment with clones detected by several clone detectors in several different software systems, we found our approach has an accuracy of up to 87.4% when compared against the manual validation by multiple expert judges. The proposed method also shows better results in several comparative studies with the existing related approaches for clone classification.
2019
In modern days, mobile applications (apps) have become omnipresent. Components of mobile apps (such as 3rd party libraries) require to be separated and analyzed differently for security issue detection, repackaged app detection, tumor code purification and so on. Various techniques are available to automatically analyze mobile apps. However, analysis of the app's executable binary remains challenging due to required curated database, large codebases and obfuscation. Considering these, we focus on exploring a versatile technique to separate different components with design-based features independent of code obfuscation. Particularly, we conducted an empirical study using design patterns and fuzzy signatures to separate app components such as 3rd party libraries. In doing so, we built a system for automatically extracting design patterns from both the executable package (APK) and Jar of an Android application. The experimental results with various standard datasets containing 3rd party libraries, obfuscated apps and malwares reveal that design features like these are present significantly within them (within 60% APKs including malware). Moreover, these features remain unaltered even after app obfuscation. Finally, as a case study, we found that the design patterns alone can detect 3rd party libraries within the obfuscated apps considerably (F1 score is 32%). Overall, our empirical study reveals that design features might play a versatile role in separating various Android components for various purposes.
Abstract Code clones are identical or nearly similar code fragments in a code-base. According to the existing studies, code clones are directly related to bugs. Code cloning, creating code clones, is suspected to propagate temporarily hidden bugs from one code fragment to another. However, there is no study on the intensity of bug-propagation through code cloning. In this paper, we define two clone evolutionary patterns that reasonably indicate bug propagation through code cloning. By analyzing software evolution history, we identify those code clones that evolved following the bug propagation patterns. According to our study on thousands of commits of seven subject systems, overall 18.42% of the clone fragments that experience bug-fixes contain propagated bugs. Type-3 clones are primarily involved with bug-propagation. Bug propagation is more likely to occur in the clone fragments that are created in the same commit rather than in different commits. Moreover, code clones residing in the same file have a higher possibility of containing propagated bugs compared to those residing in different files. Severe bugs can sometimes get propagated through code cloning. Automatic support for immediately identifying occurrences of bug-propagation can be beneficial for software maintenance. Our findings are important for prioritizing code clones for management.
The identical or nearly similar code fragments in a code-base are called code clones. There is a common belief that code cloning (copy/pasting code fragments) can introduce bugs in a software system if the copied code fragments are not properly adapted to their contexts (i.e., surrounding code). However, none of the existing studies have investigated whether such bugs are really present in code clones. We denote these bugs as Context Adaptation Bugs, or simply Context-Bugs, in our paper and investigate the extent to which they can be present in code clones. We define and automatically analyze two clone evolutionary patterns that indicate fixing of Context-Bugs. According to our analysis on thousands of revisions of six open-source subject systems written in Java, C, and C#, code cloning often introduces Context-Bugs in software systems. Around 50% of the clone related bug-fixes can occur for fixing Context-Bugs. Cloning (copy/pasting) a newly created code fragment (i.e., a code fragment that was not added in a former revision) is more likely to introduce Context-Bugs compared to cloning a preexisting fragment (i.e., a code fragment that was added in a former revision). Moreover, cloning across different files appears to have a significantly higher tendency of introducing Context-Bugs compared to cloning within the same file. Finally, Type 3 clones (gapped clones) have the highest tendency of containing Context-Bugs among the three major clone-types. Our findings can be important for early detection as well as removal of Context-Bugs in code clones.
Scientific Workflow Management Systems (SWfMSs) have become popular for accelerating the specification, execution, visualization, and monitoring of data-intensive scientific experiments. Unfortunately, to the best of our knowledge no existing SWfMSs directly support collaboration. Data is increasing in complexity, dimensionality, and volume, and the efficient analysis of data often goes beyond the realm of an individual and requires collaboration with multiple researchers from varying domains. In this paper, we propose a groupware system architecture for data analysis that in addition to supporting collaboration, also incorporates features from SWfMSs to support modern data analysis processes. As a proof of concept for the proposed architecture we developed SciWorCS - a groupware system for scientific data analysis. We present two real-world use-cases: collaborative software repository analysis and bioinformatics data analysis. The results of the experiments evaluating the proposed system are promising. Our bioinformatics user study demonstrates that SciWorCS can leverage real-world data analysis tasks by supporting real-time collaboration among users.
A code clone is a pair of similar code fragments, within or between software systems. To detect each possible clone pair from a software system while handling the complex code structures, the clone detection tools undergo a lot of generalization of the original source codes. The generalization often results in returning code fragments that are only coincidentally similar and not considered clones by users, and hence requires manual validation of the reported possible clones by users which is often both time-consuming and challenging. In this paper, we propose a machine learning based tool 'CloneCognition' (Open Source Codes: https://github.com/pseudoPixels/CloneCognition ; Video Demonstration: https://www.youtube.com/watch?v=KYQjmdr8rsw) to automate the laborious manual validation process. The tool runs on top of any code clone detection tools to facilitate the clone validation process. The tool shows promising clone classification performance with an accuracy of up to 87.4%. The tool also exhibits significant improvement in the results when compared with state-of-the-art techniques for code clone validation.
Software clones are detrimental to software maintenance and evolution and as a result many clone detectors have been proposed. These tools target clone detection in software applications written in a single programming language. However, a software application may be written in different languages for different platforms to improve the application's platform compatibility and adoption by users of different platforms. Cross language clones (CLCs) introduce additional challenges when maintaining multi-platform applications and would likely go undetected using existing tools. In this paper, we propose CLCDSA, a cross language clone detector which can detect CLCs without extensive processing of the source code and without the need to generate an intermediate representation. The proposed CLCDSA model analyzes different syntactic features of source code across different programming languages to detect CLCs. To support large scale clone detection, the CLCDSA model uses an action filter based on cross language API call similarity to discard non-potential clones. The design methodology of CLCDSA is two-fold: (a) it detects CLCs on the fly by comparing the similarity of features, and (b) it uses a deep neural network based feature vector learning model to learn the features and detect CLCs. Early evaluation of the model observed an average precision, recall and F-measure score of 0.55, 0.86, and 0.64 respectively for the first phase and 0.61, 0.93, and 0.71 respectively for the second phase which indicates that CLCDSA outperforms all available models in detecting cross language clones.
Scientific workflow management system (SWFMS) is one of the inherent parts of Big Data analytics systems. Analyses in such data intensive research using workflows are very costly. SWFMSs or workflows keep track of every bit of executions through logs, which later could be used on demand. For example, in the case of errors, security breaches, or even any conditions, we may need to trace back to the previous steps or look at the intermediate data elements. Such fashion of logging is known as workflow provenance. However, prominent workflows being domain specific and developed following different programming paradigms, their architectures, logging mechanisms, information in the logs, provenance queries, and so on differ significantly. So, provenance technology of one workflow from a certain domain is not easily applicable in another domain. Facing the lack of a general workflow provenance standard, we propose a programming model for automated workflow logging. The programming model is easy to implement and easily configurable by domain experts independent of workflow users. We implement our workflow programming model on Bioinformatics research—for evaluation and collect workflow logs from various scientific pipelines’ executions. Then we focus on some fundamental provenance questions inspired by recent literature that can derive many other complex provenance questions. Finally, the end users are provided with discovered insights from the workflow provenance through online data visualization as a separate web service.
Big Data analytics or systems developed with parallel distributed processing frameworks (e.g., Hadoop and Spark) are becoming popular for finding important insights from a huge amount of heterogeneous data (e.g., image, text, and sensor data). These systems offer a wide range of tools and connect them to form workflows for processing Big Data. Independent schemes from different studies for managing programs and data of workflows have been already proposed by many researchers and most of the systems have been presented with data or metadata management. However, to the best of our knowledge, no study particularly discusses the performance implications of utilizing intermediate states of data and programs generated at various execution steps of a workflow in distributed platforms. In order to address the shortcomings, we propose a scheme of Big Data management for micro-level modular computation-intensive programs in a Spark and Hadoop-based platform. In this paper, we investigate whether management of the intermediate states can speed up the execution of an image processing pipeline consisting of various image processing tools/APIs in Hadoop Distributed File System (HDFS) while ensuring appropriate reusability and error monitoring. From our experiments, we obtained prominent results, e.g., we have reported that with the intermediate data management, we can gain up to 87% computation time for an image processing job.
2018
Workflows are frequently built and used to systematically process large datasets using workflow management systems (WMS). A workflow (i.e., a pipeline) is a finite set of processing modules organized as a series of steps that is applied to an input dataset to produce a desired output. In a workflow management system, users generally create workflows manually for their own investigations. However, workflows can sometimes be lengthy and the constituent processing modules might often be computationally expensive. In this situation, it would be beneficial if users could reuse intermediate stage results generated by previously executed workflows for executing their current workflow.In this paper, we propose a novel technique based on association rule mining for suggesting which intermediate stage results from a workflow that a user is going to execute should be stored for reusing in the future. We call our proposed technique, RISP (Recommending Intermediate States from Pipelines). According to our investigation on hundreds of workflows from two scientific workflow management systems, our proposed technique can efficiently suggest intermediate state results to store for future reuse. The results that are suggested to be stored have a high reuse frequency. Moreover, for creating around 51% of the entire pipelines, we can reuse results suggested by our technique. Finally, we can achieve a considerable gain (74% gain) in execution time by reusing intermediate results stored by the suggestions provided by our proposed technique. We believe that our technique (RISP) has the potential to have a significant positive impact on Big-Data systems, because it can considerably reduce execution time of the workflows through appropriate reuse of intermediate state results, and hence, can improve the performance of the systems.
If two or more program entities (such as files, classes, methods) co-change (i.e., change together) frequently during software evolution, then it is likely that these two entities are coupled (i.e., the entities are related). Such a coupling is termed as evolutionary coupling in the literature. The concept of traditional evolutionary coupling restricts us to assume coupling among only those entities that changed together in the past. The entities that did not co-change in the past might also have coupling. However, such couplings can not be retrieved using the current concept of detecting evolutionary coupling in the literature. In this paper, we investigate whether we can detect such couplings by applying transitive rules on the evolutionary couplings detected using the traditional mechanism. We call these couplings that we detect using our proposed mechanism as transitive evolutionary couplings. According to our research on thousands of revisions of four subject systems, transitive evolutionary couplings combined with the traditional ones provide us with 13.96% higher recall and 5.56% higher precision in detecting future co-change candidates when compared with a state-of-the-art technique.
A code clone is a pair of code fragments, within or between software systems that are similar. Since code clones often negatively impact the maintainability of a software system, a great many numbers of code clone detection techniques and tools have been proposed and studied over the last decade. To detect all possible similar source code patterns in general, the clone detection tools work on syntax level (such as texts, tokens, AST and so on) while lacking user-specific preferences. This often means the reported clones must be manually validated prior to any analysis in order to filter out the true positive clones from task or user-specific considerations. This manual clone validation effort is very time-consuming and often error-prone, in particular for large-scale clone detection. In this paper, we propose a machine learning based approach for automating the validation process. In an experiment with clones detected by several clone detectors in several different software systems, we found our approach has an accuracy of up to 87.4% when compared against the manual validation by multiple expert judges. The proposed method shows promising results in several comparative studies with the existing related approaches for automatic code clone validation. We also present our experimental results in terms of different code clone detection tools, machine learning algorithms and open source software systems.
In today's open source era, developers look forsimilar software applications in source code repositories for anumber of reasons, including, exploring alternative implementations, reusing source code, or looking for a better application. However, while there are a great many studies for finding similarapplications written in the same programming language, there isa marked lack of studies for finding similar software applicationswritten in different languages. In this paper, we fill the gapby proposing a novel modelCroLSimwhich is able to detectsimilar software applications across different programming lan-guages. In our approach, we use the API documentation tofind relationships among the API calls used by the differentprogramming languages. We adopt a deep learning based word-vector learning method to identify semantic relationships amongthe API documentation which we then use to detect cross-language similar software applications. For evaluating CroLSim, we formed a repository consisting of 8,956 Java, 7,658 C#, and 10,232 Python applications collected from GitHub. Weobserved thatCroLSimcan successfully detect similar softwareapplications across different programming languages with a meanaverage precision rate of 0.65, an average confidence rate of3.6 (out of 5) with 75% high rated successful queries, whichoutperforms all related existing approaches with a significantperformance improvement.
Scientific Workflow Management Systems are being widely used in recent years for data-intensive analysis tasks or domain-specific discoveries. It often becomes challenging for an individual to effectively analyze the large scale scientific data of relatively higher complexity and dimensions, and requires a collaboration of multiple members of different disciplines. Hence, researchers have focused on designing collaborative workflow management systems. However, consistency management in the face of conflicting concurrent operations of the collaborators is a major challenge in such systems. In this paper, we propose a locking scheme (e.g., collaborator gets write access to non-conflicting components of the workflow at a given time) to facilitate consistency management in collaborative scientific workflow management systems. The proposed method allows locking workflow components at a granular level in addition to supporting locks on a targeted part of the collaborative workflow. We conducted several experiments to analyze the performance of the proposed method in comparison to related existing methods. Our studies show that the proposed method can reduce the average waiting time of a collaborator by up to 36.19% in comparison to existing descendent modular level locking techniques for collaborative scientific workflow management systems.