Golam Mostaeen


2020

DOI bib
A machine learning based framework for code clone validation
Golam Mostaeen, Banani Roy, Chanchal K. Roy, Kevin A. Schneider, Jeffrey Svajlenko
Journal of Systems and Software, Volume 169

Abstract A code clone is a pair of code fragments, within or between software systems that are similar. Since code clones often negatively impact the maintainability of a software system, several code clone detection techniques and tools have been proposed and studied over the last decade. However, the clone detection tools are not always perfect and their clone detection reports often contain a number of false positives or irrelevant clones from specific project management or user perspective. To detect all possible similar source code patterns in general, the clone detection tools work on the syntax level while lacking user-specific preferences. This often means the clones must be manually inspected before analysis in order to remove those false positives from consideration. This manual clone validation effort is very time-consuming and often error-prone, in particular for large-scale clone detection. In this paper, we propose a machine learning approach for automating the validation process. First, a training dataset is built by taking code clones from several clone detection tools for different subject systems and then manually validating those clones. Second, several features are extracted from those clones to train the machine learning model by the proposed approach. The trained algorithm is then used to automatically validate clones without human inspection. Thus the proposed approach can be used to remove the false positive clones from the detection results, automatically evaluate the precision of any clone detectors for any given set of datasets, evaluate existing clone benchmark datasets, or even be used to build new clone benchmarks and datasets with minimum effort. In an experiment with clones detected by several clone detectors in several different software systems, we found our approach has an accuracy of up to 87.4% when compared against the manual validation by multiple expert judges. The proposed method also shows better results in several comparative studies with the existing related approaches for clone classification.

2019

DOI bib
Designing for Real-Time Groupware Systems to Support Complex Scientific Data Analysis
Golam Mostaeen, Banani Roy, Chanchal K. Roy, Kevin A. Schneider
Proceedings of the ACM on Human-Computer Interaction, Volume 3, Issue EICS

Scientific Workflow Management Systems (SWfMSs) have become popular for accelerating the specification, execution, visualization, and monitoring of data-intensive scientific experiments. Unfortunately, to the best of our knowledge no existing SWfMSs directly support collaboration. Data is increasing in complexity, dimensionality, and volume, and the efficient analysis of data often goes beyond the realm of an individual and requires collaboration with multiple researchers from varying domains. In this paper, we propose a groupware system architecture for data analysis that in addition to supporting collaboration, also incorporates features from SWfMSs to support modern data analysis processes. As a proof of concept for the proposed architecture we developed SciWorCS - a groupware system for scientific data analysis. We present two real-world use-cases: collaborative software repository analysis and bioinformatics data analysis. The results of the experiments evaluating the proposed system are promising. Our bioinformatics user study demonstrates that SciWorCS can leverage real-world data analysis tasks by supporting real-time collaboration among users.

DOI bib
CloneCognition: machine learning based code clone validation tool
Golam Mostaeen, Jeffrey Svajlenko, Banani Roy, Chanchal K. Roy, Kevin A. Schneider
Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

A code clone is a pair of similar code fragments, within or between software systems. To detect each possible clone pair from a software system while handling the complex code structures, the clone detection tools undergo a lot of generalization of the original source codes. The generalization often results in returning code fragments that are only coincidentally similar and not considered clones by users, and hence requires manual validation of the reported possible clones by users which is often both time-consuming and challenging. In this paper, we propose a machine learning based tool 'CloneCognition' (Open Source Codes: https://github.com/pseudoPixels/CloneCognition ; Video Demonstration: https://www.youtube.com/watch?v=KYQjmdr8rsw) to automate the laborious manual validation process. The tool runs on top of any code clone detection tools to facilitate the clone validation process. The tool shows promising clone classification performance with an accuracy of up to 87.4%. The tool also exhibits significant improvement in the results when compared with state-of-the-art techniques for code clone validation.

DOI bib
A Data Management Scheme for Micro-Level Modular Computation-Intensive Programs in Big Data Platforms
Debasish Chakroborti, Banani Roy, Amit Kumar Mondal, Golam Mostaeen, Chanchal K. Roy, Kevin A. Schneider, Ralph Deters
Studies in Big Data

Big Data analytics or systems developed with parallel distributed processing frameworks (e.g., Hadoop and Spark) are becoming popular for finding important insights from a huge amount of heterogeneous data (e.g., image, text, and sensor data). These systems offer a wide range of tools and connect them to form workflows for processing Big Data. Independent schemes from different studies for managing programs and data of workflows have been already proposed by many researchers and most of the systems have been presented with data or metadata management. However, to the best of our knowledge, no study particularly discusses the performance implications of utilizing intermediate states of data and programs generated at various execution steps of a workflow in distributed platforms. In order to address the shortcomings, we propose a scheme of Big Data management for micro-level modular computation-intensive programs in a Spark and Hadoop-based platform. In this paper, we investigate whether management of the intermediate states can speed up the execution of an image processing pipeline consisting of various image processing tools/APIs in Hadoop Distributed File System (HDFS) while ensuring appropriate reusability and error monitoring. From our experiments, we obtained prominent results, e.g., we have reported that with the intermediate data management, we can gain up to 87% computation time for an image processing job.

2018

DOI bib
[Research Paper] On the Use of Machine Learning Techniques Towards the Design of Cloud Based Automatic Code Clone Validation Tools
Golam Mostaeen, Jeffrey Svajlenko, Banani Roy, Chanchal K. Roy, Kevin A. Schneider
2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM)

A code clone is a pair of code fragments, within or between software systems that are similar. Since code clones often negatively impact the maintainability of a software system, a great many numbers of code clone detection techniques and tools have been proposed and studied over the last decade. To detect all possible similar source code patterns in general, the clone detection tools work on syntax level (such as texts, tokens, AST and so on) while lacking user-specific preferences. This often means the reported clones must be manually validated prior to any analysis in order to filter out the true positive clones from task or user-specific considerations. This manual clone validation effort is very time-consuming and often error-prone, in particular for large-scale clone detection. In this paper, we propose a machine learning based approach for automating the validation process. In an experiment with clones detected by several clone detectors in several different software systems, we found our approach has an accuracy of up to 87.4% when compared against the manual validation by multiple expert judges. The proposed method shows promising results in several comparative studies with the existing related approaches for automatic code clone validation. We also present our experimental results in terms of different code clone detection tools, machine learning algorithms and open source software systems.

DOI bib
Fine-Grained Attribute Level Locking Scheme for Collaborative Scientific Workflow Development
Golam Mostaeen, Banani Roy, Chanchal K. Roy, Kevin A. Schneider
2018 IEEE International Conference on Services Computing (SCC)

Scientific Workflow Management Systems are being widely used in recent years for data-intensive analysis tasks or domain-specific discoveries. It often becomes challenging for an individual to effectively analyze the large scale scientific data of relatively higher complexity and dimensions, and requires a collaboration of multiple members of different disciplines. Hence, researchers have focused on designing collaborative workflow management systems. However, consistency management in the face of conflicting concurrent operations of the collaborators is a major challenge in such systems. In this paper, we propose a locking scheme (e.g., collaborator gets write access to non-conflicting components of the workflow at a given time) to facilitate consistency management in collaborative scientific workflow management systems. The proposed method allows locking workflow components at a granular level in addition to supporting locks on a targeted part of the collaborative workflow. We conducted several experiments to analyze the performance of the proposed method in comparison to related existing methods. Our studies show that the proposed method can reduce the average waiting time of a collaborator by up to 36.19% in comparison to existing descendent modular level locking techniques for collaborative scientific workflow management systems.