Studies in Big Data

Anthology ID:: G19-206
Month:
Year:: 2019
Address:
Venue:: GWF
SIG:
Publisher:: Springer International Publishing
URL:: https://gwf-uwaterloo.github.io/gwf-publications/G19-206
DOI:
Bib Export formats:: BibTeX MODS XML EndNote

pdf bib abs
Workflow Provenance for Big Data: From Modelling to Reporting
Rayhan Ferdous | Banani Roy | Chanchal K. Roy | Kevin A. Schneider

Scientific workflow management system (SWFMS) is one of the inherent parts of Big Data analytics systems. Analyses in such data intensive research using workflows are very costly. SWFMSs or workflows keep track of every bit of executions through logs, which later could be used on demand. For example, in the case of errors, security breaches, or even any conditions, we may need to trace back to the previous steps or look at the intermediate data elements. Such fashion of logging is known as workflow provenance. However, prominent workflows being domain specific and developed following different programming paradigms, their architectures, logging mechanisms, information in the logs, provenance queries, and so on differ significantly. So, provenance technology of one workflow from a certain domain is not easily applicable in another domain. Facing the lack of a general workflow provenance standard, we propose a programming model for automated workflow logging. The programming model is easy to implement and easily configurable by domain experts independent of workflow users. We implement our workflow programming model on Bioinformatics research—for evaluation and collect workflow logs from various scientific pipelines’ executions. Then we focus on some fundamental provenance questions inspired by recent literature that can derive many other complex provenance questions. Finally, the end users are provided with discovered insights from the workflow provenance through online data visualization as a separate web service.

Big Data analytics or systems developed with parallel distributed processing frameworks (e.g., Hadoop and Spark) are becoming popular for finding important insights from a huge amount of heterogeneous data (e.g., image, text, and sensor data). These systems offer a wide range of tools and connect them to form workflows for processing Big Data. Independent schemes from different studies for managing programs and data of workflows have been already proposed by many researchers and most of the systems have been presented with data or metadata management. However, to the best of our knowledge, no study particularly discusses the performance implications of utilizing intermediate states of data and programs generated at various execution steps of a workflow in distributed platforms. In order to address the shortcomings, we propose a scheme of Big Data management for micro-level modular computation-intensive programs in a Spark and Hadoop-based platform. In this paper, we investigate whether management of the intermediate states can speed up the execution of an image processing pipeline consisting of various image processing tools/APIs in Hadoop Distributed File System (HDFS) while ensuring appropriate reusability and error monitoring. From our experiments, we obtained prominent results, e.g., we have reported that with the intermediate data management, we can gain up to 87% computation time for an image processing job.