Selected Publications

[1] Mostaeen, Golam, et al. "CloneCognition: machine learning based code clone validation tool." Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 2019.

Abstract: A code clone is a pair of similar code fragments, within or between software systems. To detect each possible clone pair from a software system while handling the complex code structures, the clone detection tools undergo a lot of generalization of the original source codes. The generalization often results in returning code fragments that are only coincidentally similar and not considered clones by users, and hence requires manual validation of the reported possible clones by users which is often both time-consuming and challenging. In this paper, we propose a machine learning based tool ‘CloneCognition’ (Open Source Codes: https://github.com/pseudoPixels/CloneCognition ; Video Demonstration: https://www.youtube.com/watch?v=KYQjmdr8rsw) to automate the laborious manual validation process. The tool runs on top of any code clone detection tools to facilitate the clone validation process. The tool shows promising clone classification performance with an accuracy of up to 87.4%. The tool also exhibits significant improvement in the results when compared with state-of-the-art techniques for code clone validation



[2] Mostaeen, Golam, et al. "On the Use of Machine Learning Techniques Towards the Design of Cloud Based Automatic Code Clone Validation Tools." 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 2018.

Abstract: A code clone is a pair of code fragments, within or between software systems that are similar. Since code clones often negatively impact the maintainability of a software system, a great many numbers of code clone detection techniques and tools have been proposed and studied over the last decade. To detect all possible similar source code patterns in general, the clone detection tools work on syntax level (such as texts, tokens, AST and so on) while lacking user-specific preferences. This often means the reported clones must be manually validated prior to any analysis in order to filter out the true positive clones from task or user-specific considerations. This manual clone validation effort is very time-consuming and often error-prone, in particular for large-scale clone detection. In this paper, we propose a machine learning based approach for automating the validation process. In an experiment with clones detected by several clone detectors in several different software systems, we found our approach has an accuracy of up to 87.4% when compared against the manual validation by multiple expert judges. The proposed method shows promising results in several comparative studies with the existing related approaches for automatic code clone validation. We also present our experimental results in terms of different code clone detection tools, machine learning algorithms and open source software systems.



[3] Mostaeen, Golam, et al. "A Framework for Building Collaborative Scientific Workflows for Plant Phenotyping and Genotyping" 2018 IEEE 18th International Annual Plant Phenotyping And Imaging Research Centre Symposium, June, 2017. (poster)

Abstract: Plant genotyping and phenotyping are important for ensuring global food security. Various frameworks (e.g. Galaxy, iPlant Collaborative, GenAp and LemnaTec) have been developed to automate the scientific workflows/pipelines and support the computational needs of this domain. One of the challenges of these frameworks is that associated stakeholders (e.g. agronomists, data specialists, image analysts and tool developers) work in isolation to perform their tasks towards developing a pipeline. Consequently, stakeholders cannot perform their tasks effectively. These kinds of problems of the stakeholders could be solved or mitigated if they could communicate and collaborate with each other effectively while working towards building a pipeline. As the existing frameworks of plant genotyping and phenotyping do not support building collaborative genomic and image processing pipelines, design of such pipelines becomes error prone and time consuming. In order to address the shortcomings of the frameworks, we design a cloud-based framework that allows collaborative building of scientific pipelines. We demonstrate a preliminary evaluation involving three stakeholders, such as a tool developers, a bio-informatician, and an agronomist. Our user study shows that the developed framework is promising for building collaborative on-the-fly scientific pipelines for the subject domain.



[4] Rahman, ABM Ashikur, Golam Mostaeen, and Md Hasanul Kabir. "A statistical approach for offline signature verification using local gradient features." 2016 2nd International Conference on Electrical, Computer & Telecommunication Engineering (ICECTE). IEEE, 2016.

Abstract: Signature is widely used as a means of personal verification which emphasizes the need for a signature verification system. Often the single signature feature may produce unacceptable error rates. Features used in this method are mainly local key-point feature that deals with the orientation around each key-point. Before extracting the features, preprocessing of a scanned image is necessary to isolate the region of interest part of a signature and to remove any spurious noise present. The system is initially trained using a database of signatures obtained from those individuals whose signatures are to be authenticated by the system. For extracting the feature, key-points of the image are detected. For each point, orientation around the point is calculated as the feature. By matching the features of sample signature and testing signature decision is taken. If a query signature is in the acceptance range then it is an authentic signature, otherwise it is a forged signature.



[5] Ajwad, Rasif, Syed Nayem Hossain, Golam Mostaeen, and M. A. Mottalib. "An optimized algorithm to find maximum parsimonious tree using PrimeNucleotide based approach." In 2014 17th International Conference on Computer and Information Technology (ICCIT), pp. 127-131. IEEE, 2014.

Abstract: Phylogenetic inference methods like Maximum-parsimony perform exhaustive search strategy to extract evolutionary information from genomic sequences. Complexity arises when we increase the number of sequences involved, as the number of possible solutions increase exponentially alongside. In this paper, we have proposed an algorithm which identifies the highest repeating nucleotide (PrimeNucleotide) from the informative site efficiently to fix one ParentNode with the best fitted nucleotide using a predefined WeightMatrix to find the most parsimonious phylogenetic tree in linear time. The algorithm has been applied on the genome sequences of different bacteria and viruses to ensure its efficiency and universality. The results obtained were similar to the traditional Transverse-parsimony method and a significant improvement in both time consumption and memory usage rate were achieved.



[6] Mostaeen, Golam, Banani Roy, Chanchal Roy, and Kevin Schneider. "Designing for Real-Time Groupware Systems to Support Complex Scientific Data Analysis." Proceedings of the ACM on Human-Computer Interaction 3, no. EICS (2019): 1-28.

Abstract: Scientific Workflow Management Systems (SWfMSs) have become popular for accelerating the specification, execution, visualization, and monitoring of data-intensive scientific experiments. Unfortunately, to the best of our knowledge no existing SWfMSs directly support collaboration. Data is increasing in complexity, dimensionality, and volume, and the efficient analysis of data often goes beyond the realm of an individual and requires collaboration with multiple researchers from varying domains. In this paper, we propose a groupware system architecture for data analysis that in addition to supporting collaboration, also incorporates features from SWfMSs to support modern data analysis processes. As a proof of concept for the proposed architecture we developed SciWorCS - a groupware system for scientific data analysis. We present two real-world use-cases: collaborative software repository analysis and bioinformatics data analysis. The results of the experiments evaluating the proposed system are promising. Our bioinformatics user study demonstrates that SciWorCS can leverage real-world data analysis tasks by supporting real-time collaboration among users.



[7] Mostaeen, Golam, Banani Roy, Chanchal K. Roy, and Kevin A. Schneider. "Fine-Grained Attribute Level Locking Scheme for Collaborative Scientific Workflow Development." In 2018 IEEE International Conference on Services Computing (SCC), pp. 273-277. IEEE, 2018.

Abstract: Scientific Workflow Management Systems are being widely used in recent years for data-intensive analysis tasks or domain-specific discoveries. It often becomes challenging for an individual to effectively analyze the large scale scientific data of relatively higher complexity and dimensions, and requires a collaboration of multiple members of different disciplines. Hence, researchers have focused on designing collaborative workflow management systems. However, consistency management in the face of conflicting concurrent operations of the collaborators is a major challenge in such systems. In this paper, we propose a locking scheme (e.g., collaborator gets write access to non-conflicting components of the workflow at a given time) to facilitate consistency management in collaborative scientific workflow management systems. The proposed method allows locking workflow components at a granular level in addition to supporting locks on a targeted part of the collaborative workflow. We conducted several experiments to analyze the performance of the proposed method in comparison to related existing methods. Our studies show that the proposed method can reduce the average waiting time of a collaborator by up to 36.19% in comparison to existing descendent modular level locking techniques for collaborative scientific workflow management systems.



[8] Mostaeen, Golam, "Towards Collaborative Scientific Workflow Management System", December, 2018 (Master's Thesis)

Abstract: The big data explosion phenomenon has impacted several domains, starting from research areas to divergent of business models in recent years. As this intensive amount of data opens up the possibilities of several interesting knowledge discoveries, over the past few years divergent of research domains have undergone the shift of trend towards analyzing those massive amount data. Scientific Workflow Management System (SWfMS) has gained much popularity in recent years in accelerating those data-intensive analyses, visualization, and discoveries of important information. Data-intensive tasks are often significantly time-consuming and complex in nature and hence SWfMSs are designed to efficiently support the specification, modification, execution, failure handling, and monitoring of the tasks in a scientific workflow. As far as the complexity, dimension, and volume of data are concerned, their effective analysis or management often become challenging for an individual and requires collaboration of multiple scientists instead. Hence, the notion of 'Collaborative SWfMS' was coined - which gained significant interest among researchers in recent years as none of the existing SWfMSs directly support real-time collaboration among scientists. In terms of collaborative SWfMSs, consistency management in the face of conflicting concurrent operations of the collaborators is a major challenge for its highly interconnected document structure among the computational modules - where any minor change in a part of the workflow can highly impact the other part of the collaborative workflow for the datalink relation among them. In addition to the consistency management, studies show several other challenges that need to be addressed towards a successful design of collaborative SWfMSs, such as sub-workflow composition and execution by different sub-groups, relationship between scientific workflows and collaboration models, sub-workflow monitoring, seamless integration and access control of the workflow components among collaborators and so on. In this thesis, we propose a locking scheme to facilitate consistency management in collaborative SWfMSs. The proposed method works by locking workflow components at a granular attribute level in addition to supporting locks on a targeted part of the collaborative workflow. We conducted several experiments to analyze the performance of the proposed method in comparison to related existing methods. Our studies show that the proposed method can reduce the average waiting time of a collaborator by up to 36% while increasing the average workflow update rate by up to 15% in comparison to existing descendent modular level locking techniques for collaborative SWfMSs. We also propose a role-based access control technique for the management of collaborative SWfMSs. We leverage the Collaborative Interactive Application Methodology (CIAM) for the investigation of role-based access control in the context of collaborative SWfMSs. We present our proposed method with a use-case of Plant Phenotyping and Genotyping research domain. Recent study shows that the collaborative SWfMSs often different sets of opportunities and challenges. From our investigations on existing research works towards collaborative SWfMSs and findings of our prior two studies, we propose an architecture of collaborative SWfMSs. We propose - SciWorCS - a Collaborative Scientific Workflow Management System as a proof of concept of the proposed architecture; which is the first of its kind to the best of our knowledge. We present several real-world use-cases of scientific workflows using SciWorCS. Finally, we conduct several user studies using SciWorCS comprising different real-world scientific workflows (i.e., from myExperiment) to understand the user behavior and styles of work in the context of collaborative SWfMSs. In addition to evaluating SciWorCS, the user studies reveal several interesting facts which can significantly contribute in the research domain, as none of the existing methods considered such empirical studies, and rather relied only on computer generated simulated studies for evaluation.



[9] Chakroborti, Debasish, Banani Roy, Amit Mondal, Golam Mostaeen, Chanchal K. Roy, Kevin A. Schneider, and Ralph Deters. "A Data Management Scheme for Micro-Level Modular Computation-Intensive Programs in Big Data Platforms." In Data Management and Analysis, pp. 135-153. Springer, Cham, 2020.

Abstract: Big Data analytics or systems developed with parallel distributed processing frameworks (e.g., Hadoop and Spark) are becoming popular for finding important insights from a huge amount of heterogeneous data (e.g., image, text, and sensor data). These systems offer a wide range of tools and connect them to form workflows for processing Big Data. Independent schemes from different studies for managing programs and data of workflows have been already proposed by many researchers and most of the systems have been presented with data or metadata management. However, to the best of our knowledge, no study particularly discusses the performance implications of utilizing intermediate states of data and programs generated at various execution steps of a workflow in distributed platforms. In order to address the shortcomings, we propose a scheme of Big Data management for micro-level modular computation-intensive programs in a Spark and Hadoop-based platform. In this paper, we investigate whether management of the intermediate states can speed up the execution of an image processing pipeline consisting of various image processing tools/APIs in Hadoop Distributed File System (HDFS) while ensuring appropriate reusability and error monitoring. From our experiments, we obtained prominent results, e.g., we have reported that with the intermediate data management, we can gain up to 87% computation time for an image processing job.



[10] Mostaeen, Golam, Banani Roy, Chanchal Roy, Kevin Schneider, and Jeffrey Svajlenko. "A Machine Learning Based Framework for Code Clone Validation." arXiv preprint arXiv:2005.00967 (2020).

Abstract: A code clone is a pair of code fragments, within or between software systems that are similar. Since code clones often negatively impact the maintainability of a software system, several code clone detection techniques and tools have been proposed and studied over the last decade. To detect all possible similar source code patterns in general, the clone detection tools work on the syntax level while lacking user-specific preferences. This often means the clones must be manually inspected before analysis in order to remove those false positives from consideration. This manual clone validation effort is very time-consuming and often error-prone, in particular for large-scale clone detection. In this paper, we propose a machine learning approach for automating the validation process. Our machine learning-based approach is used to automatically validate clones without human inspection. Thus the proposed approach can be used to remove the false positive clones from the detection results, automatically evaluate the precision of any clone detectors for any given set of datasets, evaluate existing clone benchmark datasets, or even be used to build new clone benchmarks and datasets with minimum effort. In an experiment with clones detected by several clone detectors in several different software systems, we found our approach has an accuracy of up to 87.4% when compared against the manual validation by multiple expert judges. The proposed method also shows better results in several comparative studies with the existing related approaches for clone classification.