ML-Based Framework for Evaluating Workflow Precision on HPC

Supervisors: Philippe Bonnet
Semester: Fall 2020
Tags: ML, reproducibility, workflow, HPC

Reproducibility is a cornerstone of the scientific method. There are systems available today to build reproducible and sharable data and analysis pipelines including workflow engines (e.g., GWL, Nextflow), package managers (e.g., bioconda), and container systems (e.g., Singularity). However, validating their executions on high-performance computers remains an open issue. Indeed, there are many sources of non-determinism at runtime (e.g., parallelism, scheduling, errors, diverse cores). As a result, static workflow analysis is not enough to guarantee reproducibility. It is necessary to validate workflows dynamically, in real-time as they are executed, without sacrificing performance or security.

The first step in the project is to reproduce bioinformatics pipeline on a high-performance computer. The second step is to develop a ML-based framework for comparing workflow executions and thus reason about the precision of a given pipeline.