Using a Custom Compiler to Reduce Toil

At OICR, we analyse samples prepared in our in-house lab. The laboratory staff enters data into MISO, our laboratory information system, about each sample and how it was processed. What analysis we need to do depends on this information. That metadata is...not perfect. Sometimes there are data entry errors and some times additional information is added as the lab does quality control based on the analysis results. Sometimes there are changes based on lab housekeeping (e.g., transfer of materials between departments, re-labelling).

To get around this, we have an abstraction layer for data from the LIMS, called Pinery, that computes a hash of the metadata. If information changes in MISO, the Pinery hash will change and we know that the analysis may be incorrect.

This turns out to be too conservative though. For instance, if the lab moves a sample to a different freezer, this changes the metadata, but in a way that almost no analysis cares about. We can't simply exclude this field because some of that information does make it into QC reports. Unfortunately, a human has to reconcile these records and decide whether to re-analyse the data or update the hash. Determining which pieces of metadata are important is hard, differs for different analysis types, and it is easy to make a mistake.

Fortunately, the Shesmu compiler can come to the rescue. Shessu extracts all kinds of metadata from Pinery and makes it available to a small programs, called an olive, that direct what analysis to do. Here's an extract from an olive that runs BWAmem:

Where
    metatype =="chemical/seq-na-fastq-gzip"
    && (kit == "10X Genome Linked Reads") == linked_reads
    && workflow In ["bcl2fastq", "CASAVA", "chromium_mkfastq", "FileImportForAnalysis"]
    && library_design In ["WG", "EX" ,"TS", "NN", "CH", "AS", "CM"]
    && tissue_type != "X"
    && config::project_info::get(project).cancer_pipeline != ``

In this example, kit, library_design, tissue_type, and project come from Pinery. Other values, including, metatype and workflow, come from other sources. Since Shesmu knows that these values are important, it can use them to create a signature of only the data that is important.

The author of an input format can specify what fields are signable and the Shesmu compiler will track all signable variables. There are then signers, which are algorithms to convert these variables into an output signature. Built-in to Shesmu are:

  • std::signature::sha1 - create a stable SHA1 hash from all the signable variables
  • std::json::signature - create a JSON object with all of the signable variables as fields

Additional signers can be added to Shesmu using its plugin system.

Since a signature uses only the data we know was required. This makes it possible to automatically determine if the changes in MISO should trigger re-analysis or not. If the hashes from Pinery are mismatched, but the Shesmu signature match, the Pinery signatures of the analysis can be automatically updated. If the Shesmu signatures do not match, but the Pinery hashes match, a structural change has been made to the olive and it is equivalent to the old analysis, so the Shesmu signature can be updated. If both mismatch, then analysis is required.

This drastically reduces the amount of toil our operations group has to do sorting these out. The additional cost to the olive developers to use the signatures is tiny and requires to updating as the olive is changed and different fields are used.

To view or add a comment, sign in

More articles by Andre Masella

  • Using LLVM for fun and performance

    BAM is a common biological data format composed of reads, individual sequences of DNA observed using a DNA sequencing…

  • Don't argue with me, I know ptrace

    At OICR, we use Illumina DNA sequencing instruments which produce more output than we would want for a single sample…

    1 Comment
  • Getting it done fast and fast with Rust

    At OICR, I was writing a workflow for bcl2fastq, the tool that extracts data from Illumina sequencers into FASTQ, a…

  • Using INVOKE DYNAMIC on the JVM for Pluggable Services

    Shesmu is a program I developed at OICR to help execute analysis workflows. We keep a record of what analysis has been…

Explore content categories