MPI-INF Logo
Campus Event Calendar

Event Entry

What and Who

Terabyte-Scale Genome Analysis for Underfunded Labs

Sven Rahmann
MMCI; CISPA Helmholtz Center for Information Security;
Joint Lecture Series
AG 1, INET, AG 5, RG1, SWS, AG 2, AG 4, D6, AG 3  
Public Audience
English

Date, Time and Location

Wednesday, 1 February 2023
12:15
60 Minutes
E1 5
002
Saarbrücken

Abstract

In 2023, you can get your personal genome sequenced for under 1000 Euros. If you do, you will obtain the data (50G to 100G) on a USB stick or hard drive. The data consists of many short strings (length 100 or 150) over the famous DNA alphabet {A,C,G,T}, for a total of 50 to 100 billion letters.


With this personal data, you may want to learn about your ancestry, or judge your personal risk of getting various conditions, such as myocardial infarction, stroke, breast cancer, and others. So you must look for certain (known) DNA variations in your individual genome that are known to be associated to certain population groups or diseases.
As the raw data ("reads") consists of essentially random sub-strings of the genome, it is necessary to find the place of origin of each read in the genome, an error-tolerant pattern search task.
In a medium-scale research study (say, for heart disease), we have similar tasks for a few hundred individual patients and healthy control persons, for a total of roughly 30-50 TB of data, delivered on a few USB hard drives.
After primary analysis, the task is to find genetic variants (or, more coarsely, genes) related to the disease, i.e., we have a pattern mining problem with millions of features and a few hundred samples.
The full workflow for such a study consists of more than 100_000 single steps, including simple per-sample steps (e.g., removing low-quality reads), and complex ones, involving statistical models across all samples for variant calling. Particularly in a medical setting, each step needs to be fully reproducible, we need to trace data provenance and maintain a chain of accountability.
In the past ten years, we have worked and contributed to many aspects of variant-calling workflows and realized that the strategy to attack the ever-growing data with ever-growing compute clusters and storage systems will not scale well in the near future. Thus, our current work focuses on so-called alignment-free methods, which have the potential to yield the same answers as current state-of-the-art methods with 10 to 100 times less CPU work.
I will present our recent advances in laying better foundations for alignment-free methods: engineered and optimized parallel hash tables for short DNA pieces (k-mers), and the design masks for gapped k-mers with optimal error tolerance. These new methods will enable even small labs to analyze large genomics datasets on a "good gaming PC", while investing less than 5000 Euros into computational hardware.
I will also advertise our workflow language and execution engine "Snakemake", a combination of Make and Python that is now one of the most frequently used Bioinformatics workflow management tools, but actually not restricted to Bioinformatics research.

Contact

Jennifer Müller
+49 681 9325 2900
--email hidden

Virtual Meeting Details

Zoom
997 1565 5535
passcode not visible
logged in users only

Uwe Brahm, 11/18/2022 23:45
Jennifer Müller, 11/16/2022 14:53
Anna Rossien, 09/08/2022 11:54 -- Created document.