Publications

For a full list of publications, see Google Scholar or my DBLP record.

Breaking through the computational bottlenecks in biology and medicine.

2023 MetaStore: High-Performance Metagenomic Analysis via In-Storage Computing

Conference

Metagenomics, the study of the genome sequences of diverse organisms in a common environment, has led to significant advancements in many fields. Since the species present in a metagenomic sample are not known in advance, metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases containing information on different species’ genomes. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system to the rest of the system. In-storage processing can be a fundamental solution for reducing data movement overhead. However, designing an in-storage processing system for metagenomics is challenging because none of the existing approaches can be directly implemented in storage effectively due to the hardware limitations of modern SSDs. We propose MetaStore, the first in-storage processing system designed to significantly reduce the data movement overhead of end-to-end metagenomic analysis. MetaStore is enabled by our lightweight and cooperative design that effectively leverages and orchestrates processing inside and outside the storage system. Through our detailed analysis of the end-to-end metagenomic analysis pipeline and careful hardware/software co-design, we address in-storage processing challenges for metagenomics via specialized and efficient 1) task partitioning, 2) data/computation flow coordination, 3) storage technology-aware algorithmic optimizations, 4) light-weight in-storage accelerators, and 5) data mapping. Our evaluation shows that MetaStore outperforms the state-of-the-art performance- and accuracy-optimized software metagenomic tools by 2.7–37.2× and 6.9–100.2×, respectively, while matching the accuracy of the accuracy-optimized tool. MetaStore achieves 1.5–5.1× speedup compared to the state-of-the-art metagenomic hardware-accelerated tool, while achieving significantly higher accuracy.

BibTex Key
Authors Arvid Gollwitzer | Can Firtina | Haiyu Ma | Harun Mustafa | Jisung Park | Joel Lindegger | Julien Eudine | Meryem Banu Cavlak | Mohammad Sadrosadati | Mohammed Alser | Nika Mansouri Ghiasi | Onur Mutlu
Tags
DOI Number https://doi.org/10.48550/arXiv.2311.12527

2022 MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing

Conference

Metagenomics, the study of the genome sequences of diverse organisms in a common environment, has led to significant advances in many fields. Since the species present in a metagenomic sample are not known in advance, metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases containing information on different species’ genomes. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system to the rest of the system. In-storage processing can be a fundamental solution for reducing this overhead. However, designing an in-storage processing system for metagenomics is challenging because existing approaches to metagenomic analysis cannot be directly implemented in storage effectively due to the hardware limitations of modern SSDs. We propose MegIS, the first in-storage processing system designed to significantly reduce the data movement overhead of the end-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight design that effectively leverages and orchestrates processing inside and outside the storage system. Through our detailed analysis of the end-to-end metagenomic analysis pipeline and careful hardware/software co-design, we address in-storage processing challenges for metagenomics via specialized and efficient 1) task partitioning, 2) data/computation flow coordination, 3) storage technology-aware algorithmic optimizations, 4) data mapping, and 5) lightweight in-storage accelerators. MegIS’s design is flexible, capable of supporting different types of metagenomic input datasets, and can be integrated into various metagenomic analysis pipelines. Our evaluation shows that MegIS outperforms the state-of-the-art performance and accuracy-optimized software metagenomic tools by 2.7×– 37.2× and 6.9×–100.2×, respectively, while matching the accuracy of the accuracy-optimized tool. MegIS achieves 1.5×–5.1× speedup compared to the state-of-the-art metagenomic hardwareaccelerated (using processing-in-memory) tool, while achieving significantly higher accuracy.

BibTex Key
Authors Arvid Gollwitzer | Can Firtina | Haiyu Mao | Harun Mustafa | Jisung Park | Joel Lindegger | Julien Eudine | Meryem Banu Cavlak | Mohammad Sadrosadati | Mohammed Alser | Nika Mansouri Ghiasi | Onur Mutlu
Tags
DOI Number 10.1109/ISCA59077.2024.00054

2023 GenStore: A High-Performance In-Storage Processing System for Genome Sequence Analysis

Conference

Read mapping is a fundamental step in many genomics applications. It is used to identify potential matches and differences between fragments (called reads) of a sequenced genome and an already known genome (called a reference genome). Read mapping is costly because it needs to perform approximate string matching (ASM) on large amounts of data. To address the computational challenges in genome analysis, many prior works propose various approaches, such as accurate filters that select the reads within a dataset of genomic reads (called a read set) that must undergo expensive computation, efficient heuristics, and hardware acceleration. While effective at reducing the amount of expensive computation, all such approaches still require the costly movement of a large amount of data from storage to the rest of the system, which can significantly lower the end-to-end performance of read mapping in conventional and emerging genomics systems. We propose GenStore, the first in-storage processing system designed for genome sequence analysis that greatly reduces both data movement and computational overheads of genome sequence analysis by exploiting low-cost and accurate in-storage filters. GenStore leverages hardware/software co-design to address the challenges of in-storage processing, supporting reads with 1) different properties such as read lengths and error rates, which highly depend on the sequencing technology, and 2) different degrees of genetic variation compared to the reference genome, which highly depends on the genomes that are being compared. Through rigorous analysis of read mapping processes of reads with different properties and degrees of genetic variation, we meticulously design low-cost hardware accelerators and data/computation flows inside a NAND flash-based solid-state drive (SSD). Our evaluation using a wide range of real genomic datasets shows that GenStore, when implemented in three modern NAND flash-based SSDs, significantly improves the read mapping performance of state-of-the-art software (hardware) baselines by 2.07-6.05× (1.52-3.32×) for read sets with high similarity to the reference genome and 1.45-33.63× (2.70-19.2×) for read sets with low similarity to the reference genome.

BibTex Key
Authors Arvid Gollwitzer | Ataberk Olgun | Can Firtina | Damla Senol Cali | Haiyu Mao | Harun Mustafa | Jeremie Kim | Jisung Park | Mohammed Alser | Nandita Vijaykumar | Nika Mansouri Ghiasi | Nour Almadhoun Alserr | Onur Mutlu | Rachata Ausavarungnirun
Tags
DOI Number https://doi.org/10.1145/3503222.3507702

2024 SequenceLab: A Comprehensive Benchmark of Computational Methods for Comparing Genomic Sequences

Journal Article

Computational complexity is a key limitation of genomic analyses. Thus, over the last 30 years, researchers have proposed numerous fast heuristic methods that provide computational relief. Comparing genomic sequences is one of the most fundamental computational steps in most genomic analyses. Due to its high computational complexity, optimized exact and heuristic algorithms are still being developed. We find that these methods are highly sensitive to the underlying data, its quality, and various hyperparameters. Despite their wide use, no in-depth analysis has been performed, potentially falsely discarding genetic sequences from further analysis and unnecessarily inflating computational costs. We provide the first analysis and benchmark of this heterogeneity. We deliver an actionable overview of the 11 most widely used state-of-the-art methods for comparing genomic sequences. We also inform readers about the advantages and downsides of using thorough experimental evaluation and different real datasets from all major manufacturers (i.e., Illumina, ONT, and PacBio). SequenceLab is publicly available at https://github.com/CMU-SAFARI/SequenceLab.

BibTex Key
Authors Arvid E. Gollwitzer | Can Firtina | Joel Lindegger | Maximilian-David Rumpf | Mohammed Alser | Nour Almadhoun | Onur Mutlu | Serghei Mangul
Tags
DOI Number https://doi.org/10.48550/arXiv.2310.16908

2024 MetaTrinity: Enabling Fast Metagenomic Classification via Seed Counting and Edit Distance Approximation

Journal Article

Metagenomics, the study of genome sequences of diverse organisms cohabiting in a shared environment, has experienced significant advancements across various medical and biological fields. Metagenomic analysis is crucial, for instance, in clinical applications such as infectious disease screening and the diagnosis and early detection of diseases such as cancer. A key task in metagenomics is to determine the species present in a sample and their relative abundances. Currently, the field is dominated by either alignment-based tools, which offer high accuracy but are computationally expensive, or alignment-free tools, which are fast but lack the needed accuracy for many applications. To address this tradeoff, we introduce MetaTrinity, a novel metagenomic classification tool leveraging heuristic-based seed counting and edit distance approximation, achieving a balance of speed and accuracy that surpasses existing methods. We benchmark MetaTrinity against two leading metagenomic classifiers, each representing different ends of the performance-accuracy spectrum. On one end, Kraken2, a tool optimized for performance, shows modest accuracy yet a rapid runtime. The other end of the spectrum is governed by Metalign, a tool optimized for accuracy. Our evaluations show that MetaTrinity achieves an accuracy comparable to Metalign while gaining a 4x speedup without any loss in accuracy. Compared to Kraken2, MetaTrinity requires a 5x longer runtime yet delivers a 17x improvement in accuracy. This demonstrates a 3.4x enhancement in the accuracy-runtime tradeoff for MetaTrinity. This dual comparison positions MetaTrinity as a broadly applicable solution for metagenomic classification, combining the advantages of both ends of the spectrum: speed and accuracy. MetaTrinity is publicly available at https://github.com/CMU-SAFARI/MetaTrinity.

BibTex Key
Authors Arvid E. Gollwitzer | Can Firtina | Joel Bergtholdt | Joel Lindegger | Maximilian-David Rumpf | Mohammed Alser | Onur Mutlu | Serghei Mangul
Tags
DOI Number https://doi.org/10.48550/arXiv.2311.02029