Skip to main content

2024 | Buch

Research in Computational Molecular Biology

28th Annual International Conference, RECOMB 2024, Cambridge, MA, USA, April 29–May 2, 2024, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 28th Annual International Conference on Research in Computational Molecular Biology, RECOMB 2024, held in Cambridge, MA, USA, during April 29–May 2, 2024.

The 57 full papers included in this book were carefully reviewed and selected from 352 submissions. They were organized in topical sections as follows: theoretical and foundational algorithm contributions and more applied directions that engage with new technologies and intriguing biological questions.

Inhaltsverzeichnis

Frontmatter
Fast Approximate IsoRank for Scalable Global Alignment of Biological Networks

The pioneering and still popular IsoRank method of Singh, Xu, and Berger for global alignment of two protein-protein interaction networks across species was introduced at Recomb in 2007, and was awarded the Recomb test of time award in 2019. However, with the availability of increasing amounts of experimental data the number of edges in the networks to align has grown considerably, making running IsoRank unfeasible on these networks without access to substantial computational resources. In this paper, we develop a new IsoRank approximation that exploits the mathematical properties of IsoRank’s linear system to solve the problem in quadratic time with respect to the maximum size of the two PPI networks. We further propose a refinement to this initial approximation so that the updated result is even closer to the original IsoRank formulation while remaining computationally inexpensive. In experiments on synthetic and real PPI networks with various proposed metrics to measure alignment quality, we find the results of our approximate IsoRank are nearly as accurate as the original IsoRank. In fact, for functional enrichment-based measures of global network alignment quality we find our approximation performs better than exact IsoRank, doubtless because it is more robust to the noise of missing or incorrect edges. It also performs competitively against two more recent global network alignment algorithms.

Kapil Devkota, Anselm Blumer, Xiaozhe Hu, Lenore Cowen
Sequential Optimal Experimental Design of Perturbation Screens Guided by Multi-modal Priors

Understanding a cell’s expression response to genetic perturbations helps to address important challenges in biology and medicine. This includes the inference of gene circuits, the discovery of therapeutic targets and the reprogramming and engineering of cells. In recent years, Perturb-seq, pooled genetic screens with single cell RNA-seq (scRNA-seq) readouts, has emerged as a common method to collect such data. Despite technological advancements, the unpredictable, non-additive effects of gene perturbation combinations imply that the number of experimental configurations far exceeds what is experimentally feasible. In some cases, this may even surpass the number of available cells for research. While recent machine learning models, trained on existing Perturb-seq datasets, can predict perturbation outcomes with some degree of accuracy, they are currently limited by sub-optimal training set selection and the small number of cell contexts of training data. This leads to poor predictions for unexplored parts of perturbation space. As biologists deploy Perturb-seq across diverse biological systems, there is a significant need for algorithms to design iterative experiments. These tools are essential for exploring the large space of possible perturbations and their combinations. Here, we propose a sequential approach for designing Perturb-seq experiments that uses the model to strategically select the most informative perturbations at each step for subsequent experiments. This enables a significantly more efficient exploration of the perturbation space, while predicting the effect of the rest of the unseen perturbations with high-fidelity. Analysis of a previous large-scale Perturb-seq experiment reveals that our setting is severely restricted by the number of examples and rounds, falling into a non-conventional active learning regime called “active learning on a budget”. Motivated by this insight, we develop IterPert, a novel active learning method that exploits rich and multi-modal prior knowledge in order to efficiently guide the selection of subsequent perturbations. Using prior knowledge for this task is novel, and crucial for successful active learning on a budget. We validate IterPert using in-silico benchmarking of active learning, constructed from a large-scale CRISPRi Perturb-seq dataset. We find that IterPert outperforms other active learning strategies by reaching comparable accuracy at only a third of the number of perturbations profiled as the next best method. Overall, our results demonstrate the potential of sequentially designing perturbation screens through IterPert.

Kexin Huang, Romain Lopez, Jan-Christian Hütter, Takamasa Kudo, Antonio Rios, Aviv Regev
Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts

An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes, conserved elements, and epigenetic modifications. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing two random unrelated annotations. To incorporate more background information into such analyses and avoid biased results, we propose a new null model based on a Markov chain which differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or sequencing gaps. We then develop a new algorithm for estimating p-values by computing the exact expectation and variance of the test statistic and then estimating the p-value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models.We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed p-values for 450 pairs of human genome annotations using 24 threads in under three hours. The use of genomic contexts to correct for GC-bias also resulted in the reversal of some previously published findings.Availability. The software is freely available at https://github.com/ fmfi-compbio/mcdp2 under the MIT licence. All data for reproducibility are available at https://github.com/fmfi-compbio/mcdp2 -reproducibility .

Askar Gafurov, Tomáš Vinař, Paul Medvedev, Broňa Brejová
Secure Federated Boolean Count Queries Using Fully-Homomorphic Cryptography

Biomedical data is often distributed between a network of custodians, causing challenges for researchers wishing to securely compute aggregate statistics on those data without centralizing everything—the prototypical ‘count query’ asks how many patients match some multifaceted set of conditions across a network of hospitals. Difficulty arises from two sources: (1) the need to deduplicate patients who may be present in the records of multiple hospitals and (2) the need to unify partial records for the same patient which may be split across hospitals. Although cryptographic tools for secure computation promise to enable collaborative studies with formal privacy guarantees, existing approaches either are computationally impractical or support only simplified analysis pipelines. To the best of our knowledge, no existing practical secure method addresses both of these difficulties simultaneously.Here, we introduce secure federated Boolean count queries using a novel 2-stage probabilistic sketching and sampling protocol that can be efficiently implemented in off-the-shelf federated homomorphic encryption libraries (Palisade and Lattigo), provably ensuring data security. To this end, we needed several key technological innovations, including re-encoding the LogLog union-cardinality sketch and designing an appropriate sampling for intersection cardinalities. Our benchmarking shows that we can answer federated Boolean count queries in less than 2 CPU-minutes with absolute errors in the range of 6% of the total number of touched records, while revealing only the final answer and the total number of touched records. With modern core-parallelism, we can thus answer queries on the order of seconds. Our study demonstrates that by computing on compressed and encrypted data, it is possible to securely answer federated Boolean count queries in real-time.

Alexander T. Leighton, Yun William Yu
FragXsiteDTI: Revealing Responsible Segments in Drug-Target Interaction with Transformer-Driven Interpretation

Drug-Target Interaction (DTI) prediction is vital for drug discovery, yet challenges persist in achieving model interpretability and optimizing performance. We propose a novel transformer-based model, FragXsiteDTI, that aims to address these challenges in DTI prediction. Notably, FragXsiteDTI is the first DTI model to simultaneously leverage drug molecule fragments and protein pockets. Our information-rich representations for both proteins and drugs offer a detailed perspective on their interaction. Inspired by the Perceiver IO framework, our model features a learnable latent array, initially interacting with protein binding site embeddings using cross-attention and later refined through self-attention and used as a query to the drug fragments in the drug’s cross-attention transformer block. This learnable query array serves as a mediator and enables seamless information translation, preserving critical nuances in drug-protein interactions. Our computational results on three benchmarking datasets demonstrate the superior predictive power of our model over several state-of-the-art models. We also show the interpretability of our model in terms of the critical components of both target proteins and drug molecules within drug-target pairs.

Ali Khodabandeh Yalabadi, Mehdi Yazdani-Jahromi, Niloofar Yousefi, Aida Tayebi, Sina Abdidizaji, Ozlem Ozmen Garibay
An Integer Programming Framework for Identifying Stable Components in Asynchronous Boolean Networks

Executable models of biological circuits offer the ability to simulate their behavior under different settings with important biomedical applications. In particular, Boolean network models have been a prime research focus and dozens of manually curated Boolean models are available in public databases. A key challenge in studying the dynamics of these models is determining their asymptotic behavior, that is the state-sets or attractors they converge to. This is particularly challenging for large networks, as the state space size grows exponentially. Here we introduce a novel method for identifying stable components within attractors under an asynchronous update scheme. Our method leverages the observation that the majority of cellular functions in current models can be described as linear threshold functions, facilitating an efficient integer programming formulation for the problem. We conduct simulations on both synthetic and real biological networks, demonstrating that our proposed method is highly efficient and outperforms previous methods.

Shani Jacobson, Roded Sharan
ImputeCC Enhances Integrative Hi-C-Based Metagenomic Binning Through Constrained Random-Walk-Based Imputation

Metagenomic Hi-C (metaHi-C) enables the recognition of relationships between contigs in terms of their physical proximity within the same cell, facilitating the reconstruction of high-quality metagenome-assembled genomes (MAGs) from complex microbial communities. However, current Hi-C-based contig binning methods solely depend on Hi-C interactions between contigs to group them, ignoring invaluable biological information, including the presence of single-copy marker genes. Here, we introduce ImputeCC, an integrative contig binning tool tailored for metaHi-C datasets. ImputeCC integrates Hi-C interactions with the inherent discriminative power of single-copy marker genes, initially clustering them as preliminary bins, and develops a new constrained random walk with restart (CRWR) algorithm to improve Hi-C connectivity among these contigs. Extensive evaluations on mock and real metaHi-C datasets from diverse environments, including the human gut, wastewater, cow rumen, and sheep gut, demonstrate that ImputeCC consistently outperforms other Hi-C-based contig binning tools. ImputeCC’s genus-level analysis of the sheep gut microbiota further reveals its ability and potential to recover essential species from dominant genera such as Bacteroides, detect previously unrecognized genera, and shed light on the characteristics and functional roles of genera such as Alistipes within the sheep gut ecosystem.Availability: ImputeCC is implemented in Python and available at https://github.com/dyxstat/ImputeCC . The Supplementary Information is available at https://doi.org/10.5281/zenodo.10776604 .

Yuxuan Du, Wenxuan Zuo, Fengzhu Sun
Graph-Based Genome Inference from Hi-C Data

Three-dimensional chromosome structure plays an important role in fundamental genomic functions. Hi-C, a high-throughput, sequencing-based technique, has drastically expanded our comprehension of 3D chromosome structures. The first step of Hi-C analysis pipelines involves mapping sequencing reads from Hi-C to linear reference genomes. However, the linear reference genome does not incorporate genetic variation information, which can lead to incorrect read alignments, especially when analyzing samples with substantial genomic differences from the reference such as cancer samples. Using genome graphs as the reference facilitates more accurate mapping of reads, however, new algorithms are required for inferring linear genomes from Hi-C reads mapped on genome graphs and constructing corresponding Hi-C contact matrices, which is a prerequisite for the subsequent steps of the Hi-C analysis such as identifying topologically associated domains and calling chromatin loops. We introduce the problem of genome sequence inference from Hi-C data mediated by genome graphs. We formalize this problem, show the hardness of solving this problem, and introduce a novel heuristic algorithm specifically tailored to this problem. We provide a theoretical analysis to evaluate the efficacy of our algorithm. Finally, our empirical experiments indicate that the linear genomes inferred from our method lead to the creation of improved Hi-C contact matrices, which are more effective in accurately capturing the structures of topologically associated domains.

Yihang Shen, Lingge Yu, Yutong Qiu, Tianyu Zhang, Carl Kingsford
Meta-colored Compacted de Bruijn Graphs

The colored compacted de Bruijn graph (c-dBG) has become a fundamental tool used across several areas of genomics and pangenomics. For example, it has been widely adopted by methods that perform read mapping or alignment, abundance estimation, and subsequent downstream analyses. These applications essentially regard the c-dBG as a map from $$k$$ k -mers to the set of references in which they appear. The c-dBG data structure should retrieve this set—the color of the $$k$$ k -mer—efficiently for any given $$k$$ k -mer, while using little memory. To aid retrieval, the colors are stored explicitly in the data structure and take considerable space for large reference collections, even when compressed. Reducing the space of the colors is therefore of utmost importance for large-scale sequence indexing.We describe the meta-colored compacted de Bruijn graph (Mac-dBG)—a new colored de Bruijn graph data structure where colors are represented holistically, i.e., taking into account their redundancy across the whole collection being indexed, rather than individually as atomic integer lists. This allows the factorization and compression of common sub-patterns across colors. While optimizing the space of our data structure is NP-hard, we propose a simple heuristic algorithm that yields practically good solutions. Results show that the Mac-dBG data structure improves substantially over the best previous space/time trade-off, by providing remarkably better compression effectiveness for the same (or better) query efficiency . This improved space/time trade-off is robust across different datasets and query workloads.Code availability. A C++17 implementation of the Mac-dBG is publicly available on GitHub at: https://github.com/jermp/fulgor .

Giulio Ermanno Pibiri, Jason Fan, Rob Patro
Color Coding for the Fragment-Based Docking, Design and Equilibrium Statistics of Protein-Binding ssRNAs

We revisit the fragment-based docking and design of single-stranded RNA aptamers (ssRNAs), consisting of k nucleotides, onto a rigid protein. Fragments, representing short sequences of (modified) nucleotides, are individually docked as poses onto the protein surface using a force field. Compatible poses are then assembled while optimizing for an additive notion of energy, to obtain stable conformations that can either be constrained to represent an input ssRNA sequence (docking) or left unconstrained (design). However, a brute-force enumeration of clash-free conformations quickly becomes prohibitive due to their superexponential ( $$\varTheta (n^k)$$ Θ ( n k ) worst-case) combinatorial explosion, hindering the potential of fragment-based methods towards docking and design.In this work, we adapt the color-coding technique, introduced by Alon, Yuster and Zwick, to optimize over self-avoiding fragment assemblies in time/space linear on n the number of poses, and in time only exponential on k the number of fragments. The dynamic programming algorithm at the core of our method is surprisingly simple, and can be extended to produce suboptimal candidates, or modified to perform Boltzmann sampling of candidates assemblies. Using a rejection principle, and further optimized by a clique decomposition of clashing poses, these algorithms can be leveraged into efficient algorithms optimizing over clash-free complexes. The resulting sampling procedure can further be adapted into statistically-consistent estimators for any computable feature of interest.We showcase some of the capabilities of this new framework by reanalyzing a set of 7 documented ssRNA-protein complexes, demonstrating its practical relevance and versatility.

Taher Yacoub, Roy González-Alemán, Fabrice Leclerc, Isaure Chauvot de Beauchêne, Yann Ponty
Automated Design of Efficient Search Schemes for Lossless Approximate Pattern Matching

We present a novel method for the automated design of search schemes for lossless approximate pattern matching. Search schemes are combinatorial structures that define a series of searches, each of which specifies lower and upper bounds on the number of errors in each part of a partitioned search pattern, and the processing order of these parts. Collectively, these searches guarantee that all approximate occurrences of a search pattern up to a predefined number of k errors are identified. Because generating efficient search schemes is increasingly computationally expensive for larger k, search schemes have been proposed in literature for only up to $$k=4$$ k = 4 errors.To design search schemes allowing more errors, we combine a greedy algorithm and a new Integer Linear Programming (ILP) formulation. Efficient, ILP-optimal search schemes for up to $$k=7$$ k = 7 errors are proposed and shown to outperform alternative strategies, both in theory and in practice. Additionally, we propose a technique to dynamically select an appropriate search scheme given a specific search pattern. These combined approaches result in reductions of up to 53% in runtime for higher values of k.We introduce Hato, an open-source software tool (AGPL-3.0 license) to automatically generate search schemes. It implements the greedy algorithm and solves the ILP formulation using CPLEX. Furthermore, we present Columba 1.2, an open-source lossless read mapper (AGPL-3.0 license) implemented in C++. Columba can identify all approximate occurrences of 100 000 Illumina reads (150 bp) in the human reference genome in 24 s (maximum edit distance of 4) and in 75 s (edit distance of 6) using a single CPU core, thereby outperforming existing state-of-the-art tools for lossless approximate matching by a large margin.

Luca Renders, Lore Depuydt, Sven Rahmann, Jan Fostier
CELL-E: A Text-to-Image Transformer for Protein Image Prediction

Accurately predicting cellular activities of proteins based on their primary amino acid sequences would greatly improve our understanding of the proteome. In this paper, we present CELL-E, a text-to-image transformer model that generates 2D probability density images describing the spatial distribution of proteins within cells. Given an amino acid sequence and a reference image for cell or nucleus morphology, CELL-E predicts a more refined representation of protein localization, as opposed to previous in silico methods that rely on pre-defined, discrete class annotations of protein localization to subcellular compartments.

Emaad Khwaja, Yun S. Song, Bo Huang
A Scalable Optimization Algorithm for Solving the Beltway and Turnpike Problems with Uncertain Measurements

The Beltway and Turnpike problems entail the reconstruction of circular and linear one-dimensional point sets from unordered pairwise distances. These problems arise in computational biology when the measurements provide distances but do not associate those distances with the entities that gave rise to them. Such applications include molecular structure determination, genomic sequencing, tandem mass spectrometry, and molecular error-correcting codes (since sequencing and mass spec technologies can give lengths or weights, usually without connecting them to endpoints). Practical algorithms for Turnpike are known when the distance measurements are accurate, but both problems become strongly NP-hard under any level of measurement uncertainty. This is problematic since all known applications experience some degree of uncertainty from uncontrollable factors. Traditional algorithms cope with this complexity by exploring a much larger solution space, leading to exponential blowup in terms of both time and space. To alleviate both issues, we propose a novel alternating optimization algorithm that can scale to large, uncertain distance sets with as many as 100,000 points. This algorithm is space and time-efficient, with each step running in $$\mathcal {O}(m \log (m))$$ O ( m log ( m ) ) time and requiring only $$\mathcal {O}(\sqrt{m})$$ O ( m ) working space for a distance set of size m. Evaluations of this approach on synthetic and partial digest data showcase improved accuracy and scalability in the presence of uncertain, duplicated, and missing distances. Our implementation of the algorithm is available at https://github.com/Kingsford-Group/turnpikesolvermm .

C. S. Elder, Minh Hoang, Mohsen Ferdosi, Carl Kingsford
Overcoming Observation Bias for Cancer Progression Modeling

Cancers evolve by accumulating genetic alterations, such as mutations and copy number changes. The chronological order of these events is important for understanding the disease, but not directly observable from cross-sectional genomic data. Cancer progression models (CPMs), such as Mutual Hazard Networks (MHNs), reconstruct the progression dynamics of tumors by learning a network of causal interactions between genetic events from their co-occurrence patterns. However, current CPMs fail to include effects of genetic events on the observation of the tumor itself and assume that observation occurs independently of all genetic events. Since a dataset contains by definition only tumors at their moment of observation, neglecting any causal effects on this event leads to the “conditioning on a collider” bias: Events that make the tumor more likely to be observed appear anti-correlated, which results in spurious suppressive effects or masks promoting effects among genetic events. Here, we extend MHNs by modeling effects from genetic progression events on the observation event, thereby correcting for the collider bias. We derive an efficient tensor formula for the likelihood function and learn two models on somatic mutation datasets from the MSK-IMPACT study. In colon adenocarcinoma, we find a strong effect on observation by mutations in TP53, and in lung adenocarcinoma by mutations in EGFR. Compared to classical MHNs, this explains away many spurious suppressive interactions and uncovers several promoting effects.The data, code, and results are available at https://github.com/cbg-ethz/ObservationMHN .

Rudolf Schill, Maren Klever, Andreas Lösch, Y. Linda Hu, Stefan Vocht, Kevin Rupp, Lars Grasedyck, Rainer Spang, Niko Beerenwinkel
Inferring Metabolic States from Single Cell Transcriptomic Data via Geometric Deep Learning

The ability to measure gene expression at single-cell resolution has elevated our understanding of how biological features emerge from complex and interdependent networks at molecular, cellular, and tissue scales. As technologies have evolved that complement scRNAseq measurements with things like single-cell proteomic, epigenomic, and genomic information, it becomes increasingly apparent how much biology exists as a product of multimodal regulation. Biological processes such as transcription, translation, and post-translational or epigenetic modification impose both energetic and specific molecular demands on a cell and are therefore implicitly constrained by the metabolic state of the cell. While metabolomics is crucial for defining a holistic model of any biological process, the chemical heterogeneity of the metabolome makes it particularly difficult to measure, and technologies capable of doing this at single-cell resolution are far behind other multiomics modalities. To address these challenges, we present GEFMAP (Gene Expression-based Flux Mapping and Metabolic Pathway Prediction), a method based on geometric deep learning for predicting flux through reactions in a global metabolic network using transcriptomics data, which we ultimately apply to scRNAseq. GEFMAP leverages the natural graph structure of metabolic networks to learn both a biological objective for each cell and estimate a mass-balanced relative flux rate for each reaction in each cell using novel deep learning models.

Holly R. Steach, Siddharth Viswanath, Yixuan He, Xitong Zhang, Natalia Ivanova, Matthew Hirn, Michael Perlmutter, Smita Krishnaswamy
Computing Robust Optimal Factories in Metabolic Reaction Networks

Perhaps the most fundamental model in synthetic and systems biology for inferring pathways in metabolic reaction networks is a metabolic factory: a system of reactions that starts from a set of source compounds and produces a set of target molecules, while conserving or not depleting intermediate metabolites. Finding a shortest factory—that minimizes a sum of real-valued weights on its reactions to infer the most likely pathway—is NP-complete. The current state-of-the-art for shortest factories solves a mixed-integer linear program with a major drawback: it requires the user to set a critical parameter, where too large a value can make optimal solutions infeasible, while too small a value can yield degenerate solutions due to numerical error.We present the first robust algorithm for optimal factories that is both parameter-free (relieving the user from determining a parameter setting) and degeneracy-free (guaranteeing it finds an optimal nondegenerate solution). We also give for the first time a complete characterization of the graph-theoretic structure of shortest factories via cuts of hypergraphs that reveals two important classes of degenerate solutions which were overlooked and potentially output by the prior state-of-the-art. In addition we settle the relationship between the two established pathway models of hyperpaths and factories by proving that hyperpaths are actually a subclass of factories. Comprehensive experiments over all instances from the standard metabolic reaction databases in the literature demonstrate our algorithm is fast in practice, quickly finding optimal factories in large real-world networks containing thousands of reactions.A preliminary implementation of our algorithm for robust optimal factories in a new tool called Freeia is available free for research use at http://freeia.cs.arizona.edu .

Spencer Krieger, John Kececioglu
Undesignable RNA Structure Identification via Rival Structure Generation and Structure Decomposition

RNA design is the search for a sequence or set of sequences that will fold into predefined structures, also known as the inverse problem of RNA folding. While numerous RNA design methods have been invented to find sequences capable of folding into a target structure, little attention has been given to the identification of undesignable structures according to the minimum free energy ( $$\textrm{MFE}$$ MFE ) criterion under the Turner model. In this paper, we address this gap by first introducing mathematical theorems outlining sufficient conditions for recognizing undesignable structures, then proposing efficient algorithms, guided by these theorems, to verify the undesignability of RNA structures. Through the application of these theorems and algorithms to the Eterna100 puzzles, we demonstrate the ability to efficiently establish that 15 of the puzzles indeed fall within the category of undesignable structures. In addition, we provide specific insights from the study of undesignability, in the hope that it will enable more understanding of RNA folding and RNA design.Availability: Our source code is available at https://github.com/shanry/RNA-Undesign .

Tianshuo Zhou, Wei Yu Tang, David H. Mathews, Liang Huang
Structure- and Function-Aware Substitution Matrices via Learnable Graph Matching

Substitution matrices, which are crafted to quantify the functional impact of substitutions or deletions in biomolecules, are central component of remote homology detection, functional element discovery, and structure prediction algorithms. In this work we explore the use of biological structures and prior knowledge about molecular function (e.g. experimental data or functional annotations) as additional information for building more expressive substitution matrices compared to the traditional frequency-based methods. External prior knowledge in the form of family annotations have been exploited for specialized sequence alignment methods, and substitution matrices on structural alphabets have led to advances in remote homology detection. However, no method has integrated both structural information as well as external priors without the need of pre-curated alignments.Here we propose a general algorithmic framework for learning structure-based substitution matrices automatically conditioned on any prior knowledge. In particular, we represent the structures of interest as graphs and we learn, using graph neural networks, suitable substitution cost matrices such that the resulting graph matching metric correlates with the prior at hand. Our method shows promising performance in functional similarity classification tasks and molecular database searching and shows potential for interpreting the functional importance of substructures.Code and data are available at: https://github.com/BorgwardtLab/GraphMatchingSubstitutionMatrices .

Paolo Pellizzoni, Carlos Oliver, Karsten Borgwardt
Secure Discovery of Genetic Relatives Across Large-Scale and Distributed Genomic Datasets

Finding related individuals in genomic datasets is a necessary step in many genetic analysis workflows and has broader societal value as a tool for retrieving lost relatives. However, detecting such relationships is often infeasible when the dataset is distributed across multiple entities due to privacy concerns. Although cryptographic techniques for secure computation offer ways to jointly analyze distributed datasets with privacy guarantees, the sheer computational burden of operations required for identifying kinship, such as pairwise sequence comparison of all individuals across large datasets, presents key challenges to developing a practical privacy-preserving solution. We introduce SF-Relate, a secure federated algorithm for identifying genetic relatives across data silos (Fig. 1) that scales efficiently to large datasets that include hundreds of thousands of individuals. We leverage the key insight that the number of individual pairs to compare can be vastly reduced, while maintaining accurate detection, through innovative locality-sensitive hashing of individuals who are likely to be related together into buckets and then testing relationships only between individuals in corresponding buckets across parties. To this end, we construct an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with our novel micro-bucketing strategy is key to achieving accurate and practical private relative detection. To guarantee privacy, we devise an efficient algorithm based on multiparty homomorphic encryption (MHE) to allow the parties to cooperatively compute the relatedness coefficients between pairs of individuals, and to further classify their degrees of relatedness, all without sharing any private data. We demonstrate the accuracy and practical runtimes of SF-Relate on real genomic datasets of varying sizes, from the UK Biobank and All of Us datasets. On the largest dataset of 200K individuals split between two parties, SF-Relate securely detects 94.9% of third degree relatives, and 99.9% of relatives second-degree or closer within 15 h of runtime. Our work enables secure identification of relatives across large-scale genomic datasets. Our code is available at https://github.com/froelich/sf-relate , and the full manuscript preprint is available at https://doi.org/10.1101/2024.02.16.580613 .

Matthew M. Hong, David Froelicher, Ricky Magner, Victoria Popic, Bonnie Berger, Hyunghoon Cho
GFETM: Genome Foundation-Based Embedded Topic Model for scATAC-seq Modeling

Single-cell Assay for Transposase-Accessible Chromatin with sequencing (scATAC-seq) has emerged as a powerful technique for investigating open chromatin landscapes at the single-cell level. Yet, scATAC-seq cell representation learning and its downstream tasks remain challenging due to the inherent high dimensional, sparse, and noisy properties of the data.

Yimin Fan, Yu Li, Jun Ding, Yue Li
SEM: Size-Based Expectation Maximization for Characterizing Nucleosome Positions and Subtypes

Nucleosome landscapes across the genome are typically characterized using micrococcal nuclease sequencing (MNase-seq). MNase is an endo-exonuclease that preferentially digests accessible DNA between nucleosomes.

Jianyu Yang, Kuangyu Yen, Shaun Mahony
Centrifuger: Lossless Compression of Microbial Genomes for Efficient and Accurate Metagenomic Sequence Classification

Centrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. Due to the increasing availability of microbial genomes, classification methods tend to store the genome database in an approximate way, keeping the memory footprint within a practical range. In contrast, Centrifuger losslessly compresses the Burrows-Wheeler transformed (BWT) sequence from microbial genomes using a novel compression algorithm called run-block compression. We prove that the run-block compression achieves sublinear space complexity, $$O(\frac{n}{\sqrt{l}})$$ O ( n l ) words, where $$n$$ n is the sequence length and $$l$$ l is the average run length. This space complexity falls between the no-compression wavelet tree representation using $$O\left(n\right)$$ O n words and the run-length compression representation using $$O(\frac{n}{l})$$ O ( n l ) words. Run-block compression is effective at compressing microbial databases like RefSeq, where the average run length of the BWT sequence is low, e.g., about 6.8. Combining this compression method with other strategies for compacting the Ferragina-Manzini (FM) index, Centrifuger reduces the index size by half compared to its predecessor, Centrifuge. Lossless compression helps Centrifuger achieve greater accuracy than competing methods at lower taxonomic levels such as species and genus. Additionally, run-block compression supports rapid rank queries in $$O\left(log\sigma \right)\mathrm{ time}$$ O l o g σ time , the same order as the wavelet-tree rank query. Despite its use of a compressed data structure, Centrifuger is as fast as Centrifuge in terms of processing speed.

Li Song, Ben Langmead
BONOBO: Bayesian Optimized Sample-Specific Networks Obtained by Omics Data

Correlation networks can provide important insights into biological systems by uncovering intricate interactions between genes and their molecular regulators. However, methods for estimating co-expression networks generally derive an aggregate population-specific network that represents the mean regulatory properties of the entire population and hence falls short in capturing heterogeneity across individuals. While numerous methods have been proposed to estimate sample-specific co-expression networks, they fail to estimate positive semidefinite correlation networks and, hence, are subject to misinterpretation. To fill this gap in co-expression network inference, we introduce BONOBO (Bayesian Optimized Networks Obtained By assimilating Omics data), a scalable Bayesian model for deriving individual sample-specific co-expression networks by acknowledging heterogeneity in molecular interactions across individuals. For each sample, BONOBO imposes a Gaussian distribution on the log-transformed, centered gene expression and a conjugate Inverse Wishart prior distribution on the sample-specific co-expression matrix constructed from assimilating all other samples in the data. BONOBO yields a closed-form solution for the posterior distribution of the sample-specific co-expression matrices by combining the sample-specific gene expression with the prior distribution. We demonstrate the advantages of BONOBO using several simulated and real datasets. BONOBO is computationally scalable and available as open-source software through the Network Zoo package (from netZooPy v0.10.0; netzoo.github.io). A preprint associated with this abstract can be found on bioRxiv (doi: 10.1101/2023.11.16.567119v1).

Enakshi Saha, Viola Fanfani, Panagiotis Mandros, Marouen Ben-Guebila, Jonas Fischer, Katherine H. Shutta, Kimberly Glass, Dawn L. DeMeo, Camila M. Lopes-Ramos, John Quackenbush
regLM: Designing Realistic Regulatory DNA with Autoregressive Language Models

We present regLM, a framework to design synthetic CREs with desired properties, such as high, low or cell type-specific activity, using autoregressive language models in conjunction with supervised sequence-to-function models. Using regLM, we designed synthetic yeast promoters of defined strength, as well as cell type-specific human enhancers. We show that the synthetic CREs generated by regLM contain biological features similar to experimentally validated CREs.

Avantika Lal, David Garfield, Tommaso Biancalani, Gokcen Eraslan
DexDesign: A New OSPREY-Based Algorithm for Designing de novo D-peptide Inhibitors

D-peptide inhibitors offer unique advantages as therapeutics, including increased metabolic stability and low immunogenicity. We introduce DexDesign, an OSPREY-based algorithm for computationally designing de novo D-peptide inhibitors, and use it to design inhibitors of two biomedically important PDZ domains. Novel techniques enabling exponential reductions in peptide search space are presented: Minimum Flexible Set, Inverse Alanine Scans, and K*-based Mutational Scans. Designed D-peptide inhibitors are predicted to significantly improve binding affinity (KD) over the endogenous ligand.

Nathan Guerin, Henry Childs, Pei Zhou, Bruce R. Donald
Memory-Bound and Taxonomy-Aware K-Mer Selection for Ultra-Large Reference Libraries

Classifying sequencing reads based on $$k$$ k -mer matches to a reference library is widely used in applications such as taxonomic profiling. Given the ever-increasing number of genomes publicly available, it is increasingly impossible to keep all or a majority of their $$k$$ k -mers in memory. Thus, there is a growing need for methods for selecting a subset of $$k$$ k -mers while accounting for taxonomic relationships. We propose $$k$$ k -mer RANKer (KRANK), a method that uses a set of heuristics to efficiently and effectively select a size-constrained subset of $$k$$ k -mers from a diverse and imbalanced taxonomy that suffers biased sampling. Empirical evaluations demonstrate that a fraction of all $$k$$ k -mers in large reference libraries can achieve comparable accuracy to the full set.

Ali Osman Berk Şapcı, Siavash Mirarab
SpaCeNet: Spatial Cellular Networks from Omics Data

We propose SpaCeNet as a method for analyzing patterns of correlation in spatial transcriptomics data. SpaCeNet extends the concept of conditional independence to spatially distributed information, and disentangles conditional independence relations between inter- and intra-cellular variables. Moreover, SpaCeNet addresses the various length scales over which intercellular communication occurs via flexible interaction potentials in combination with appropriate regularization strategies.

Stefan Schrod, Niklas Lück, Robert Lohmayer, Stefan Solbrig, Tina Wipfler, Katherine H. Shutta, Marouen Ben Guebila, Andreas Schäfer, Tim Beißbarth, Helena U. Zacharias, Peter J. Oefner, John Quackenbush, Michael Altenbuchinger
Discovering and Overcoming the Bias in Neoantigen Identification by Unified Machine Learning Models

Neoantigens, formed by genetic mutations in tumor cells, are abnormal peptides that can trigger immune responses. Precisely identifying neoantigens from vast mutations is the key to tumor immunotherapy design. There are three main steps in the neoantigen immune process, i.e., binding with MHCs, extracellular presentation, and induction of immunogenicity. Various machine learning methods have been developed to predict the probability of one of the three events, but the overall accuracy of neoantigen identification remains far from satisfactory. To gain a systematic understanding of the key factors of neoantigen identification, we developed a unified transformer-based machine learning framework ImmuBPI that comprised three tasks and achieved state-of-the-art performance. Through cross-task model interpretation, we have discovered an underestimation of data bias for immunogenicity prediction, which has led to skewed discriminatory boundaries of current machine learning models. We designed a mutual information-based debiasing strategy that performed well on mutation variants immunogenicity prediction, a task where current methods fell short. Clustering immunogenic peptides with debiased representations uncovers unique preferences for biophysical properties, such as hydrophobicity and polarity. These observations serve as an important complement to the past understanding that accurately predicting neoantigen is constrained by limited data, highlighting the necessity of bias control. We expect this study will provide novel and insightful perspectives for neoantigen prediction methods and benefit future neoantigen-mediated immunotherapy designs.

Ziting Zhang, Wenxu Wu, Lei Wei, Xiaowo Wang
MaSk-LMM: A Matrix Sketching Framework for Linear Mixed Models in Association Studies

Linear mixed models have been widely used in genome-wide association studies to control for population stratification and cryptic relatedness. Unfortunately, estimating LMM parameters is computationally expensive, necessitating large-scale matrix operations to build the genetic relatedness matrix. Randomized Linear Algebra has provided alternative approaches to such matrix operations by leveraging matrix sketching, which often results in provably accurate fast and efficient approximations. We leverage matrix sketching to develop a fast and efficient LMM method called Matrix-Sketching LMM (MaSk-LMM) by sketching the genotype matrix to reduce its dimensions and speed up computations. Our framework provides theoretical guarantees and a strong empirical performance compared to current methods.

Myson Burch, Aritra Bose, Gregory Dexter, Laxmi Parida, Petros Drineas
Community Structure and Temporal Dynamics of Viral Epistatic Networks Allow for Early Detection of Emerging Variants with Altered Phenotypes

In this study, we demonstrated that SARS-CoV-2 emerging variants can be detected or predicted by examining the community structure of viral coordinated substitution networks. These variants can be linked to dense network communities, which become discernible earlier than their associated viral variants reach noticeable prevalence levels. From these insights, we developed HELEN (Heralding Emerging Lineages in Epistatic Networks), a computational framework that identifies densely connected communities of SAV alleles and merges them into haplotypes using a combination of statistical inference, population genetics, and discrete optimization techniques. Our methodology can be employed to detect emerging and circulating strains of any highly mutable pathogen with adequate genomic surveillance data, while offering greater scalability than phylogenetic lineage tracing methods.

Fatemeh Mohebbi, Alexander Zelikovsky, Serghei Mangul, Gerardo Chowell, Pavel Skums
Maximum Likelihood Inference of Time-Scaled Cell Lineage Trees with Mixed-Type Missing Data

Recent dynamic lineage tracing technologies combine CRISPR-based genome editing with single-cell sequencing to track cell divisions during development. A key problem in lineage tracing is to infer a cell lineage tree from the measured CRISPR-induced mutations. Several features of lineage tracing data distinguish this problem from standard phylogenetic tree inference: CRISPR-induced mutations are non-modifiable and can result in distinct sets of possible mutations at each target site; the number of mutations decreases over time due to non-modifiability; and CRISPR-based genome-editing and single-cell sequencing results in high rates of both heritable and non-heritable (dropout) missing data. To model these features, we introduce the Probabilistic Mixed-type Missing (PMM) model. We describe an algorithm, LAML (Lineage Analysis via Maximum Likelihood), to compute a maximum likelihood tree under the PMM model. LAML combines an Expectation Maximization (EM) algorithm with a heuristic tree search to jointly estimate tree topology, branch lengths and missing data parameters.

Uyen Mai, Gillian Chu, Benjamin J. Raphael
TRIBAL: Tree Inference of B Cell Clonal Lineages

B cells are a critical component of the adaptive immune system. Single cell RNA-sequencing (scRNA-seq) has allowed for both profiling of B cell receptor (BCR) sequences and gene expression. However, understanding the adaptive and evolutionary mechanisms of B cells in response to specific stimuli remains a significant challenge in the field of immunology. We introduce a new method, TRIBAL, which aims to infer the evolutionary history of clonally related B cells from scRNA-seq data. The key insight of TRIBAL is that inclusion of isotype data into the B cell lineage inference problem is valuable for reducing phylogenetic uncertainty that arises when only considering the receptor sequences. Consequently, the TRIBAL inferred B cell lineage trees jointly capture the somatic mutations introduced to the B cell receptor during affinity maturation and isotype transitions during class switch recombination. In addition, TRIBAL infers isotype transition probabilities that are valuable for gaining insight into the dynamics of class switching. Via in silico experiments, we demonstrate that TRIBAL infers isotype transition probabilities with the ability to distinguish between direct versus sequential switching in a B cell population. This results in more accurate B cell lineage trees and corresponding ancestral sequence and class switch reconstruction compared to competing methods. Using real-world scRNA-seq datasets, we show that TRIBAL recapitulates expected biological trends in a model affinity maturation system. Furthermore, the B cell lineage trees inferred by TRIBAL were equally plausible for the BCR sequences as those inferred by competing methods but yielded lower entropic partitions for the isotypes of the sequenced B cell. Thus, our method holds the potential to further advance our understanding of vaccine responses, disease progression, and the identification of therapeutic antibodies.

Leah L. Weber, Derek Reiman, Mrinmoy S. Roddur, Yuanyuan Qi, Mohammed El-Kebir, Aly A. Khan
Mapping the Topography of Spatial Gene Expression with Interpretable Deep Learning

Spatially resolved transcriptomics technologies provide high-throughput measurements of gene expression in a tissue slice, but the sparsity of this data complicates the analysis of spatial gene expression patterns. We address this issue by deriving a topographic map of a tissue slice—analogous to a map of elevation in a landscape—using a novel quantity called the isodepth. Contours of constant isodepth enclose spatial domains with distinct cell type composition, while gradients of the isodepth indicate spatial directions of maximum change in gene expression. We develop GASTON, an unsupervised and interpretable deep learning algorithm that simultaneously learns the isodepth, spatial gene expression gradients, and piecewise linear functions of the isodepth. GASTON models both continuous gradients and discontinuous spatial variation in the expression of individual genes. We show that GASTON accurately identifies spatial domains and marker genes in multiple SRT datasets.

Uthsav Chitra, Brian J. Arnold, Hirak Sarkar, Cong Ma, Sereno Lopez-Darwin, Kohei Sanno, Benjamin J. Raphael
GraSSRep: Graph-Based Self-supervised Learning for Repeat Detection in Metagenomic Assembly

Repetitive DNA (repeats) poses significant challenges for accurate and efficient genome assembly and sequence alignment. This is particularly true for metagenomic data, where genome dynamics such as horizontal gene transfer, gene duplication, and gene loss/gain complicate accurate genome assembly from metagenomic communities. Detecting repeats is a crucial first step in overcoming these challenges. To address this issue, we propose GraSSRep, a novel approach that leverages the assembly graph’s structure through graph neural networks (GNNs) within a self-supervised learning framework to classify DNA sequences into repetitive and non-repetitive categories. Specifically, we frame this problem as a node classification task within a metagenomic assembly graph. In a self-supervised fashion, we rely on a high-precision (but low-recall) heuristic to generate pseudo-labels for a small proportion of the nodes. We then use those pseudo-labels to train a GNN embedding and a random forest classifier to propagate the labels to the remaining nodes. In this way, GraSSRep combines sequencing features with pre-defined and learned graph features to achieve state-of-the-art performance in repeat detection. We evaluate our method using simulated and synthetic metagenomic datasets. The results on the simulated data highlight our GraSSRep’s robustness to repeat attributes, demonstrating its effectiveness in handling the complexity of repeated sequences. Additionally, our experiments with synthetic metagenomic datasets reveal that incorporating the graph structure and the GNN enhances our detection performance. Finally, in comparative analyses, GraSSRep outperforms existing repeat detection tools with respect to precision and recall.

Ali Azizpour, Advait Balaji, Todd J. Treangen, Santiago Segarra
PRS-Net: Interpretable Polygenic Risk Scores via Geometric Learning

Polygenic risk score (PRS) serves as a valuable tool for predicting the genetic risk of complex human diseases for individuals, playing a pivotal role in advancing precision medicine. Traditional PRS methods, predominantly following a linear structure, often fall short in capturing the intricate relationships between genotype and phenotype. We present PRS-Net, an interpretable deep learning-based framework designed to effectively model the nonlinearity of biological systems for enhanced disease prediction and biological discovery. PRS-Net begins by deconvoluting the genome-wide PRS at the single-gene resolution, and then it encapsulates gene-gene interactions for genetic risk prediction leveraging a graph neural network, thereby enabling the characterization of biological nonlinearity underlying complex diseases. An attentive readout module is specifically introduced into the framework to facilitate model interpretation and biological discovery. Through extensive tests across multiple complex diseases, PRS-Net consistently outperforms baseline PRS methods, showcasing its superior performance on disease prediction. Moreover, the interpretability of PRS-Net has been demonstrated by the identification of genes and gene-gene interactions that significantly influence the risk of Alzheimer’s disease and multiple sclerosis. In summary, PRS-Net provides a potent tool for parallel genetic risk prediction and biological discovery for complex diseases.

Han Li, Jianyang Zeng, Michael P. Snyder, Sai Zhang
Haplotype-Aware Sequence Alignment to Pangenome Graphs

Modern pangenome graphs are built using haplotype-resolved genome assemblies. While mapping reads to a pangenome graph, prioritizing alignments that are consistent with the known haplotypes has been shown to improve genotyping accuracy. However, the existing rigorous formulations for sequence-to-graph co-linear chaining and alignment problems do not consider the haplotype paths in a pangenome graph. This often leads to spurious read alignments to those paths that are unlikely recombinations of the known haplotypes.We present novel formulations and algorithms for haplotype-aware sequence alignment to directed acyclic graphs (DAGs). We consider both sequence-to-DAG chaining and sequence-to-DAG alignment problems. Drawing inspiration from the commonly used models for genotype imputation, we assume that a query sequence is an imperfect mosaic of the reference haplotypes. Accordingly, we extend previous chaining and alignment formulations by introducing a recombination penalty for a haplotype switch. First, we solve haplotype-aware sequence-to-DAG alignment in $$O(|Q||E||\mathcal {H}|)$$ O ( | Q | | E | | H | ) time where Q is the query sequence, E is the set of edges, and $$\mathcal {H}$$ H is the set of haplotypes represented in the graph. To complement our solution, we prove that an algorithm significantly faster than $$O(|Q||E||\mathcal {H}|)$$ O ( | Q | | E | | H | ) is impossible under the Strong Exponential Time Hypothesis (SETH). Second, we propose a haplotype-aware chaining algorithm that runs in $$O(|\mathcal {H}|N \log {|\mathcal {H}|N})$$ O ( | H | N log | H | N ) time after graph preprocessing, where N is the count of input anchors. We then establish that a chaining algorithm significantly faster than $$O(|\mathcal {H}|N)$$ O ( | H | N ) is impossible under SETH. As a proof-of-concept of our algorithmic solutions, we implemented the chaining algorithm in the Minichain aligner ( https://github.com/at-cg/minichain ). We demonstrate the advantage of the algorithm by aligning sequences sampled from human major histocompatibility complex (MHC) to a pangenome graph of 60 MHC haplotypes. The proposed algorithm offers better consistency with ground-truth recombinations when compared to a haplotype-agnostic algorithm.

Ghanshyam Chandra, Daniel Gibney, Chirag Jain
Disease Risk Predictions with Differentiable Mendelian Randomization

Predicting future disease onset is crucial in preventive healthcare, yet longitudinal datasets linking early risk factors to subsequent health outcomes are scarce. To address this challenge, we introduce Differentiable Mendelian Randomization (DMR), an extension of the classical Mendelian Randomization framework for disease risk predictions without longitudinal data. To do so, DMR leverages risk factors and genetic profiles from a healthy cohort, along with results from genome-wide association studies (GWAS) of diseases of interest. In this work, we describe the DMR framework and confirm its reliability and effectiveness in simulations and an application to a type 2 diabetes (T2D) cohort.

Ludwig Gräf, Daniel Sens, Liubov Shilova, Francesco Paolo Casale
DIISCO: A Bayesian Framework for Inferring Dynamic Intercellular Interactions from Time-Series Single-Cell Data

Characterizing cell-cell communication and tracking its variability over time is essential for understanding the coordination of biological processes mediating normal development, progression of disease, or responses to perturbations such as therapies. Existing tools lack the ability to capture time-dependent intercellular interactions, such as those influenced by therapy, and primarily rely on existing databases compiled from limited contexts. We present DIISCO, a Bayesian framework for characterizing the temporal dynamics of cellular interactions using single-cell RNA-sequencing data from multiple time points. Our method uses structured Gaussian process regression to unveil time-resolved interactions among diverse cell types according to their co-evolution and incorporates prior knowledge of receptor-ligand complexes. We show the interpretability of DIISCO in new data collected from CAR-T cells co-cultured with lymphoma cells, demonstrating its potential to uncover dynamic cell-cell crosstalk.Availability: DIISCO is publicly accessible at https://github.com/azizilab/DIISCO_public . All data will be deposited to GEO upon publication.

Cameron Park, Shouvik Mani, Nicolas Beltran-Velez, Katie Maurer, Satyen Gohil, Shuqiang Li, Teddy Huang, David A. Knowles, Catherine J. Wu, Elham Azizi
Enhancing Gene Set Analysis in Embedding Spaces: A Novel Best-Match Approach

Embedding techniques have become valuable strategies for extracting crucial information from high-dimensional data and transforming it into more interpretable lower-dimensional spaces. In biology, embeddings are frequently used to capture a variety of functional relationships between genes to encode individual genes in a compact latent space. Genes, however, do not function in isolation but in coordinated gene sets where groups of proteins form complexes, function in pathways, or, more simply, have a localized set of possible interactions. Gene embeddings have been used mostly for downstream machine learning tasks, or, at best, comparisons between pairs of genes. There has been limited methodological development towards comparing gene sets in embedding spaces. Here, we propose a new method, ANDES, that compares how two gene sets are related in gene embedding spaces. ANDES uses a novel best-match approach that considers gene similarity while reconciling gene set diversity. ANDES is a flexible framework that has wide-ranging potential, especially when combined with different types of embeddings.

Lechuan Li, Ruth Dannenfelser, Charlie Cruz, Vicky Yao
Prompt-Based Learning on Large Protein Language Models Improves Signal Peptide Prediction

Signal peptides (SP) play a crucial role in protein localization in cells. The development of large protein language models (PLMs) provides a new opportunity for SP prediction. We applied a prompt-based learning framework, Parameter-Efficient Fine-Tuning (PEFT) for SP prediction, PEFT-SP, to effectively utilize pre-trained PLMs. We integrated low-rank adaptation (LoRA) into ESM-2 models to better leverage the protein sequence evolutionary knowledge of PLMs. Experiments show that PEFT-SP using LoRA enhances state-of-the-art results, leading to a maximum MCC gain of 0.372 for SPs with small training samples and an overall MCC gain of 0.048. Furthermore, we also employed two other prompt-based learning methods, i.e., Prompt Tuning and Adapter Tuning, into ESM-2 for SP prediction. More elaborate experiments show that PEFT-SP using Adapter Tuning can also improve the state-of-the-art results with up to 0.202 MCC gain for SPs with small training samples and an overall MCC gain of 0.030. LoRA requires fewer computing resources and less memory than the Adapter during the training stage, making it possible to adapt larger and more powerful protein models for SP prediction. The PEFT-SP framework is available at https://github.com/shuaizengMU/PEFT-SP . The web server for SP predic-tion leveraging the PEFT-SP framework is publicly available at https://www.mu-loc.org/peftsp/ .

Shuai Zeng, Duolin Wang, Lei Jiang, Dong Xu
Decoil: Reconstructing Extrachromosomal DNA Structural Heterogeneity from Long-Read Sequencing Data

Circular extrachromosomal DNA (ecDNA) is a form of oncogene amplification found across cancer types and associated with poor outcome in patients. EcDNA can be structurally complex and contain rearranged DNA sequences derived from multiple chromosome locations. As the structure of ecDNA can impact oncogene regulation and may indicate mechanisms of its formation, disentangling it at high resolution from sequencing data is essential. Even though methods have been developed to identify and reconstruct ecDNA in cancer genome sequencing, it remains challenging to resolve complex ecDNA structures, in particular amplicons with shared genomic footprints. We here introduce Decoil, a computational method which combines a breakpoint-graph approach with LASSO regression to reconstruct complex ecDNA and deconvolve co-occurring ecDNA elements with overlapping genomic footprints from long-read nanopore sequencing. Decoil outperforms de-novo assembly and alignment-based methods in simulated long-read sequencing data for both simple and complex ecDNAs. Applying Decoil on whole genome sequencing data uncovered different ecDNA topologies and explored ecDNA structure heterogeneity in neuroblastoma tumors and cell lines, indicating that this method may improve ecDNA structural analyzes in cancer.

Mădălina Giurgiu, Nadine Wittstruck, Elias Rodriguez-Fos, Rocío Chamorro González, Lotte Brückner, Annabell Krienelke-Szymansky, Konstantin Helmsauer, Anne Hartebrodt, Philipp Euskirchen, Richard P. Koche, Kerstin Haase, Knut Reinert, Anton G. Henssen
Privacy Preserving Epigenetic PaceMaker: Stronger Privacy and Improved Efficiency

DNA methylation data plays a crucial role in estimating chronological age in mammals, offering real-time insights into an individual’s aging process. The Epigenetic Pacemaker (EPM) model allows inference of the epigenetic age as deviations from the population trend. Given the sensitivity of this data, it is essential to safeguard both inputs and outputs of the EPM model. In a recent study by Goldenberg et al., a privacy-preserving approach for EPM computation was introduced, utilizing Fully Homomorphic Encryption (FHE). However, their method had limitations, including having high communication complexity and being impractical for large datasets. Our work presents a new privacy preserving protocol for EPM computation, improving both privacy and complexity. Notably, we employ a single server for the secure computation phase while ensuring privacy even in the event of server corruption (compared to requiring two non-colluding servers in Goldenberg et al.). Using techniques from symbolic algebra and number theory, the new protocol eliminates the need for communication during the secure computing phase, significantly improves asymptotic runtime, and offers better compatibility to parallel computing for further time complexity reduction. We have implemented our protocol, demonstrating its ability to produce results similar to the standard (insecure) EPM model with substantial performance improvement compared to Goldenberg et al. These findings hold promise for enhancing data security in medical applications where personal privacy is paramount. The generality of both the new approach and the EPM, suggests that this protocol may be useful to other uses employing similar expectation maximization techniques.

Meir Goldenberg, Loay Mualem, Amit Shahar, Sagi Snir, Adi Akavia
Mapping Cell Fate Transition in Space and Time

Cell fate transition is fundamentally a spatiotemporal process, but previous work has largely neglected the spatial dimension. Incorporating both space and time into models of cell fate transition would be a key step toward characterizing how interactions among neighboring cells, the presence of local niche factors, and physical migration of cells contribute to tissue development. To realize this potential, we propose a model for jointly inferring spatial and temporal dynamics of cell fate transition from spatial transcriptomic data. Our approach extends the RNA velocity framework to model single-cell gene expression dynamics of an entire tissue with spatially coupled differential equations. Our principled probabilistic approach enables the incorporation of time point labels and multiple slices. We further introduce the idea of cell velocity, which is defined as the physical direction of cell maturation and migration. Simulated data analysis indicates that incorporating spatial coordinates significantly improves the accuracy of velocity and time inference. Our work introduces a new dimension into the study of cell fate transitions and lays a foundation for modeling the collective dynamics of cells comprising an entire tissue. The full paper is at https://www.biorxiv.org/content/10.1101/2024.02.12.579941v1 .

Yichen Gu, Jialin Liu, Chen Li, Joshua D. Welch
Protein Domain Embeddings for Fast and Accurate Similarity Search

Recently developed protein language models have enabled a variety of applications of the protein contextual embeddings. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins, however limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins of single domains but not multi-domain proteins. Here we propose a novel approach that first segments proteins into domains and then applies discrete cosine transformation to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach called DCTdomain utilizes predicted contact maps from ESM-2 for domain segmentation, which is formulated as a domain segmentation problem and can be solved using a recursive cut algorithm (RecCut in short) in quadratic time to the protein length. We showed such domain-level contextual vectors (termed as DCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities.

Benjamin Giovanni Iovino, Haixu Tang, Yuzhen Ye
Processing-Bias Correction with DEBIAS-M Improves Cross-Study Generalization of Microbiome-Based Prediction Models

Microbiome profiling exhibits strong study- and batch-specific effects, impeding the identification of signals that are reproducible across studies and the development of generalizable prediction models. Prior work has attributed this to biases introduced during experimental protocols [1], with factors such as the type of DNA extraction kit affecting the efficiency of extracting and sequencing different microbes [2, 3]. While existing batch-correction methods show benefit in microbiome analysis [4–7], many make strong parametric assumptions, which do not necessarily apply in this data, or require the use of the outcome variable, which risks overfitting [8]. Lastly and importantly, the transformations performed to the data are largely non-interpretable, e.g., introducing counts to features that were initially very sparse. Here, we present DEBIAS-M (Domain adaptation with phenotype Estimation and Batch Integration Across Studies of the Microbiome), an interpretable framework for processing-bias inference, batch correction, and domain adaptation in microbiome studies. DEBIAS-M learns bias-correction factors for each microbe in each batch that simultaneously minimize batch effects and maximize cross-study associations with phenotypes. Using benchmarks, including HIV classification from gut microbiome data, we demonstrate that DEBIAS-M outperforms alternative batch-correction methods commonly used in the field. Overall, we show that DEBIAS-M facilitates better modeling of microbiome data and identification of signals that are reproducible across studies.

George I. Austin, Aya Brown Kav, Heekuk Park, Jana Biermann, Anne-Catrin Uhlemann, Tal Korem
VICTree - A Variational Inference Method for Clonal Tree Reconstruction

Clonal tree inference brings crucial insights to the analysis of tumor heterogeneity and cancer evolution. Recent progress in single cell sequencing has prompted a demand for more advanced probabilistic models of copy number evolution, coupled with inference methods which can account for the noisy nature of the data along with dependencies between adjacent sites in copy number profiles. We present VICTree, a variational inference based algorithm for joint Bayesian inference of clonal trees, together with a novel Tree-structured Mixture Hidden Markov Model (TSMHMM) which combines HMMs related through a tree with a mixture model. For the tree inference, we introduce a new algorithm, LARS, for sampling directed labeled multifurcating trees. To evaluate our proposed method, we conduct experiments on simulated data and on samples of multiple myeloma and breast cancer. We demonstrate VICTree’s capacity for reliable clustering, clonal tree reconstruction, copy number evolution and the utility of the ELBO for model selection. Lastly, VICTree’s results are compared in terms of quality and speed of inference to other state-of-the-art methods. The code for VICTree is available on GitHub: github.com/Lagergren-Lab/victree and the full paper on bioRxiv .

Harald Melin, Vittorio Zampinetti, Andrew McPherson, Jens Lagergren
DeST-OT: Alignment of Spatiotemporal Transcriptomics Data

Spatially resolved transcriptomics (SRT) measures mRNA transcripts at thousands of locations within a tissue slice, revealing spatial variations in gene expression as well as the spatial distribution of cell types. In recent studies, SRT has been applied to tissue slices from multiple timepoints during the development of an organism. Alignment of this spatiotemporal transcriptomics data can provide insights into the gene expression programs governing the growth and differentiation of cells over space and time. We introduce DeST-OT (Developmental SpatioTemporal Optimal Transport), a method to align SRT slices from pairs of developmental timepoints using the framework of optimal transport (OT). DeST-OT uses semi-relaxed optimal transport to precisely model cellular growth, death, and differentiation processes that are not well-modeled by existing alignment methods. We further introduce two metrics to quantify the plausibility of a spatiotemporal alignment: a growth distortion metric which quantifies the discrepancy between the inferred and the true cell type growth rates, and a migration metric which quantifies the distance traveled between ancestor and descendant cells.

Peter Halmos, Xinhao Liu, Julian Gold, Feng Chen, Li Ding, Benjamin J. Raphael
Determining Optimal Placement of Copy Number Aberration Impacted Single Nucleotide Variants in a Tumor Progression History

Intratumoral heterogeneity arises as a result of genetically distinct subclones emerging during tumor progression. These subclones are characterized by various types of somatic genomic aberrations, with single nucleotide variants (SNVs) and copy number aberrations (CNAs) being the most prominent. In this paper, we introduce DETOPT, a combinatorial optimization method for accurate tumor progression tree inference that places SNVs impacted by CNAs on trees of tumor progression with minimal distortion on their variant allele frequencies observed across available samples of a tumor. We show that on simulated data DETOPT provides more accurate tree placement of SNVs impacted by CNAs than the available alternatives. When applied to a set of multi-sample bulk exome-sequenced tumor metastases from a treatment-refractory, triple-positive metastatic breast cancer, DETOPT reports biologically plausible trees of tumor progression, identifying the tree placement of copy number state gains and losses impacting SNVs, including those in clinically significant genes.Full Text Preprint. https://www.biorxiv.org/content/10.1101/2024.03.10.584318v1 .

Chih Hao Wu, Suraj Joshi, Welles Robinson, Paul F. Robbins, Russell Schwartz, S. Cenk Sahinalp, Salem Malikić
Accurate Assembly of Circular RNAs with TERRACE

Circular RNA (circRNA) is a class of RNA molecules that forms a closed loop with its 5’ and 3’ ends covalently bonded. CircRNAs were severely overlooked previously owing to the biases in the RNA-seq protocols and in the detection algorithms, but recently gained tremendous attentions in both aspects. Most existing methods for assembling circRNAs heavily rely on the annotated transcriptomes, and hence exhibit unsatisfactory accuracy when a high-quality annotation is unavailable. Here we present TERRACE, a new algorithm for full-length assembly of circRNAs from paired-end total RNA-seq data. TERRACE is compared with leading circRNA detection methods on both simulations and biological datasets. Our method consistently outperforms by a large margin in sensitivity while maintaining better or comparable precision. In particular, when the annotations are not provided, TERRACE can assemble 123%–412% more correct circRNAs than state-of-the-art methods on human tissues. TERRACE presents a major leap on assembling full-length circRNAs from RNA-seq data, and we expect it to be widely used in the downstream research on circRNAs. TERRACE is freely available at https://github.com/Shao-Group/TERRACE . The full version of this manuscript is available at https://doi.org/10.1101/2024.02.09.579380 .

Tasfia Zahin, Qian Shi, Xiaofei Carl Zang, Mingfu Shao
Semi-supervised Learning While Controlling the FDR with an Application to Tandem Mass Spectrometry Analysis

Canonical procedures to control the false discovery rate (FDR) among the list of putative discoveries rely on our ability to compute informative p-values. Competition-based approach offers a fairly novel and increasingly popular alternative when computing such p-values is impractical. The popularity of this approach stems from its wide applicability: instead of computing p-values, which requires knowing the entire null distribution for each null hypothesis, a competition-based approach only requires a single draw from each such null distribution. This drawn example is known as a “decoy” in the mass spectrometry community (which was the first to adopt the competition approach) or as a “knockoff” in the statistics community. The decoy is competed with the original observation so that only the higher scoring of the two is retained. The number of decoy wins is subsequently used to estimate and control the FDR among the target wins.In this paper we offer a novel method to extend the competition-based approach to control the FDR while taking advantage of side information, i.e., additional features that can help us distinguish between correct and incorrect discoveries. Our motivation comes from the problem of peptide detection in tandem mass spectrometry proteomics data. Specifically, we recently showed that a popular mass spectrometry analysis software tool, Percolator, can apparently fail to control the FDR. We address this problem here by developing a general protocol called “RESET” that can take advantage of the additional features, such as the ones Percolator uses, while still theoretically and empirically controlling the FDR.

Jack Freestone, Lukas Käll, William Stafford Noble, Uri Keich
CoRAL Accurately Resolves Extrachromosomal DNA Genome Structures with Long-Read Sequencing

Extrachromosomal DNA (ecDNA) is a central mechanism of focal oncogene amplification in cancer and can drive tumor formation, evolution, and drug resistance. Elucidating the genomic architecture of ecDNA amplifications is critical for understanding tumor pathology and developing more effective therapies. Current short-read based methods can predict the ecDNA presence in cancer samples, but are limited in resolving complex and heterogeneous ecDNA structures. Here, we propose CoRAL, an algorithm for identifying and reconstructing ecDNA amplicon structures from long-reads. CoRAL takes mapped long-reads as input, builds a breakpoint graph for each focally amplified region, and extracts cycles from the breakpoint graph representing the ecDNA structures. Through extensive benchmarks on simulated data and previously-characterized cell lines, we report that CoRAL substantially improves breakpoint detection and reconstruction of complex ecDNA structures over the existing short and long-read based methods.Code Availability: https://github.com/AmpliconSuite/CoRAL .

Kaiyuan Zhu, Matthew G. Jones, Jens Luebeck, Xinxin Bu, Hyerim Yi, King L. Hung, Ivy Tsz-Lo Wong, Shu Zhang, Paul S. Mischel, Howard Y. Chang, Vineet Bafna
A Scalable Adaptive Quadratic Kernel Method for Interpretable Epistasis Analysis in Complex Traits

Our knowledge of the contribution of genetic interactions (epistasis) to variation in human complex traits remains limited, partly due to the lack of efficient, powerful, and interpretable algorithms to detect interactions. Recently proposed approaches for set-based association tests show promise in improving power to detect epistasis by examining the aggregated effects of multiple variants. Nevertheless, these methods either do not scale to large numbers of individuals available in Biobank datasets or do not provide interpretable results. We, therefore, propose QuadKAST, a scalable algorithm focused on testing pairwise interaction effects (also termed as quadratic effects) of a set of genetic variants on a trait and quantifying the proportion of phenotypic variance explained by these effects. We performed comprehensive simulations and demonstrated that QuadKAST is well-calibrated with good statistical power. We applied QuadKAST to 53 quantitative phenotypes measured in $$\approx 300,000$$ ≈ 300 , 000 unrelated white British individuals in the UK Biobank to test for quadratic effects within each of $$9,515$$ 9 , 515 protein-coding genes (after accounting for linear additive effects). We detected $$32$$ 32 trait-gene pairs across $$17$$ 17 traits that demonstrate statistically significant signals of quadratic effects ( $$p \le \frac{0.05}{9,515\times 53}$$ p ≤ 0.05 9 , 515 × 53 accounting for the number of genes and traits tested). Our method enables the detailed investigation of epistasis on a large scale, offering new insights into its role and importance.

Boyang Fu, Prateek Anand, Aakarsh Anand, Joel Mefford, Sriram Sankararaman
Optimal Tree Metric Matching Enables Phylogenomic Branch Length Estimation

Evolutionary histories are uncertain and heterogeneous across the genome. As a result, for any given set of species, we may obtain a large number of incongruent trees. The differences in these trees are not limited to topology as branch lengths are also uncertain and biologically heterogeneous. Yet, despite the biological significance of branch length differences, comparing and reconciling trees has predominantly focused on topology. To close this gap, we explore the problem of matching a query tree to a reference tree by assigning new branch lengths to the query tree. We formulate this objective as a least-squares optimization problem defined on the set of all pairwise distances. We prove that the problem is convex and thus can be solved optimally using standard tools. We also introduce dynamic programming algorithms to compute the required inputs to the optimization problem in quadratic time. We use this framework to estimate the branch lengths of a fixed species tree topology in the unit of the expected number of substitutions per site by matching it to gene trees that have branch lengths in the same unit.

Shayesteh Arasti, Puoya Tabaghi, Yasamin Tabatabaee, Siavash Mirarab
Inferring Allele-Specific Copy Number Aberrations and Tumor Phylogeography from Spatially Resolved Transcriptomics

A key challenge in cancer research is to reconstruct the somatic evolution within a tumor over time and across space. Spatially resolved transcriptomics (SRT) measures gene expression at thousands of spatial locations in a tumor, but does not directly reveal genetic aberrations. We introduce CalicoST, an algorithm to simultaneously infer allele-specific copy number aberrations (CNAs) and a spatial model of tumor evolution from SRT of tumor slices. By modeling CNA-induced perturbations in both total and allele-specific gene expression, CalicoST identifies important types of CNAs - including copy-neutral loss of heterozygosity (CNLOH) and mirrored subclonal CNAs- that are invisible to total copy number analysis. CalicoST achieves high accuracy by modeling both correlations in space with a Hidden Markov Random Field and across genomic segments with a Hidden Markov Model.

Cong Ma, Metin Balaban, Jingxian Liu, Siqi Chen, Li Ding, Benjamin J. Raphael
Contrastive Fitness Learning: Reprogramming Protein Language Models for Low-N Learning of Protein Fitness Landscape

Machine learning (ML) is revolutionizing our ability to model the fitness landscape of protein sequences. Recently, the protein language model (pLM) has become the foundation of state-of-the-art ML solutions for many problems in protein biology. However, significant challenges remain in leveraging pLMs for protein fitness prediction, in part due to the disparity between the scarce number of sequences functionally characterized by high-throughput assays and the massive data samples required for training large pLMs. To bridge this gap, we introduce Contrastive Fitness Learning (ConFit), a pLM-based ML method for learning the protein fitness landscape with limited (low-N) experimental fitness measurements as training data (Fig. 1). We propose a novel contrastive learning strategy to fine-tune the pre-trained pLM, tailoring it to achieve protein-specific fitness prediction while avoiding overfitting.

Junming Zhao, Chao Zhang, Yunan Luo
Scalable Summary Statistics-Based Heritability Estimation Method with Individual Genotype Level Accuracy

SNP heritability, the proportion of phenotypic variation explained by genotyped SNPs, is an important parameter in understanding the genetic architecture underlying various diseases and traits. Methods that aim to estimate SNP heritability from individual genotype and phenotype data are limited by their ability to scale to Biobank-scale datasets and by the restrictions in access to individual-level data. These limitations have motivated the development of methods that only require summary statistics. While the availability of publicly accessible summary statistics makes them widely applicable, these methods lack the accuracy of methods that utilize individual genotypes.Here we present a SUMmary statistics-based Randomized Haseman-Elston regression (SUM-RHE), a method that can estimate the SNP heritability of complex phenotypes with accuracies comparable to approaches that require individual genotypes, while exclusively relying on summary statistics. SUM-RHE employs Genome-Wide Association Study (GWAS) summary statistics and statistics obtained on a reference population, which can be efficiently estimated and readily shared for public use. Our results demonstrate that SUM-RHE obtains estimates of SNP heritability that are substantially more accurate compared to other summary statistic methods and on par with methods that rely on individual-level data.

Moonseong Jeong, Ali Pazokitoroudi, Zhengtong Liu, Sriram Sankararaman
scMulan: A Multitask Generative Pre-Trained Language Model for Single-Cell Analysis

Gene expression could be perceived as a form of “cell language”, with underlying regulatory mechanisms akin to biological grammar. Decoding this language is critical in understanding cellular functions and behaviors. In this study, we proposed a new pre-training paradigm by integrating rich metadata and pre-training tasks, and developed scMulan, a multitask generative pre-trained language model for single-cell analyses. scMulan can accomplish multiple tasks in zero-shot manner such as cell-type annotation, batch integration, and conditional cell generation, guided by different task prompts. scMulan is also ready to be expanded for novel tasks through fine-tuning.

Haiyang Bian, Yixin Chen, Xiaomin Dong, Chen Li, Minsheng Hao, Sijie Chen, Jinyi Hu, Maosong Sun, Lei Wei, Xuegong Zhang
Backmatter
Metadaten
Titel
Research in Computational Molecular Biology
herausgegeben von
Jian Ma
Copyright-Jahr
2024
Electronic ISBN
978-1-0716-3989-4
Print ISBN
978-1-0716-3988-7
DOI
https://doi.org/10.1007/978-1-0716-3989-4

Premium Partner