dRFEtools: Dynamic recursive feature elimination for omics
AbstractMotivationAdvances in technology have generated larger omics datasets with potential applications for machine learning. In many datasets, however, cost and limited sample availability result in an excessively higher number of features as compared to observations. Moreover, biological processes are associated with networks of core and peripheral genes, while traditional feature selection approaches capture only core genes.ResultsTo overcome these limitations, we present dRFEtools that implements dynamic recursive feature elimination (RFE), reducing computational time with high accuracy compared to standard RFE, expanding dynamic RFE to regression algorithms, and outputting the subsets of features that hold predictive power with and without peripheral features. dRFEtools integrates with scikit-learn (the popular Python machine learning platform) and thus provides new opportunities for dynamic RFE in large-scale omics data while enhancing its interpretability.AvailabilitydRFEtools is freely available on PyPI at https://pypi.org/project/drfetools/ or on GitHub https://github.com/LieberInstitute/dRFEtools, implemented in Python 3, and supported on Linux, Windows, and Mac OS.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online and https://github.com/LieberInstitute/dRFEtools_manuscript.
Categories: Bioinformatics Trends
Joint embedding of biological networks for cross-species functional alignment
AbstractMotivationModel organisms are widely used to better understand the molecular causes of human disease. While sequence similarity greatly aids this transfer, sequence similarity does not imply functional similarity, and thus, several current approaches incorporate protein-protein interactions (PPIs) to help map findings between species. Existing transfer methods either formulate the alignment problem as a matching problem which pits network features against known orthology, or more recently, as a joint embedding problem.ResultsWe propose a novel state-of-the-art joint embedding solution: Embeddings to Network Alignment (ETNA). ETNA generates individual network embeddings based on network topological structures and then uses a Natural Language Processing-inspired cross-training approach to align the two embeddings using sequence-based orthologs. The final embedding preserves both within and between species gene functional relationships, and we demonstrate that it captures both pairwise and group functional relevance. In addition, ETNA’s embeddings can be used to transfer genetic interactions across species and identify phenotypic alignments, laying the groundwork for potential opportunities for drug repurposing and translational studies.Availabilityhttps://github.com/ylaboratory/ETNASupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
VAPEX: an interactive web server for the deep exploration of natural virus and phage genomes
AbstractMotivationStudying the genetic makeup of viruses and phages through genome analysis is crucial for comprehending their function in causing diseases, progressing medicine, tracing their evolutionary history, monitoring the environment, and creating innovative biotechnologies. However, accessing the necessary data can be challenging due to a lack of dedicated comparative genomic tools and viral and phage databases, which are often outdated. Moreover, many wet bench experimentalists may not have the computational proficiency required to manipulate large amounts of genomic data.ResultsWe have developed VAPEX (Virus And Phage EXplorer), a web server which is supported by a database and features a user-friendly web interface. This tool enables users to easily perform various genomic analysis queries on all natural viruses and phages that have been fully sequenced and are listed in the NCBI compendium. VAPEX therefore excels in producing visual depictions of fully resolved synteny maps, which is one of its key strengths. VAPEX has the ability to exhibit a vast array of orthologous gene classes simultaneously through the use of symbolic representation. Additionally, VAPEX can fully analyze user-submitted viral and phage genomes, including those that have not yet been annotated.Availability and implementationVAPEX can be accessed from all current web browsers such as Chrome, Firefox, Edge, Safari and Opera. VAPEX is freely accessible at https://archaea.i2bc.paris-saclay.fr/vapex/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Somatic mutation effects diffused over microRNA dysregulation
AbstractMotivationAs an important player in transcriptome regulation, microRNAs may effectively diffuse somatic mutation impacts to broad cellular processes and ultimately manifest disease and dictate prognosis. Previous studies that tried to correlate mutation with gene expression dysregulation neglected to adjust for the disparate multitudes of false positives associated with unequal sample sizes and uneven class balancing scenarios.ResultsTo properly address this issue, we developed a statistical framework to rigorously assess the extent of mutation impact on microRNAs in relation to a permutation-based null distribution of a matching sample structure. Carrying out the framework in a pan-cancer study, we ascertained 9008 protein-coding genes with statistically significant mutation impacts on miRNAs. Of these, the collective miRNA expression for 83 genes showed significant prognostic power in nine cancer types. For example, in lower-grade glioma, 10 genes’ mutations broadly impacted miRNAs, all of which showed prognostic value with the corresponding miRNA expression. Our framework was further validated with functional analysis and augmented with rich features including the ability to analyze miRNA isoforms; aggregative prognostic analysis; advanced annotations such as mutation type, regulator alteration, somatic motif, and disease association; and instructive visualization such as mutation OncoPrint, Ideogram, and interactive mRNA-miRNA network.Availabilityhttp://innovebioinfo.com/Database/TmiEx/MutMix.phpSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Cell-connectivity-guided trajectory inference from single-cell data
AbstractMotivationSingle-cell RNA-sequencing enables cell-level investigation of cell differentiation, which can be modelled using trajectory inference methods. While tremendous effort has been put into designing these methods, inferring accurate trajectories automatically remains difficult. Therefore, the standard approach involves testing different trajectory inference methods and picking the trajectory giving the most biologically sensible model. As the default parameters are often suboptimal, their tuning requires methodological expertise.ResultsWe introduce Totem, an open-source, easy-to-use R package designed to facilitate inference of tree-shaped trajectories from single-cell data. Totem generates a large number of clustering results, estimates their topologies as minimum spanning trees, and uses them to measure the connectivity of the cells. Besides automatic selection of an appropriate trajectory, cell connectivity enables to visually pinpoint branching points and milestones relevant to the trajectory. Furthermore, testing different trajectories with Totem is fast, easy, and does not require in-depth methodological knowledge.AvailabilityTotem is available as an R package at https://github.com/elolab/Totem.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
DEP2: an upgraded comprehensive analysis toolkit for quantitative proteomics data
AbstractSummaryMass spectrometry (MS)-based proteomics has become the most powerful approach to study the proteome of given biological and clinical samples. Advancements in sample preparation and MS detection have extended the application of proteomics, but have also brought new demands on data analysis. Appropriate proteomics data analysis workflow mainly requires quality control, hypothesis testing, functional mining, and visualization. Although there are numerous tools for each process, an efficient and universal tandem analysis toolkit to obtain a quick overall view of various proteomics data is still urgently needed. Here, we present DEP2, an updated version of DEP we previously established, for proteomics data analysis. We amended the analysis workflow by incorporating alternative approaches to accommodate diverse proteomics data, introducing peptide-protein summarization and coupling biological function exploration. In summary, DEP2 is a well-rounded toolkit designed for protein- and peptide-level quantitative proteomics data. It features a more flexible differential analysis workflow and includes a user-friendly Shiny application to facilitate data analysis.Availability and implementationDEP2 is available at https://github.com/mildpiggy/DEP2, released under the MIT license. For further information and usage details, please refer to the package website at https://mildpiggy.github.io/DEP2/.
Categories: Bioinformatics Trends
libSBOLj3: A graph-based library for design and data exchange in synthetic biology
AbstractSummaryThe Synthetic Biology Open Language version 3 data standard provides a graph-based approach to exchange information about biological designs. The new data model has major updates and offers several features for software tools. Here, we present libSBOLj3 to facilitate data exchange and provide interoperability between computer-aided design and automation tools using this standard. The library adopts a graph-based approach. Tool developers can extend these graphs with application-specific information and use detailed validation reports to identify errors and interoperability issues and apply best practice rules.Availability and ImplementationThe libSBOLj3 library is implemented in Java and can be downloaded or used as a Maven dependency. The open-source project, code examples and documentation about accessing and using the library are available via GitHub at https://github.com/SynBioDex/libSBOLj3.
Categories: Bioinformatics Trends
Gonomics: Uniting high performance and readability for genomics with Go
AbstractSummaryMany existing software libraries for genomics require researchers to pick between competing considerations: the performance of compiled languages and the accessibility of interpreted languages. Go, a modern compiled language, provides an opportunity to address this conflict. We introduce Gonomics, an open-source collection of command line programs and bioinformatic libraries implemented in Go that unites readability and performance for genomic analyses. Gonomics contains packages to read, write, and manipulate a wide array of file formats (e.g. FASTA, FASTQ, BED, BEDPE, SAM, BAM, and VCF), and can convert and interface between these formats. Furthermore, our modular library structure provides a flexible platform for researchers developing their own software tools to address specific questions. These commands can be combined and incorporated into complex pipelines to meet the growing need for high-performance bioinformatic resources.Availability and implementationGonomics is implemented in the Go programming language. Source code, installation instructions, and documentation are freely available at https://github.com/vertgenlab/gonomics.
Categories: Bioinformatics Trends
MULGA, a unified multi-view graph autoencoder-based approach for identifying drug-protein interaction and drug repositioning
AbstractMotivationIdentifying drug-protein interactions (DPIs) is a critical step in drug repositioning, which allows reuse of approved drugs that may be effective for treating a different disease and thereby alleviates the challenges of new drug development. Despite the fact that a great variety of computational approaches for DPI prediction have been proposed, key challenges, such as extendable and unbiased similarity calculation, heterogeneous information utilization and reliable negative sample selection, remain to be addressed.ResultsTo address these issues, we propose a novel, unified multi-view graph autoencoder framework, termed MULGA, for both DPI and drug repositioning predictions. MULGA is featured by: (i) a multi-view learning technique to effectively learn authentic drug affinity and target affinity matrices; (ii) a graph autoencoder to infer missing DPI interactions; and (iii) a new “guilty-by-association”-based negative sampling approach for selecting highly reliable non-DPIs. Benchmark experiments demonstrate that MULGA outperforms state-of-the-art methods in DPI prediction and the ablation studies verify the effectiveness of each proposed component. Importantly, we highlight the top drugs shortlisted by MULGA that target the spike glycoprotein of severe acute respiratory syndrome coronavirus 2 (SAR-CoV-2), offering additional insights into and potentially useful treatment option for COVID-19. Together with the availability of datasets and source codes, we envision that MULGA can be explored as a useful tool for DPI prediction and drug repositioning.Availability and implementationMULGA is publicly available for academic purposes at https://github.com/jianiM/MULGA/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
CellAnn: A comprehensive, super-fast, and user-friendly single-cell annotation web server
AbstractMotivationSingle-cell sequencing technology has become a routine in studying many biological problems. A core step of analyzing single-cell data is the assignment of cell clusters to specific cell types. Reference-based methods are proposed for predicting cell types for single-cell clusters. However, the scalability and lack of preprocessed reference datasets prevent them from being practical and easy to use.ResultsHere we introduce a reference-based cell annotation web server, CellAnn, which is super-fast and easy to use. CellAnn contains a comprehensive reference database with 204 human and 191 mouse single-cell datasets. These reference datasets cover 32 organs. Furthermore, we developed a cluster-to-cluster alignment method to transfer cell labels from the reference to the query datasets, which is superior to the existing methods with higher accuracy and higher scalability. Finally, CellAnn is an online tool that integrates all the procedures in cell annotation, including reference searching, transferring cell labels, visualizing results, and harmonizing cell annotation labels. Through the user-friendly interface, users can identify the best annotation by cross-validating with multiple reference datasets. We believe that CellAnn can greatly facilitate single-cell sequencing data analysis.Availability and implementationThe web server is available at www.cellann.io, and the source code is available at https://github.com/Pinlyu3/CellAnn_shinyapp.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Pretrained Transformer Models for Predicting the Withdrawal of Drugs from the Market
AbstractMotivationThe process of drug discovery is notoriously complex, costing an average of 2.6 billion dollars and taking approximately 13 years to bring a new drug to the market. The success rate for new drugs is alarmingly low (around 0.0001%), and severe adverse drug reactions (ADRs) frequently occur, some of which may even result in death. Early identification of potential ADRs is critical to improve the efficiency and safety of the drug development process.ResultsIn this study, we employed pretrained large language models (LLMs) to predict the likelihood of a drug being withdrawn from the market due to safety concerns. Our method achieved an area under the curve (AUC) of over 0.75 through cross-database validation, outperforming classical machine-learning models and graph-based models. Notably, our pretrained LLMs successfully identified over 50% drugs that were subsequently withdrawn, when predictions were made on a subset of drugs with inconsistent labeling between the training and test sets.AvailabilityThe code and datasets are available at https://github.com/eyalmazuz/DrugWithdrawn.Supplementary informationSupplementary dataSupplementary data associated with this research are available at Bioinformatics online.
Categories: Bioinformatics Trends
Coherent pathway enrichment estimation by modeling inter-pathway dependencies using regularized regression
AbstractMotivationGene set enrichment methods are a common tool to improve the interpretability of gene lists as obtained, for example, from differential gene expression analyses. They are based on computing whether dysregulated genes are located in certain biological pathways more often than expected by chance. Gene set enrichment tools rely on pre-existing pathway databases such as KEGG, Reactome, or the Gene Ontology. These databases are increasing in size and in the number of redundancies between pathways, which complicates the statistical enrichment computation.ResultsWe address this problem and develop a novel gene set enrichment method, called pareg, which is based on a regularized generalized linear model and directly incorporates dependencies between gene sets related to certain biological functions, for example, due to shared genes, in the enrichment computation. We show that pareg is more robust to noise than competing methods. Additionally, we demonstrate the ability of our method to recover known pathways as well as to suggest novel treatment targets in an exploratory analysis using breast cancer samples from TCGA.Availability and Implementationpareg is freely available as an R package on Bioconductor (https://bioconductor.org/packages/release/bioc/html/pareg.html) as well as on https://github.com/cbg-ethz/pareg. The GitHub repository also contains the Snakemake workflows needed to reproduce all results presented here.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
metGWAS 1.0: An R workflow for network-driven over-representation analysis between independent metabolomic and meta-genome-wide association studies
AbstractMotivationThe method of GWAS and metabolomics combined provide an quantitative approach to pinpoint metabolic pathways and genes linked to specific diseases; however, such analyses require both genomics and metabolomics datasets from the same individuals/samples. In most cases, this approach is not feasible due to high costs, lack of technical infrastructure, unavailability of samples, and other factors. Therefore, an unmet need exists for a bioinformatics tool that can identify gene loci-associated polymorphic variants for metabolite alterations seen in disease states using standalone metabolomics.ResultsHere, we developed a bioinformatics tool, metGWAS 1.0, that integrates independent GWAS data from the GWAS database and standalone metabolomics data using a network-based systems biology approach to identify novel disease/trait-specific metabolite-gene associations. The tool was evaluated using standalone metabolomics datasets extracted from two metabolomics-GWAS case studies. It discovered both the observed and novel gene loci with known single nucleotide polymorphisms when compared to the original studies.Availability and implementationThe developed metGWAS 1.0 framework is implemented in an R pipeline and available at: https://github.com/saifurbd28/metGWAS-1.0.
Categories: Bioinformatics Trends
crosshap: R package for local haplotype visualization for trait association analysis
AbstractSummaryGWAS excels at harnessing dense genomic variant datasets to identify candidate regions responsible for producing a given phenotype. However, GWAS and traditional fine-mapping methods do not provide insight into the complex local landscape of linkage that contains and has been shaped by the causal variant(s). Here, we present ‘crosshap’, an R package that performs robust density-based clustering of variants based on their linkage profiles to capture haplotype structures in a local genomic region of interest. Following this, ‘crosshap’ is equipped with visualization tools for choosing optimal clustering parameters (ɛ) before producing an intuitive figure that provides an overview of the complex relationships between linked variants, haplotype combinations, phenotype and metadata traits.AvailabilityThe ‘crosshap’ package is freely available under the MIT license and can be downloaded directly from CRAN with R > 4.0.0. The development version is available on GitHub alongside issue support (https://github.com/jacobimarsh/crosshap). Tutorial vignettes and documentation are available (https://jacobimarsh.github.io/crosshap/).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
MSDRP: a deep learning model based on multi-source data for predicting drug response
AbstractMotivationCancer heterogeneity drastically affects cancer therapeutic outcomes. Predicting drug response in vitro is expected to help formulate personalized therapy regimens. In recent years, several computational models based on machine learning and deep learning have been proposed to predict drug response in vitro. However, most of these methods capture drug features based on a single drug description (e.g., drug structure), without considering the relationships between drugs and biological entities (e.g., target, diseases and side effects). Moreover, most of these methods collect features separately for drugs and cell lines but fail to consider the pairwise interactions between drugs and cell lines.ResultsIn this paper, we propose a deep learning framework, named MSDRP for drug response prediction. MSDRP uses an interaction module to capture interactions between drugs and cell lines, and integrates multiple associations/interactions between drugs and biological entities through similarity network fusion (SNF) algorithms, outperforming some state-of-the-art models in all performance measures for all experiments. The experimental results of de novo test and independent test demonstrate the excellent performance of our model for new drugs. Furthermore, several case studies illustrate the rationality for using feature vectors derived from drug similarity matrices from multi-source data to represent drugs and the interpretability of our model.AvailabilityThe codes of MSDRP are available at https://github.com/xyzhang-10/MSDRP.
Categories: Bioinformatics Trends
Minmers are a generalization of minimizers that enable unbiased local jaccard estimation
AbstractMotivationThe Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates.ResultsTo address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications.AvailabilityMashMap3 is available at https://github.com/marbl/MashMap
Categories: Bioinformatics Trends
Optimal Selection of Suitable Templates in Protein Interface Prediction
AbstractMotivationMolecular-level classification of protein-protein interfaces can greatly assist in functional characterization and rational drug design. The most accurate protein interface predictions rely on finding homologous proteins with known interfaces since most interfaces are conserved within the same protein family. The accuracy of these template-based prediction approaches depends on the correct choice of suitable templates. Choosing the right templates in the immunoglobulin superfamily (IgSF) is challenging because its members share low sequence identity and display a wide range of alternative binding sites despite structural homology.ResultsWe present a new approach to predict protein interfaces. First, template specific, informative evolutionary profiles are established using a mutual information-based approach. Next, based on the similarity of residue level conservation scores derived from the evolutionary profiles, a query protein is hierarchically clustered with all available template proteins in its superfamily with known interface definitions. Once clustered, a subset of the most closely related templates is selected, and an interface prediction is made. These initial interface predictions are subsequently refined by extensive docking. This method was benchmarked on 51 IgSF proteins and can predict non-trivial interfaces of IgSF proteins with an average and median F-score of 0.64 and 0.78, respectively. We also provide a way to assess the confidence of the results. The average and median F-scores increase to 0.8 and 0.81, respectively, if 27% of low confidence cases and 17% of medium confidence cases are removed. Lastly, we provide residue level interface predictions, protein complexes, and confidence measurements for singletons in the IgSF.AvailabilitySource code is freely available at: https://gitlab.com/fiserlab.org/interdct_with_refinementSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
StarPep Toolbox: An Open-Source Software to Assist Chemical Space Analysis of Bioactive Peptides and Their Functions using Complex Networks
AbstractMotivationAntimicrobial peptides (AMPs) are promising molecules to treat infectious diseases caused by multi-drug resistance pathogens, some types of cancer, and other conditions. Computer-aided strategies are efficient tools for the high-throughput screening of AMPs.ResultsThis report highlights StarPep Toolbox, an open-source and user-friendly software to study the bioactive chemical space of AMPs using complex network-based representations, clustering, and similarity-searching models. The novelty of this research lies in the combination of network science and similarity-searching techniques, distinguishing it from conventional methods based on machine learning and other computational approaches. The network-based representation of the AMP chemical space presents promising opportunities for peptide drug repurposing, development, and optimization. This approach could serve as a baseline for the discovery of a new generation of therapeutics peptides.AvailabilityAll underlying code and installation files are accessible through GitHub (https://github.com/Grupo-Medicina-Molecular-y-Traslacional/StarPep) under the Apache 2.0 license.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Dynamic applicability domain (dAD): compound-target binding affinity estimates with local conformal prediction
AbstractMotivationIncreasing efforts are being made in the field of machine learning to advance the learning of robust and accurate models from experimentally measured data and enable more efficient drug discovery processes. The prediction of binding affinity is one of the most frequent tasks of compound bioactivity modelling. Learned models for binding affinity prediction are assessed by their average performance on unseen samples, but point predictions are typically not provided with a rigorous confidence assessment. Approaches such as the conformal predictor framework equip conventional models with a more rigorous assessment of confidence for individual point predictions. In this paper, we extend the inductive conformal prediction (ICP) framework for interaction data, in particular the compound-target binding affinity prediction task. The new framework is based on dynamically defined calibration sets that are specific for each testing pair and provides prediction assessment in the context of calibration pairs from its compound-target neighbourhood, enabling improved estimates based on the local properties of the prediction model.ResultsThe effectiveness of the approach is benchmarked on several publicly available datasets and tested in realistic use-case scenarios with increasing levels of difficulty on a complex compound-target binding affinity space. We demonstrate that in such scenarios, novel approach combining applicability domain paradigm with conformal prediction framework, produces superior confidence assessment with valid and more informative prediction regions compared to other state-of-the-art conformal prediction approaches.AvailabilityDataset and the code are available on GitHub (https://github.com/mlkr-rbi/dAD).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends
Predicted structural proteome of Sphagnum divinum and proteome-scale annotation
AbstractMotivationSphagnum-dominated peatlands store a substantial amount of terrestrial carbon. The genus is undersampled and under-studied. No experimental crystal structure from any Sphagnum species exists in the Protein Data Bank and fewer than 200 Sphagnum-related genes have structural models available in the AlphaFold Protein Structure Database. Tools and resources are needed to help bridge these gaps, and to enable the analysis of other structural proteomes now made possible by accurate structure prediction.ResultsWe present the predicted structural proteome (25,134 primary transcripts) of S. divinum computed using AlphaFold, structural alignment results of all high-confidence models against an annotated non-redundant crystallographic database of over 90,000 structures, a structure-based classification of putative Enzyme Commission (EC) numbers across this proteome, and the computational method to perform this proteome-scale structure-based annotation.AvailabilityAll data and code are available in public repositories, detailed at https://github.com/BSDExabio/SAFA. The structural models of the S. divinum proteome have been deposited in the ModelArchive repository at https://modelarchive.org/doi/10.5452/ma-ornl-sphdiv.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends