Jump to Navigation
Subscribe to Bioinformatics Oxford Journals feed
Updated: 5 hours 5 min ago

SpikePro: a webserver to predict the fitness of SARS-CoV-2 variants

Thu, 21/07/2022 - 5:30am
AbstractMotivationThe SARS-CoV-2 virus has shown a remarkable ability to evolve and spread across the globe through successive waves of variants since the original Wuhan lineage. Despite all the efforts of the last two years, the early and accurate prediction of variant severity is still a challenging issue which needs to be addressed to help, for example, the decision of activating COVID-19 plans long before the peak of new waves. Upstream preparation would indeed make it possible to avoid the overflow of health systems and limit the most severe cases.ResultsWe recently developed SpikePro, a structure-based computational model capable of quickly and accurately predicting the viral fitness of a variant from its spike protein sequence. It is based on the impact of mutations on the stability of the spike protein as well as on its binding affinity for the angiotensin-converting enzyme 2 (ACE2) and for a set of neutralizing antibodies. It yields a precise indication of the virus transmissibility, infectivity, immune escape and basic reproduction rate. We present here an updated version of the model that is now available on an easy-to-use webserver, and illustrate its power in a retrospective study of fitness evolution and reproduction rate of the main viral lineages. SpikePro is thus expected to be great help to assess the fitness of newly emerging SARS-CoV-2 variants in genomic surveillance and viral evolution programs.AvailabilitySpikePro webserver http://babylone.ulb.ac.be/SpikePro/Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

PanTools v3: functional annotation, classification, and phylogenomics

Thu, 21/07/2022 - 5:30am
AbstractSummaryThe ever-increasing number of sequenced genomes necessitates the development of pangenomic approaches for comparative genomics. Introduced in 2016, PanTools is a platform that allows pangenome construction, homology grouping and pangenomic read mapping. The use of graph database technology makes PanTools versatile, applicable from small viral genomes like SARS-CoV-2 up to large plant or animal genomes like tomato or human. Here we present our third major update to PanTools that enables the integration of functional annotations and provides both gene-level analyses and phylogenetics.Availability and implementationPanTools is implemented in Java 8 and released under the GNU GPLv3 license. Software and documentation are available at https://git.wur.nl/bioinformatics/pantoolsSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Predicting and explaining the impact of genetic disruptions and interactions on organismal viability

Thu, 21/07/2022 - 5:30am
AbstractMotivationExisting computational models can predict single- and double-mutant fitness but they do have limitations. First, they are often tested via evaluation metrics that are inappropriate for imbalanced datasets. Second, all of them only predict a binary outcome (viable or not, and negatively interacting or not). Third, most are uninterpretable black box machine learning models.ResultsBudding yeast datasets were used to develop high performance Multinomial Regression (MN) models capable of predicting the impact of single, double, and triple genetic disruptions on viability. These models are interpretable and give realistic non-binary predictions and can predict negative genetic interactions in triple-gene knockouts. They are based on a limited set of gene features, and their predictions are influenced by the probability of target gene participating in molecular complexes or pathways. Furthermore, the MN models have utility in other organisms such as fission yeast, fruit flies, and humans, with the single gene fitness MN model being able to distinguish essential genes necessary for cell autonomous viability from those required for multicellular survival. Finally, our models exceed the performance of previous models, without sacrificing interpretability.AvailabilityAll code used to generate results and figures in this manuscript are available at our Github repository at https://github.com/KISRDevelopment/cell_viability_paper. The repository also contains a link to the genetic interaction (GI) prediction website that lets users search for GIs using the MN models.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

ANISE: an Application to Design Mechanobiology Simulations of Planar Epithelia

Wed, 20/07/2022 - 5:30am
AbstractSummaryTiFoSi is an efficient computational tool for performing mechanobiology simulations of planar epithelia. A drawback of this tool is that it relies on an XML configuration file (input data) that can be cumbersome to set up and/or decode due to the endless possibilities of the software. Moreover, some modeling know-how is needed in order to provide equations that describe gene regulatory interactions. These factors limit the usability of this tool for users with a weak computational and/or mathematical background. Here we introduce ANISE, a web-app that allows to easily setup the configuration of mechanobiology simulations using TiFoSi. The application covers all the configuration modules in TiFoSi comprehensively (from basic to advanced editing options) and uses a graphical approach (e.g., to build the modeling equations of gene regulatory networks).Availabilityhttp://github.com/lsym-uveg/anise (server: http://lsymserver.uv.es/lsym/ANISE)
Categories: Bioinformatics Trends

High-dimension to high-dimension screening for detecting genome-wide epigenetic and noncoding RNA regulators of gene expression

Wed, 20/07/2022 - 5:30am
AbstractMotivationThe advancement of high-throughput technology characterizes a wide variety of epigenetic modifications and noncoding RNAs across the genome involved in disease pathogenesis via regulating gene expression. The high-dimensionality of both epigenetic/noncoding RNA and gene expression data make it challenging to identify the important regulators of genes. Conducting univariate test for each possible regulator-gene pair is subject to serious multiple comparison burden, and direct application of regularization methods to select regulator-gene pairs is computationally infeasible. Applying fast screening to reduce dimension first before regularization is more efficient and stable than applying regularization methods alone.ResultsWe propose a novel screening method based on robust partial correlation to detect epigenetic and noncoding RNA regulators of gene expression over the whole genome, a problem that includes both high-dimensional predictors and high-dimensional responses. Compared to existing screening methods, our method is conceptually innovative that it reduces the dimension of both predictor and response, and screens at both node (regulators or genes) and edge (regulator-gene pairs) levels. We develop data-driven procedures to determine the conditional sets and the optimal screening threshold, and implement a fast iterative algorithm. Simulations and applications to long non-coding RNA and microRNA regulation in Kidney cancer and DNA methylation regulation in Glioblastoma Multiforme illustrate the validity and advantage of our method.AvailabilityThe R package, related source codes and real data sets used in this paper are provided at https://github.com/kehongjie/rPCor.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

CompAIRR: ultra-fast comparison of adaptive immune receptor repertoires by exact and approximate sequence matching

Tue, 19/07/2022 - 5:30am
AbstractMotivationAdaptive immune receptor (AIR) repertoires (AIRRs) record past immune encounters with exquisite specificity. Therefore, identifying identical or similar AIR sequences across individuals is a key step in AIRR analysis for revealing convergent immune response patterns that may be exploited for diagnostics and therapy. Existing methods for quantifying AIRR overlap scale poorly with increasing dataset numbers and sizes. To address this limitation, we developed CompAIRR, which enables ultra-fast computation of AIRR overlap, based on either exact or approximate sequence matching.ResultsCompAIRR improves computational speed 1000-fold relative to the state of the art and uses only one-third of the memory: on the same machine, the exact pairwise AIRR overlap of 104 AIRRs with 105 sequences is found in ∼17 minutes, while the fastest alternative tool requires 10 days. CompAIRR has been integrated with the machine learning ecosystem immuneML to speed up commonly used AIRR-based machine learning applications.AvailabilityCompAIRR code and documentation are available at https://github.com/uio-bmi/compairr. Docker images are available at https://hub.docker.com/r/torognes/compairr. The code to replicate the synthetic datasets, scripts for benchmarking and creating figures, and all raw data underlying the figures are available at https://github.com/uio-bmi/compairr-benchmarking.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Avoiding C-hacking when evaluating survival distribution predictions with discrimination measures

Tue, 12/07/2022 - 5:30am
AbstractMotivationIn this paper we consider how to evaluate survival distribution predictions with measures of discrimination. This is non-trivial as discrimination measures are the most commonly used in survival analysis and yet there is no clear method to derive a risk prediction from a distribution prediction. We survey methods proposed in literature and software and consider their respective advantages and disadvantages.ResultsWhilst distributions are frequently evaluated by discrimination measures, we find that the method for doing so is rarely described in the literature and often leads to unfair comparisons or ‘C-hacking’. We demonstrate by example how simple it can be to manipulate results and use this to argue for better reporting guidelines and transparency in the literature. We recommend that machine learning survival analysis software implements clear transformations between distribution and risk predictions in order to allow more transparent and accessible model evaluation.AvailabilityThe code used in the final experiment is available at https://github.com/RaphaelS1/distribution_discrimination.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

DiffChIPL: A differential peak analysis method for high throughput sequencing data with biological replicates based on limma

Sat, 09/07/2022 - 5:30am
AbstractMotivationChIP-seq detects protein-DNA interactions within chromatin, such as that of chromatin structural components and transcription machinery. ChIP-seq profiles are often noisy and variable across replicates, posing a challenge to the development of effective algorithms to accurately detect differential peaks. Methods have recently been designed for this purpose but sometimes yield conflicting results that are inconsistent with the underlying biology. Most existing algorithms perform well on limited datasets. To improve differential analysis of ChIP-seq, we present a novel Differential analysis method for ChIP-seq based on limma (DiffChIPL).ResultsDiffChIPL is adaptive to asymmetrical or symmetrical data and can accurately report global differences. We used simulated and real datasets for transcription factor (TF) and histone modification marks to validate and benchmark our algorithm. DiffChIPL shows superior performance in sensitivity and false positive rate (FPR) in different simulations and control datasets. DiffChIPL also performs well on real ChIP-seq, CUT&RUN, CUT&Tag, and ATAC-seq datasets. DiffChIPL is an accurate and robust method, exhibiting better performance in differential analysis for a variety of applications including TF binding, histone modifications, and chromatin accessibility.Availabilityhttps://github.com/yancychy/DiffChIPL.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Identifying modifications on DNA-bound histones with joint deep learning of multiple binding sites in DNA sequence

Sat, 09/07/2022 - 5:30am
AbstractMotivationHistone modifications are epigenetic markers that impact gene expression by altering the chromatin structure or recruiting histone modifiers. Their accurate identification is key to unraveling the mechanisms by which they regulate gene expression. However, the solutions for this task can be improved by exploiting multiple relationships from dataset and exploring designs of learning models, e.g. jointly learning technology.ResultsThis paper proposes a deep learning–based multi-objective computational approach, iHMnBS, to identify which of the seven typical histone modifications a DNA sequence may choose to bind, and which parts of the DNA sequence bind to them. iHMnBS employs a customized dataset that allows the marking of modifications contained in histones that may bind to any position in the DNA sequence. iHMnBS tries to mine the information implicit in this richer data by means of deep neural networks. In comprehensive comparisons, iHMnBS outperforms a baseline method, and the probability of binding to modified histones assigned to a representative nucleotide of a DNA sequence can serve as a reference for biological experiments. Since the interaction between transcription factors (TFs) and histone modifications has an important role in gene expression, we extracted a number of sequence patterns that may bind to TFs, and explored their possible impact on disease.AvailabilityThe source code is available at https://github.com/lennylv/iHMnBS.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Optimization of synthetic molecular reporters for a mesenchymal glioblastoma transcriptional program by integer programming

Sat, 09/07/2022 - 5:30am
AbstractMotivationA recent approach to perform genetic tracing of complex biological problems involves the generation of synthetic DNA probes that specifically mark cells with a phenotype of interest. These synthetic locus control regions (sLCRs), in turn, drive the expression of a reporter gene, such as fluorescent protein. To build functional and specific sLCRs, it is critical to accurately select multiple bona fide cis-regulatory elements from the target cell phenotype cistrome. This selection occurs by maximizing the number and diversity of transcription factors (TFs) within the sLCR, yet the size of the final sLCR should remain limited.ResultsIn this work, we discuss how optimization, in particular integer programming, can be used to systematically address the construction of a specific sLCR and optimize pre-defined properties of the sLCR. Our presented instance of a linear optimization problem maximizes the activation potential of the sLCR such that its size is limited to a pre-defined length and a minimum number of all TFs deemed sufficiently characteristic for the phenotype of interest is covered. We generated an sLCR to trace the mesenchymal glioblastoma program in patients by solving our corresponding linear program with the software optimizer Gurobi. Considering the binding strength of transcription factor binding sites (TFBSs) with their TFs as a proxy for activation potential, the optimized sLCR scores similarly to an sLCR experimentally validated in vivo, and is smaller in size while having the same coverage of TFBSs.AvailabilityWe provide a Python implementation of the presented framework in the Supplementary material with which an optimal selection of cis-regulatory elements can be calculated once the target set of TFs and their binding strength with their TFBSs is known.Supplementary informationSupplementary dataSupplementary data is available at Bioinformatics online.
Categories: Bioinformatics Trends

RNAloops: a database of RNA multiloops

Sat, 09/07/2022 - 5:30am
AbstractMotivationKnowledge of the three-dimensional structure of RNA supports discovering its functions and is crucial for designing drugs and modern therapeutic solutions. Thus, much attention is devoted to experimental determination and computational prediction targeting the global fold of RNA and its local substructures. The latter include multi-branched loops – functionally significant elements that highly affect the spatial shape of the entire molecule. Unfortunately, their computational modeling constitutes a weak point of structural bioinformatics. A remedy for this is in collecting these motifs and analyzing their features.ResultsRNAloops is a self-updating database that stores multi-branched loops identified in the PDB-deposited RNA structures. A description of each loop includes angular data – planar and Euler angles computed between pairs of adjacent helices to allow studying their mutual arrangement in space. The system enables search and analysis of multiloops, presents their structure details numerically and visually, and computes data statistics.AvailabilityRNAloops is freely accessible at https://rnaloops.cs.put.poznan.pl.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Needle: A fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments

Fri, 08/07/2022 - 5:30am
AbstractMotivationThe ever-growing size of sequencing data is a major bottleneck in bioinformatics as the advances of hardware development cannot keep up with the data growth. Therefore, an enormous amount of data is collected but rarely ever reused, because it is nearly impossible to find meaningful experiments in the stream of raw data.ResultsAs a solution, we propose Needle, a fast and space-efficient index which can be built for thousands of experiments in less than two hours and can estimate the quantification of a transcript in these experiments in seconds, thereby outperforming its competitors. The basic idea of the Needle index is to create multiple interleaved Bloom filters that each store a set of representative k-mers depending on their multiplicity in the raw data. This is then used to quantify the query.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.Availability and implementationhttps://github.com/seqan/needle
Categories: Bioinformatics Trends

TogoID: an exploratory ID converter to bridge biological datasets

Fri, 08/07/2022 - 5:30am
AbstractMotivationUnderstanding life cannot be accomplished without making full use of biological data, which are scattered across databases of diverse categories in life sciences. To connect such data seamlessly, identifier (ID) conversion plays a key role. However, existing ID conversion services have disadvantages, such as covering only a limited range of biological categories of databases, not keeping up with the updates of the original databases, and outputs being hard to interpret in the context of biological relations, especially when converting IDs in multiple steps.ResultsTogoID is an ID conversion service implementing unique features with an intuitive web interface and an API for programmatic access. TogoID currently supports 65 datasets covering various biological categories. TogoID users can perform exploratory multistep conversions to find a path among IDs. To guide the interpretation of biological meanings in the conversions, we crafted an ontology that defines the semantics of the dataset relations.Availability and ImplementationThe TogoID service is freely available on the TogoID website, and the API is also provided to allow programmatic access. To encourage developers to add new dataset pairs, the system stores the configurations of pairs at the GitHub repository and accepts the request of additional pairs.Supplementary InformationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

GenBank as a Source to Monitor and Analyze Host-Microbiome Data

Fri, 08/07/2022 - 5:30am
AbstractMotivationMicrobiome datasets are often constrained by sequencing limitations. GenBank is the largest collection of publicly available DNA sequences, which is maintained by the National Center of Biotechnology Information (NCBI). The metadata of GenBank records are a largely understudied resource and may be uniquely leveraged to access the sum of prior studies focused on microbiome composition. Here, we developed a computational pipeline to analyze GenBank metadata, containing data on hosts, microorganisms, and their place of origin. This work provides the first opportunity to leverage the totality of GenBank to shed light on compositional data practices that shape how microbiome datasets are formed as well as examine host-microbiome relationships.ResultsThe collected dataset contains multiple kingdoms of microorganisms, consisting of bacteria, viruses, archaea, protozoa, fungi, and invertebrate parasites, and hosts of multiple taxonomical classes, including mammals, birds, and fish. A human data subset of this dataset provides insights to gaps in current microbiome data collection, which is biased towards clinically relevant pathogens. Clustering and phylogenic analysis reveals the potential to use these data to model host taxonomy and evolution, revealing groupings formed by host diet, environment, and coevolution.AvailabilityGenBank Host-Microbiome Pipeline is available at {{https://github.com/bcbi/genbank_holobiome}}. The GenBank loader is available at {{https://github.com/bcbi/genbank_loader}}.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Docking cyclic peptides formed by a disulfide bond through a hierarchical strategy

Fri, 08/07/2022 - 5:30am
AbstractMotivationCyclization is a common strategy to enhance the therapeutic potential of peptides. Many cyclic peptide drugs have been approved for clinical use, in which the disulfide-driven cyclic peptide is one of the most prevalent categories. Molecular docking is a powerful computational method to predict the binding modes of molecules. For protein-cyclic peptide docking, a big challenge is considering the flexibility of peptides with conformers constrained by cyclization.ResultsIntegrating our efficient peptide 3D conformation sampling algorithm MODPEP2.0 and knowledge-based scoring function ITScorePP, we have proposed an extended version of our hierarchical peptide docking algorithm, named HPEPDOCK2.0, to predict the binding modes of the peptide cyclized through a disulfide against a protein. Our HPEPDOCK2.0 approach was extensively evaluated on diverse test sets and compared with the state-of-the-art cyclic peptide docking program AutoDock CrankPep (ADCP). On a benchmark data set of 18 cyclic peptide-protein complexes, HPEPDOCK2.0 obtained a native contact fraction of above 0.5 for 61% of the cases when the top prediction was considered, compared with 39% for ADCP. On a larger test set of 25 cyclic peptide-protein complexes, HPEPDOCK2.0 yielded a success rate of 44% for the top prediction, compared with 20% for ADCP. In addition, HPEPDOCK2.0 was also validated on two other test sets of 10 and 11 complexes with apo and predicted receptor structures, respectively. HPEPDOCK2.0 is computationally efficient and the average running time for docking a cyclic peptide is about 34 minutes on a single CPU core, compared with 496 minutes for ADCP. HPEPDOCK2.0 will facilitate the study of the interaction between cyclic peptides and proteins and the development of therapeutic cyclic peptide drugs.Availability and implementationhttp://huanglab.phys.hust.edu.cn/hpepdock/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

IIFDTI: predicting drug-target interactions through interactive and independent features based on attention mechanism

Fri, 08/07/2022 - 5:30am
AbstractMotivationIdentifying drug-target interactions is a crucial step for drug discovery and design. Traditional biochemical experiments are credible to accurately validate drug-target interactions. However, they are also extremely laborious, time-consuming, and expensive. With the collection of more validated biomedical data and the advancement of computing technology, the computational methods based on chemogenomics gradually attract more attention, which guide the experimental verifications.ResultsIn this study, we propose an end-to-end deep learning-based method named IIFDTI to predict DTIs based on independent features of drug-target pairs and interactive features of their substructures. First, the interactive features of substructures between drugs and targets are extracted by the bidirectional encoder-decoder architecture. The independent features of drugs and targets are extracted by the graph neural networks and convolutional neural networks, respectively. Then, all extracted features are fused and inputted into fully connected dense layers in downstream tasks for predicting DTIs. IIFDTI takes into account the independent features of drugs/targets and simulates the interactive features of the substructures from the biological perspective. Multiple experiments show that IIFDTI outperforms the state-of-the-art methods in terms of AUC, AUPR, precision, and recall on benchmark datasets. In addition, the mapped visualizations of attention weights indicate that IIFDTI has learned the biological knowledge insights, and two case studies illustrate the capabilities of IIFDTI in practical applications.Availability and implementationThe codes of IIFDTI are available at https://github.com/czjczj/IIFDTI.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

StabilitySort: assessment of protein stability changes on a genome-wide scale to prioritise potentially pathogenic genetic variation

Fri, 08/07/2022 - 5:30am
AbstractSummaryMissense mutations that change protein stability are strongly associated with human genetic disease. With the recent availability of predicted structures for all human proteins generated using the AlphaFold2 prediction model, genome-wide assessment of the stability effects of genetic variation can, for the first time, be easily performed. This facilitates the interrogation of personal genetic variation for potentially pathogenic effects through the application of stability metrics. Here, we present a novel tool to prioritise variants predicted to cause strong instability in essential proteins. We show that by filtering by ΔΔG values and then prioritising by StabilitySort Z-scores, we are able to more accurately discriminate pathogenic, protein-destabilising mutations from population variation, compared with other mutation effect predictors.Availability and ImplementationStabilitySort is available as a web service (https://www.stabilitysort.org), as a data download for integration with other tools (https://www.stabilitysort.org/download) or can be deployed as a standalone system from source code (https://gitlab.com/baaron/StabilitySort).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

CondiS Web App: Imputation of Censored Lifetimes for Machine Learning-Based Survival Analysis

Fri, 08/07/2022 - 5:30am
AbstractSummaryIn the era of big data, machine learning techniques are widely applied to every area in biomedical research including survival analysis. It’s well recognized that censoring, which is a common missing issue in survival time data, hampers the direct usage of these machine learning techniques. Here we present CondiS, a web toolkit with graphical user interface to help impute the survival times for censored observations and predict the survival times for future enrolled patients. CondiS imputes a censored survival time based on its distribution conditional on its observed part. When covariates are available, CondiS-X incorporates this information to further increase the imputation accuracy. Users can also upload data of newly enrolled patients and predict their survival times. As the first web-app tool with an imputation function for censored lifetime data, CondiS web can facilitate conducting survival analysis with machine learning approaches.AvailabilityCondiS is an open-source application implemented with Shiny in R, available free at: https://biostatistics.mdanderson.org/shinyapps/CondiS/.
Categories: Bioinformatics Trends

SLPred: A Multi-view Subcellular Localization Prediction Tool for Multi-location Human Proteins

Fri, 08/07/2022 - 5:30am
AbstractSummaryAccurate prediction of the subcellular locations of proteins is a critical topic in protein science. In this study, we present SLPred, an ensemble-based multi-view and multi-label protein subcellular localization prediction tool. For a query protein sequence, SLPred provides predictions for nine main subcellular locations using independent machine learning models trained for each location. We used UniProtKB/Swiss-Prot human protein entries and their curated subcellular location (SL) annotations as our source data. We connected all disjoint terms in the UniProt SL hierarchy based on the corresponding term relationships in the cellular component category of Gene Ontology, and constructed a training dataset that is both reliable and large-scale using the re-organized hierarchy. We tested SLPred on multiple benchmarking datasets including our-in house sets, and compared its performance against six state-of-the-art methods. Results indicated that SLPred outperforms other tools in the majority of cases.AvailabilitySLPred is available both as an open-access and user-friendly web-server (https://slpred.kansil.org) and a stand-alone tool (https://github.com/kansil/SLPred). All datasets used in this study are also available at https://slpred.kansil.orgSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

DEMoS: A Deep Learning-based Ensemble Approach for Predicting the Molecular Subtypes of Gastric Adenocarcinomas from Histopathological Images

Fri, 08/07/2022 - 5:30am
AbstractMotivationThe molecular subtyping of gastric cancer (adenocarcinoma) into four main subtypes based on integrated multiomics profiles, as proposed by The Cancer Genome Atlas (TCGA) initiative, represents an effective strategy for patient stratification. However, this approach requires the use of multiple technological platforms, and is quite expensive and time consuming to perform. A computational approach that uses histopathological image data to infer molecular subtypes could be a practical, cost- and time-efficient complementary tool for prognostic and clinical management purposes.ResultsHere, we propose a deep learning ensemble learning approach (called DEMoS) capable of predicting the four recognized molecular subtypes of gastric cancer directly from histopathological images. DEMoS achieved tile-level area under the receiver-operating characteristic curve (AUROC) values of 0.785, 0.668, 0.762, and 0.811 for the prediction of these four subtypes of gastric cancer (i.e. Epstein-Barr (EBV)-infected, (2) microsatellite instability (MSI), (3) genomically-stable (GS), and (4) chromosomally unstable tumors (CIN)) using an independent test dataset, respectively. At the patient-level, it achieved AUROC values of 0.897, 0.764, 0.890, and 0.898, respectively. Thus, these four subtypes are well-predicted by DEMoS. Benchmarking experiments further suggest that DEMoS is able to achieve an improved classification performance for image-based subtyping and prevent model overfitting. This study highlights the feasibility of using a deep learning ensemble-based method to rapidly and reliably subtype gastric cancer (adenocarcinoma) solely using features from histopathological images.AvailabilityAll WSIs used in this study was collected from the TCGA database. This study builds upon our previously published HEAL framework, with related documentation and tutorials available at http://heal.erc.monash.edu.au. The source code and related models are freely accessible at https://github.com/Docurdt/DEMoS.git.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Pages

Calendar

Mon
Tue
Wed
Thu
Fri
Sat
Sun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
 
September 2022