Jump to Navigation
Subscribe to Bioinformatics Oxford Journals feed
Updated: 6 hours 38 min ago

Neural Collective Matrix Factorization for Integrated Analysis of Heterogeneous Biomedical Data

Fri, 05/08/2022 - 5:30am
AbstractMotivationIn many biomedical studies, there arises the need to integrate data from multiple directly or indirectly related sources. Collective matrix factorization (CMF) and its variants are models designed to collectively learn from arbitrary collections of matrices. The latent factors learnt are rich integrative representations that can be used in downstream tasks such as clustering or relation prediction with standard machine learning models. Previous CMF-based methods have numerous modeling limitations. They do not adequately capture complex non-linear interactions and do not explicitly model varying sparsity and noise levels in the inputs, and some cannot model inputs with multiple datatypes. These inadequacies limit their use on many biomedical datasets.ResultsTo address these limitations, we develop Neural Collective Matrix Factorization (NCMF), the first fully neural approach to CMF. We evaluate NCMF on relation prediction tasks of gene-disease association prediction and adverse drug event prediction, using multiple datasets. In each case, data is obtained from heterogeneous publicly available databases, and used to learn representations to build predictive models. NCMF is found to outperform previous CMF-based methods and several state-of-the-art graph embedding methods for representation learning in our experiments. Our experiments illustrate the versatility and efficacy of NCMF in representation learning for seamless integration of heterogeneous data.Availabilityhttps://github.com/ajayago/NCMF_bioinformatics
Categories: Bioinformatics Trends

Hierarchical deep learning for predicting GO annotations by integrating protein knowledge

Fri, 05/08/2022 - 5:30am
AbstractMotivationExperimental testing and manual curation are the most precise ways for assigning Gene Ontology (GO) terms describing protein functions. However, they are expensive, time-consuming, and cannot cope with the exponential growth of data generated by high throughput sequencing methods. Hence, researchers need reliable computational systems to help fill the gap with automatic function prediction. The results of the last Critical Assessment of Function Annotation challenge revealed that GO terms prediction remains a very challenging task. Recent developments on deep learning are significantly breaking out the frontiers leading to new knowledge in protein research thanks to the integration of data from multiple sources. However, deep models hitherto developed for functional prediction are mainly focused on sequence data and have not achieved breakthrough performances yet.ResultsWe propose DeeProtGO, a novel deep learning model for predicting GO annotations by integrating protein knowledge. DeeProtGO was trained for solving 18 different prediction problems, defined by the three GO sub-ontologies, the type of proteins, and the taxonomic kingdom. Our experiments reported higher prediction quality when more protein knowledge is integrated. We also benchmarked DeeProtGO against state-of-the-art methods on public datasets, and showed it can effectively improve the prediction of GO annotations.AvailabilityDeeProtGO and a case of use are available at https://github.com/gamerino/DeeProtGOSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Prediction of Gene Co-expression from Chromatin Contacts with Graph Attention Network

Fri, 05/08/2022 - 5:30am
AbstractMotivationThe technology of high-throughput chromatin conformation capture (Hi-C) allows genome-wide measurement of chromatin interactions. Several studies have shown statistically significant relationships between gene-gene spatial contacts and their co-expression. It is desirable to uncover epigenetic mechanisms of transcriptional regulation behind such relationships using computational modeling. Existing methods for predicting gene co-expression from Hi-C data use manual feature engineering or unsupervised learning, which either limits the prediction accuracy or lacks interpretability.ResultsTo address these issues, we propose HiCoEx, a novel end-to-end framework for explainable prediction of gene co-expression from Hi-C data based on graph neural network. We apply graph attention mechanism to a gene contact network inferred from Hi-C data to distinguish the importance among different neighboring genes of each gene, and learn the gene representation to predict co-expression in a supervised and task-specific manner. Then, from the trained model, we extract the learned gene embeddings as a model interpretation to distill biological insights. Experimental results show that HiCoEx can learn gene representation from 3D genomics signals automatically to improve prediction accuracy, and make the black box model explainable by capturing some biologically meaningful patterns, e.g., in a gene contact network, the common neighbors of two central genes might contribute to the co-expression of the two central genes through sharing enhancers.AvailabilityThe source code is freely available at https://github.com/JieZheng-ShanghaiTech/HiCoExSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

DBFE: Distribution-based feature extraction from structural variants in whole-genome data

Fri, 05/08/2022 - 5:30am
AbstractMotivationWhole-genome sequencing has revolutionized biosciences by providing tools for constructing complete DNA sequences of individuals. With entire genomes at hand, scientists can pinpoint DNA fragments responsible for oncogenesis and predict patient responses to cancer treatments. Machine learning plays a paramount role in this process. However, the sheer volume of whole-genome data makes it difficult to encode the characteristics of genomic variants as features for learning algorithms.ResultsIn this paper, we propose three feature extraction methods that facilitate classifier learning from sets of genomic variants. The core contributions of this work include: (1) strategies for determining features using variant length binning, clustering, and density estimation; (2) a programming library for automating distribution-based feature extraction in machine learning pipelines. The proposed methods have been validated on five real-world datasets using four different classification algorithms and a clustering approach. Experiments on genomes of 219 ovarian, 61 lung, and 929 breast cancer patients show that the proposed approaches automatically identify genomic biomarkers associated with cancer subtypes and clinical response to oncological treatment. Finally, we show that the extracted features can be used alongside unsupervised learning methods to analyze genomic samples.AvailabilityThe source code of the presented algorithms and reproducible experimental scripts are available on Github at https://github.com/MNMdiagnostics/dbfe
Categories: Bioinformatics Trends

ASURAT: Functional annotation-driven unsupervised clustering of single-cell transcriptomes

Thu, 04/08/2022 - 5:30am
AbstractMotivationSingle-cell RNA sequencing (scRNA-seq) analysis reveals heterogeneity and dynamic cell transitions. However, conventional gene-based analyses require intensive manual curation to interpret biological implications of computational results. Hence, a theory for efficiently annotating individual cells remains warranted.ResultsWe present ASURAT, a computational tool for simultaneously performing unsupervised clustering and functional annotation of disease, cell type, biological process, and signaling pathway activity for single-cell transcriptomic data, using a correlation graph decomposition for genes in database-derived functional terms. We validated the usability and clustering performance of ASURAT using scRNA-seq datasets for human peripheral blood mononuclear cells, which required fewer manual curations than existing methods. Moreover, we applied ASURAT to scRNA-seq and spatial transcriptome datasets for human small cell lung cancer and pancreatic ductal adenocarcinoma, respectively, identifying previously overlooked subpopulations and differentially expressed genes. ASURAT is a powerful tool for dissecting cell subpopulations and improving biological interpretability of complex and noisy transcriptomic data.AvailabilityASURAT is published on Bioconductor (DOI: 10.18129/B9.bioc.ASURAT). The codes for analyzing data in this article are available at Github (https://github.com/keita-iida/ASURATBI) or figshare (https://doi.org/10.6084/m9.figshare.19200254.v3).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

DiNAMIC.Duo: detecting somatic DNA copy number differences without a normal reference

Thu, 04/08/2022 - 5:30am
AbstractMotivationSomatic DNA copy number alterations (CNAs) arise in tumor tissue because of underlying genomic instability. Recurrent CNAs that occur in the same genomic region across multiple independent samples are of interest to researchers because they may contain genes that contribute to the cancer phenotype. However, differences in copy number states between cancers are also commonly of interest, for example when comparing tumors with distinct morphologies in the same anatomic location. Current methodologies are limited by their inability to perform direct comparisons of CNAs between tumor cohorts, and thus they cannot formally assess the statistical significance of observed copy number differences or identify regions of the genome where these differences occur.ResultsWe introduce the DiNAMIC.Duo R package that can be used to identify recurrent copy number alterations in a single cohort or recurrent copy number differences between two cohorts, including when neither cohort is copy neutral. The package utilizes Python scripts for computational efficiency and provides functionality for producing figures and summary output files.AvailabilityThe DiNAMIC.Duo R package is available from CRAN at https://cran.r-project.org/web/packages/DiNAMIC.Duo/index.html.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

3D GAN image synthesis and dataset quality assessment for bacterial biofilm

Thu, 04/08/2022 - 5:30am
AbstractMotivationData-driven deep learning techniques usually require a large quantity of labeled training data to achieve reliable solutions in bioimage analysis. However, noisy image conditions and high cell density in bacterial biofilm images make 3D cell annotations difficult to obtain. Alternatively, data augmentation via synthetic data generation is attempted, but current methods fail to produce realistic images.ResultsThis paper presents a bioimage synthesis and assessment workflow with application to augment bacterial biofilm images. 3D cyclic generative adversarial networks (GAN) with unbalanced cycle consistency loss functions are exploited in order to synthesize 3D biofilm images from binary cell labels. Then, a stochastic synthetic dataset quality assessment (SSQA) measure that compares statistical appearance similarity between random patches from random images in two datasets is proposed. Both SSQA scores and other existing image quality measures indicate that the proposed 3D Cyclic GAN, along with the unbalanced loss function, provides a reliably realistic (as measured by mean opinion score) 3D synthetic biofilm image. In 3D cell segmentation experiments, a GAN-augmented training model also presents more realistic signal-to-background intensity ratio and improved cell counting accuracy.Availability and Implementationhttps://github.com/jwang-c/DeepBiofilm.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

CPRiL: Compound-Protein Relationships in Literature

Wed, 03/08/2022 - 5:30am
Abstract Newly discovered functional relationships of (bio-)molecules are a key component in molecular biology and life science research. Especially in the drug discovery field, knowledge of how small molecules associate with proteins plays a fundamental role in understanding how drugs or metabolites can affect cells, tissues, and human metabolism. Finding relevant information about these relationships among the huge number of published articles is becoming increasingly challenging and time-consuming. On average, more than 25,000 new (bio-)medical articles are added to the literature database PubMed weekly. In this work, we present a new web server (CPRiL) which provides information on functional relationships between small molecules and proteins in literature. Currently, CPRiL contains ∼465,000 unique names and synonyms of small molecules, ∼100,000 unique proteins, and more than 9 million described functional relationships between these entities. The applied BioBERT machine learning model for the determination of functional relationships between small molecules and proteins in texts was extensively trained and tested. On a related benchmark, CPRiL yielded a high performance, with an F1-score of 84.3%, precision of 82.9%, and recall of 85.7%.AvailabilityCPRiL is freely available at https://www.pharmbioinf.uni-freiburg.de/cpril.
Categories: Bioinformatics Trends

Identification of bacteriophage genome sequences with representation learning

Wed, 03/08/2022 - 5:30am
AbstractMotivationBacteriophages/Phages are the viruses that infect and replicate within bacteria and archaea. Phages are used to therapeutically provide another potential solution for solving antibiotic resistance, which is one of the threats to global health. To develop phage therapies, the identification of phages from metagenome sequences is the first step. Currently, there are two main methods for identifying phages: database-based (alignment-based) methods and alignment-free methods. Database-based methods typically use a large number of sequences as references; alignment-free methods usually learn the features of the sequences with machine learning and deep learning models.ResultsWe propose INHERIT which use a deep representation learning model to integrate both database-based and alignment-free methods, combining the strengths of both. Pre-training is used as an alternative way of acquiring knowledge representations from existing databases, while the BERT-style deep learning framework retains the advantage of alignment-free methods. We compare INHERIT with four existing methods on a third-party benchmark dataset. Our experiments show that INHERIT achieves a better performance with the F1-score of 0.9932. In addition, we find that pre-training two species separately helps the non-alignment deep learning model make more accurate predictions.AvailabilityThe codes of INHERIT are now available in: https://github.com/Celestial-Bai/INHERIT.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

PanExplorer: A web-based tool for exploratory analysis and visualization of bacterial pan-genomes

Tue, 02/08/2022 - 5:30am
AbstractMotivationAs pan-genome approaches are largely employed for bacterial comparative genomics and evolution analyses, but still difficult to be carried out by non-bioinformatician biologists, there is a need for an innovative tool facilitating the exploration of bacterial pan-genomes.ResultsPanExplorer is a web application providing various genomic analyses and reports, giving intuitive views that enable a better understanding of bacterial pan-genomes. As an example, we produced the pan-genome for 121 Anaplasmataceae strains (including 30 Ehrlichia, 15 Anaplasma, 68 Wolbachia).Availability and implementationPanExplorer is written in Perl CGI and relies on several JavaScript libraries for visualization (hotmap.js, MauveViewer, CircosJS). It is freely available at http://panexplorer.southgreen.fr. The source code has been released in a GitHub repository https://github.com/SouthGreenPlatform/PanExplorer. A documentation section is available on PanExplorer website.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Quasi-Entropy Closure: A Fast and Reliable Approach to Close the Moment Equations of the Chemical Master Equation

Tue, 02/08/2022 - 5:30am
AbstractMotivationThe Chemical Master Equation is a stochastic approach to describe the evolution of a (bio)chemical reaction system. Its solution is a time-dependent probability distribution on all possible configurations of the system. As this number is typically large, the Master Equation is often practically unsolvable. The Method of Moments reduces the system to the evolution of a few moments, which are described by ordinary differential equations. Those equations are not closed, since lower order moments generally depend on higher order moments. Various closure schemes have been suggested to solve this problem. Two major problems with these approaches are first that they are open loop systems, which can diverge from the true solution, and second, some of them are computationally expensive.ResultsHere we introduce Quasi-Entropy Closure, a moment closure scheme for the Method of Moments. It estimates higher order moments by reconstructing the distribution that minimizes the distance to a uniform distribution subject to lower order moment constraints. Quasi-Entropy Closure can be regarded as an advancement of Zero-Information Closure, which similarly maximizes the information entropy. Results show that both approaches outperform truncation schemes. Quasi-Entropy Closure is computationally much faster than Zero-Information Closure, although both methods consider solutions on the space of configurations and hence do not completely overcome the curse of dimensionality. In addition, our scheme includes a plausibility check for the existence of a distribution satisfying a given set of moments on the feasible set of configurations. All results are evaluated on different benchmark problems.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Identifying cellular cancer mechanisms through pathway-driven data integration

Tue, 02/08/2022 - 5:30am
AbstractMotivationCancer is a genetic disease in which accumulated mutations of driver genes induce a functional reorganisation of the cell by reprogramming cellular pathways. Current approaches identify cancer pathways as those most internally perturbed by gene expression changes. However, driver genes characteristically perform hub roles between pathways. Therefore, we hypothesise that cancer pathways should be identified by changes in their pathway-pathway relationships.ResultsTo learn an embedding space that captures the relationships between pathways in a healthy cell, we propose pathway-driven non-negative matrix tri-factorisation (PNMTF). In this space, we determine condition-specific (i.e., diseased and healthy) embeddings of pathways and genes. Based on these embeddings, we define our ‘NMTF centrality’ to measure a pathway’s or gene’s functional importance, and our ‘moving-distance’, to measure the change in its functional relationships. We combine both measures to predict 15 genes and pathways involved in four major cancers, predicting 60 gene-cancer associations in total, covering 28 unique genes. To further exploit driver genes’ tendency to perform hub roles, we model our network data using graphlet-adjacency, which considers nodes adjacent if their interaction patterns form specific shapes (e.g., paths or triangles). We find that the predicted genes rewire pathway-pathway interactions in the immune system and provide literary evidence that many are druggable (15/28) and implicated in the associated cancers (47/60). We predict six druggable cancer-specific drug targets.AvailabilityThe source code is available at: https://gitlab.bsc.es/swindels/pathway_driven_nmtfSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Multidrug Representation Learning Based on Pretraining Model and Molecular Graph for Drug Interaction and Combination Prediction

Fri, 29/07/2022 - 5:30am
AbstractMotivationApproaches for the diagnosis and treatment of diseases often adopt multidrug therapy method because it can increase the efficacy or reduce the toxic side effects of drugs. Using different drugs simultaneously may trigger unexpected pharmacological effects. Therefore, efficient identification of drug interactions is essential for the treatment of complex diseases. Currently proposed calculation methods are often limited by the collection of redundant drug features, a small amount of labeled data, and low model generalization capabilities. Meanwhile, there is also a lack of unique methods for multidrug representation learning, which makes it more difficult to take full advantage of the originally scarce data.ResultsInspired by graph models and pretraining models, we integrated a large amount of unlabeled drug molecular graph information and target information, then designed a pretraining framework, MGP-DR (Molecular Graph Pretraining for Drug Representation), specifically for drug pair representation learning. The model uses self-supervised learning strategies to mine the contextual information within and between drug molecules to predict drug–drug interactions and drug combinations. The results achieved promising performance across multiple metrics compared with other state-of-the-art methods. Our MGP-DR model can be used to provide a reliable candidate set for the combined use of multiple drugs.Availability and implementationCode of the model, datasets and results can be downloaded from GitHub (https://github.com/LiangYu-Xidian/MGP-DR)Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

tmVar 3.0: an improved variant concept recognition and normalization tool

Fri, 29/07/2022 - 5:30am
AbstractMotivationPrevious studies have shown that automated text-mining tools are becoming increasingly important for successfully unlocking variant information in scientific literature at large scale. Despite multiple attempts in the past, existing tools are still of limited recognition scope and precision.ResultWe propose tmVar 3.0: an improved variant recognition and normalization system. Compared to its predecessors, tmVar 3.0 recognizes a wider spectrum of variant related entities (e.g., allele and copy number variants), and groups together different variant mentions belonging to the same genomic sequence position in an article for improved accuracy. Moreover, tmVar 3.0 provides advanced variant normalization options such as allele-specific identifiers from the ClinGen Allele Registry. tmVar 3.0 exhibits state-of-the-art performance with over 90% in F-measure for variant recognition and normalization, when evaluated on three independent benchmarking datasets. tmVar 3.0 as well as annotations for the entire PubMed and PMC datasets are freely available for download.Availabilityhttps://github.com/ncbi/tmVar3
Categories: Bioinformatics Trends

MIB2: Metal ion-binding site prediction and modeling server

Fri, 29/07/2022 - 5:30am
AbstractMotivationMIB2 attempts to overcome the limitation of structure-based prediction approaches, with many proteins lacking a solved structure. MIB2 also offers more accurate prediction performance and more metal ion types.ResultsMIB2 utilizes both the (PS)2 method and the AlphaFold Protein Structure Database to acquire predicted structures to perform metal ion docking and predict binding residues. MIB2 offers marked improvements over MIB by collecting more metal ion-binding residue templates and using the metal ion type-specific scoring function. It offers a total of 18 types of metal ions for binding site predictions.AvailabilityFreely available on the web at http://bioinfo.cmu.edu.tw/MIB2/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

The K-mer File Format: a standardized and compact disk representation of sets of k-mers

Fri, 29/07/2022 - 5:30am
AbstractSummaryBioinformatics applications increasingly rely on ad-hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here we introduce the K-mer File Format (KFF) as a general lossless framework for storing and manipulating k-mer sets, realizing space savings of 3-5x compared to other formats, and bringing interoperability across tools.AvailabilityFormat specification, C ++/Rust API, tools: https://github.com/Kmer-File-Format/
Categories: Bioinformatics Trends

ViReMaShiny: An Interactive Application for Analysis of Viral Recombination Data

Fri, 29/07/2022 - 5:30am
AbstractMotivationRecombination is an essential driver of virus evolution and adaption, giving rise to new chimeric viruses, structural variants, sub-genomic RNAs, and Defective-RNAs. Next-Generation Sequencing of virus samples, either from experimental or clinical settings, has revealed a complex distribution of recombination events that contributes to the intrahost diversity. We and others have previously developed alignment tools to discover and map these diverse recombination events in NGS data. However, there is no standard for data visualization to contextualize events of interest and downstream analysis often requires bespoke coding.ResultsWe present ViReMaShiny, a web-based application built using the R Shiny framework to allow interactive exploration and point-and-click visualization of viral recombination data provided in BED format generated by computational pipelines such as ViReMa (Viral-Recombination-Mapper).AvailabilityThe application is hosted at https://routhlab.shinyapps.io/ViReMaShiny/ with associated documentation at https://jayeung12.github.io/. Code is available at https://github.com/routhlab/ViReMaShiny.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

SECEDO: SNV-based subclone detection using ultra-low coverage single-cell DNA sequencing

Thu, 28/07/2022 - 5:30am
AbstractMotivationSeveral recently developed single-cell DNA sequencing technologies enable whole-genome sequencing of thousands of cells. However, the ultra-low coverage of the sequenced data (< 0.05x per cell) mostly limits their usage to the identification of copy number alterations in multi-megabase segments. Many tumors are not copy number-driven, and thus single-nucleotide variant (SNV)-based subclone detection may contribute to a more comprehensive view on intra-tumor heterogeneity. Due to the low coverage of the data, the identification of SNVs is only possible when superimposing the sequenced genomes of hundreds of genetically similar cells. Thus, we have developed a new approach to efficiently cluster tumor cells based on a Bayesian filtering approach of relevant loci and exploiting read overlap and phasing.ResultsWe developed Single Cell Data Tumor Clusterer (SECEDO, lat. ‘to separate’), a new method to cluster tumor cells based solely on SNVs, inferred on ultra-low coverage single-cell DNA sequencing data. We applied SECEDO to a synthetic dataset simulating 7,250 cells and eight tumor subclones from a single patient and were able to accurately reconstruct the clonal composition, detecting 92.11% of the somatic SNVs, with the smallest clusters representing only 6.9% of the total population. When applied to five real single-cell sequencing datasets from a breast cancer patient, each consisting of ≈ 2,000 cells, SECEDO was able to recover the major clonal composition in each dataset at the original coverage of 0.03x, achieving an Adjusted Rand Index (ARI) score of ≈ 0.6. The current state-of-the-art SNV-based clustering method achieved an ARI score of ≈ 0, even after merging cells to create higher coverage data (factor 10 increase), and was only able to match SECEDO’s performance when pooling data from all five datasets, in addition to artificially increasing the sequencing coverage by a factor of 7. Variant calling on the resulting clusters recovered more than twice as many SNVs as would have been detected if calling on all cells together. Further, the allelic ratio of the called SNVs on each subcluster was more than double relative to the allelic ratio of the SNVs called without clustering, thus demonstrating that calling variants on subclones, in addition to both increasing sensitivity of SNV detection and attaching SNVs to subclones, significantly increases the confidence of the called variants.AvailabilitySECEDO is implemented in C ++ and is publicly available at https://github.com/ratschlab/secedo. Instructions to download the data and the evaluation code to reproduce the findings in this paper are available at: https://github.com/ratschlab/secedo-evaluation. The code and data of the submitted version is archived at: https://doi.org/10.5281/zenodo.6516955.
Categories: Bioinformatics Trends

CoGO: a contrastive learning framework to predict disease similarity based on gene network and ontology structure

Thu, 28/07/2022 - 5:30am
AbstractMotivationQuantifying the similarity of human diseases provides guiding insights to the discovery of micro-scope mechanisms from a macro scale. Previous work demonstrated that better performance can be gained by integrating multi-view data sources or applying machine learning techniques. However, designing an efficient framework to extract and incorporate information from different biological data using deep learning models remains unexplored.ResultsWe present CoGO, a Contrastive learning framework to predict disease similarity based on Gene network and Ontology structure, which incorporates the gene interaction network and gene ontology (GO) domain knowledge using graph deep learning models. First, graph deep learning models are applied to encode the features of genes and GO terms from separate graph structure data. Next, gene and GO features are projected to a common embedding space via a non-linear projection. Then cross-view contrastive loss is applied to maximize the agreement of corresponding gene-GO associations and lead to meaningful gene representation. Finally, CoGO infers the similarity between diseases by the cosine similarity of disease representation vectors derived from related gene embedding. In our experiments, CoGO outperforms the most competitive baseline method on both AUROC and AUPRC, especially improves 19.57% in AUPRC (0.7733). The prediction results are significantly comparable with other disease similarity studies and thus highly credible. Furthermore, we conduct a detailed case study of top similar disease pairs which is demonstrated by other studies. Empirical results show that CoGO achieves powerful performance in disease similarity problem.Availabilityhttps://github.com/yhchen1123/CoGO.
Categories: Bioinformatics Trends

PLCOjs, a FAIR GWAS web SDK for the NCI Prostate, Lung, Colorectal, and Ovarian Cancer Genetic Atlas Project

Thu, 28/07/2022 - 5:30am
AbstractMotivationThe Division of Cancer Epidemiology and Genetics (DCEG) and the Division of Cancer Prevention (DCP) at the National Cancer Institute (NCI) have recently generated genome-wide association study (GWAS) data for multiple traits in the Prostate, Lung, Colorectal, and Ovarian (PLCO) Genomic Atlas project. The GWAS included 110,000 participants. The dissemination of the genetic association data through a data portal called GWAS Explorer, in a manner that addresses the modern expectations of FAIR reusability by data scientists and engineers, is the main motivation for the development of the open-source JavaScript Software Development Kit (SDK) reported here.ResultsThe PLCO GWAS Explorer resource relies on a public stateless HTTP API deployed as the sole backend service for both the landing page’s web application and third-party analytical workflows. The core PLCOjs SDK is mapped to each of the API methods, and also to each of the reference graphic visualizations in the GWAS Explorer. A few additional visualization methods extend it. As is the norm with Web SDKs, no download or installation is needed and modularization supports targeted code injection for web applications, reactive notebooks (Observable) and node-based Web services.Availabilitycode at https://github.com/episphere/plco; project page at https://episphere.github.io/plcoSupplementary informationTutorial at https://youtu.be/87dXT9YtbfY (17 mins).
Categories: Bioinformatics Trends

Pages

Calendar

Mon
Tue
Wed
Thu
Fri
Sat
Sun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
 
September 2022