Jump to Navigation

statgenMPP: an R package implementing an IBD-based mixed model approach for QTL mapping in a wide range of multi-parent populations

Bioinformatics Oxford Journals - Tue, 04/10/2022 - 5:30am
AbstractMotivationMulti-parent populations (MPPs) are popular for QTL mapping because they combine wide genetic diversity in parents with easy control of population structure, but a limited number of software tools for QTL mapping are specifically developed for general MPP designs.ResultsWe developed an R package called statgenMPP, adopting a unified identity-by-descent (IBD)-based mixed model approach for QTL analysis in MPPs. The package offers easy-to-use functionalities of IBD calculations, mixed model solutions, and visualizations for QTL mapping in a wide range of MPP designs, including diallel, nested-association mapping populations (NAM), multi-parent advanced genetic inter-cross (MAGIC) populations and other complicated MPPs with known crossing schemes.AvailabilityThe R package statgenMPP is open-source and freely available on CRAN at https://CRAN.R-project.org/package=statgenMPPSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Learning Temporal Difference Embeddings for Biomedical Hypothesis Generation

Bioinformatics Oxford Journals - Tue, 04/10/2022 - 5:30am
AbstractMotivationHypothesis Generation (HG) refers to the discovery of meaningful implicit connections be-tween disjoint scientific terms, which is of great significance for drug discovery, prediction of drug side effects and precision treatment. More recently, a few initial studies attempt to model the dynamic meaning of the terms or term pairs for HG. However, most existing methods still fail to accurately capture and utilize the dynamic evolution of scientific term relations.ResultsThis paper proposes a novel Temporal Difference Embedding (TDE) learning framework to model the temporal difference information evolution of term-pair relations for predicting future interactions. Specifically, the HG problem is formulated as a future connectivity prediction task on a temporal sequence of a dynamic attributed graph. Our approach models both the local neighbor changes of the term-pairs and the changes of the global graph structure over time, learning local and global TDE of node-pairs, respectively. Future term-pair relations can be inferred in a recurrent network based on the local and global TDE. Experiments on three real-world biomedical term relationship datasets show the effectiveness and superiority of the proposed approach.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

wenda_gpu: fast domain adaptation for genomic data

Bioinformatics Oxford Journals - Tue, 04/10/2022 - 5:30am
AbstractMotivationDomain adaptation allows for development of predictive models even in cases with limited sample data. Weighted elastic net domain adaptation specifically leverages features of genomic data to maximize transferability but the method is too computationally demanding to apply to many genome-sized datasets.ResultsWe developed wenda_gpu, which uses GPyTorch to train models on genomic data within hours on a single GPU-enabled machine. We show that wenda_gpu returns comparable results to the original wenda implementation, and that it can be used for improved prediction of cancer mutation status on small sample sizes than regular elastic net.Availabilitywenda_gpu is available on GitHub at https://github.com/greenelab/wenda_gpu/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

WMSA: a novel method for multiple sequence alignment of DNA sequences

Bioinformatics Oxford Journals - Fri, 30/09/2022 - 5:30am
AbstractMotivationMultiple sequence alignment (MSA) is a fundamental problem in bioinformatics. The quality of alignment will affect downstream analysis. MAFFT has adopted the FFT method for searching the homologous segments and using them as anchors to divide the sequences, then making alignment only on segments, which can save time and memory without overly reducing the sequence alignment quality. MAFFT becomes slow when the dataset is large.ResultsWe made a software, WMSA, which uses the divide-and-conquer method to split the sequences into clusters, aligns those clusters into profiles with the center star strategy, and then makes a progressive profile-profile alignment. The alignment is conducted by the compiled algorithms of MAFFT, K-Band with multithread parallelism. Our method can balance time, space and quality and performs better than MAFFT in test experiments on highly conserved datasets.Availability and implementationSource code is freely available at https://github.com/malabz/WMSA/, which is implemented in C/C ++ and supported on Linux, and datasets are available at https://github.com/malabz/WMSA-dataset.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

GeneNetTools: Tests for Gaussian graphical models with shrinkage

Bioinformatics Oxford Journals - Fri, 30/09/2022 - 5:30am
AbstractBackgroundGaussian graphical models (GGMs) are network representations of random variables (as nodes) and their partial correlations (as edges). GGMs overcome the challenges of high-dimensional data analysis by using shrinkage methodologies. Therefore, they have become useful to reconstruct gene regulatory networks from gene expression profiles. However, it is often ignored that the partial correlations are ‘shrunk’ and that they cannot be compared/assessed directly. Therefore, accurate (differential) network analyses need to account for the number of variables, the sample size, and also the shrinkage value, otherwise, the analysis and its biological interpretation would turn biased. To date, there are no appropriate methods to account for these factors and address these issues.ResultsWe derive the statistical properties of the partial correlation obtained with the Ledoit-Wolf shrinkage. Our result provides a toolbox for (differential) network analyses as i) confidence intervals, ii) a test for zero partial correlation (null-effects), and iii) a test to compare partial correlations. Our novel (parametric) methods account for the number of variables, the sample size, and the shrinkage values. Additionally, they are computationally fast, simple to implement, and require only basic statistical knowledge. Our simulations show that the novel tests perform better than DiffNetFDR -a recently published alternative-, in terms of the trade-off between true and false positives. The methods are demonstrated on synthetic data and two gene expression datasets from Escherichia coli and Mus musculus.AvailabilityThe R package with the methods and the R script with the analysis are be available in https://github.com/V-Bernal/GeneNetTools
Categories: Bioinformatics Trends

GBZ File Format for Pangenome Graphs

Bioinformatics Oxford Journals - Fri, 30/09/2022 - 5:30am
AbstractMotivationPangenome graphs representing aligned genome assemblies are being shared in the text-based Graphical Fragment Assembly format. As the number of assemblies grows, there is a need for a file format that can store the highly repetitive data space-efficiently.ResultsWe propose the GBZ file format based on data structures used in the Giraffe short read aligner. The format provides good compression, and the files can be efficiently loaded into in-memory data structures. We provide compression and decompression tools and libraries for using GBZ graphs, and we show that they can be efficiently used on a variety of systems.AvailabilityC ++ and Rust implementations are available at https://github.com/jltsiren/gbwtgraph and https://github.com/jltsiren/gbwt-rs, respectively.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Integrating Phylogenetic and Functional Data in Microbiome Studies

Bioinformatics Oxford Journals - Fri, 30/09/2022 - 5:30am
AbstractMotivationMicrobiome functional data are frequently analyzed to identify associations between microbial functions (e.g., genes) and sample groups of interest. However, it is challenging to distinguish between different possible explanations for variation in community-wide functional profiles by considering functions alone. To help address this problem, we have developed POMS, a package that implements multiple phylogeny-aware frameworks to more robustly identify enriched functions.ResultsThe key contribution is an extended balance-tree workflow that incorporates functional and taxonomic information to identify functions that are consistently enriched in sample groups across independent taxonomic lineages. Our package also includes a workflow for running phylogenetic regression. Based on simulated data we demonstrate that these approaches more accurately identify gene families that confer a selective advantage compared with commonly used tools. We also show that POMS in particular can identify enriched functions in real-world metagenomics datasets that are potential targets of strong selection on multiple members of the microbiome.AvailabilityThese workflows are freely available in the POMS R package at https://github.com/gavinmdouglas/POMS.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

CITEdb: a manually curated database of cell-cell interactions in human

Bioinformatics Oxford Journals - Fri, 30/09/2022 - 5:30am
AbstractMotivationThe interactions among various types of cells play critical roles in cell functions and the maintenance of the entire organism. While cell-cell interactions are traditionally revealed from experimental studies, recent developments in single cell technologies combined with data mining methods have enabled computational prediction of cell-cell interactions, which have broadened our understanding of how cells work together, and have important implications in therapeutic interventions targeting cell-cell interactions for cancers and other diseases. Despite the importance, to our knowledge, there is no database for systematic documentation of high-quality cell-cell interactions at cell type level, which hinders the development of computational approaches to identify cell-cell interactions.ResultsWe develop a publicly accessible database, CITEdb (Cell-cell InTEraction database, https://citedb.cn/), which not only facilitates interactive exploration of cell-cell interactions in specific physiological contexts (e.g., a disease or an organ), but also provides a benchmark dataset to interpret and evaluate computationally derived cell-cell interactions from different tools. CITEdb contains 728 pairs of cell-cell interactions in human that are manually curated. Each interaction is equipped with structured annotations including the physiological context, the ligand-receptor pairs that mediate the interaction, etc. Our database provides a web interface to search, visualize, and download cell-cell interactions. Users can search for cell-cell interactions by selecting the physiological context of interest or specific cell types involved. CITEdb is the first attempt to catalogue cell-cell interactions at cell type level, which is beneficial to both experimental, computational, and clinical studies of cell-cell interactions.Availability and ImplementationCITEdb is freely available at https://citedb.cn/ and the R package implementing benchmark is available at https://github.com/shanny01/benchmark.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Top-Down Crawl: A method for the ultra-rapid and motif-free alignment of sequences with associated binding metrics

Bioinformatics Oxford Journals - Fri, 30/09/2022 - 5:30am
AbstractSummarySeveral high-throughput protein–DNA binding methods currently available produce highly reproducible measurements of binding affinity at the level of the k-mer. However, understanding where a k-mer is positioned along a binding site sequence depends on alignment. Here we present Top-Down Crawl (TDC), an ultra-rapid tool designed for the alignment of k-mer level data in a rank-dependent and position weight matrix (PWM)-independent manner. As the framework only depends on the rank of the input, the method can accept input from many types of experiments (protein binding microarray, SELEX-seq, SMiLE-seq, etc.) without the need for specialized parameterization. Measuring the performance of the alignment using multiple linear regression with 5-fold cross-validation, we find TDC to perform as well as or better than computationally expensive PWM-based methods.Availability and implementationTDC can be run online at https://topdowncrawl.usc.edu or locally as a python package available through pip at https://pypi.org/project/TopDownCrawl.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

mHapTk: A comprehensive toolkit for the analysis of DNA methylation haplotypes

Bioinformatics Oxford Journals - Fri, 30/09/2022 - 5:30am
AbstractSummaryBisulfite sequencing (BS-seq) remains the gold standard technique to detect DNA methylation profiles at single-nucleotide resolution. The DNA methylation status of CpG sites on the same fragment represents a discrete methylation haplotype (mHap). The mHap-level metrics were demonstrated to be promising cancer biomarkers and explain more gene expression variation than average methylation. However, most existing tools focus on average methylation and neglect mHap patterns. Here, we present mhapTk, a comprehensive python toolkit for the analysis of DNA methylation haplotypes. It calculates eight mHap-level summary statistics in predefined regions or across individual CpG in a genome-wide manner. It identifies methylation haplotype blocks (MHBs), in which methylations of pairwise CpGs is tightly correlated. Furthermore, mHap patterns can be visualized with the built-in functions in mHapTk or external tools such as IGV and deepTools.Availabilityhttps://jiantaoshi.github.io/mhaptk/index.htmlSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

SpaceX: Gene Co-expression Network Estimation for Spatial Transcriptomics

Bioinformatics Oxford Journals - Fri, 30/09/2022 - 5:30am
AbstractMotivationThe analysis of spatially-resolved transcriptome enables the understanding of the spatial interactions between the cellular environment and transcriptional regulation. In particular, the characterization of the gene-gene co-expression at distinct spatial locations or cell types in the tissue enables delineation of spatial co-regulatory patterns as opposed to standard differential single gene analyses. To enhance the ability and potential of spatial transcriptomics technologies to drive biological discovery, we develop a statistical framework to detect gene co-expression patterns in a spatially structured tissue consisting of different clusters in the form of cell classes or tissue domains.ResultsWe develop SpaceX (spatially dependent gene co-expression network), a Bayesian methodology to identify both shared and cluster-specific co-expression network across genes. SpaceX uses an over-dispersed spatial Poisson model coupled with a high-dimensional factor model which is based on a dimension reduction technique for computational efficiency. We show via simulations, accuracy gains in co-expression network estimation and structure by accounting for (increasing) spatial correlation and appropriate noise distributions. In-depth analysis of two spatial transcriptomics datasets in mouse hypothalamus and human breast cancer using SpaceX, detected multiple hub genes which are related to cognitive abilities for the hypothalamus data and multiple cancer genes (e.g. collagen family) from the tumor region for the breast cancer data.Availability and implementationThe SpaceX R-package is available at github.com/bayesrx/SpaceX.Supplementary informationSupplementary dataSupplementary data are available at bookdown.org/satwik91/SpaceX_supplementary/.
Categories: Bioinformatics Trends

SCAFE: a software suite for analysis of transcribed cis-regulatory elements in single cells

Bioinformatics Oxford Journals - Thu, 29/09/2022 - 5:30am
AbstractMotivationCell type-specific activities of cis-regulatory elements (CRE) are central to understanding gene regulation and disease predisposition. Single-cell RNA 5’end sequencing (sc-end5-seq) captures the transcription start sites (TSS) which can be used as a proxy to measure the activity of transcribed CREs (tCREs). However, a substantial fraction of TSS identified from sc-end5-seq data may not be genuine due to various artifacts, hindering the use of sc-end5-seq for de novo discovery of tCREs.ResultsWe developed SCAFE—Single Cell Analysis of Five-prime Ends—a software suite that processes sc-end5-seq data to de novo identify TSS clusters based on multiple logistic regression. It annotates tCREs based on the identified TSS clusters and generates a tCRE-by-cell count matrix for downstream analyses. The software suite consists of a set of flexible tools that could either be run independently or as pre-configured workflows.AvailabilitySCAFE is implemented in Perl and R. The source code and documentation are freely available for download under the MIT License from https://github.com/chung-lab/SCAFE. Docker images are available from https://hub.docker.com/r/cchon/scafe. The submitted software version and test data are archived at https://doi.org/10.5281/zenodo.7023163 and https://doi.org/10.5281/zenodo.7024060 respectively.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Homologue Series Detection and Management in LC-MS data with homologueDiscoverer

Bioinformatics Oxford Journals - Tue, 27/09/2022 - 5:30am
AbstractSummaryUntargeted metabolomics data analysis is highly labor intensive and can be severely frustrated by both experimental noise and redundant features. Homologous polymer series are a particular case of features that can either represent large numbers of noise features, or alternatively represent features of interest with large peak redundancy. Here we present homologueDiscoverer, an R package which allows for the targeted and untargeted detection of homologue series as well as their evaluation and management using interactive plots and simple local database functionalities.AvailabilityhomologueDiscoverer is freely available at github https://github.com/kevinmildau/homologueDiscoverer.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Correction to: Integrative data semantics through a model-enabled data stewardship

Bioinformatics Oxford Journals - Fri, 23/09/2022 - 5:30am
This is a correction to: Philipp Wegner, Sebastian Schaaf, Mischa Uebachs, Daniel Domingo-Fernández, Yasamin Salimi, Stephan Gebel, Astghik Sargsyan, Colin Birkenbihl, Stephan Springstubbe, Thomas Klockgether, Juliane Fluck, Martin Hofmann-Apitius, Alpha Tom Kodamullil, Integrative data semantics through a model-enabled data stewardship, Bioinformatics, Volume 38, Issue 15, 1 August 2022, Pages 3850–3852, https://doi.org/10.1093/bioinformatics/btac375
Categories: Bioinformatics Trends

Computational modelling in health and disease: highlights of the 6th annual SysMod meeting

Bioinformatics Oxford Journals - Thu, 22/09/2022 - 5:30am
AbstractSummaryThe Community of Special Interest (COSI) in Computational Modelling of Biological Systems (SysMod) brings together interdisciplinary scientists interested in combining data-driven computational modelling, multi-scale mechanistic frameworks, large-scale -omics data and bioinformatics. SysMod’s main activity is an annual meeting at the Intelligent Systems for Molecular Biology (ISMB) conference, a meeting for computer scientists, biologists, mathematicians, engineers and computational and systems biologists. The 2021 SysMod meeting was conducted virtually due to the ongoing COVID-19 pandemic (coronavirus disease 2019). During the 2-day meeting, the development of computational tools, approaches and predictive models was discussed, along with their application to biological systems, emphasizing disease mechanisms. This report summarizes the meeting.Availability and implementationAll resources and further information are freely accessible at https://sysmod.info.
Categories: Bioinformatics Trends

Predicting colorectal cancer tumor mutational burden from histopathological images and clinical information using multi-modal deep learning

Bioinformatics Oxford Journals - Wed, 21/09/2022 - 5:30am
AbstractMotivationTumor mutational burden (TMB) is an indicator of the efficacy and prognosis of immune checkpoint therapy in colorectal cancer (CRC). Cancer patients with high TMB (TMB_H) values tend to benefit from immunotherapy, whereas those with low TMB (TMB_L) values tend to be not. Though whole-exome sequencing (WES) is considered the gold standard for determining TMB, it is difficult to be applied in clinical practice due to its high cost. There are also a few DNA panel-based methods to estimate TMB; however, their detection cost is also high, and the associated wet-lab experiments usually take days, which emphasize the need for faster and cheaper alternatives.MethodsIn this study, we propose a multi-modal deep learning model based on a residual network (ResNet) and multi-modal compact bilinear pooling to predict TMB status (i.e., TMB_H or TMB_L) directly from histopathological images and clinical data. We applied the model to CRC data from The Cancer Genome Atlas and compared it with four other popular methods, namely, ResNet18, ResNet50, VGG19, and AlexNet. We tested different TMB thresholds, namely, percentiles of 10%, 14.3%, 15%, 16.3%, 20%, 30% and 50%, to differentiate TMB_H and TMB_L.ResultsFor the percentile 14.3% (i.e., TMB value 20) and ResNet18, our model achieved an area under the receiver operating characteristic curve of 0.817 after five-fold cross-validation, which was better than that of other compared models. In addition, we also found that TMB values were significantly associated with tumor stage and N and M stages. Our study shows that deep learning models can predict TMB status from histopathological images and clinical information only, which is worth clinical application.
Categories: Bioinformatics Trends

BioBulkFoundary: A customized webserver for exploring biosynthetic potentials of bulk chemicals

Bioinformatics Oxford Journals - Wed, 21/09/2022 - 5:30am
AbstractSummaryAdvances in metabolic engineering have boosted the production of bulk chemicals, resulting in tons of production volumes of some bulk chemicals with very low prices. A decrease in the production cost and overproduction of bulk chemicals makes it necessary and desirable to explore the potential to synthesize higher-value products from them. It is also useful and important for society to explore the use of design methods involving synthetic biology to increase the economic value of these bulk chemicals. Therefore, we developed “BioBulkFoundary,” which provides an elaborate analysis of the biosynthetic potential of bulk chemicals based on the state-of-art exploration of pathways to synthesize value-added chemicals, along with associated comprehensive technology and economic database into a user-friendly framework.AvailabilityFreely available on the web at http://design.rxnfinder.org/biobulkfoundary/Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Increasing confidence in proteomic spectral deconvolution through mass defect

Bioinformatics Oxford Journals - Wed, 21/09/2022 - 5:30am
AbstractMotivationConfident deconvolution of proteomic spectra is critical for several applications such as de novo sequencing, cross-linking mass spectrometry, and handling chimeric mass spectra.ResultsIn general, all deconvolution algorithms may eventually report mass peaks that are not compatible with the chemical formula of any peptide. We show how to remove these artifacts by considering their mass defects. We introduce YADA 3.0, a fast deconvolution algorithm that can remove peaks with unacceptable mass defects. Our approach is effective for polypeptides with less than 10 kDa and its essence can be easily incorporated into any deconvolution algorithm.AvailabilityYADA 3.0 is freely available for academic use at http://patternlabforproteomics.org/yada3.Supplementary informationSupplementary informationSupplementary information is available at Bioinformatics online.
Categories: Bioinformatics Trends

Highly significant improvement of protein sequence alignments with AlphaFold2

Bioinformatics Oxford Journals - Wed, 21/09/2022 - 5:30am
AbstractMotivationProtein sequence alignments are essential to structural, evolutionary and functional analysis but their accuracy is often limited by sequence similarity unless molecular structures are available. Protein structures predicted at experimental grade accuracy, as achieved by AlphaFold2, could therefore have a major impact on sequence analysis.ResultsHere, we find that multiple sequence alignments estimated on AlphaFold2 predictions are almost as accurate as alignments estimated on experimental structures and significantly closer to the structural reference than sequence-based alignments. We also show that AlphaFold2 structural models of relatively low quality can be used to obtain highly accurate alignments. These results suggest that, besides structure modeling, AlphaFold2 encodes higher-order dependencies that can be exploited for sequence analysis.AvailabilityAll data, analyses, and results are available on Zenodo (https://doi.org/10.5281/zenodo.7031286). The code and scripts have been deposited in GitHub (https://github.com/cbcrg/msa-af2-nf) and the various containers in (https://cloud.sylabs.io/library/athbaltzis/af2/alphafold, https://hub.docker.com/r/athbaltzis/pred).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Lung Cancer Subtype Diagnosis using Weakly-paired Multi-omics Data

Bioinformatics Oxford Journals - Tue, 20/09/2022 - 5:30am
AbstractMotivationCancer subtype diagnosis is crucial for its precise treatment and different subtypes need different therapies. Although the diagnosis can be greatly improved by fusing multi-omics data, most fusion solutions depend on paired omics data, which are actually weakly-paired, with different omics views missing for different samples. Incomplete multi-view learning based solutions can alleviate this issue but are still far from satisfactory because they: (i) mainly focus on shared information while ignore the important individuality of multi-omics data; (ii) cannot pick out interpretable features for precise diagnosis.ResultsWe introduce an interpretable and flexible solution (LungDWM) for Lung cancer subtype Diagnosis using Weakly-paired Multi-omics data. LungDWM first builds an attention-based encoder for each omics to pick out important diagnostic features and extract shared and complementary information across omics. Next, it proposes an individual loss to jointly extract the specific information of each omics, and performs generative adversarial learning to impute missing omics of samples using extracted features. After that, it fuses the extracted and imputed features to diagnose cancer subtypes. Experiments on benchmark datasets show that LungDWM achieves a better performance than recent competitive methods, and has a high authenticity and good interpretability.AvailabilityThe code is available at http://www.sdu-idea.cn/codes.php?name=LungDWM.
Categories: Bioinformatics Trends


Subscribe to Centre for Bioinformatics aggregator - Bioinformatics Trends


December 2022