Jump to Navigation
Subscribe to Bioinformatics Oxford Journals feed
Updated: 6 hours 19 min ago

MarkerMAG: linking metagenome-assembled genomes (MAGs) with 16S rRNA marker genes using paired-end short reads

Fri, 17/06/2022 - 5:30am
AbstractMotivationMetagenome-assembled genomes (MAGs) have substantially extended our understanding of microbial functionality. However, 16S rRNA genes, which are commonly used in phylogenetic analysis and environmental surveys, are often missing from MAGs. Here, we developed MarkerMAG, a pipeline that links 16S rRNA genes to MAGs using paired-end sequencing reads.ResultsAssessment of MarkerMAG on three benchmarking metagenomic datasets with various degrees of complexity shows substantial increases in the number of MAGs with 16S rRNA genes and a 100% assignment accuracy. MarkerMAG also estimates the copy number of 16S rRNA genes in MAGs with high accuracy. Assessments on three real metagenomic datasets demonstrates 1.1- to 14.2-fold increases in the number of MAGs with 16S rRNA genes. We also show that MarkerMAG-improved MAGs increase the accuracy of functional prediction from 16S rRNA gene amplicon data. MarkerMAG is helpful in connecting information in MAG database with those in 16S rRNA databases and surveys and hence contributes to our increasing understanding of microbial diversity, function, and phylogeny.AvailabilityMarkerMAG is implemented in Python3 and freely available at https://github.com/songweizhi/MarkerMAG.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Sequence Tagging For Biomedical Extractive Question Answering

Fri, 17/06/2022 - 5:30am
AbstractMotivationCurrent studies in extractive question answering (EQA) have modeled the single-span extraction setting, where a single answer span is a label to predict for a given question-passage pair. This setting is natural for general domain EQA as the majority of the questions in the general domain can be answered with a single span. Following general domain EQA models, current biomedical EQA (BioEQA) models utilize the single-span extraction setting with post-processing steps.ResultsIn this paper, we investigate the question distribution across the general and biomedical domains and discover biomedical questions are more likely to require list-type answers (multiple answers) than factoid-type answers (single answer). This necessitates the models capable of producing multiple answers for a question. Based on this preliminary study, we propose a sequence tagging approach for BioEQA, which is a multi-span extraction setting. Our approach directly tackles questions with a variable number of phrases as their answer and can learn to decide the number of answers for a question from training data. Our experimental results on the BioASQ 7 b and 8 b list-type questions outperformed the best-performing existing models without requiring post-processing steps.AvailabilitySource codes and resources are freely available for download at https://github.com/dmis-lab/SeqTagQASupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Shepherd: Accurate Clustering for Correcting DNA Barcode Errors

Thu, 16/06/2022 - 5:30am
AbstractMotivationDNA barcodes are short, random nucleotide sequences introduced into cell populations to track the relative counts of hundreds of thousands of individual lineages over time. Lineage tracking is widely applied, e.g. to understand evolutionary dynamics in microbial populations and the progression of breast cancer in humans. Barcode sequences are unknown upon insertion and must be identified using next-generation sequencing technology, which is error prone. In this study, we frame the barcode error correction task as a clustering problem with the aim to identify true barcode sequences from noisy sequencing data. We present Shepherd, a novel clustering method that is based on an indexing system of barcode sequences using k-mers, and a Bayesian statistical test incorporating a substitution error rate to distinguish true from error sequences.ResultsWhen benchmarking with synthetic data, Shepherd provides barcode count estimates that are significantly more accurate than state-of-the-art methods, producing 10-150 times fewer spurious lineages. For empirical data, Shepherd produces results that are consistent with the improvements seen on synthetic data. These improvements enable higher resolution lineage tracking and more accurate estimates of biologically relevant quantities, e.g. the detection of small effect mutations.AvailabilityA Python implementation of Shepherd is freely available at: https://www.github.com/Nik-Tavakolian/Shepherd.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

DrawTetrado to create layer diagrams of G4 structures

Wed, 15/06/2022 - 5:30am
AbstractMotivationQuadruplexes are specific 3D structures found in nucleic acids. Due to the exceptional properties of these motifs, their exploration with the general-purpose bioinformatics methods can be problematic or insufficient. The same applies to visualizing their structure. A hand-drawn layer diagram is the most common way to represent the quadruplex anatomy. No molecular visualization software generates such a structural model based on atomic coordinates.ResultsDrawTetrado is an open-source Python program for automated visualization targeting the structures of quadruplexes and G4-helices. It generates static layer diagrams that represent structural data in a pseudo-3D perspective. The possibility to set color schemes, nucleotide labels, inter-element distances, or angle of view allows for easy customization of the output drawing.AvailabilityThe program is available under the MIT license at https://github.com/RNApolis/drawtetrado
Categories: Bioinformatics Trends

scCNC: A method based on Capsule Network for Clustering scRNA-seq Data

Tue, 14/06/2022 - 5:30am
AbstractMotivationA large number of studies have shown that clustering is a crucial step in scRNA-seq analysis. Most existing methods are based on unsupervised learning without the prior exploitation of any domain knowledge, which does not utilize available gold-standard labels. When confronted by the high dimensionality and general dropout events of scRNA-seq data, purely unsupervised clustering methods may not produce biologically interpretable clusters, which complicates cell type assignment.ResultsIn this paper, we propose a semi-supervised clustering method based on a capsule network named scCNC, that integrates domain knowledge into the clustering step. Significantly, we also propose a Semi-supervised Greedy Iterative Training (SGIT) method used to train the whole network. Experiments on some real scRNA-seq datasets show that scCNC can significantly improve clustering performance and facilitate downstream analyses.AvailabilityThe source code of scCNC is freely available at https://github.com/WHY-17/scCNC.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

MetBP: A Software Tool for Detection of Interaction between Metal Ion-RNA Base Pairs

Mon, 13/06/2022 - 5:30am
AbstractMotivationThe role of metals in shaping and functioning of RNA is a well established fact and the understanding of that through the analysis of structural data has biological relevance. Often metal ions bind to one or more atoms of the nucleobase of an RNA. This fact becomes more interesting when such bases form a base pair with any other base. Furthermore, when metal ions bind to any residue of an RNA, the secondary structural features of the residue (helix, loop, unpaired etc) are also biologically important. The available metal binding related software tools cannot address such type specific queries.ResultsTo fill this limitation, we have designed a software tool, called MetBP, that meets the goal. This tool is a stand-alone command line based tool and has no dependency on the other existing software. It accepts a structure file in mmCIF or PDB format and computes the base pairs and thereafter reports all metals that bind to one or more nucleotides that form pairs with another. It reports binding distance, angles along with base pair stability. It also reports several other important aspects, e.g. secondary structure of the residue in the RNA. MetBP can be used as a generalized metal binding site detection tool for Proteins and DNA as well.Availabilityhttps://github.com/computational-biology/metbpSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

SPEAR: Systematic ProtEin AnnotatoR

Mon, 13/06/2022 - 5:30am
AbstractSummaryWe present SPEAR, a lightweight and rapid SARS-CoV-2 variant annotation and scoring tool, for identifying mutations contributing to potential immune escape and transmissibility (ACE2 binding) at point of sequencing. SPEAR can be used in the field to evaluate genomic surveillance results in real-time and features a powerful interactive data visualisation report.Availability and implementationSPEAR and documentation are freely available on GitHub: https://github.com/m-crown/SPEAR and is implemented in Python and installable via Conda environment.SupplementalSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

hapCon: Estimating Contamination of Ancient Genomes by Copying from Reference Haplotypes

Mon, 13/06/2022 - 5:30am
AbstractMotivationHuman ancient DNA (aDNA) studies have surged in recent years, revolutionizing the study of the human past. Typically, aDNA is preserved poorly, making such data prone to contamination from other human DNA. Therefore, it is important to rule out substantial contamination before proceeding to downstream analysis. As most aDNA samples can only be sequenced to low coverages (<1x average depth), computational methods that can robustly estimate contamination in the low coverage regime are needed. However, the ultra low-coverage regime (0.1x and below) remains a challenging task for existing approaches.ResultsWe present a new method to estimate contamination in aDNA for male modern humans. It utilizes a Li&Stephens haplotype copying model for haploid X chromosomes, with mismatches modelled as errors or contamination. We assessed this new approach, hapCon, on simulated and down-sampled empirical aDNA data. Our experiments demonstrate that hapCon outperforms a commonly used tool for estimating male X contamination (ANGSD), with substantially lower variance and narrower confidence intervals, especially in the low coverage regime. We found that hapCon provides useful contamination estimates for coverages as low as 0.1x for SNP capture data (1240k) and 0.02x for whole genome sequencing data (WGS), substantially extending the coverage limit of previous male X chromosome based contamination estimation methods. Our experiments demonstrate that hapCon has little bias for contamination up to 25-30% as long as the contaminating source is specified within continental genetic variation, and that its application range extends to human aDNA as old as ∼45,000 and various global ancestries.AvailabilityWe make hapCon available as part of a python package (hapROH), which is available at the Python Package Index (https://pypi.org/project/hapROH) and can be installed via pip. The documentation provides example use cases as blueprints for custom applications (https://haproh.readthedocs.io/en/latest/hapCon.html). The program can analyze either BAM files or pileup files produced with samtools. An implementation of our software (hapCon) using Python and C is deposited at https://github.com/hyl317/hapROH.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

SNIKT: sequence-independent adapter identification and removal in long-read shotgun sequencing data

Mon, 13/06/2022 - 5:30am
AbstractSummaryHere we introduce SNIKT, a command-line tool for sequence-independent visual confirmation and input-assisted removal of adapter contamination in whole-genome shotgun or metagenomic shotgun long-read sequencing DNA or RNA data.Availability and ImplementationSNIKT is implemented in R and is compatible with Unix-like platforms. The source code, along with documentation, is freely available under an MIT license at https://github.com/piyuranjan/SNIKT.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

MiMIR: R-shiny application to infer risk factors and endpoints from Nightingale Health’s 1H-NMR Metabolomics data

Mon, 13/06/2022 - 5:30am
AbstractMotivation1H-NMR metabolomics is rapidly becoming a standard resource in large epidemiological studies to acquire metabolic profiles in large numbers of samples in a relatively low-priced and standardized manner. Concomitantly, metabolomics-based models are increasingly developed that capture disease risk or clinical risk factors. These developments raise the need for user-friendly toolbox to inspect new 1H-NMR metabolomics data and project a wide array of previously established risk models.ResultsWe present MiMIR (Metabolomics-based Models for Imputing Risk), a graphical user interface that provides an intuitive framework for ad-hoc statistical analysis of Nightingale Health’s 1H-NMR metabolomics data and allows for the projection and calibration of 24 pre-trained metabolomics-based models, without any pre-required programming knowledge.AvailabilityThe R-shiny package is available in CRAN or downloadable at https://github.com/DanieleBizzarri/MiMIR, together with an extensive user manual (also available as Supplementary Documents to the paper).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

ZipHiC: a novel Bayesian framework to identify enriched interactions and experimental biases in Hi-C data

Thu, 09/06/2022 - 5:30am
AbstractMotivationSeveral computational and statistical methods have been developed to analyse data generated through the 3C-based methods, especially the Hi-C. Most of the existing methods do not account for dependency in Hi-C data.ResultsHere, we present ZipHiC, a novel statistical method to explore Hi-C data focusing on the detection of enriched contacts. ZipHiC implements a Bayesian method based on a hidden Markov random field (HMRF) model and the Approximate Bayesian Computation (ABC) to detect interactions in two-dimensional space based on a Hi-C contact frequency matrix. ZipHiC uses data on the sources of biases related to the contact frequency matrix, allows borrowing information from neighbours using the Potts model and improves computation speed by using the ABC model. In addition to outperforming existing tools on both simulated and real data, our model also provides insights into different sources of biases that affects Hi-C data. We show that some datasets display higher biases from DNA accessibility or Transposable Elements content. Furthermore, our analysis in D. melanogaster showed that approximately half of the detected significant interactions connect promoters with other parts of the genome indicating a functional biological role. Finally, we found that the micro-C datasets display higher biases from DNA accessibility compared to a similar Hi-C experiment, but this can be corrected by ZipHiC.
Categories: Bioinformatics Trends

Accelerating in-silico saturation mutagenesis using compressed sensing

Thu, 09/06/2022 - 5:30am
AbstractMotivationIn-silico saturation mutagenesis (ISM) is a popular approach in computational genomics for calculating feature attributions on biological sequences that proceeds by systematically perturbing each position in a sequence and recording the difference in model output. However, this method can be slow because systematically perturbing each position requires performing a number of forward passes proportional to the length of the sequence being examined.ResultsIn this work, we propose a modification of ISM that leverages the principles of compressed sensing to require only a constant number of forward passes, regardless of sequence length, when applied to models that contain operations with a limited receptive field, such as convolutions. Our method, named Yuzu, can reduce the time that ISM spends in convolution operations by several orders of magnitude and, consequently, Yuzu can speed up ISM on several commonly used architectures in genomics by over an order of magnitude. Notably, we found that Yuzu provides speedups that increase with the complexity of the convolution operation and the length of the sequence being analyzed, suggesting that Yuzu provides large benefits in realistic settings.AvailabilityWe have made this tool available at https://github.com/kundajelab/yuzu.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

OverProt: secondary structure consensus for protein families

Wed, 08/06/2022 - 5:30am
AbstractSummaryEvery protein family has a set of characteristic secondary structures. However, due to individual variations, a single structure is not enough to represent the whole family. OverProt can create a secondary structure consensus, showing the general fold of the family as well as its variation. Our server provides precomputed results for all CATH superfamilies and user-defined computations, visualized by an interactive viewer, which shows the SSE type, length, frequency of occurrence, spatial variability, and β-connectivity.Availability and implementationOverProt Server is freely available at https://overprot.ncbr.muni.cz.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Looking at the BiG Picture: Incorporating bipartite graphs in drug response prediction

Wed, 08/06/2022 - 5:30am
AbstractMotivationThe increasing number of publicly available databases containing drugs’ chemical structures, their response in cell lines, and molecular profiles of the cell lines has garnered attention to the problem of drug response prediction. However, many existing methods do not fully leverage the information that is shared among cell lines and drugs with similar structure. As such, drug similarities in terms of cell line responses and chemical structures could prove to be useful in forming drug representations to improve drug response prediction accuracy.ResultsWe present two deep learning approaches, BiG-DRP and BiG-DRP+, for drug response prediction. Our models take advantage of the drugs’ chemical structure and the underlying relationships of drugs and cell lines through a bipartite graph and a heterogenous graph convolutional network that incorporate sensitive and resistant cell line information in forming drug representations. Evaluation of our methods and other state-of-the-art models in different scenarios shows that incorporating this bipartite graph significantly improves the prediction performance. Additionally, genes that contribute significantly to the performance of our models also point to important biological processes and signaling pathways. Analysis of predicted drug response of patients’ tumors using our model revealed important associations between mutations and drug sensitivity, illustrating the utility of our model in pharmacogenomics studies.Availability and implementationAn implementation of the algorithms in Python is provided in https://github.com/ddhostallero/BiG-DRP.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

RNAsolo: a repository of cleaned PDB-derived RNA 3D structures

Wed, 08/06/2022 - 5:30am
AbstractMotivationThe development of algorithms dedicated to RNA 3D structures contributes to the demand for training, testing, and benchmarking data. A reliable source of such data derived from computational prediction is the RNA-Puzzles repository. In contrast, the largest resource with experimentally determined structures is the Protein Data Bank. However, files in this archive often contain other molecular data in addition to the RNA structure itself, which—to be used by RNA processing algorithms—should be removed.ResultsRNAsolo is a self-updating database dedicated to RNA bioinformatics. It systematically collects experimentally determined RNA 3D structures stored in the PDB, cleans them from non-RNA chains, and groups them into equivalence classes. It allows users to download various subsets of data—clustered by resolution, source, data format, etc. – for further processing and analysis with a single click.AvailabilityThe repository is publicly available at https://rnasolo.cs.put.poznan.pl
Categories: Bioinformatics Trends

WAT3R: Recovery of T Cell Receptor Variable Regions From 3’ Single-Cell RNA-Sequencing

Wed, 08/06/2022 - 5:30am
AbstractSummaryDiversity of the T cell receptor (TCR) repertoire is central to adaptive immunity. The TCR is composed of α and β chains, encoded by the TRA and TRB genes, of which the variable regions determine antigen specificity. To generate novel biological insights into the complex functioning of immune cells, combined capture of variable regions and single-cell transcriptomes provides a compelling approach. Recent developments enable the enrichment of TRA and TRB variable regions from widely used technologies for 3’-based single-cell RNA-sequencing (scRNA-seq). However, a comprehensive computational pipeline to process TCR-enriched data from 3’ scRNA-seq is not available. Here we present an analysis pipeline to process TCR variable regions enriched from 3’ scRNA-seq cDNA. The tool reports TRA and TRB nucleotide and amino acid sequences linked to cell barcodes, enabling the reconstruction of T cell clonotypes with associated transcriptomes. We demonstrate the software using peripheral blood mononuclear cells (PBMCs) from a healthy donor and detect TCR sequences in a high proportion of single T cells. Detection of TCR sequences is low in non-T cell populations, demonstrating specificity. Finally, we show that TCR clones are larger in CD8 Memory T cells than in other T cell types, indicating an association between T cell clonotypes and differentiation states.Availability and implementationThe Workflow for Association of T cell receptors from 3' single-cell RNA-seq (WAT3R), including test data, is available on GitHub (https://github.com/mainciburu/WAT3R), Docker Hub (https://hub.docker.com/r/mainciburu/wat3r), and a workflow on the Terra platform (https://app.terra.bio). The test dataset is available on GEO (accession number GSE195956).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

meGPS: a multi-omics signature for hepatocellular carcinoma detection integrating methylome and transcriptome data

Wed, 08/06/2022 - 5:30am
AbstractMotivationHepatocellular carcinoma (HCC) is a primary malignancy with poor prognosis. Recently, multi-omics molecular-level measurement enables HCC diagnosis and prognosis prediction, which is crucial for early intervention of personalized therapy to diminish mortality. Here, we introduce a novel strategy utilizing DNA methylation and RNA expression data to achieve a multi-omics gene pair signature (GPS) for HCC discrimination.ResultsThe immune genes with negative correlations between expression and promoter methylation are enriched in the highly connected cancer-related pathway network, which are considered as the candidates for HCC detection. After that, we separately construct a methylation GPS (mGPS) and an expression GPS (eGPS), and then assemble them as a meGPS with five gene pairs, in which the significant methylation and expression changes occur between HCC tumor and non-tumor groups. Reliable performance has been validated by independent tissue (age, gender, and etiology) and blood datasets. This study proposes a procedure for multi-omics GPS identification and develops a novel HCC signature using both methylome and transcriptome data, suggesting potential molecular targets for the detection and therapy of HCC.Availability and implementationModels are available at https://github.com/bioinformaticStudy/meGPS.git.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

HyperGraphs.jl – representing high-order relationships in Julia

Wed, 08/06/2022 - 5:30am
AbstractSummaryHyperGraphs.jl is a Julia package that implements hypergraphs. These are a generalisation of graphs that allow us to represent n-ary relationships and not just binary, pairwise relationships. High-order interactions are commonplace in biological systems and are of critical importance to their dynamics; hypergraphs thus offer a natural way to accurately describe and model these systems.Availability and ImplementationHyperGraphs.jl is freely available under the MIT license. Source code and documentation can be found at https://github.com/lpmdiaz/HyperGraphs.jl.Supplementary InformationSupplementary information is available at Bioinformatics online.
Categories: Bioinformatics Trends

Prediction of Allosteric Communication Pathways in Proteins

Wed, 08/06/2022 - 5:30am
AbstractMotivationAllostery in proteins is an essential phenomenon in biological processes. In this paper, we present a computational model to predict paths of maximum information transfer between active and allosteric sites. In this information theoretic study, we use mutual information as the measure of information transfer, where transition probability of information from one residue to its contacting neighbors is proportional to the magnitude of mutual information between the two residues. Starting from a given residue and using a Hidden Markov Model, we successively determine the neighboring residues that eventually lead to a path of optimum information transfer. The Gaussian approximation of mutual information between residue pairs is adopted. The limits of validity of this approximation are discussed in terms of a nonlinear theory of mutual information and its reduction to the Gaussian form.ResultsPredictions of the model are tested on six widely studied cases, CheY Bacterial Chemotaxis, B-cell Lymphoma extra-large Bcl-xL, Human proline isomerase cyclophilin A (CypA), Dihydrofolate reductase DHFR, HRas GTPase, and Caspase-1. The communication transmission rendering the propagation of local fluctuations from the active sites throughout the structure in multiple paths correlate well with the known experimental data. Distinct paths originating from the active site may likely represent a multi functionality such as involving more than one allosteric site and/or preexistence of some other functional states. Our model is computationally fast and simple, and can give allosteric communication pathways, which are crucial for the understanding and control of protein functionality.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Deep Learning for Survival Analysis in Breast Cancer with Whole Slide Image Data

Wed, 08/06/2022 - 5:30am
AbstractMotivationWhole slide tissue images contain detailed data on the sub-cellular structure of cancer. Quantitative analyses of this data can lead to novel biomarkers for better cancer diagnosis and prognosis and can improve our understanding of cancer mechanisms. Such analyses are challenging to execute because of the sizes and complexity of whole slide image data and relatively limited volume of training data for machine learning methods.ResultsWe propose and experimentally evaluate a multi-resolution deep learning method for breast cancer survival analysis. The proposed method integrates image data at multiple resolutions and tumor, lymphocyte and nuclear segmentation results from deep learning models. Our results show that this approach can significantly improve the deep learning model performance compared to using only the original image data. The proposed approach achieves a c-index value of 0.706 compared to a c-index value of 0.551 from an approach that uses only color image data at the highest image resolution. Furthermore, when clinical features (sex, age and cancer stage) are combined with image data, the proposed approach achieves a c-index of 0.773.Availabilityhttps://github.com/SBU-BMI/deep_survival_analysis
Categories: Bioinformatics Trends

Pages

Calendar

Mon
Tue
Wed
Thu
Fri
Sat
Sun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
 
September 2022