Jump to Navigation

ConsensuSV—from the whole genome sequencing data to the complete variant list

Bioinformatics Oxford Journals - Mon, 31/10/2022 - 5:30am
AbstractSummaryThe detection of the Structural Variants using Illumina sequencing of human DNA is not an easy task. Multiple approaches have been proposed; however, all the methods have their limitations. In this paper we present ConsensuSV pipeline, that aids the research in complex variant detection. By using consensus meta-approach, eight independent SV callers are being used to identify a uniform set of high-quality structural variants. The pipeline works using raw sequencing data, and performs all the necessary steps automatically, significantly reducing the researchers’ time required for processing the data. The output files contain Structural Variants, Single Nucleotide Polymorphisms and Indels. The pipeline uses luigi framework, allowing the software to be run efficiently and parallelly using the high-performance computing (HPC) infrastructure. We strongly believe that the software is useful to the scientific community interested in the germline variant detection.Availabilityhttps://github.com/SFGLab/ConsensuSV-pipelineSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Convex hull as diagnostic tool in single-molecule localization microscopy

Bioinformatics Oxford Journals - Mon, 31/10/2022 - 5:30am
AbstractMotivationSingle-molecule localization microscopy resolves individual fluorophores or fluorescence-labeled biomolecules. Data is provided as a set of localizations that distribute normally around the true fluorophore position with a variance determined by the localization precision. Characterizing the spatial fluorophore distribution to differentiate between resolution-limited localization clusters, which resemble individual biomolecules, and extended structures, which represent aggregated molecular complexes, is a common challenge.ResultsWe demonstrate use of the convex hull and related hull properties of localization clusters for diagnostic purposes, as a parameter for cluster selection, or as a tool to determine localization precision.Availabilityhttps://github.com/super-resolution/Ebert-et-al-2022-supplement.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

kmdiff, large-scale and user-friendly differential k-mer analyses

Bioinformatics Oxford Journals - Mon, 31/10/2022 - 5:30am
Abstract Genome Wide Association Studies (GWAS) elucidate links between genotypes and phenotypes. Recent studies point out the interest of conducting such experiments using k-mers as the base signal instead of single-nucleotide polymorphisms. We propose a tool, kmdiff, that performs differential k-mer analyses on large sequencing cohorts in an order of magnitude less time and memory than previously possible.Availabilityhttps://github.com/tlemane/kmdiffFundingThe work was funded by IPL Inria Neuromarkers, ANR Inception (ANR-16-CONV-0005), ANR Prairie (ANR-19-P3IA-0001), ANR SeqDigger (ANR-19-CE45-0008), H2020 ITN ALPACA grant 956229.
Categories: Bioinformatics Trends

HaploDMF: viral Haplotype reconstruction from long reads via Deep Matrix Factorization

Bioinformatics Oxford Journals - Sat, 29/10/2022 - 5:30am
AbstractMotivationLacking strict proofreading mechanisms, many RNA viruses can generate progeny with slightly changed genomes. Being able to characterize highly similar genomes (i.e. haplotypes) in one virus population helps study the viruses’ evolution and their interactions with the host/other microbes. High-throughput sequencing data has become the major source for characterizing viral populations. However, the inherent limitation on read length by next-generation sequencing (NGS) makes complete haplotype reconstruction difficult.ResultsIn this work, we present a new tool named HaploDMF that can construct complete haplotypes using third-generation sequencing (TGS) data. HaploDMF utilizes a deep matrix factorization model with an adapted loss function to learn latent features from aligned reads automatically. The latent features are then used to cluster reads of the same haplotype. Unlike existing tools whose performance can be affected by the overlap size between reads, HaploDMF is able to achieve highly robust performance on data with different coverage, haplotype number, and error rates. In particular, it can generate more complete haplotypes even when the sequencing coverage drops in the middle. We benchmark HaploDMF against the state-of-the-art tools on simulated and real sequencing TGS data on different viruses. The results show that HaploDMF competes favorably against all others.AvailabilityThe source code and the documentation of HaploDMF are available at https://github.com/dhcai21/HaploDMF.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

ADViSELipidomics: a workflow for analyzing lipidomics data

Bioinformatics Oxford Journals - Sat, 29/10/2022 - 5:30am
AbstractSummaryADViSELipidomics is a novel Shiny app for preprocessing, analyzing, and visualizing lipidomics data. It handles the outputs from LipidSearch and LIQUID for lipid identification and quantification and the data from the Metabolomics Workbench. ADViSELipidomics extracts information by parsing lipid species (using LIPID MAPS classification) and, together with information available on the samples, performs several exploratory and statistical analyses. When the experiment includes internal lipid standards, ADViSELipidomics can normalize the data matrix, providing normalized concentration values per lipids and samples. Moreover, it identifies differentially abundant lipids in simple and complex experimental designs, dealing with batch effect correction. Finally, ADViSELipidomics has a user-friendly Graphical User Interface (GUI) and supports an extensive series of interactive graphics.Availability and ImplementationADViSELipidomics is freely available at https://github.com/ShinyFabio/ADViSELipidomicsSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

HAT: Haplotype Assembly Tool using short and error-prone long reads

Bioinformatics Oxford Journals - Sat, 29/10/2022 - 5:30am
AbtractMotivationHaplotypes are the set of alleles co-occurring on a single chromosome and inherited together to the next generation. Because a monoploid reference genome loses this co-occurrence information, it has limited use in associating phenotypes with allelic combinations of genotypes. Therefore, methods to reconstruct the complete haplotypes from DNA sequencing data are crucial.Recently, several attempts have been made at haplotype reconstructions, but significant limitations remain. High-quality continuous haplotypes cannot be created reliably, particularly when there are few differences between the homologous chromosomes.ResultsHere, we introduce HAT, a haplotype assembly tool that exploits short and long reads along with a reference genome to reconstruct haplotypes. HAT tries to take advantage of the accuracy of short reads and the length of the long reads to reconstruct haplotypes. We tested HAT on the aneuploid yeast strain Saccharomyces pastorianus CBS1483 and multiple simulated polyploid data sets of the same strain, showing that it outperforms existing tools.Availabilityhttps://github.com/AbeelLab/hat/Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Common data model for COVID-19 datasets

Bioinformatics Oxford Journals - Thu, 27/10/2022 - 5:30am
AbstractMotivationA global medical crisis like the COVID-19 pandemic requires interdisciplinary and highly collaborative research from all over the world. One of the key challenges for collaborative research is a lack of interoperability among various heterogeneous data sources. Interoperability, standardization and mapping of datasets is necessary for data analysis and applications in advanced algorithms such as developing personalized risk prediction modeling.ResultsTo ensure the interoperability and compatibility among COVID-19 datasets, we present here a Common Data Model (CDM) which has been built from 11 different COVID-19 datasets from various geographical locations. The current version of the CDM holds 4639 data variables related to COVID-19 such as basic patient information (age, biological sex, and diagnosis) as well as disease-specific data variables, for example, Anosmia and Dispnea. Each of the data variables in the data model is associated with specific data types, variable mappings, value ranges, data units, and data encodings that could be used for standardizing any dataset. Moreover, the compatibility with established data standards like OMOP and FHIR makes the CDM a well-designed common data model for COVID-19 data interoperability.AvailabilityThe CDM is available in a public repo here: https://github.com/Fraunhofer-SCAI-Applied-Semantics/COVID-19-Global-ModelSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Neuron Tracing from Light Microscopy Images: Automation, Deep Learning, and Bench Testing

Bioinformatics Oxford Journals - Thu, 27/10/2022 - 5:30am
AbstractMotivationLarge-scale neuronal morphologies are essential to neuronal typing, connectivity characterization and brain modeling. It is widely accepted that automation is critical to the production of neuronal morphology. Despite previous survey papers about neuron tracing from light microscopy data in the last decade, thanks to the rapid development of the field, there is a need to update recent progress in a review focusing on new methods and remarkable applications.ResultsThis review outlines neuron tracing in various scenarios with the goal to help the community understand and navigate tools and resources. We describe the status, examples, and accessibility of automatic neuron tracing. We survey recent advances of the increasingly popular deep learning enhanced methods. We highlight the semi-automatic methods for single neuron tracing of mammalian whole brains as well as the resulting datasets, each containing thousands of full neuron morphologies. Finally, we exemplify the commonly used datasets and metrics for neuron tracing bench testing.
Categories: Bioinformatics Trends

EvAM-Tools: tools for evolutionary accumulation and cancer progression models

Bioinformatics Oxford Journals - Wed, 26/10/2022 - 5:30am
AbstractSummaryEvAM-Tools is an R package and web application that provides a unified interface to state-of-the-art cancer progression models (CPMs) and, more generally, evolutionary models of event accumulation. The output includes, in addition to the fitted models, the transition (and transition rate) matrices between genotypes and the probabilities of evolutionary paths. Generation of random cancer progression models is also available. Using the GUI in the web application, users can easily construct models (modifying Directed Acyclic Graphs —DAGs— of restrictions, matrices of mutual hazards, or specifying genotype composition), generate data from them (with user-specified observational/genotyping error), and analyze the data.Availability and ImplementationImplemented in R and C; open source code available under the GNU Affero General Public License v3.0 at https://github.com/rdiaz02/EvAM-Tools. Docker images freely available from https://hub.docker.com/u/rdiaz02. Web app freely accessible at https://iib.uam.es/evamtools.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Identifying the critical state of complex biological systems by the directed-network rank score method

Bioinformatics Oxford Journals - Tue, 25/10/2022 - 5:30am
AbstractMotivationCatastrophic transitions are ubiquitous in the dynamic progression of complex biological systems; that is, a critical transition at which complex systems suddenly shift from one stable state to another occurs. Identifying such a critical point or tipping point is essential for revealing the underlying mechanism of complex biological systems. However, it is difficult to identify the tipping point since few significant differences in the critical state are detected in terms of traditional static measurements.ResultsIn this study, by exploring the dynamic changes in gene cooperative effects between the before-transition and critical states, we presented a model-free approach, the directed-network rank score (DNRS), to detect the early-warning signal of critical transition in complex biological systems. The proposed method is applicable to both bulk and single-cell RNA-sequencing (scRNA-seq) data. This computational method was validated by the successful identification of the critical or pre-transition state for both simulated and six real datasets, including three scRNA-seq datasets of embryonic development and three tumor datasets. In addition, the functional and pathway enrichment analyses suggested that the corresponding DNRS signaling biomarkers were involved in key biological processes.AvailabilityThe source code is freely available at https://github.com/zhongjiayuan/DNRSSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

GlycoEnzOnto: A GlycoEnzyme Pathway and Molecular Function Ontology

Bioinformatics Oxford Journals - Tue, 25/10/2022 - 5:30am
AbstractMotivationThe ‘glycoEnzymes’ include a set of proteins having related enzymatic, metabolic, transport, structural and cofactor functions. Currently there is no established ontology to describe glycoEnzyme properties and to relate them to glycan biosynthesis pathways.ResultsWe present GlycoEnzOnto, an ontology describing 403 human glycoEnzymes curated along 139 glycosylation pathways, 134 molecular functions and 22 cellular compartments. The pathways described regulate nucleotide-sugar metabolism, glycosyl-substrate/donor transport, glycan biosynthesis, and degradation. The role of each enzyme in the glycosylation initiation, elongation/branching, and capping/termination phases is described. IUPAC linear strings present systematic human/machine readable descriptions of individual reaction steps and enable automated knowledge-based curation of biochemical networks. All GlycoEnzOnto knowledge is integrated with the Gene Ontology (GO) biological processes. GlycoEnzOnto enables improved transcript overrepresentation analyses and glycosylation pathway identification compared to other available schema, e.g. KEGG and Reactome. Overall, GlycoEnzOnto represents a holistic glycoinformatics resource for systems-level analyses.Availabilityhttps://github.com/neel-lab/GlycoEnzOntoSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

EcoTransLearn: an R-package to easily use Transfer Learning for Ecological Studies. A plankton case study

Bioinformatics Oxford Journals - Tue, 25/10/2022 - 5:30am
AbstractSummaryIn recent years, Deep Learning (DL) has been increasingly used in many fields, in particular in image recognition, due to its ability to solve problems where traditional machine learning algorithms fail. However, building an appropriate DL model from scratch, especially in the context of ecological studies, is a difficult task due to the dynamic nature and morphological variability of living organisms, as well as the high cost in terms of time, human resources and skills required to label a large number of training images. To overcome this problem, Transfer Learning (TL) can be used to improve a classifier by transferring information learnt from many domains thanks to a very large training set composed of various images, to another domain with a smaller amount of training data. To compensate the lack of “easy-to-use” software optimized for ecological studies, we propose the EcoTransLearn R-package, which allows greater automation in classification of images acquired with various devices (FlowCam, ZooScan, photographs, etc.), thanks to different TL methods pre-trained on the generic ImageNet dataset.Availability and ImplementationEcoTransLearn is an open-source package. It is implemented in R, and calls Python scripts for image classification step (using reticulate and tensorflow libraries). The source code, instruction manual and examples can be found at https://github.com/IFREMER-LERBL/EcoTransLearn.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

The LCD-Composer Webserver: High-Specificity Identification and Functional Analysis of Low-Complexity Domains in Proteins

Bioinformatics Oxford Journals - Tue, 25/10/2022 - 5:30am
AbstractSummaryLow-complexity domains (LCDs) in proteins are regions enriched in a small subset of amino acids. LCDs exist in all domains of life, often have unusual biophysical behavior, and function in both normal and pathological processes. We recently developed an algorithm to identify LCDs based predominantly on amino acid composition thresholds. Here, we have integrated this algorithm with a webserver and augmented it with additional analysis options. Specifically, users can: 1) search for LCDs in whole proteomes by setting minimum composition thresholds for individual or grouped amino acids, 2) submit a known LCD sequence to search for similar LCDs, 3) search for and plot LCDs within a single protein, 4) statistically test for enrichment of LCDs within a user-provided protein set, and 5) specifically identify proteins with multiple types of LCDs.AvailabilityThe LCD-Composer server can be accessed at http://lcd-composer.bmb.colostate.edu. The corresponding command-line scripts can be accessed at https://github.com/RossLabCSU/LCD-Composer/tree/master/WebserverScripts.
Categories: Bioinformatics Trends

AHoJ: rapid, tailored search and retrieval of apo and holo protein structures for user-defined ligands

Bioinformatics Oxford Journals - Tue, 25/10/2022 - 5:30am
AbstractSummaryUnderstanding the mechanism of action of a protein or designing better ligands for it, often requires access to a bound (holo) and an unbound (apo) state of the protein. Resources for the quick and easy retrieval of such conformations are severely limited.Apo-Holo Juxtaposition (AHoJ), is a web application for retrieving apo-holo structure pairs for user-defined ligands. Given a query structure and one or more user-specified ligands, it retrieves all other structures of the same protein that feature the same binding site(s), aligns them, and examines the superimposed binding sites to determine whether each structure is apo or holo, in reference to the query. The resulting superimposed datasets of apo-holo pairs can be visualized and downloaded for further analysis. AHoJ accepts multiple input queries, allowing the creation of customized apo-holo datasets.AvailabilityFreely available for non-commercial use at http://apoholo.cz. Source code available at https://github.com/cusbg/AHoJ-project.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Correction to: Efficient permutation-based genome-wide association studies for normal and skewed phenotypic distributions

Bioinformatics Oxford Journals - Sat, 22/10/2022 - 5:30am
This is a correction to: Maura John, Markus J. Ankenbrand, Carolin Artmann, Jan A. Freudenthal, Arthur Korte, and Dominik G. Grimm, Efficient permutation-based genome-wide association studies for normal and skewed phenotypic distributions, Bioinformatics, Volume 38, Issue Supplement_2, September 2022, Pages ii5–ii12, https://doi.org/10.1093/bioinformatics/btac455
Categories: Bioinformatics Trends

OmicsEV: a tool for comprehensive quality evaluation of omics data tables

Bioinformatics Oxford Journals - Sat, 22/10/2022 - 5:30am
AbstractSummaryRNA-Seq and mass spectrometry-based studies generate omics data tables with measurements for tens of thousands of genes across all samples in a study. The success of a study relies on the quality of these data tables, which is determined by both experimental data generation and computational methods used to process raw experimental data into quantitative data tables. We present OmicsEV, an R package for quality evaluation of omics data tables. For each data table, OmicsEV uses a series of methods to evaluate data depth, data normalization, batch effect, biological signal, platform reproducibility, and multi-omics concordance, producing comprehensive visual and quantitative evaluation results that help assess data quality of individual data tables and facilitate the identification of the optimal data processing method and parameters for the omics study under investigation.AvailabilityThe source code and the user manual of OmicsEV are available at https://github.com/bzhanglab/OmicsEV, and the source code is released under the GPL-3 license.
Categories: Bioinformatics Trends

CRISPRon/off: CRISPR/Cas9 on- and off-target gRNA design

Bioinformatics Oxford Journals - Sat, 22/10/2022 - 5:30am
AbstractSummaryThe effectiveness of CRISPR/Cas9-mediated genome editing experiments largely depends on the guide RNA (gRNA) used by the CRISPR/Cas9 system for target recognition and cleavage activation. Careful design is necessary to select a gRNA with high editing efficiency at the on-target site and with minimum off-target potential. Here we present our webserver for gRNA design with a user-friendly graphical interface, which provides interoperability between our on- and off-target prediction tools, CRISPRon and CRISPRoff, for a complete and streamlined gRNA selection.Availability and implementationThe graphical interface uses the Integrative Genomic Viewer (IGV) JavaScript plugin. The backend tools are implemented in Python and C. The CRISPRon and CRISPRoff webservers and command-line tools are freely available at https://rth.dk/resources/crispr.
Categories: Bioinformatics Trends

DeepPerVar: a multimodal deep learning framework for functional interpretation of genetic variants in personal genome

Bioinformatics Oxford Journals - Sat, 22/10/2022 - 5:30am
AbstractMotivationUnderstanding the functional consequence of genetic variants, especially the noncoding ones, is important but particularly challenging. Genome-wide association studies or quantitative trait locus analyses may be subject to limited statistical power and linkage disequilibrium, and thus are less optimal to pinpoint the causal variants. Moreover, most existing machine learning approaches, which exploit the functional annotations to interpret and prioritize putative causal variants, cannot accommodate the heterogeneity of personal genetic variations and traits in a population study, targeting a specific disease.ResultsBy leveraging paired whole genome sequencing data and epigenetic functional assays in a population study, we propose a multi-modal deep learning framework to predict genome-wide quantitative epigenetic signals by considering both personal genetic variations and traits. The proposed approach can further evaluate the functional consequence of noncoding variants on an individual level by quantifying the allelic difference of predicted epigenetic signals. By applying the approach to the ROSMAP cohort studying Alzheimer’s disease (AD), we demonstrate that the proposed approach can accurately predict quantitative genome-wide epigenetic signals and in key genomic regions of AD causal genes, learn canonical motifs reported to regulate gene expression of AD causal genes, improve the partitioning heritability analysis, and prioritize putative causal variants in a GWAS risk locus. Finally, we release the proposed deep learning model as a stand-alone Python toolkit and a web server.Availabilityhttps://github.com/lichen-lab/DeepPerVar
Categories: Bioinformatics Trends

Defining the extent of gene function using ROC curvature

Bioinformatics Oxford Journals - Sat, 22/10/2022 - 5:30am
AbstractMotivationInteractions between proteins help us understand how genes are functionally related and how they contribute to phenotypes. Experiments provide imperfect “ground truth” information about a small subset of potential interactions in a specific biological context, which can then be extended to the whole genome across different contexts, such as conditions, tissues, or species, through machine learning methods. However, evaluating the performance of these methods remains a critical challenge. Here, we propose to evaluate the generalizability of gene characterizations through the shape of performance curves.ResultsWe identify Functional Equivalence Classes (FECs), subsets of annotated and unannotated genes that jointly drive performance, by assessing the presence of straight lines in ROC curves built from gene-centric prediction tasks, such as function or interaction predictions. FECs are widespread across data types and methods, they can be used to evaluate the extent and context-specificity of functional annotations in a data-driven manner. For example, FECs suggest that B cell markers can be decomposed into shared primary markers (10 to 50 genes), and tissue-specific secondary markers (100 to 500 genes). In addition, FECs suggest the existence of functional modules that span a wide range of the genome, with marker sets spanning at most 5% of the genome and data-driven extensions of Gene Ontology sets spanning up to 40% of the genome. Simple to assess visually and statistically, the identification of FECs in performance curves paves the way for novel functional characterization and increased robustness in the definition of functional gene sets.AvailabilityCode for analyses and figures is available at https://github.com/yexilein/pyroc.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

LaGAT: Link-aware Graph Attention Network for Drug-Drug Interaction Prediction

Bioinformatics Oxford Journals - Sat, 22/10/2022 - 5:30am
AbstractMotivationDrug-drug interaction (DDI) prediction is a challenging problem in pharmacology and clinical applications. With the increasing availability of large biomedical databases, large-scale biological knowledge graphs containing drug information have been widely used for DDI prediction. However, large knowledge graphs inevitably suffer from data noise problems, which limit the performance and interpretability of models based on the knowledge graph. Recent studies attempt to improve models by introducing inductive bias through an attention mechanism. However, they all only depend on the topology of entity nodes independently to generate fixed attention pathways, without considering the semantic diversity of entity nodes in different drug pair links. This makes it difficult for models to select more meaningful nodes to overcome data quality limitations and make more interpretable predictions.ResultsTo address this issue, we propose a Link-aware Graph Attention method for DDI prediction, called LaGAT, which is able to generate different attention pathways for drug entities based on different drug pair links. For a drug pair link, the LaGAT uses the embedding representation of one of the drugs as a query vector to calculate the attention weights, thereby selecting the appropriate topological neighbor nodes to obtain the semantic information of the other drug. We separately conduct experiments on binary and multi-class classification and visualize the attention pathways generated by the model. The results prove that LaGAT can better capture semantic relationships and achieves remarkably superior performance over both the classical and state-of-the-art models on DDI prediction.AvailabilityThe source code and data are available at https://github.com/Azra3lzz/LaGATSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Pages

Subscribe to Centre for Bioinformatics aggregator - Bioinformatics Trends

Calendar

Mon
Tue
Wed
Thu
Fri
Sat
Sun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
January 2023