Jump to Navigation

SYNPHONI: scale-free & phylogeny-aware reconstruction of synteny conservation & transformation across animal genomes

Bioinformatics Oxford Journals - Fri, 21/10/2022 - 5:30am
AbstractSummaryCurrent approaches detect conserved genomic order either at chromosomal (macro-synteny) or at subchromosomal scales (microsynteny). The latter generally requires collinearity and hard thresholds on syntenic region size, thus excluding a major proportion of syntenies with recent expansions or minor rearrangements. “SYNPHONI” bridges the gap between micro- and macro-synteny detection, providing detailed information on both synteny conservation and transformation throughout the evolutionary history of animal genomes.Availability and ImplementationSource code is freely available 'here' {{https://github.com/nsmro/SYNPHONI}}, implemented in Python3.9.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Improving and evaluating deep learning models of cellular organization

Bioinformatics Oxford Journals - Thu, 20/10/2022 - 5:30am
AbstractMotivationCells contain dozens of major organelles and thousands of other structures, many of which vary extensively in their number, size, shape and spatial distribution. This complexity and variation dramatically complicates the use of both traditional and deep learning methods to build accurate models of cell organization. Most cellular organelles are distinct objects with defined boundaries that do not overlap, while the pixel resolution of most imaging methods is not sufficient to resolve these boundaries. Thus while cell organization is conceptually object-based, most current methods are pixel-based. Using extensive image collections in which particular organelles were fluorescently-labeled, deep learning methods can be used to build conditional autoencoder models for particular organelles. A major advance occurred with the use of a U-net approach to make multiple models all conditional upon a common reference, unlabeled image, allowing the relationships between different organelles to be at least partially inferred.ResultsWe have developed improved GAN-based approaches for learning these models and have also developed novel criteria for evaluating how well synthetic cell images reflect the properties of real images. The first set of criteria measure how well models preserve the expected property that organelles do not overlap. We also developed a modified loss function that allows retraining of the models to minimize that overlap. The second set of criteria uses object-based modeling to compare object shape and spatial distribution between synthetic and real images. Our work provides the first demonstration that, at least for some organelles, deep learning models can capture object-level properties of cell images.Availabilityhttp://murphylab.cbd.cmu.edu/Software/2022_insilico.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

A novel pipeline for computerized mouse spermatogenesis staging

Bioinformatics Oxford Journals - Thu, 20/10/2022 - 5:30am
AbstractMotivationDifferentiating 12 stages of mouse seminiferous epithelial cycle is vital towards understanding the dynamic spermatogenesis process. However, it is challenging since two adjacent spermatogenic stages are morphologically similar. Distinguishing Stages I-III from Stages IV-V is important for histologists to understand sperm development in wildtype mice and spermatogenic defects in infertile mice. To achieve this, we propose a novel pipeline for Computerized Spermatogenesis Staging (CSS).ResultsThe CSS pipeline comprises four parts: 1) A seminiferous tubule segmentation model is developed to extract every single tubule; 2) A Multi-Scale Learning (MSL) model is developed to integrate local and global information of a seminiferous tubule to distinguish Stages I-V from Stages VI-XII; 3) A Multi-Task Learning (MTL) model is developed to segment the Multiple Testicular Cells (MTCs) for Stages I-V without an exhaustive requirement for manual annotation; 4) A set of 204-dimensional image-derived features is developed to discriminate Stages I-III from Stages IV-V by capturing cell-level and image-level representation. Experimental results suggest that the proposed MSL and MTL models outperform classic single-scale and single-task models when manual annotation is limited. In addition, the proposed image-derived features are discriminative between Stages I-III and Stages IV-V. In conclusion, the CSS pipeline can not only provide histologists with a solution to facilitate quantitative analysis for spermatogenesis stage identification but also help them to uncover novel computerized image-derived biomarkers.Availability and implementationhttps://github.com/jydada/CSSSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

MAGScoT - a fast, lightweight, and accurate bin-refinement tool

Bioinformatics Oxford Journals - Thu, 20/10/2022 - 5:30am
AbstractMotivationRecovery of metagenome-assembled genomes (MAGs) from shotgun metagenomic data is an important task for the comprehensive analysis of microbial communities from variable sources. Single binning tools differ in their ability to leverage specific aspects in MAG reconstruction, the use of ensemble binning refinement tools is often time consuming and computational demand increases with community complexity. We introduce MAGScoT, a fast, lightweight and accurate implementation for the reconstruction of highest-quality MAGs from the output of multiple genome-binning tools.ResultsMAGScoT outperforms popular bin-refinement solutions in terms of quality and quantity of MAGs as well as computation time and resource consumption.AvailabilityMAGScoT is available via GitHub (https://github.com/ikmb/MAGScoT) and as an easy-to-use Docker container (https://hub.docker.com/repository/docker/ikmb/magscot).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online. All scripts to produce the binning results and the subsequent refinement are available via GitHub (https://github.com/mruehlemann/MAGScoT_benchmarking_scripts).
Categories: Bioinformatics Trends

Powerful and interpretable control of false discoveries in two-group differential expression studies

Bioinformatics Oxford Journals - Thu, 20/10/2022 - 5:30am
AbstractMotivationThe standard approach for statistical inference in differential expression (DE) analyses is to control the False Discovery Rate (FDR). However, controlling the FDR does not in fact imply that the proportion of false discoveries is upper bounded. Moreover, no statistical guarantee can be given on subsets of genes selected by FDR thresholding. These known limitations are overcome by post hoc inference, which provides guarantees of the number of proportion of false discoveries among arbitrary gene selections. However, post hoc inference methods are not yet widely used for DE studies.ResultsIn this paper, we demonstrate the relevance and illustrate the performance of adaptive interpolation-based post hoc methods for two-group DE studies. First, we formalize the use of permutation-based methods to obtain sharp confidence bounds that are adaptive to the dependence between genes. Then, we introduce a generic linear time algorithm for computing post hoc bounds, making these bounds applicable to large-scale two-group DE studies. The use of the resulting Adaptive Simes bound is illustrated on a RNA sequencing study. Comprehensive numerical experiments based on real microarray and RNA sequencing data demonstrate the statistical performance of the method.Availability and implementationA cross-platform open source implementation within the R package sanssouci is available at https://sanssouci-org.github.io/sanssouci/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online. Rmarkdown vignettes for the differential analysis of microarray and RNAseq data are available from the package.
Categories: Bioinformatics Trends

ATLIGATOR: Editing protein interactions with an atlas-based approach

Bioinformatics Oxford Journals - Wed, 19/10/2022 - 5:30am
AbstractMotivationRecognition of specific molecules by proteins is a fundamental cellular mechanism and relevant for many applications. Being able to modify binding is a key interest and can be achieved by repurposing established interaction motifs. We were specifically interested in a methodology for the design of peptide binding modules. By leveraging interaction data from known protein structures, we plan to accelerate the design of novel protein or peptide binders.ResultsWe developed ATLIGATOR—a computational method to support the analysis and design of a protein’s interaction with a single side chain. Our program enables the building of interaction atlases based on structures from the PDB. From these atlases pocket definitions are extracted that can be searched for frequent interactions. These searches can reveal similarities in unrelated proteins as we show here for one example. Such frequent interactions can then be grafted onto a new protein scaffold as a starting point of the design process. The ATLIGATOR tool is made accessible through a python API as well as a CLI with python scripts.Availability and ImplementationSource code can be downloaded at github (https://www.github.com/Hoecker-Lab/atligator), installed from PyPI (“atligator”) and is implemented in Python 3.
Categories: Bioinformatics Trends

CoxMKF: A Knockoff Filter for High-Dimensional Mediation Analysis with a Survival Outcome in Epigenetic Studies

Bioinformatics Oxford Journals - Tue, 18/10/2022 - 5:30am
AbstractMotivationIt is of scientific interest to identify DNA methylation CpG sites that might mediate the effect of an environmental exposure on a survival outcome in high-dimensional mediation analysis. However, there is a lack of powerful statistical methods that can provide a guarantee of false discovery rate (FDR) control in finite-sample settings.ResultsIn this article, we propose a novel method called CoxMKF, which applies aggregation of multiple knockoffs to a Cox proportional hazards model for a survival outcome with high-dimensional mediators. The proposed CoxMKF can achieve FDR control even in finite-sample settings, which is particularly advantageous when the sample size is not large. Moreover, our proposed CoxMKF can overcome the randomness of the unstable model-X knockoffs. Our simulation results show that CoxMKF controls FDR well in finite samples. We further apply CoxMKF to a lung cancer data set from The Cancer Genome Atlas (TCGA) project with 754 subjects and 365 306 DNA methylation CpG sites, and identify four DNA methylation CpG sites that might mediate the effect of smoking on the overall survival among lung cancer patients.AvailabilityThe R package CoxMKF is publicly available at https://github.com/MinhaoYaooo/CoxMKF.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

AIscEA: Unsupervised Integration of Single-cell Gene Expression and Chromatin Accessibility via Their Biological Consistency

Bioinformatics Oxford Journals - Mon, 17/10/2022 - 5:30am
AbstractMotivationThe integrative analysis of single-cell gene expression and chromatin accessibility measurements is essential for revealing gene regulation, but it is one of the key challenges in computational biology. Gene expression and chromatin accessibility are measurements from different modalities, and no common features can be directly used to guide integration. Current state-of-the-art methods lack practical solutions for finding heterogeneous clusters. However, previous methods might not generate reliable results when cluster heterogeneity exists. More importantly, current methods lack an effective way to select hyper-parameters under an unsupervised setting. Therefore, applying computational methods to integrate single-cell gene expression and chromatin accessibility measurements remains difficult.ResultsWe introduce AIscEA—Alignment-based Integration of single-cell gene Expression and chromatin Accessibility—a computational method that integrates single-cell gene expression and chromatin accessibility measurements using their biological consistency. AIscEA first defines a ranked similarity score to quantify the biological consistency between cell clusters across measurements. AIscEA then uses the ranked similarity score and a novel permutation test to identify cluster alignment across measurements. AIscEA further utilizes graph alignment for the aligned cell clusters to align the cells across measurements. We compared AIscEA with the competing methods on several benchmark datasets and demonstrated that AIscEA is highly robust to the choice of hyper-parameters and can better handle the cluster heterogeneity problem. Furthermore, AIscEA significantly outperforms the state-of-the-art methods when integrating real-world SNARE-seq and scMultiome-seq datasets in terms of integration accuracy.AvailabilityAIscEA is available at https://figshare.com/articles/software/AIscEA_zip/21291135 on FigShare as well as https://github.com/elhaam/AIscEA on GitHub.
Categories: Bioinformatics Trends

CEDA: integrating gene expression data with CRISPR pooled screen data identifies essential genes with higher expression

Bioinformatics Oxford Journals - Mon, 17/10/2022 - 5:30am
AbstractMotivationCRISPR-based genetic perturbation screen is a powerful tool to probe gene function. However, experimental noises, especially for the lowly expressed genes, need to be accounted for to maintain proper control of false positive rate.MethodWe develop a statistical method, named CRISPR screen with Expression Data Analysis (CEDA), to integrate gene expression profiles and CRISPR screen data for identifying essential genes. CEDA stratifies genes based on expression level and adopts a three-component mixture model for the log-fold change of single-guide RNAs (sgRNAs). Empirical Bayesian prior and Expectation-Maximization algorithm are used for parameter estimation and false discovery rate inference.ResultsTaking advantage of gene expression data, CEDA identifies essential genes with higher expression. Compared to existing methods, CEDA shows comparable reliability but higher sensitivity in detecting essential genes with moderate sgRNA fold change. Therefore, using the same CRISPR data, CEDA generates an additional hit gene list.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

scGNN 2.0: a graph neural network tool for imputation and clustering of single-cell RNA-Seq data

Bioinformatics Oxford Journals - Mon, 17/10/2022 - 5:30am
AbstractMotivationGene expression imputation has been an essential step of the single-cell RNA-Seq data analysis workflow. Among several deep learning methods, the debut of scGNN gained substantial recognition in 2021 for its superior performance and the ability to produce a cell-cell graph. However, the implementation of scGNN was relatively time-consuming and its performance could still be optimized.ResultsThe implementation of scGNN 2.0 is significantly faster than scGNN thanks to a simplified close-loop architecture. For all eight datasets, cell clustering performance was increased by 85.02% on average in terms of adjusted rand index, and the imputation Median L1 Error was reduced by 67.94% on average. With the built-in visualizations, users can quickly assess the imputation and cell clustering results, compare against benchmarks, and interpret the cell-cell interaction. The expanded input and output formats also pave the way for custom workflows that integrate scGNN 2.0 with other scRNA-Seq toolkits on both Python and R platforms.AvailabilityscGNN 2.0 is implemented in Python (as of version 3.8) with the source code available at https://github.com/OSU-BMBL/scGNN2.0.Supplementary informationSupplementary files are available at Bioinformatics online.
Categories: Bioinformatics Trends

Correction to: plotsr: visualizing structural similarities and rearrangements between multiple genomes

Bioinformatics Oxford Journals - Thu, 13/10/2022 - 5:30am
This is a correction to: Manish Goel and Korbinian Schneeberger plotsr: visualizing structural similarities and rearrangements between multiple genomes, Bioinformatics, Volume 38, Issue 10, 15 May 2022, https://doi.org/10.1093/bioinformatics/btac196
Categories: Bioinformatics Trends

Estimation of Speciation Times Under the Multispecies Coalescent

Bioinformatics Oxford Journals - Thu, 13/10/2022 - 5:30am
AbstractMotivationThe multispecies coalescent model is now widely accepted as an effective model for incorporating variation in the evolutionary histories of individual genes into methods for phylogenetic inference from genome-scale data. However, because model-based analysis under the coalescent can be computationally expensive for large data sets, a variety of inferential frameworks and corresponding algorithms have been proposed for estimation of species-level phylogenies and associated parameters, including speciation times and effective population sizes.ResultsWe consider the problem of estimating the timing of speciation events along a phylogeny in a coalescent framework. We propose a maximum a posteriori estimator based on composite likelihood (MAPCL) for inferring these speciation times under a model of DNA sequence evolution for which exact site pattern probabilities can be computed under the assumption of a constant θ throughout the species tree. We demonstrate that the MAPCL estimates are statistically consistent and asymptotically normally distributed, and we show how this result can be used to estimate their asymptotic variance. We also provide a more computationally efficient estimator of the asymptotic variance based on the nonparametric bootstrap. We evaluate the performance of our method using simulation and by application to an empirical dataset for gibbons.Availability and implementationThe method has been implemented in the PAUP* program, freely available at https://paup.phylosolutions.com for Macintosh, Windows, and Linux operating systems.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

E-SNPs&GO: Embedding of protein sequence and function improves the annotation of human pathogenic variants

Bioinformatics Oxford Journals - Thu, 13/10/2022 - 5:30am
AbstractMotivationThe advent of massive DNA sequencing technologies is producing a huge number of human single-nucleotide polymorphisms occurring in protein-coding regions and possibly changing their sequences. Discriminating harmful protein variations from neutral ones is one of the crucial challenges in precision medicine. Computational tools based on artificial intelligence provide models for protein sequence encoding, bypassing database searches for evolutionary information. We leverage the new encoding schemes for an efficient annotation of protein variants.ResultsE-SNPs&GO is a novel method that, given an input protein sequence and a single amino acid variation, can predict whether the variation is related to diseases or not. The proposed method adopts an input encoding completely based on protein language models and embedding techniques, specifically devised to encode protein sequences and GO functional annotations. We trained our model on a newly generated dataset of 101,146 human protein single amino acid variants in 13,661 proteins, derived from public resources. When tested on a blind set comprising 10,266 variants, our method well compares to recent approaches released in literature for the same task, reaching a Matthews Correlation Coefficient (MCC) score of 0.72. We propose E-SNPs&GO as a suitable, efficient and accurate large-scale annotator of protein variant datasets.AvailabilityThe method is available as a webserver at https://esnpsandgo.biocomp.unibo.it. Datasets and predictions are available at https://esnpsandgo.biocomp.unibo.it/datasets.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Biomedical Evidence Engineering for Data-Driven Discovery

Bioinformatics Oxford Journals - Thu, 13/10/2022 - 5:30am
AbstractMotivationWith the rapid development of precision medicine, a large amount of health data (such as electronic health records, gene sequencing, medical images, etc.) has been produced. It encourages more and more interest in data-driven insight discovery from these data. A reasonable way to verify the derived insights is by checking evidence from biomedical literature. However, manual verification is inefficient and not scalable. Therefore, an intelligent technique is necessary to solve this problem.ResultsThis paper introduces a framework for biomedical evidence engineering, addressing this problem more effectively. The framework consists of a biomedical literature retrieval module and an evidence extraction module. The retrieval module ensembles several methods and achieves state-of-the-art performance in biomedical literature retrieval. A BERT-based evidence extraction model is proposed to extract evidence from literature in response to queries. Moreover, we create a dataset with 1 million examples of biomedical evidence, 10,000 of which are manually annotated.AvailabilityDatasets are available at https://github.com/SendongZhao.
Categories: Bioinformatics Trends

Evaluation of efficiency prediction algorithms and development of ensemble model for CRISPR/Cas9 gRNA selection

Bioinformatics Oxford Journals - Thu, 13/10/2022 - 5:30am
AbstractMotivationThe CRISPR/Cas9 system is widely used for genome editing. The editing efficiency of CRISPR/Cas9 is mainly determined by the guide RNA (gRNA). Although many computational algorithms have been developed in recent years, it is still a challenge to select optimal bioinformatics tools for gRNA design in different experimental settings.ResultsWe performed a comprehensive comparison analysis of fifteen public algorithms for gRNA design, using fifteen experimental gRNA datasets. Based on this analysis, we identified the top-performing algorithms, with which we further implemented various computational strategies to build ensemble models for performance improvement. Validation analysis indicates that the new ensemble model had improved performance over any individual algorithm alone at predicting gRNA efficacy under various experimental conditions.AvailabilityThe new sgRNA design tool is freely accessible as a web application via https://crisprdb.org. The source code and stand-alone version is available at Figshare (https://doi.org/10.6084/m9.figshare.21295863) and Github (https://github.com/wang-lab/CRISPRDB).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Leveraging a Pharmacogenomics Knowledge-base to Formulate a Drug Response Phenotype Terminology for Genomic Medicine

Bioinformatics Oxford Journals - Wed, 12/10/2022 - 5:30am
AbstractMotivationDespite the increasing evidence of utility of genomic medicine in clinical practice, systematically integrating genomic medicine information and knowledge into clinical systems with a high-level of consistency, scalability, and computability remains challenging. A comprehensive terminology is required for relevant concepts and the associated knowledge model for representing relationships.MethodsIn this study, we leveraged PharmGKB, a comprehensive pharmacogenomics (PGx) knowledgebase, to formulate a terminology for drug response phenotypes that can represent relationships between genetic variants and treatments. We evaluated coverage of the terminology through manual review of a randomly selected subset of 200 sentences extracted from genetic reports that contained concepts for “Genes and Gene Products” and “Treatments”.ResultsResults showed that our proposed drug response phenotype terminology could cover 96% of the drug response phenotypes in genetic reports. Among 18,653 sentences that contained both “Genes and Gene Products” and “Treatments”, 3,011 sentences were able to be mapped to a drug response phenotype in our proposed terminology, among which the most discussed drug response phenotypes were response (994), sensitivity (829), and survival (332). In addition, we were able to re-analyze genetic report context incorporating the proposed terminology and enrich our previously proposed PGx knowledge model to reveal relationships between genetic variants and treatments.ConclusionIn conclusion, we proposed a drug response phenotype terminology that enhanced structured knowledge representation of genomic medicine.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

METAbolomics data Balancing with Over-sampling Algorithms (Meta-BOA): an online resource for addressing class imbalance

Bioinformatics Oxford Journals - Wed, 12/10/2022 - 5:30am
AbstractMotivationClass imbalance, or unequal sample sizes between classes, is an increasing concern in machine learning for metabolomic and lipidomic data mining, which can result in overfitting for the over-represented class. Numerous methods have been developed for handling class imbalance, but they are not readily accessible to users with limited computational experience. Moreover, there is no resource that enables users to easily evaluate the effect of different over-sampling algorithms.ResultsMETAbolomics data Balancing with Over-sampling Algorithms (META-BOA) is a web-based application that enables users to select between four different methods for class balancing, followed by data visualization and classification of the sample to observe the augmentation effects. META-BOA outputs a newly balanced dataset, generating additional samples in the minority class, according to the user’s choice of Synthetic Minority Over-sampling Technique (SMOTE), Borderline-SMOTE (BSMOTE), Adaptive Synthetic (ADASYN), or Random Over-Sampling Examples (ROSE). To present the effect of over-sampling on the data META-BOA further displays both principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) visualization of data pre- and post-over-sampling. Random forest classification is utilized to compare sample classification in both the original and balanced datasets, enabling users to select the most appropriate method for their further analyses.Availability and implementationMETA-BOA is available at https://complimet.ca/meta-boa.Supplementary InformationSupplementary materialSupplementary material is available at Bioinformatics online.
Categories: Bioinformatics Trends

Prediction of Drug-likeness using Graph Convolutional Attention Network

Bioinformatics Oxford Journals - Wed, 12/10/2022 - 5:30am
AbstractMotivationThe drug-likeness has been widely used as a criterion to distinguish drug-like molecules from non-drugs. Developing reliable computational methods to predict the drug-likeness of compounds is crucial to triage unpromising molecules and accelerate the drug discovery process.ResultsIn this study, a deep learning method was developed to predict the drug-likeness based on the graph convolutional attention network (D-GCAN) directly from molecular structures. Results showed that the D-GCAN model outperformed other state-of-the-art models for drug-likeness prediction. The combination of graph convolution and attention mechanism made an important contribution to the performance of the model. Specifically, the application of the attention mechanism improved accuracy by 4.0%. The utilization of graph convolution improved the accuracy by 6.1%. Results on the dataset beyond Lipinski's rule of five space and the non-US dataset showed that the model had good versatility. Then, the billion-scale GDB-13 database was used as a case study to screen SARS-CoV-2 3C-like protease inhibitors. Sixty-five drug candidates were screened out, most substructures of which are similar to these of existing oral drugs. Candidates screened from S-GDB13 have higher similarity to existing drugs and better molecular docking performance than those from the rest of GDB-13. The screening speed on S-GDB13 is significantly faster than screening directly on GDB-13. In general, D-GCAN is a promising tool to predict the drug-likeness for selecting potential candidates and accelerating drug discovery by excluding unpromising candidates and avoiding unnecessary biological and clinical testing.AvailabilityThe source code, model, and tutorials are available at https://github.com/JinYSun/D-GCAN. The S-GDB13 database is available at https://doi.org/10.5281/zenodo.7054367.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Tree2GD: A Phylogenomic Method to Detect Large Scale Gene Duplication Events

Bioinformatics Oxford Journals - Tue, 11/10/2022 - 5:30am
AbstractMotivationWhole-genome duplication events have long been discovered throughout the evolution of eukaryotes, contributing to genome complexity and biodiversity and leaving traces in the descending organisms. Therefore, an accurate and rapid phylogenomic method is needed to identify the retained duplicated genes on various lineages across the target taxonomy.ResultsHere we present Tree2GD, an integrated method to identify large scale gene duplication events by automatically perform multiple procedures, including sequence alignment, recognition of homolog, gene tree/species tree reconciliation, Ks distribution of gene duplicates and synteny analyses. Application of Tree2GD on two datasets, 12 metazoan genomes and 68 angiosperms, successfully identifies all reported whole-genome duplication events exhibited by these species, showing effectiveness and efficiency of Tree2GD on phylogenomic analyses of large-scale gene duplications.Availability and implementationTree2GD is written in Python and C ++, and is available at https://github.com/Dee-chen/Tree2gdSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

GTDB-Tk v2: memory friendly classification with the Genome Taxonomy Database

Bioinformatics Oxford Journals - Tue, 11/10/2022 - 5:30am
AbstractMotivationThe Genome Taxonomy Database (GTDB) and associated taxonomic classification toolkit (GTDB-Tk) have been widely adopted by the microbiology community. However, the growing size of the GTDB bacterial reference tree has resulted in GTDB-Tk requiring substantial amounts of memory (∼320 GB) which limits its adoption and ease of use. Here we present an update to GTDB-Tk that uses a divide-and-conquer approach where user genomes are initially placed into a bacterial reference tree with family-level representatives followed by placement into an appropriate class-level subtree comprising species representatives. This substantially reduces the memory requirements of GTDB-Tk while having minimal impact on classification.AvailabilityGTDB-Tk is implemented in Python and licenced under the GNU General Public Licence v3.0. Source code and documentation are available at: https://github.com/ecogenomics/gtdbtk.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Pages

Subscribe to Centre for Bioinformatics aggregator - Bioinformatics Trends

Calendar

Mon
Tue
Wed
Thu
Fri
Sat
Sun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
January 2023