Jump to Navigation
Subscribe to Bioinformatics Oxford Journals feed
Updated: 15 hours 44 min ago

Single-cell mutation calling and phylogenetic tree reconstruction with loss and recurrence

Wed, 24/08/2022 - 5:30am
AbstractMotivationTumours evolve as heterogeneous populations of cells, which may be distinguished by different genomic aberrations. The resulting intra-tumour heterogeneity plays an important role in cancer patient relapse and treatment failure, so that obtaining a clear understanding of each patient's tumour composition and evolutionary history is key for personalised therapies. Single-cell sequencing now provides the possibility to resolve tumour heterogeneity at the highest resolution of individual tumour cells, but brings with it challenges related to the particular noise profiles of the sequencing protocols as well as the complexity of the underlying evolutionary process.ResultsBy modelling the noise processes and allowing mutations to be lost or to reoccur during tumour evolution, we present a method to jointly call mutations in each cell, reconstruct the phylogenetic relationship between cells, and determine the locations of mutational losses and recurrences. Our Bayesian approach allows us to accurately call mutations as well as to quantify our certainty in such predictions. We show the advantages of allowing mutational loss or recurrence with simulated data and present its application to tumour single-cell sequencing data.AvailabilitySCIϕN is available at https://github.com/cbg-ethz/SCIPhINSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

ntHash2: recursive spaced seed hashing for nucleotide sequences

Wed, 24/08/2022 - 5:30am
AbstractMotivationSpaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research.ResultsntHash2 is up to 2.1x faster at hashing various spaced seeds than the previous version and 3.8x faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism.AvailabilityntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

GTFtools: a software package for analyzing various features of gene models

Wed, 24/08/2022 - 5:30am
AbstractMotivationGene-centric bioinformatics studies frequently involve calculation or extraction of various features of genes such as splice sites, promoters, independent introns, and untranslated regions (UTRs) through manipulation of gene models. Gene models are often annotated in gene transfer format (GTF) files. The features are essential for subsequent analysis such as intron retention detection, DNA-binding site identification, and computing splicing strength of splice sites. Some features such as independent introns and splice sites are not provided in existing resources including the commonly used BioMart database. A package that implements and integrates functions to analyze various features of genes will greatly ease routine analysis for related bioinformatics studies. However, to the best of our knowledge, such a package is not available yet.ResultsIn this work, we introduce GTFtools, a stand-alone command-line software that provides a set of functions to calculate various gene features, including splice sites, independent introns, transcription start sites (TSS)-flanking regions, UTRs, isoform coordination and length, different types of gene lengths, etc. It takes the ENSEMBL or GENCODE GTF files as input, and can be applied to both human and non-human gene models like the lab mouse. We compare the utilities of GTFtools with those of two related tools: Bedtools and BioMart. GTFtools is implemented in Python and not dependent on any third-party software, making it very easy to install and use.AvailabilityGTFtools is freely available at www.genemine.org/gtftools.php as well as pyPI and Bioconda
Categories: Bioinformatics Trends

Isoform function prediction by Gene Ontology embedding

Tue, 23/08/2022 - 5:30am
AbstractMotivationHigh resolution annotation of gene functions is a central task in functional genomics. Multiple proteoforms translated from alternatively spliced isoforms from a single gene are actual function performers and greatly increase the functional diversity. The specific functions of different isoforms can decipher the molecular basis of various complex diseases at a finer granularity. Multi-instance learning (MIL) based solutions have been developed to distribute gene(bag)-level Gene Ontology (GO) annotations to isoforms(instances), but they simply presume that a particular annotation of the gene is responsible by only one isoform, neglect the hierarchical structures and semantics of massive GO terms (labels), or can only handle dozens of terms.ResultsWe propose an efficacy approach IsofunGO to differentiate massive functions of isoforms by GO embedding. Particularly, IsofunGO firstly introduces an attributed hierarchical network to model massive GO terms, and a GO network embedding strategy to learn compact representations of GO terms and project GO annotations of genes into compressed ones, this strategy not only explores and preserves hierarchy between GO terms but also greatly reduces the prediction load. Next, it develops an attention based multi-instance learning network to fuse genomics and transcriptomics data of isoforms and predict isoform functions by referring to compressed annotations. Extensive experiments on benchmark datasets demonstrate the efficacy of IsofunGO. Both the GO embedding and attention mechanism can boost the performance and interpretability.AvailabilityThe code of IsofunGO is available at http://www.sdu-idea.cn/codes.php?name=IsofunGOSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

CLNN-loop: A deep learning model to predict CTCF-mediated chromatin loops in the different cell lines and CTCF-binding sites (CBS) pair types

Tue, 23/08/2022 - 5:30am
AbstractMotivationThree-dimensional (3D) genome organization is of vital importance in gene regulation and disease mechanisms. Previous studies have shown that CTCF-mediated chromatin loops are crucial to studying the 3D structure of cells. Although various experimental techniques have been developed to detect chromatin loops, they have been found to be time-consuming and costly. Nowadays, various sequence-based computational methods can capture significant features of 3D genome organization and help predict chromatin loops. However, these methods have low performance and poor generalization ability in predicting chromatin loops.ResultsHere, we propose a novel deep learning model, called CLNN-loop, to predict chromatin loops in different cell lines and CTCF-binding sites (CBS) pair types by fusing multiple sequence-based features. The analysis of a series of examinations based on the datasets in the previous study shows that CLNN-loop has satisfactory performance and is superior to the existing methods in terms of predicting chromatin loops. In addition, we apply the SHAP framework to interpret the predictions of different models, and find that CTCF motif and sequence conservation are important signs of chromatin loops in different cell lines and CBS pair types. The source code of CLNN-loop is freely available at https://github.com/HaoWuLab-Bioinformatics/CLNN-loop and the webserver of CLNN-loop is freely available at http://hwclnn.sdu.edu.cn.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Predicting cancer drug response using parallel heterogeneous graph convolutional networks with neighborhood interactions

Tue, 23/08/2022 - 5:30am
AbstractMotivationDue to cancer heterogeneity, the therapeutic effect may not be the same when a cohort of patients of the same cancer type receive the same treatment. The anticancer drug response prediction may help develop personalized therapy regimens to increase survival and reduce patients' expenses. Recently graph neural network-based methods have aroused widespread interest and achieved impressive results on the drug response prediction task. However, most of them apply graph convolution to process cell line-drug bipartite graphs while ignoring the intrinsic differences between cell lines and drug nodes. Moreover, most of these methods aggregate node-wise neighbor features but fail to consider the element-wise interaction between cell lines and drugs.ResultsThis work proposes a neighborhood interaction-based heterogeneous graph convolution network method, namely NIHGCN, for anticancer drug response prediction in an end-to-end way. Firstly, it constructs a heterogeneous network consisting of drugs, cell lines and the known drug response information. Cell line gene expression and drug molecular fingerprints are linearly transformed and input as node attributes into an interaction model. The interaction module consists of a parallel graph convolution network (PGCN) layer and a neighborhood interaction (NI) layer, which aggregates node-level features from their neighbors through graph convolution operation and considers the element-level of interactions with their neighbors in the NI layer. Finally, the drug response predictions are made by calculating the linear correlation coefficients of feature representations of cell lines and drugs. We have conducted extensive experiments to assess the effectiveness of our model on Cancer Drug Sensitivity Data (GDSC) and Cancer Cell Line Encyclopedia (CCLE) datasets. It has achieved the best performance compared with the state-of-the-art algorithms, especially in predicting drug responses for new cell lines, new drugs and targeted drugs. Furthermore, our model that was well trained on the GDSC dataset can be successfully applied to predict samples of PDX and TCGA, which verified the transferability of our model from cell line in vitro to the datasets in vivo.AvailabilityThe source code can be obtained from https://github.com/weiba/NIHGCN.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

YAMACS: a graphical interface for GROMACS

Tue, 23/08/2022 - 5:30am
AbstractSummaryA graphical user interface for the GROMACS program has been developed as plugins for YASARA molecular graphics suite. The most significant GROMACS methods can be run entirely via a windowed menu system, and the results are shown on screen in real-time.Availability and ImplementationYAMACS is written in Python and is freely available for download at https://github.com/YAMACS-SML/YAMACS and is supported on Linux. It has been released under GPL-3.0 license.Supplementary informationYAMACS User Manual, available at https://github.com/YAMACS-SML/YAMACS
Categories: Bioinformatics Trends

MMGraph: a multiple motif predictor based on graph neural network and coexisting probability for ATAC-seq data

Tue, 23/08/2022 - 5:30am
AbstractMotivationTranscription factor binding sites (TFBSs) prediction is a crucial step in revealing functions of transcription factors (TFs) from high-throughput sequencing data. Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) provides insight on TFBSs and nucleosome positioning by probing open chromatic, which can simultaneously reveal multiple TFBSs compare to traditional technologies. The existing tools based on convolutional neural network (CNN) only find the fixed length of TFBSs from ATAC-seq data. Graph neural network (GNN) can be considered as the extension of CNN, which has great potential in finding multiple TFBSs with different lengths from ATAC-seq data.ResultsWe develop a motif predictor called MMGraph based on three-layer GNN and coexisting probability of k-mers for finding multiple motifs from ATAC-seq data. The results of the experiment which has been conducted on 88 ATAC-seq datasets indicate that MMGraph has achieved the best performance on area of eight metrics radar (AEMR) score of 2.31 and could find 207 higher quality multiple motifs than other existing tools.AvailabilityMMGraph is wrapped in Python package, which is available at https://github.com/zhangsq06/MMGraph.gitSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

NetTIME: a Multitask and Base-pair Resolution Framework for Improved Transcription Factor Binding Site Prediction

Tue, 23/08/2022 - 5:30am
AbstractMotivationMachine learning models for predicting cell-type-specific transcription factor (TF) binding sites have become increasingly more accurate thanks to the increased availability of next-generation sequencing data and more standardized model evaluation criteria. However, knowledge transfer from data-rich to data-limited TFs and cell types remains crucial for improving TF binding prediction models because available binding labels are highly skewed towards a small collection of TFs and cell types. Transfer prediction of TF binding sites can potentially benefit from a multitask learning approach; however, existing methods typically use shallow single-task models to generate low-resolution predictions. Here we propose NetTIME, a multitask learning framework for predicting cell-type-specific transcription factor binding sites with base-pair resolution.ResultsWe show that the multitask learning strategy for TF binding prediction is more efficient than the single-task approach due to the increased data availability. NetTIME trains high-dimensional embedding vectors to distinguish TF and cell-type identities. We show that this approach is critical for the success of the multitask learning strategy and allows our model to make accurate transfer predictions within and beyond the training panels of TFs and cell types. We additionally train a linear-chain conditional random field (CRF) to classify binding predictions and show that this CRF eliminates the need for setting a probability threshold and reduces classification noise. We compare our method’s predictive performance with two state-of-the-art methods, Catchitt and Leopard, and show that our method outperforms previous methods under both supervised and transfer learning settings.AvailabilityNetTIME is freely available at https://github.com/ryi06/NetTIME**Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

HNOXPred: a web tool for the prediction of gas sensing H-NOX proteins from amino acid sequence

Mon, 22/08/2022 - 5:30am
AbstractSummaryHNOXPred is a webserver for the prediction of gas sensing H-NOX proteins from amino acid sequence. Heme-Nitric oxide/Oxygen (H-NOX) proteins are gas sensing hemoproteins found in diverse organisms ranging from bacteria to eukaryotes. Recently, gas sensing complex multi-functional proteins containing only the conserved amino acids at the heme centers of H-NOX proteins, have been identified through a motif-based approach. Based on experimental data and H-NOX candidates reported in literature, HNOXPred is created to automate and facilitate the identification of similar H-NOX centers across systems. The server features HNOXSCORES scaled from 0 to 1 that consider in its calculation, the physicochemical properties of amino acids constituting the heme center in H-NOX in addition to the conserved amino acids within the center. From user input amino acid sequence, the server returns positive hits and their calculated HNOXSCORES ordered from high to low confidence which are accompanied by interpretation guides and recommendations. The utility of this server is demonstrated using the human proteome as an example.Availability and implementationThe HNOXPred server is available at https://www.hnoxpred.com.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

The FASTQ+ format and PISA

Mon, 22/08/2022 - 5:30am
AbstractSummaryThe FASTQ+ format is designed for single-cell experiments. It extends various optional tags, including cell barcodes and unique molecular identifiers, to the sequence identifier, and is fully compatible with the FASTQ format. In addition, PISA implements various utilities for processing sequences in the FASTQ format and alignments in the SAM/BAM/CRAM format from single-cell experiments, such as converting FASTQ format to FASTQ+, annotating alignments, PCR deduplication, feature counting, and barcodes correction. The software is open-source and written in C language.Availabilityhttps://doi.org/10.5281/zenodo.6787430 or https://github.com/shiquan/PISASupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

scWMC: Weighted Matrix Completion-based Imputation of scRNA-seq Data via Prior Subspace Information

Fri, 19/08/2022 - 5:30am
AbstractMotivationSingle-cell RNA sequencing (scRNA-seq) can provide insight into gene expression patterns at the resolution of individual cells, which offers new opportunities to study the behavior of different cell types. However, it is often plagued by dropout events, a phenomenon where the expression value of a gene tends to be measured as zero in the expression matrix due to various technical defects.ResultsIn this paper, we argue that borrowing gene and cell information across column and row subspaces directly results in suboptimal solutions due to the noise contamination in imputing dropout values. Thus, to impute more precisely the dropout events in scRNA-seq data, we develop a regularization for leveraging that imperfect prior information to estimate the true underlying prior subspace and then embed it in a typical low-rank matrix completion-based framework, named scWMC. To evaluate the performance of the proposed method, we conduct comprehensive experiments on simulated and real scRNA-seq data. Extensive data analysis, including simulated analysis, cell clustering, differential expression analysis, functional genomic analysis, cell trajectory inference and scalability analysis, demonstrate that our method produces improved imputation results compared to competing methods that benefits subsequent downstream analysis.AvailabilityThe source code is available at https://github.com/XuYuanchi/scWMC and test data is available at https://doi.org/10.5281/zenodo.6832477.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Beacon V2 Reference Implementation: a Toolkit to enable federated sharing of genomic and phenotypic data

Thu, 18/08/2022 - 5:30am
AbstractSummaryBeacon v2 is an API specification established by the Global Alliance for Genomics and Health initiative (GA4GH) that defines a standard for federated discovery of genomic and phenotypic data. Here we present the Beacon v2 Reference Implementation (B2RI), a set of open-source software tools that allow lighting up a local Beacon instance “out-of-the-box”. Along with the software, we have created detailed “Read the Docs” documentation that includes information on deployment and installation.AvailabilityThe B2RI is released under GNU General Public License v3.0 and Apache License v2.0. Documentation and source code is available at: https://b2ri-documentation.readthedocs.ioSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Omnibus and Robust Deconvolution Scheme for Bulk RNA Sequencing Data Integrating Multiple Single-Cell Reference Sets and Prior Biological Knowledge

Thu, 18/08/2022 - 5:30am
AbstractMotivationCell-type deconvolution of bulk tissue RNA sequencing (RNA-seq) data is an important step towards understanding the variations in cell-type composition among disease conditions. Owing to recent advances in single-cell RNA sequencing (scRNA-seq) and the availability of large amounts of bulk RNA-seq data in disease-relevant tissues, various deconvolution methods have been developed. However, the performance of existing methods heavily relies on the quality of information provided by external data sources, such as the selection of scRNA-seq data as a reference and prior biological information.ResultsWe present the Integrated and Robust Deconvolution (InteRD) algorithm to infer cell-type proportions from target bulk RNA-seq data. Owing to the innovative use of penalized regression with a new evaluation criterion for deconvolution, InteRD has three primary advantages. First, it is able to effectively integrate deconvolution results from multiple scRNA-seq datasets. Second, InteRD calibrates estimates from reference-based deconvolution by taking into account extra biological information as priors. Third, the proposed algorithm is robust to inaccurate external information imposed in the deconvolution system. Extensive numerical evaluations and real data applications demonstrate that InteRD yields more accurate and robust cell-type proportion estimates that agree well with known biology.Availability and implementationThe proposed InteRD framework is implemented in R and the package is available at https://cran.r-project.org/web/packages/InteRD/index.html.Supplementary informationSupplementary MaterialsSupplementary Materials including pseudo algorithms, more simulation results, and extra discussion and information are available at Bioinformatics online.
Categories: Bioinformatics Trends

Correction of image distortion in large-field ssEM stitching by an unsupervised intermediate-space solving network

Wed, 17/08/2022 - 5:30am
AbstractMotivationSerial-section electron microscopy (ssEM) is a powerful technique for cellular visualization, especially for large-scale specimens. Limited by the field of view, a megapixel image of whole-specimen is regularly captured by stitching several overlapping images. However, suffering from distortion by manual operations, lens distortion or electron impact, simple rigid transformations are not adequate for perfect mosaic generation. Non-linear deformation usually causes” ghosting” phenomenon, especially with high magnification. To date, existing microscope image processing tools provide mature rigid stitching methods, but have no idea with local distortion correction.ResultsIn this paper, following the development of unsupervised deep learning, we present a multi-scale network to predict the dense deformation fields of image pairs in ssEM and blend these images into a clear and seamless montage. The model is composed of two pyramidal backbones, sharing parameters and interacting with a set of registration modules, in which the pyramidal architecture could effectively capture large deformation according to multi-scale decomposition. A novel “intermediate-space solving” paradigm is adopted in our model to treat inputted images equally and ensure nearly perfect stitching of the overlapping regions. Combining with the existing rigid transformation method, our model further improves the accuracy of sequential image stitching. Extensive experimental results well demonstrate the superiority of our method over the other traditional methods.AvailabilityThe code is available at https://github.com/HeracleBT/ssEM_stitching.
Categories: Bioinformatics Trends

Fec: a fast error correction method based on two-rounds overlapping and caching

Wed, 17/08/2022 - 5:30am
Abstract The third-generation sequencing technology has advanced genome analysis with long read length, but the reads need error correction due to the high error rate. Error correction is a time-consuming process especially when the sequencing coverage is high. Generally, for a pair of overlapping reads A and B, the existing error correction methods perform a base-level alignment from B to A when correcting the read A. And another base-level alignment from A to B is performed when correcting the read B. However, based on our observation, the base-level alignment information can be reused. In this paper, we present a fast error correction tool Fec, using two-rounds overlapping and caching. Fec can be used independently or as an error correction step in an assembly pipeline. In the first round, Fec uses a large window size (20) to quickly find enough overlaps to correct most of the reads. In the second round, a small window size (5) is used to find more overlaps for the reads with insufficient overlaps in the first round. When performing base-level alignment, Fec searches the cache first. If the alignment exists in the cache, Fec takes this alignment out and deduces the second alignment from it. Otherwise, Fec performs base-level alignment and stores the alignment in the cache. We test Fec on nine datasets, and the results show that Fec has 1.24-38.56 times speed-up compared to MECAT, CANU, and MINICNS on five PacBio datasets and 1.16-27.8 times speed-up compared to NECAT and CANU on four nanopore datasets.Availability and ImplementationFec is available at Fec is available at https://github.com/zhangjuncsu/FecSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

FastRemap: A Tool for Quickly Remapping Reads between Genome Assemblies

Wed, 17/08/2022 - 5:30am
AbstractMotivationA genome read data set can be quickly and efficiently remapped from one reference to another similar reference (e.g., between two reference versions or two similar species) using a variety of tools, e.g., the commonly-used CrossMap tool. With the explosion of available genomic data sets and references, high-performance remapping tools will be even more important for keeping up with the computational demands of genome assembly and analysis.ResultsWe provide FastRemap, a fast and efficient tool for remapping reads between genome assemblies. FastRemap provides up to a 7.19× speedup (5.97×, on average) and uses as low as 61.7% (80.7%, on average) of the peak memory consumption compared to the state-of-the-art remapping tool, CrossMap.AvailabilityFastRemap is written in C ++. Source code and user manual are freely available at: github.com/CMU-SAFARI/FastRemap Docker image available at: https://hub.docker.com/r/alkanlab/fast Also available in Bioconda.
Categories: Bioinformatics Trends

Guided interactive image segmentation using machine learning and color-based image set clustering

Wed, 17/08/2022 - 5:30am
AbstractMotivationOver the last decades, image processing and analysis has become one of the key technologies in systems biology and medicine. The quantification of anatomical structures and dynamic processes in living systems is essential for understanding the complex underlying mechanisms and allows, i.a., the construction of spatio-temporal models that illuminate the interplay between architecture and function. Recently, deep learning significantly improved the performance of traditional image analysis in cases where imaging techniques provide large amounts of data. However, if only few images are available or qualified annotations are expensive to produce, the applicability of deep learning is still limited.ResultsWe present a novel approach that combines machine learning based interactive image segmentation using supervoxels with a clustering method for the automated identification of similarly colored images in large image sets which enables a guided reuse of interactively trained classifiers. Our approach solves the problem of deteriorated segmentation and quantification accuracy when reusing trained classifiers which is due to significant color variability prevalent and often unavoidable in biological and medical images. This increase in efficiency improves the suitability of interactive segmentation for larger image sets, enabling efficient quantification or the rapid generation of training data for deep learning with minimal effort. The presented methods are applicable for almost any image type and represent a useful tool for image analysis tasks in general.AvailabilityThe presented methods are implemented in our image processing software TiQuant which is freely available at tiquant.hoehme.com.Supplementary informationSupplementary informationSupplementary information are available at Bioinformatics online and test data is provided at our website.
Categories: Bioinformatics Trends

Comparing Petri net-based models of biological systems using Holmes

Wed, 17/08/2022 - 5:30am
AbstractMotivationThe first and necessary step in systems approach to study biological phenomana is building a formal model. One of the possibilities is to construct a model based on Petri nets. They have an intuitive graphical representation on one hand, and on the other, can be analyzed using formal mathematical methods. Finding homologies or conserved processes playing important roles in various biological systems can be done by comparing models. The ones expressed as Petri nets are especially well-suited for such a comparison, but there is a lack of software tools for this task.ResultsTo resolve this problem, a new analytical tool has been implemented in Holmes application and described in this paper. It offers four different comparison methods, i.e., the ones based on t-invariants, decomposition, graphlets and branching vertices.Availability and implementationAvailable at http://www.cs.put.poznan.pl/mradom/Holmes/holmes.html
Categories: Bioinformatics Trends

Metagenomic binning with assembly graph embeddings

Tue, 16/08/2022 - 5:30am
AbstractMotivationDespite recent advancements in sequencing technologies and assembly methods, obtaining high-quality microbial genomes from metagenomic samples is still not a trivial task. Current metagenomic binners do not take full advantage of assembly graphs and are not optimized for long-read assemblies. Deep graph learning algorithms have been proposed in other fields to deal with complex graph data structures. The graph structure generated during the assembly process could be integrated with contig features to obtain better bins with deep learning.ResultsWe propose GraphMB, which uses graph neural networks to incorporate the assembly graph into the binning process. We test GraphMB on long-read datasets of different complexities, and compare the performance with other binners in terms of the number of High Quality (HQ) genome bins obtained. With our approach, we were able to obtain unique bins on all real datasets, and obtain more bins on most datasets. In particular, we obtained on average 17.5% more HQ bins when compared to state-of-the-art binners and 13.7% when aggregating the results of our binner with the others. These results indicate that a deep learning model can integrate contig-specific and graph-structure information to improve metagenomic binning.AvailabilityGraphMB is available from https://github.com/MicrobialDarkMatter/GraphMB
Categories: Bioinformatics Trends

Pages

Calendar

Mon
Tue
Wed
Thu
Fri
Sat
Sun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
December 2022