Jump to Navigation
Subscribe to Bioinformatics Oxford Journals feed
Updated: 6 hours 13 min ago

Omnibus and Robust Deconvolution Scheme for Bulk RNA Sequencing Data Integrating Multiple Single-Cell Reference Sets and Prior Biological Knowledge

Thu, 18/08/2022 - 5:30am
AbstractMotivationCell-type deconvolution of bulk tissue RNA sequencing (RNA-seq) data is an important step towards understanding the variations in cell-type composition among disease conditions. Owing to recent advances in single-cell RNA sequencing (scRNA-seq) and the availability of large amounts of bulk RNA-seq data in disease-relevant tissues, various deconvolution methods have been developed. However, the performance of existing methods heavily relies on the quality of information provided by external data sources, such as the selection of scRNA-seq data as a reference and prior biological information.ResultsWe present the Integrated and Robust Deconvolution (InteRD) algorithm to infer cell-type proportions from target bulk RNA-seq data. Owing to the innovative use of penalized regression with a new evaluation criterion for deconvolution, InteRD has three primary advantages. First, it is able to effectively integrate deconvolution results from multiple scRNA-seq datasets. Second, InteRD calibrates estimates from reference-based deconvolution by taking into account extra biological information as priors. Third, the proposed algorithm is robust to inaccurate external information imposed in the deconvolution system. Extensive numerical evaluations and real data applications demonstrate that InteRD yields more accurate and robust cell-type proportion estimates that agree well with known biology.Availability and implementationThe proposed InteRD framework is implemented in R and the package is available at https://cran.r-project.org/web/packages/InteRD/index.html.Supplementary informationSupplementary MaterialsSupplementary Materials including pseudo algorithms, more simulation results, and extra discussion and information are available at Bioinformatics online.
Categories: Bioinformatics Trends

Correction of image distortion in large-field ssEM stitching by an unsupervised intermediate-space solving network

Wed, 17/08/2022 - 5:30am
AbstractMotivationSerial-section electron microscopy (ssEM) is a powerful technique for cellular visualization, especially for large-scale specimens. Limited by the field of view, a megapixel image of whole-specimen is regularly captured by stitching several overlapping images. However, suffering from distortion by manual operations, lens distortion or electron impact, simple rigid transformations are not adequate for perfect mosaic generation. Non-linear deformation usually causes” ghosting” phenomenon, especially with high magnification. To date, existing microscope image processing tools provide mature rigid stitching methods, but have no idea with local distortion correction.ResultsIn this paper, following the development of unsupervised deep learning, we present a multi-scale network to predict the dense deformation fields of image pairs in ssEM and blend these images into a clear and seamless montage. The model is composed of two pyramidal backbones, sharing parameters and interacting with a set of registration modules, in which the pyramidal architecture could effectively capture large deformation according to multi-scale decomposition. A novel “intermediate-space solving” paradigm is adopted in our model to treat inputted images equally and ensure nearly perfect stitching of the overlapping regions. Combining with the existing rigid transformation method, our model further improves the accuracy of sequential image stitching. Extensive experimental results well demonstrate the superiority of our method over the other traditional methods.AvailabilityThe code is available at https://github.com/HeracleBT/ssEM_stitching.
Categories: Bioinformatics Trends

Fec: a fast error correction method based on two-rounds overlapping and caching

Wed, 17/08/2022 - 5:30am
Abstract The third-generation sequencing technology has advanced genome analysis with long read length, but the reads need error correction due to the high error rate. Error correction is a time-consuming process especially when the sequencing coverage is high. Generally, for a pair of overlapping reads A and B, the existing error correction methods perform a base-level alignment from B to A when correcting the read A. And another base-level alignment from A to B is performed when correcting the read B. However, based on our observation, the base-level alignment information can be reused. In this paper, we present a fast error correction tool Fec, using two-rounds overlapping and caching. Fec can be used independently or as an error correction step in an assembly pipeline. In the first round, Fec uses a large window size (20) to quickly find enough overlaps to correct most of the reads. In the second round, a small window size (5) is used to find more overlaps for the reads with insufficient overlaps in the first round. When performing base-level alignment, Fec searches the cache first. If the alignment exists in the cache, Fec takes this alignment out and deduces the second alignment from it. Otherwise, Fec performs base-level alignment and stores the alignment in the cache. We test Fec on nine datasets, and the results show that Fec has 1.24-38.56 times speed-up compared to MECAT, CANU, and MINICNS on five PacBio datasets and 1.16-27.8 times speed-up compared to NECAT and CANU on four nanopore datasets.Availability and ImplementationFec is available at Fec is available at https://github.com/zhangjuncsu/FecSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

FastRemap: A Tool for Quickly Remapping Reads between Genome Assemblies

Wed, 17/08/2022 - 5:30am
AbstractMotivationA genome read data set can be quickly and efficiently remapped from one reference to another similar reference (e.g., between two reference versions or two similar species) using a variety of tools, e.g., the commonly-used CrossMap tool. With the explosion of available genomic data sets and references, high-performance remapping tools will be even more important for keeping up with the computational demands of genome assembly and analysis.ResultsWe provide FastRemap, a fast and efficient tool for remapping reads between genome assemblies. FastRemap provides up to a 7.19× speedup (5.97×, on average) and uses as low as 61.7% (80.7%, on average) of the peak memory consumption compared to the state-of-the-art remapping tool, CrossMap.AvailabilityFastRemap is written in C ++. Source code and user manual are freely available at: github.com/CMU-SAFARI/FastRemap Docker image available at: https://hub.docker.com/r/alkanlab/fast Also available in Bioconda.
Categories: Bioinformatics Trends

Guided interactive image segmentation using machine learning and color-based image set clustering

Wed, 17/08/2022 - 5:30am
AbstractMotivationOver the last decades, image processing and analysis has become one of the key technologies in systems biology and medicine. The quantification of anatomical structures and dynamic processes in living systems is essential for understanding the complex underlying mechanisms and allows, i.a., the construction of spatio-temporal models that illuminate the interplay between architecture and function. Recently, deep learning significantly improved the performance of traditional image analysis in cases where imaging techniques provide large amounts of data. However, if only few images are available or qualified annotations are expensive to produce, the applicability of deep learning is still limited.ResultsWe present a novel approach that combines machine learning based interactive image segmentation using supervoxels with a clustering method for the automated identification of similarly colored images in large image sets which enables a guided reuse of interactively trained classifiers. Our approach solves the problem of deteriorated segmentation and quantification accuracy when reusing trained classifiers which is due to significant color variability prevalent and often unavoidable in biological and medical images. This increase in efficiency improves the suitability of interactive segmentation for larger image sets, enabling efficient quantification or the rapid generation of training data for deep learning with minimal effort. The presented methods are applicable for almost any image type and represent a useful tool for image analysis tasks in general.AvailabilityThe presented methods are implemented in our image processing software TiQuant which is freely available at tiquant.hoehme.com.Supplementary informationSupplementary informationSupplementary information are available at Bioinformatics online and test data is provided at our website.
Categories: Bioinformatics Trends

Comparing Petri net-based models of biological systems using Holmes

Wed, 17/08/2022 - 5:30am
AbstractMotivationThe first and necessary step in systems approach to study biological phenomana is building a formal model. One of the possibilities is to construct a model based on Petri nets. They have an intuitive graphical representation on one hand, and on the other, can be analyzed using formal mathematical methods. Finding homologies or conserved processes playing important roles in various biological systems can be done by comparing models. The ones expressed as Petri nets are especially well-suited for such a comparison, but there is a lack of software tools for this task.ResultsTo resolve this problem, a new analytical tool has been implemented in Holmes application and described in this paper. It offers four different comparison methods, i.e., the ones based on t-invariants, decomposition, graphlets and branching vertices.Availability and implementationAvailable at http://www.cs.put.poznan.pl/mradom/Holmes/holmes.html
Categories: Bioinformatics Trends

Metagenomic binning with assembly graph embeddings

Tue, 16/08/2022 - 5:30am
AbstractMotivationDespite recent advancements in sequencing technologies and assembly methods, obtaining high-quality microbial genomes from metagenomic samples is still not a trivial task. Current metagenomic binners do not take full advantage of assembly graphs and are not optimized for long-read assemblies. Deep graph learning algorithms have been proposed in other fields to deal with complex graph data structures. The graph structure generated during the assembly process could be integrated with contig features to obtain better bins with deep learning.ResultsWe propose GraphMB, which uses graph neural networks to incorporate the assembly graph into the binning process. We test GraphMB on long-read datasets of different complexities, and compare the performance with other binners in terms of the number of High Quality (HQ) genome bins obtained. With our approach, we were able to obtain unique bins on all real datasets, and obtain more bins on most datasets. In particular, we obtained on average 17.5% more HQ bins when compared to state-of-the-art binners and 13.7% when aggregating the results of our binner with the others. These results indicate that a deep learning model can integrate contig-specific and graph-structure information to improve metagenomic binning.AvailabilityGraphMB is available from https://github.com/MicrobialDarkMatter/GraphMB
Categories: Bioinformatics Trends

Stitching and registering highly multiplexed whole slide images of tissues and tumors using ASHLAR

Tue, 16/08/2022 - 5:30am
AbstractMotivationStitching microscope images into a mosaic is an essential step in the analysis and visualization of large biological specimens, particularly human and animal tissues. Recent approaches to highly-multiplexed imaging generate high-plex data from sequential rounds of lower-plex imaging. These multiplexed imaging methods promise to yield precise molecular single-cell data and information on cellular neighborhoods and tissue architecture. However, attaining mosaic images with single-cell accuracy requires robust image stitching and image registration capabilities that are not met by existing methods.ResultsWe describe the development and testing of ASHLAR, a Python tool for coordinated stitching and registration of 103 or more individual multiplexed images to generate accurate whole-slide mosaics. ASHLAR reads image formats from most commercial microscopes and slide scanners, and we show that it performs better than existing open source and commercial software. ASHLAR outputs standard OME-TIFF images that are ready for analysis by other open-source tools and recently developed image analysis pipelines.AvailabilityASHLAR is written in Python and available under the MIT license at https://github.com/labsyspharm/ashlar. An informational website with user guides and test data is available at https://labsyspharm.github.io/ashlar/Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

A Spatial Attention Guided Deep Learning System for Prediction of Pathological Complete Response Using Breast Cancer Histopathology Images

Sat, 13/08/2022 - 5:30am
AbstractMotivationPredicting pathological complete response (pCR) to neoadjuvant chemotherapy (NAC) in triple-negative breast cancer (TNBC) patients accurately is direly needed for clinical decision making. pCR is also regarded as a strong predictor of overall survival. In this work, we propose a deep learning system to predict pCR to NAC based on serial pathology images stained with hematoxylin and eosin (H&E) and two immunohistochemical biomarkers (Ki67 and PHH3). To support human prior domain knowledge based guidance and enhance interpretability of the deep learning system, we introduce a human knowledge derived spatial attention mechanism to inform deep learning models of informative tissue areas of interest. For each patient, three serial breast tumor tissue sections from biopsy blocks were sectioned, stained in three different stains, and integrated. The resulting comprehensive attention information from the image triplets is used to guide our prediction system for prognostic tissue regions.ResultsThe experimental dataset consists of 26,419 pathology image patches of 1,000×1,000 pixels from 73 TNBC patients treated with NAC. Image patches from randomly selected 43 patients are used as a training dataset and images patches from the rest 30 are used as a testing dataset. By the maximum voting from patch-level results, our proposed model achieves a 93% patient-level accuracy, outperforming baselines and other state-of-the-art systems, suggesting its high potential for clinical decision making. AvailabilityThe codes, the documentation, and example data are available on an open source at: https://github.com/jkonglab/PCR_Prediction_Serial_WSIs_biomarkers
Categories: Bioinformatics Trends

Structural analogue-based protein structure domain assembly assisted by deep learning

Sat, 13/08/2022 - 5:30am
AbstractMotivationWith the breakthrough of AlphaFold2, the protein structure prediction problem has made remarkable progress through deep learning end-to-end techniques, in which correct folds could be built for nearly all single-domain proteins. However, the full-chain modelling appears to be lower on average accuracy than that for the constituent domains and requires higher demand on computing hardware, indicating the performance of full-chain modelling still needs to be improved. In this study, we investigate whether the predicted accuracy of the full-chain model can be further improved by domain assembly assisted by deep learning.ResultsIn this article, we developed a structural analogue-based protein structure domain assembly method assisted by deep learning, named SADA. In SADA, a multi-domain protein structure database (MPDB) was constructed for the full-chain analogue detection using individual domain models. Starting from the initial model constructed from the analogue, the domain assembly simulation was performed to generate the full-chain model through a two-stage differential evolution algorithm guided by the energy function with an inter-residue distance potential predicted by deep learning. SADA was compared with the state-of-the-art domain assembly methods on 356 benchmark proteins, and the average TM-score of SADA models is 8.1% and 27.0% higher than that of DEMO and AIDA, respectively. We also assembled 293 human multi-domain proteins, where the average TM-score of the full-chain model after the assembly by SADA is 1.1% higher than that of the model by AlphaFold2. To conclude, we find that the domains often interact in the similar way in the quaternary orientations if the domains have similar tertiary structures. Furthermore, homologous templates and structural analogues are complementary for multi-domain protein full-chain modelling.Availabilityhttp://zhanglab-bioinf.com/SADASupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Deep Local Analysis evaluates protein docking conformations with locally oriented cubes

Sat, 13/08/2022 - 5:30am
AbstractMotivationWith the recent advances in protein 3D structure prediction, protein interactions are becoming more central than ever before. Here, we address the problem of determining how proteins interact with one another. More specifically, we investigate the possibility of discriminating near-native protein complex conformations from incorrect ones by exploiting local environments around interfacial residues.ResultsDeep Local Analysis (DLA)-Ranker is a deep learning framework applying 3D convolutions to a set of locally oriented cubes representing the protein interface. It explicitly considers the local geometry of the interfacial residues along with their neighboring atoms and the regions of the interface with different solvent accessibility. We assessed its performance on three docking benchmarks made of half a million acceptable and incorrect conformations. We show that DLA-Ranker successfully identifies near-native conformations from ensembles generated by molecular docking. It surpasses or competes with other deep learning-based scoring functions. We also showcase its usefulness to discover alternative interfaces.Availabilityhttp://gitlab.lcqb.upmc.fr/dla-ranker/DLA-Ranker.gitSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Graph attention network for link prediction of gene regulations from single cell RNA-sequencing data

Fri, 12/08/2022 - 5:30am
AbstractMotivationSingle-cell RNA sequencing (scRNA-seq) data provides unprecedented opportunities to reconstruct gene regulatory networks (GRNs) at fine-grained resolution. Numerous unsupervised or self-supervised models have been proposed to infer GRN from bulk RNA-seq data, but few of them are appropriate for scRNA-seq data under the circumstance of low signal-to-noise ratio and dropout. Fortunately, the surging of TF-DNA binding data (e.g., ChIP-seq) makes supervised GRN inference possible. We regard supervised GRN inference as a graph-based link prediction problem that expects to learn gene low-dimensional vectorized representations to predict potential regulatory interactions.ResultsIn this paper, we present GENELink to infer latent interactions between transcription factors (TFs) and target genes in GRN using graph attention network. GENELink projects the single-cell gene expression with observed TF-gene pairs to a low-dimensional space. Then, the specific gene representations are learned to serve for downstream similarity measurement or causal inference of pairwise genes by optimizing the embedding space. Compared to eight existing GRN reconstruction methods, GENELink achieves comparable or better performance on seven scRNA-seq datasets with four types of ground-truth networks. We further apply GENELink on scRNA-seq of human breast cancer metastasis and reveal regulatory heterogeneity of Notch and Wnt signaling pathways between primary tumour and lung metastasis. Moreover, the ontology enrichment results of unique lung metastasis GRN indicate that mitochondrial oxidative phosphorylation (OXPHOS) is functionally important during the seeding step of the cancer metastatic cascade, which is validated by pharmacological assays.Availability and implementationThe code and data are available at https://github.com/zpliulab/GENELink.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

MultiGran-SMILES: Multi-Granularity SMILES Learning for Molecular Property Prediction

Fri, 12/08/2022 - 5:30am
AbstractMotivationExtracting useful molecular features is essential for molecular property prediction. Atom-level representation is a common representation of molecules, ignoring the sub-structure or branch information of molecules to some extent, however, it is vice versa for the substring-level representation. Both atom-level and substring-level representations may lose the neighborhood or spatial information of molecules. While molecular graph representation aggregating the neighborhood information of a molecule has a weak ability in expressing the chiral molecules or symmetrical structure. In this paper, we aim to make use of the advantages of representations in different granularities simultaneously for molecular property prediction. To this end, we propose a fusion model named MultiGran-SMILES, which integrates the molecular features of atoms, sub-structures, and graphs from the input. Compared with the single granularity representation of molecules, our method leverages the advantages of various granularity representations simultaneously and adjusts the contribution of each type of representation adaptively for molecular property prediction.ResultsThe experimental results show that our MultiGran-SMILES method achieves state-of-the-art performance on BBBP, LogP, HIV, and ClinTox datasets. For the BACE, FDA, and Tox21 datasets, the results are comparable with the state-of-the-art models. Moreover, the experimental results show that the gains of our proposed method are bigger for the molecules with obvious functional groups or branches.AvailabilityThe code and data underlying this work are available on GitHub, at https://github. com/Jiangjing0122/MultiGran.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

medna-metadata: an open-source data management system for tracking environmental DNA samples and metadata

Fri, 12/08/2022 - 5:30am
AbstractMotivationEnvironmental DNA (eDNA), as a rapidly expanding research field, stands to benefit from shared resources including sampling protocols, study designs, discovered sequences, and taxonomic assignments to sequences. High quality community shareable eDNA resources rely heavily on comprehensive metadata documentation that captures the complex workflows covering field sampling, molecular biology lab work, and bioinformatic analyses. There are limited sources that provide documentation of database development on comprehensive metadata for eDNA and these workflows and no open-source software.ResultsWe present medna-metadata, an open-source, modular system that aligns with Findable, Accessible, Interoperable, and Reusable (FAIR) guiding principles that support scholarly data reuse and the database and application development of a standardized metadata collection structure that encapsulates critical aspects of field data collection, wet lab processing, and bioinformatic analysis. Medna-metadata is showcased with metabarcoding data from the Gulf of Maine (Polinski et al., 2019).AvailabilityThe source code of the medna-metadata web application is hosted on GitHub (https://github.com/Maine-eDNA/medna-metadata). Medna-metadata is a docker-compose installable package. Documentation can be found at https://medna-metadata.readthedocs.io/en/latest/?badge=latest. The application is implemented in Python, PostgreSQL and PostGIS, RabbitMQ, and NGINX, with all major browsers supported. A demo can be found at https://demo.metadata.maine-edna.org/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Detecting and quantifying antibody reactivity in PhIP-Seq data with BEER

Fri, 12/08/2022 - 5:30am
AbstractSummaryBecause of their high abundance, easy accessibility in peripheral blood, and relative stability ex vivo, antibodies serve as excellent records of environmental exposures and immune responses. Phage Immuno-Precipitation Sequencing (PhIP-Seq) is the most efficient technique available for assessing antibody binding to hundreds of thousands of peptides at cohort scale. PhIP-Seq is a high-throughput approach for assessing antibody reactivity to hundreds of thousands of candidate epitopes. Accurate detection of weakly reactive peptides is particularly important for characterizing the development and decline of antibody responses. Here, we present BEER (Bayesian Enrichment Estimation in R), a software package specifically developed for quantification of peptide reactivity from PhIP-Seq experiments. BEER implements a hierarchical model, and produces posterior probabilities for peptide reactivity, and a fold change estimate to quantify the magnitude. BEER also offers functionality to infer peptide reactivity based on the edgeR package, though the improvement in speed is offset by slightly lower sensitivity compared to the Bayesian approach, specifically for weakly reactive peptides.Availability and ImplementationBEER is implemented in R, and freely available from the Bioconductor repository at https://bioconductor.org/packages/release/bioc/html/beer.html.
Categories: Bioinformatics Trends

GameRank: R package for feature selection and construction

Thu, 11/08/2022 - 5:30am
AbstractMotivationBuilding calibrated and discriminating predictive models can be developed through the direct optimization of model performance metrics with combinatorial search algorithms. Often, predictive algorithms are desired in clinical settings to identify patients that may be high and low risk. However, due to the large combinatorial search space, these algorithms are slow and do not guarantee global optimality of their selection.ResultsHere we present a novel and quick maximum-likelihood-based feature selection algorithm, named GameRank. The method is implemented into an R package composed of additional functions to build calibrated and discriminative predictive models.AvailabilityGameRank is available at https://github.com/Genentech/GameRank and released under the MIT License.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

mBrainAligner-Web: A Web Server for Cross-Modal Coherent Registration of Whole Mouse Brains

Thu, 11/08/2022 - 5:30am
AbstractSummaryRecent whole-brain mapping projects are collecting increasingly larger sets of high-resolution brain images using a variety of imaging, labeling and sample preparation techniques. Both mining and analysis of these data require reliable and robust cross-modal registration tools. We recently developed the mBrainAligner, a pipeline for performing cross-modal registration of the whole-mouse brain. However, using this tool requires scripting or command-line skills to assemble and configure the different modules of mBrainAligner for accommodating different registration requirements and platform settings. In this application note, we present mBrainAligner-Web, a web server with a user-friendly interface that allows to configure and run mBrainAligner locally or remotely across platforms.Availability and implementationmBrainAligner-Web is available at http://mbrainaligner.ahu.edu.cn/ with source code at https://github.com/reaneyli/mBrainAligner-web.
Categories: Bioinformatics Trends

Detection of orthologous exons and isoforms using EGIO

Wed, 10/08/2022 - 5:30am
AbstractMotivationAlternative splicing is an important mechanism to generate transcriptomic and phenotypic diversity. Existing methods have limited power to detect orthologous isoforms.ResultsWe develop a new method, EGIO, to detect orthologous exons and orthologous isoforms from two species. EGIO uses unique exonic regions to construct exon groups, in which process dynamic programming strategy is used to do exon alignment. EGIO could cover all the coding exons within orthologous genes. A comparison between EGIO and ExTraMapper shows that EGIO could detect more orthologous isoforms with conserved sequence and exon structures. We apply EGIO to compare human and chimpanzee protein-coding isoforms expressed in the frontal cortex and identify 6912 genes that express human unique isoforms. Unexpectedly, more human unique isoforms are detected than those conserved between humans and chimpanzees.AvailabilitySource code and test data of EGIO are available at https://github.com/wu-lab-egio/EGIO.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

massDatabase: utilities for the operation of the public compound and pathway database

Tue, 09/08/2022 - 5:30am
AbstractSummaryOne of the major challenges in LC-MS data is converting many metabolic feature entries to biological function information, such as metabolite annotation and pathway enrichment, which are based on the compound and pathway databases. Multiple online databases have been developed. However, no tool has been developed for operating all these databases for biological analysis. Therefore, we developed massDatabase, an R package that operates the online public databases and combines with other tools for streamlined compound annotation and pathway enrichment. massDatabase is a flexible, simple, and powerful tool that can be installed on all platforms, allowing the users to leverage all the online public databases for biological function mining. A detailed tutorial and a case study are provided in the Supplementary MaterialsSupplementary Materials.Availability and implementationhttps://massdatabase.tidymass.org/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

MICER: A Pre-trained Encoder-Decoder Architecture for Molecular Image Captioning

Fri, 05/08/2022 - 5:30am
AbstractMotivationAutomatic recognition of chemical structures from molecular images provides an important avenue for the rediscovery of chemicals. Traditional rule-based approaches that rely on expert knowledge and fail to consider all the stylistic variations of molecular images usually suffer from cumbersome recognition processes and low generalization ability. Deep learning-based methods that integrate different image styles and automatically learn valuable features are flexible, but currently under-researched and have limitations, and are therefore not fully exploited.ResultsMICER, an encoder-decoder-based, reconstructed architecture for molecular image captioning, combines transfer learning, attention mechanisms, and several strategies to strengthen effectiveness and plasticity in different datasets. The effects of stereochemical information, molecular complexity, data volume, and pre-trained encoders on MICER performance were evaluated. Experimental results show that the intrinsic features of the molecular images and the sub-model match have a significant impact on the performance of this task. These findings inspire us to design the training dataset and the encoder for the final validation model, and the experimental results suggest that the MICER model consistently outperforms the state-of-the-art methods on four datasets. MICER was more reliable and scalable due to its interpretability and transfer capacity and provides a practical framework for developing comprehensive and accurate automated molecular structure identification tools to explore unknown chemical space.Availabilityhttps://github.com/Jiacai-Yi/MICERSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends



September 2022