Jump to Navigation

Hierarchical Reinforcement Learning for Automatic Disease Diagnosis

Bioinformatics Oxford Journals - Fri, 01/07/2022 - 5:30am
AbstractMotivationDisease diagnosis oriented dialogue system models the interactive consultation procedure as Markov Decision Process and reinforcement learning algorithms are used to solve the problem. Existing approaches usually employ a flat policy structure that treat all symptoms and diseases equally for action making. This strategy works well in the simple scenario when the action space is small, however, its efficiency will be challenged in the real environment. Inspired by the offline consultation process, we propose to integrate a hierarchical policy structure of two levels into the dialogue system for policy learning. The high-level policy consists of a master model that is responsible for triggering a low-level model, the low-level policy consists of several symptom checkers and a disease classifier. The proposed policy structure is capable to deal with diagnosis problem including large number of diseases and symptoms.ResultsExperimental results on three real-world datasets and a synthetic dataset demonstrate that our hierarchical framework achieves higher accuracy and symptom recall in disease diagnosis compared with existing systems. We construct a benchmark including datasets and implementation of existing algorithms to encourage follow-up researches.AvailabilityThe code and data is available from https://github.com/FudanDISC/DISCOpen-MedBox-DialoDiagnosisSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

ProtNAff: Protein-bound Nucleic Acid filters and fragment libraries

Bioinformatics Oxford Journals - Fri, 01/07/2022 - 5:30am
AbstractMotivationAtomistic models of Nucleic Acids (NA) fragments can be used to model the 3D structures of specific protein-NA interactions and address the problem of great NA flexibility, especially in their single-stranded regions. One way to obtain relevant NA fragments is to extract them from existing 3D structures corresponding to the targeted context (e.g. specific 2D structures, protein families, sequences) and to learn from them. Several databases exist for specific NA 3D motifs, especially in RNA, but none can handle the variety of possible contexts.ResultsThis paper presents protNAff, a new pipeline for the conception of searchable databases on the 2D and 3D structures of protein-bound NA, the selection of context-specific (regions of) NA structures by combinations of filters, and the creation of context-specific NA fragment libraries. The strength of this pipeline is its modularity, allowing users to adapt it to many specific modeling problems. As examples, the pipeline is applied to the quantitative analysis of (i) the sequence-specificity of trinucleotide conformations, (ii) the conformational diversity of RNA at several levels of resolution, (iii) the effect of protein binding on RNA local conformations, and (iv) the protein-binding propensity of RNA hairpin loops of various lengths.AvailabilityThe source code is freely available for download at URL https://github.com/isaureCdB/protNAff. The database and the trinucleotide fragment library are downloadable at URL https://zenodo.org/record/6483823#.YmbVhFxByV4.
Categories: Bioinformatics Trends

Mirage 2.0: fast and memory-efficient reconstruction of gene-content evolution considering heterogeneous evolutionary patterns among gene families

Bioinformatics Oxford Journals - Thu, 30/06/2022 - 5:30am
AbstractSummaryWe present Mirage 2.0, which accurately estimates gene-content evolutionary history by considering heterogeneous evolutionary patterns among gene families. Notably, we introduce a deterministic pattern mixture (DPM) model, which makes Mirage substantially faster and more memory-efficient to be applicable to large datasets with thousands of genomes.AvailabilityThe source code is freely available at https://github.com/fukunagatsu/Mirage.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Ab-CoV: a curated database for binding affinity and neutralization profiles of coronavirus related antibodies

Bioinformatics Oxford Journals - Thu, 30/06/2022 - 5:30am
AbstractSummaryWe have developed a database, Ab-CoV, which contains manually curated experimental interaction profiles of 1780 coronavirus related neutralizing antibodies. It contains more than 3200 datapoints on half maximal inhibitory concentration (IC50), half maximal effective concentration (EC50) and binding affinity (KD). Each data with experimentally known three-dimensional structures are complemented with predicted change in stability and affinity of all possible point mutations of interface residues. Ab-CoV also includes information on epitopes and paratopes, structural features of viral proteins, sequentially similar therapeutic antibodies and Collier de Perles plots. It has the feasibility for structure visualization and options to search, display and download the data.Availability and implementationAb-CoV database is freely available at https://web.iitm.ac.in/bioinfo2/ab-cov/home.
Categories: Bioinformatics Trends

Topological analysis as a tool for detection of abnormalities in protein-protein interaction data

Bioinformatics Oxford Journals - Thu, 30/06/2022 - 5:30am
AbstractMotivationProtein-protein interaction datasets, which can be modeled as networks, constitute an essential layer in multi-omics approach to biomedical knowledge. This representation gives insight into molecular pathways, help to uncover novel potential drug targets or predict a therapy outcome. Nevertheless, the data that constitutes such systems is frequently incomplete, error-prone and biased by scientific trends. Implementation of methods for detection of such shortcomings could improve protein-protein interaction data analysis.ResultsWe performed topological analysis of three protein-protein interaction networks (PPINs) from IntAct Molecular Database, regarding cancer, Parkinson’s disease (two most common subjects in PPINs analysis) and Human Reference Interactome. The data collections were shown to be often biased by scientific interests, which highly impacts the networks structure. This may obscure correct systematic biological interpretation of the protein-protein interactions and limit their application potential. As a solution to this problem, we propose a set of topological methods for the bias detection, which performed in the first step provides more objective biological conclusions regarding protein-protein interactions and their multi-omics consequences.AvailabilityA user-friendly tool ETNA (Extensive Tool for Network Analysis) is available on https://github.com/AlicjaNowakowska/ETNA. The software includes a graphical Colab notebook: https://githubtocolab.com/AlicjaNowakowska/ETNA/blob/main/ETNAColab.ipynb.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

i6mA-Caps: A CapsuleNet-based framework for identifying DNA N6-methyladenine sites

Bioinformatics Oxford Journals - Thu, 30/06/2022 - 5:30am
AbstractMotivationDNA N6-methyladenine (6mA) has been demonstrated to have an essential function in epigenetic modification in eukaryotic species in recent research. 6mA has been linked to various biological processes. It’s critical to create a new algorithm that can rapidly and reliably detect 6mA sites in genomes to investigate their biological roles. The identification of 6mA marks in the genome is the first and most important step in understanding the underlying molecular processes, as well as their regulatory functions.ResultsIn this paper, we proposed a novel computational tool called i6mA-Caps which CapsuleNet based a framework for identifying the DNA N6-methyladenine sites. The proposed framework uses a single encoding scheme for numerical representation of the DNA sequence. The numerical data is then used by the set of convolution layers to extract low-level features. These features are then used by the capsule network to extract intermediate-level and later high-level features to classify the 6mA sites. The proposed network is evaluated on three datasets belonging to three genomes which are Rosaceae, Rice and A.thalaina. Proposed method has attained an accuracy of 96.71%, 94% and 86.83% for independent Rosaceae dataset, Rice dataset and A.thaliana dataset respectively. The proposed framework has exhibited improved results when compared with the existing top-of-the-line methods.AvailabilityA user-friendly web-server is made available for the biological experts which can be accessed at:http://nsclbio.jbnu.ac.kr/tools/i6mA-Caps/.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

InterpolatedXY: a two-step strategy to normalise DNA methylation microarray data avoiding sex bias

Bioinformatics Oxford Journals - Thu, 30/06/2022 - 5:30am
AbstractMotivationData normalization is an essential step to reduce technical variation within and between arrays. Due to the different karyotypes and the effects of X chromosome inactivation, females and males exhibit distinct methylation patterns on sex chromosomes, thus it poses a significant challenge to normalise sex chromosome data without introducing bias. Currently, existing methods do not provide unbiased solutions to normalise sex chromosome data, usually, they just process autosomal and sex chromosomes indiscriminately.ResultsHere, we demonstrate that ignoring this sex difference will lead to introducing artificial sex bias, especially for thousands of autosomal CpGs. We present a novel two-step strategy (interpolatedXY) to address this issue, which is applicable to all quantile-based normalisation methods. By this new strategy, the autosomal CpGs are first normalised independently by conventional methods, such as funnorm or dasen; then the corrected methylation values of sex chromosome linked CpGs are estimated as the weighted average of their nearest neighbours on autosomes. The proposed two-step strategy can also be applied to other non-quantile-based normalisation methods, as well as other array-based data types. Moreover, we propose a useful concept: the sex explained fraction of variance, to quantitatively measure the normalisation effect.AvailabilityThe proposed methods are available by calling the function ‘adjustedDasen’ or ‘adjustedFunnorm’ in the latest wateRmelon package (https://github.com/schalkwyk/wateRmelon), with methods compatible with all the major workflows, including minfi.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

microbiomeMarker: an R/Bioconductor package for microbiome marker identification and visualization

Bioinformatics Oxford Journals - Thu, 30/06/2022 - 5:30am
AbstractSummaryCharacterizing biomarkers based on microbiome profiles has great potential for translational medicine and precision medicine. Here, we present microbiomeMarker, an R/Bioconductor package implementing commonly used normalization and differential analysis methods, and three supervised learning models to identify microbiome markers. microbiomeMarker also allows comparison of different methods of differential analysis and confounder analysis. It uses standardized input and output formats, which renders it highly scalable and extensible, and allows it to seamlessly interface with other microbiome packages and tools. In addition, the package provides a set of functions to visualize and interpret the identified microbiome markers.Availability and implementationmicrobiomeMarker is freely available from Bioconductor (https://www.bioconductor.org/packages/microbiomeMarker). Source code is available and maintained at GitHub (https://github.com/yiluheihei/microbiomeMarker).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

HSMotifDiscover: identification of motifs in sequences composed of non-single-letter elements

Bioinformatics Oxford Journals - Thu, 30/06/2022 - 5:30am
AbstractSummaryThe functional sub-string(s) of a biopolymer sequence defines the specificity of its interaction with other biomolecules and is often referred to as motifs. Computational algorithms and software have been broadly developed for finding such motifs in sequences in which the individual elements are single characters, such as those in DNA and protein sequences. However, there are more complex scenarios where the motifs exist in non-single-letter contexts, for example, preferred patterns of chemical modifications on proteins, DNAs, RNAs, or polysaccharides. To search for those motifs, we describe a new method that converts the modified sequence elements to representative single-letter codes and then uses a modified Gibbs-sampling algorithm to define the position specific scoring matrix (PSSM) representing the motif(s). As a proof of principle, we describe the implementation and application of an R package for discovering heparan sulfate (HS) motifs in glycan sequences, which are important in regulating protein-protein interactions. This software can be valuable for analyzing high-throughput glycoprotein binding data using microarrays with HS oligosaccharides or other biological polymers.Availability and ImplementationHSMotifDiscover is freely available as an open source R package released under an MIT license at https://github.com/bioinfoDZ/HSMotifDiscover and also available in the form of an app at https://hsmotifdiscover.shinyapps.io/HSMotifDiscover_ShinyApp/.Supplementary InformationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

ResPAN: a powerful batch correction model for scRNA-seq data through residual adversarial networks

Bioinformatics Oxford Journals - Thu, 30/06/2022 - 5:30am
AbstractMotivationWith the advancement of technology, we can generate and access large-scale, high dimensional and diverse genomics data, especially through single-cell RNA sequencing (scRNA-seq). However, integrative downstream analysis from multiple scRNA-seq datasets remains challenging due to batch effects.ResultsIn this paper, we propose a light-structured deep learning framework called ResPAN for scRNA-seq data integration. ResPAN is based on Wasserstein Generative Adversarial Network (WGAN) combined with random walk mutual nearest neighbor pairing and fully skip-connected autoencoders to reduce the differences among batches. We also discuss the limitations of existing methods and demonstrate the advantages of our model over seven other methods through extensive benchmarking studies on both simulated data under various scenarios and real datasets across different scales. Our model achieves leading performance on both batch correction and biological information conservation and maintains scalable to datasets with over half a million cells.AvailabilityAn open-source implementation of ResPAN and scripts to reproduce the results can be downloaded from: https://github.com/AprilYuge/ResPAN.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

RAPPPID: Towards Generalisable Protein Interaction Prediction with AWD-LSTM Twin Networks

Bioinformatics Oxford Journals - Thu, 30/06/2022 - 5:30am
AbstractMotivationComputational methods for the prediction of protein-protein interactions, while important tools for researchers, are plagued by challenges in generalising to unseen proteins. Datasets used for modelling protein-protein predictions are particularly predisposed to information leakage and sampling biases.ResultsIn this study, we introduce RAPPPID, a method for the Regularised Automatic Prediction of Protein-Protein Interactions using Deep Learning. RAPPPID is a twin AWD-LSTM network which employs multiple regularisation methods during training time to learn generalised weights. Testing on stringent interaction datasets composed of proteins not seen during training, RAPPPID outperforms state-of-the-art methods. Further experiments show that RAPPPID’s performance holds regardless of the particular proteins in the testing set and its performance is higher for experimentally supported edges. This study serves to demonstrate that appropriate regularisation is an important component of overcoming the challenges of creating models for protein-protein interaction prediction that generalise to unseen proteins. Additionally, as part of this study, we provide datasets corresponding to several data splits of various strictness, in order to facilitate assessment of PPI reconstruction methods by others in the future.AvailabilityCode and datasets are freely available at https://github.com/jszym/rapppid and https://zenodo.org.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

mapDATAge: a ShinyR package to chart ancient DNA data through space and time

Bioinformatics Oxford Journals - Thu, 30/06/2022 - 5:30am
AbstractSummaryAncient DNA datasets are increasingly difficult to visualise for users lacking computational experience. Here, we describe mapDATAge, which aims to provide user-friendly automated modules for the interactive mapping of allele, haplogroup and/or ancestry distributions through space and time. mapDATAge enhances collaborative data sharing while assists the assessment and reporting of spatio-temporal patterns of genetic changes.AvailabilitymapDATAge is a Shiny R application designed for exploring spatiotemporal patterns in ancient DNA data through a graphical user interface (GUI). It is freely available under GNU Public License in Github: https://github.com/xuefenfei712/mapDATAge.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

PScL-DDCFPred: an ensemble deep learning-based approach for characterizing multiclass subcellular localization of human proteins from bioimage data

Bioinformatics Oxford Journals - Thu, 30/06/2022 - 5:30am
AbstractMotivationCharacterization of protein subcellular localization has become an important and long-standing task in bioinformatics and computational biology, which provides valuable information for elucidating various cellular functions of proteins and guiding drug design.ResultsHere, we develop a novel bioimage-based computational approach, termed PScL-DDCFPred, to accurately predict protein subcellular localizations in human tissues. PScL-DDCFPred first extracts multiview image features, including global and local features, as base or pure features; Next, it applies a new integrative feature selection method based on stepwise discriminant analysis and generalized discriminant analysis to identify the optimal feature sets from the extracted pure features; Finally, a classifier based on deep neural network (DNN) and deep-cascade forest (DCF) is established. Stringent ten-fold cross-validation tests on the new protein subcellular localization training dataset, constructed from the human protein atlas databank, illustrates that PScL-DDCFPred achieves a better performance than several existing state-of-the-art methods. Moreover, the independent test set further illustrates the generalization capability and superiority of PScL-DDCFPred over existing predictors. In-depth analysis shows that the excellent performance of PScL-DDCFPred can be attributed to three critical factors, namely the effective combination of the DNN and DCF models, complementarity of global and local features, and use of the optimal feature sets selected by the integrative feature selection algorithm.Availabilityhttps://github.com/csbio-njust-edu/PScL-DDCFPredSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

MSRCall: A Multi-scale Deep Neural Network to Basecall Oxford Nanopore Sequences

Bioinformatics Oxford Journals - Wed, 29/06/2022 - 5:30am
AbstractMotivationMinION, a third-generation sequencer from Oxford Nanopore Technologies, is a portable device that can provide long nucleotide read data in real-time. It primarily aims to deduce the makeup of nucleotide sequences from the ionic current signals generated when passing DNA/RNA fragments through nanopores charged with a voltage difference. To determine nucleotides from measured signals, a translation process known as basecalling is required. However, compared to NGS basecallers, the calling accuracy of MinION still needs to be improved.ResultsIn this work, a simple but powerful neural network architecture called MSRCall is proposed. MSRCall comprises a multi-scale structure, recurrent layers, a fusion block, and a CTC decoder. To better identify both short-range and long-range dependencies, the recurrent layer is redesigned to capture various time-scale features with a multi-scale structure. The results show that MSRCall outperforms other basecallers in terms of both read and consensus accuracies.AvailabilityMSRCall is available at: https://github.com/d05943006/MSRCallSupplementary informationSupplementary dataSupplementary data are available.
Categories: Bioinformatics Trends

Outlier Detection for Multi-Network Data

Bioinformatics Oxford Journals - Tue, 28/06/2022 - 5:30am
AbstractMotivationIt has become routine in neuroscience studies to measure brain networks for different individuals using neuroimaging. These networks are typically expressed as adjacency matrices, with each cell containing a summary of connectivity between a pair of brain regions. There is an emerging statistical literature describing methods for the analysis of such multi-network data in which nodes are common across networks but the edges vary. However, there has been essentially no consideration of the important problem of outlier detection. In particular, for certain subjects, the neuroimaging data are so poor quality that the network cannot be reliably reconstructed. For such subjects, the resulting adjacency matrix may be mostly zero or exhibit a bizarre pattern not consistent with a functioning brain. These outlying networks may serve as influential points, contaminating subsequent statistical analyses. We propose a simple Outlier DetectIon for Networks (ODIN) method relying on an influence measure under a hierarchical generalized linear model for the adjacency matrices. An efficient computational algorithm is described, and ODIN is illustrated through simulations and an application to data from the UK Biobank.ResultsODIN was successful in identifying moderate to extreme outliers. Removing such outliers can significantly change inferences in downstream applications.AvailabilityODIN has been implemented in both Python and R and these implementations along with other code are publicly available at github.com/pritamdey/ODIN-python and github.com/pritamdey/ODIN-r respectively.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

A Unifying Network Modeling Approach for Codon Optimization

Bioinformatics Oxford Journals - Tue, 28/06/2022 - 5:30am
AbstractMotivationSynthesising genes to be expressed in other organisms is an essential tool in biotechnology. While the many-to-one mapping from codons to amino acids makes the genetic code degenerate, codon usage in a particular organism is not random either. This bias in codon use may have a remarkable effect on the level of gene expression. A number of measures have been developed to quantify a given codon sequence’s strength to express a gene in a host organism. Codon optimization aims to find a codon sequence that will optimize one or more of these measures. Efficient computational approaches are needed since the possible number of codon sequences grows exponentially as the number of amino acids increases.ResultsWe develop a unifying modeling approach for codon optimization. With our mathematical formulations based on graph/network representations of amino acid sequences, any combination of measures can be optimized in the same framework by finding a path satisfying additional limitations in an acyclic layered network. We tested our approach on bi-objectives commonly used in the literature, namely, Codon Pair Bias vs. Codon Adaptation Index and Relative Codon Pair Bias vs. Relative Codon Bias. However, our framework is general enough to handle any number of objectives concurrently with certain restrictions or preferences on the use of specific nucleotide sequences. We implemented our models using Python’s Gurobi interface and showed the efficacy of our approach even for the largest proteins available. We also provided experimentation showing that highly expressed genes have objective values close to the optimized values in the bi-objective codon design problem.Availability and implementationhttp://alpersen.bilkent.edu.tr/NetworkCodon.zip
Categories: Bioinformatics Trends

Improving candidate Biosynthetic Gene Clusters in fungi through reinforcement learning

Bioinformatics Oxford Journals - Tue, 28/06/2022 - 5:30am
AbstractMotivationPrecise identification of Biosynthetic Gene Clusters (BGCs) is a challenging task. Performance of BGC discovery tools is limited by their capacity to accurately predict components belonging to candidate BGCs, often overestimating cluster boundaries. To support optimizing the composition and boundaries of candidate BGCs, we propose reinforcement learning approach relying on protein domains and functional annotations from expert curated BGCs.ResultsThe proposed reinforcement learning method aims to improve candidate BGCs obtained with state-of-the-art tools. It was evaluated on candidate BGCs obtained for two fungal genomes, Aspergillus niger and Aspergillus nidulans. The results highlight an improvement of the gene precision by above 15% for TOUCAN, fungiSMASH and DeepBGC; and cluster precision by above 25% for fungiSMASH and DeepBCG, allowing these tools to obtain almost perfect precision in cluster prediction. This can pave the way of optimizing current prediction of candidate BGCs in fungi, while minimizing the curation effort required by domain experts.Availability and Implementationhttps://github.com/bioinfoUQAM/RL-bgc-componentsSupplementary informationSupplementary dataSupplementary data is available at Bioinformatics online.
Categories: Bioinformatics Trends

Robust Identification of Temporal Biomarkers in Longitudinal Omics Studies

Bioinformatics Oxford Journals - Tue, 28/06/2022 - 5:30am
AbstractMotivationLongitudinal studies increasingly collect rich ′omics′ data sampled frequently over time and across large cohorts to capture dynamic health fluctuations and disease transitions. However, the generation of longitudinal omics data has preceded the development of analysis tools that can efficiently extract insights from such data. In particular, there is a need for statistical frameworks that can identify not only which omics features are differentially regulated between groups but also over what time intervals. Additionally, longitudinal omics data may have inconsistencies, including nonuniform sampling intervals, missing data points, subject dropout, and differing numbers of samples per subject.ResultsIn this work, we developed OmicsLonDA, a statistical method that provides robust identification of time intervals of temporal omics biomarkers. OmicsLonDA is based on a semi-parametric approach, in which we use smoothing splines to model longitudinal data and infer significant time intervals of omics features based on an empirical distribution constructed through a permutation procedure. We benchmarked OmicsLonDA on five simulated datasets with diverse temporal patterns, and the method showed specificity greater than 0.99 and sensitivity greater than 0.87. Applying OmicsLonDA to the iPOP cohort revealed temporal patterns of genes, proteins, hormone metabolites, and microbes that are differentially regulated in male versus female subjects following a respiratory infection. In addition, we applied OmicsLonDA to the longitudinal multi-omics dataset of pregnant women with and without preeclampsia, and the method identified potential lipid markers that are temporally significantly different between the two groups.AvailabilityWe provide an open-source R package (https://bioconductor.org/packages/OmicsLonDA), to enable widespread use.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Vfold-Pipeline: a web server for RNA 3D structure prediction from sequences

Bioinformatics Oxford Journals - Mon, 27/06/2022 - 5:30am
AbstractSummaryRNA 3D structures are critical for understanding their functions and for RNA-targeted drug design. However, experimental determination of RNA 3D structures is laborious and technically challenging, leading to the huge gap between the number of sequences and the availability of RNA structures. Therefore, the computer-aided structure prediction of RNA 3D structures from sequences becomes a highly desirable solution to this problem. Here, we present a pipeline server for RNA 3D structure prediction from sequences that integrates the Vfold2D, Vfold3D, and VfoldLA programs. The Vfold2D program can incorporate the SHAPE experimental data in 2D structure prediction. The pipeline can also automatically extract 2D structural constraints from the Rfam database. Furthermore, with a significantly expanded 3D template database for various motifs, this Vfold-Pipeline server can efficiently return accurate 3D structure predictions or reliable initial 3D structures for further refinement.Availability and implementationhttp://rna.physics.missouri.edu/vfoldPipeline/index.htmlSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Single-cell generalized trend model (scGTM): a flexible and interpretable model of gene expression trend along cell pseudotime

Bioinformatics Oxford Journals - Mon, 27/06/2022 - 5:30am
AbstractMotivationModeling single-cell gene expression trends along cell pseudotime is a crucial analysis for exploring biological processes. Most existing methods rely on nonparametric regression models for their flexibility; however, nonparametric models often provide trends too complex to interpret. Other existing methods use interpretable but restrictive models. Since model interpretability and flexibility are both indispensable for understanding biological processes, the single-cell field needs a model that improves the interpretability and largely maintains the flexibility of nonparametric regression models.ResultsHere we propose the single-cell generalized trend model (scGTM) for capturing a gene’s expression trend, which may be monotone, hill-shaped, or valley-shaped, along cell pseudotime. The scGTM has three advantages: (1) it can capture non-monotonic trends that are still easy to interpret, (2) its parameters are biologically interpretable and trend informative, and (3) it can flexibly accommodate common distributions for modeling gene expression counts. To tackle the complex optimization problems, we use the particle swarm optimization algorithm to find the constrained maximum likelihood estimates for the scGTM parameters. As an application, we analyze several single-cell gene expression data sets using the scGTM and show that it can capture interpretable gene expression trends along cell pseudotime and reveal molecular insights underlying the biological processes.Availability and implementationThe Python package scGTM is open-access and available at https://github.com/ElvisCuiHan/scGTM.
Categories: Bioinformatics Trends

Pages

Subscribe to Centre for Bioinformatics aggregator - Bioinformatics Trends

Calendar

Mon
Tue
Wed
Thu
Fri
Sat
Sun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
July 2022