Jump to Navigation
Subscribe to Bioinformatics Oxford Journals feed
Updated: 5 hours 51 min ago

FastMix: A Versatile Data Integration Pipeline for Cell Type-Specific Biomarker Inference

Fri, 26/08/2022 - 5:30am
AbstractMotivationFlow cytometry (FCM) and transcription profiling are the two widely used assays in translational immunology research. However, there is no data integration pipeline for analyzing these two types of assays together with experiment variables for biomarker inference. Current FCM data analysis mainly relies on subjective manual gating analysis, which is difficult to be directly integrated with other automated computational methods. Existing deconvolutional analysis of bulk transcriptomics relies on predefined marker genes in the transcriptomics data, which are unavailable for novel cell types and does not utilize the FCM data that provide canonical phenotypic definitions of the cell types.ResultsWe developed a novel analytics pipeline - FastMix - for computational immunology, which integrates flow cytometry, bulk transcriptomics, and clinical covariates for identifying cell type-specific gene expression signatures and biomarker genes. FastMix addresses the “large p, small n” problem in the gene expression and flow cytometry integration analysis via a linear mixed effects model (LMER) for both cross-sectional and longitudinal studies. Its novel moment-based estimator not only reduces bias in parameter estimation but also is more efficient than iterative optimization. The FastMix pipeline also includes a cutting-edge flow cytometry data analysis method - DAFi - for identifying cell populations of interest and their characteristics. Simulation studies showed that FastMix produced smaller type I/II errors than competing methods. Validation using real data of two vaccine studies showed that FastMix identified a consistent set of signature genes as in independent single cell RNA-seq analysis, producing additional interesting findings.AvailabilitySource code of FastMix is publicly available at https://github.com/terrysun0302/FastMix.Supplementary informationSupplementary text and data are available at Bioinformatics online.
Categories: Bioinformatics Trends

hCoCena: Horizontal integration and analysis of transcriptomics datasets

Fri, 26/08/2022 - 5:30am
AbstractMotivationTranscriptome-based gene co-expression analysis has become a standard procedure for structured and contextualized understanding and comparison of different conditions and phenotypes. Since large study designs with a broad variety of conditions are costly and laborious, extensive comparisons are hindered when utilizing only a single data set. Thus, there is an increased need for tools that allow the integration of multiple transcriptomic data sets with subsequent joint analysis, which can provide a more systematic understanding of gene co-expression and co-functionality within and across conditions. To make such an integrative analysis accessible to a wide spectrum of users with differing levels of programming expertise it is essential to provide user-friendliness and customizability as well as thorough documentation.ResultsThis paper introduces horizontal CoCena (hCoCena: horizontal construction of co-expression networks and analysis), an R-package for network-based co-expression analysis that allows the analysis of a single transcriptomic data set as well as the joint analysis of multiple data sets. With hCoCena we provide a freely available, user-friendly, and adaptable tool for integrative multi-study or single-study transcriptomics analyses alongside extensive comparisons to other existing tools.AvailabilityThe hCoCena R-package is provided together with R Markdowns that implement an exemplary analysis workflow including extensive documentation and detailed descriptions of data structures and objects. Such efforts not only make the tool easy to use but also enable the seamless integration of user-written scripts and functions into the workflow, creating a tool that provides a clear design while remaining flexible and highly customizable. The package and additional information including an extensive Wiki are freely available on GitHub: https://github.com/MarieOestreich/hCoCena. The version at the time of writing has been added to Zenodo under the following link: https://doi.org/10.5281/zenodo.6911782Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Propeller: testing for differences in cell type proportions in single cell data

Thu, 25/08/2022 - 5:30am
AbstractMotivationSingle cell RNA Sequencing (scRNA-seq) has rapidly gained popularity over the last few years for profiling the transcriptomes of thousands to millions of single cells. This technology is now being used to analyse experiments with complex designs including biological replication. One question that can be asked from single cell experiments, which has been difficult to directly address with bulk RNA-seq data, is whether the cell type proportions are different between two or more experimental conditions. As well as gene expression changes, the relative depletion or enrichment of a particular cell type can be the functional consequence of disease or treatment. However, cell type proportion estimates from scRNA-seq data are variable and statistical methods that can correctly account for different sources of variability are needed to confidently identify statistically significant shifts in cell type composition between experimental conditions.ResultsWe have developed propeller, a robust and flexible method that leverages biological replication to find statistically significant differences in cell type proportions between groups. Using simulated cell type proportions data we show that propeller performs well under a variety of scenarios. We applied propeller to test for significant changes in cell type proportions related to human heart development, ageing and COVID-19 disease severity.AvailabilityThe propeller method is publicly available in the open source speckle R package (https://github.com/phipsonlab/speckle). All the analysis code for the paper is available at the associated analysis website: https://phipsonlab.github.io/propeller-paper-analysis/. The speckle package, analysis scripts and datasets have been deposited at https://doi.org/10.5281/zenodo.7009042.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

LipidMS 3.0: an R-package and a web-based tool for LC-MS/MS data processing and lipid annotation

Thu, 25/08/2022 - 5:30am
AbstractMotivationLipidMS was initially envisioned to use fragmentation rules and data-independent acquisition (DIA) for lipid annotation. However, data-dependent acquisition (DDA) remains the most widespread acquisition mode for untargeted LC-MS/MS-based lipidomics. Here we present LipidMS 3.0, an R package that not only adds DDA and new lipid classes to its pipeline, but also the required functionalities to cover the whole data analysis workflow from pre-processing (i.e., peak-peaking, alignment and grouping) to lipid annotation.ResultsWe applied the new workflow in the data analysis of a commercial human serum pool spiked with 68 representative lipid standards acquired in full scan, DDA and DIA modes. When focusing on the detected lipid standard features and total identified lipids, LipidMS 3.0 data pre-processing performance is similar to XCMS, whereas it complements the annotations returned by MS-DIAL, providing a higher level of structural information and a lower number of incorrect annotations. To extend and facilitate LipidMS 3.0 usage among less experienced R-programming users, the workflow was also implemented as a web-based application.AvailabilityThe LipidMS R-package is freely available at https://CRAN.R-project.org/package=LipidMS and as a website at http://www.lipidms.com.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Microbench: Automated metadata management for systems biology benchmarking and reproducibility in Python

Wed, 24/08/2022 - 5:30am
AbstractMotivationComputational systems biology analyses typically make use of multiple software and their dependencies, which are often run across heterogeneous compute environments. This can introduce differences in performance and reproducibility. Capturing metadata (e.g., package versions, GPU model) currently requires repetitious code and is difficult to store centrally for analysis. Even where virtual environments and containers are used, updates over time mean that versioning metadata should still be captured within analysis pipelines to guarantee reproducibility.ResultsMicrobench is a simple and extensible Python package to automate metadata capture to a file or Redis database. Captured metadata can include execution time, software package versions, environment variables, hardware information, Python version, and more, with plugins. We present three case studies demonstrating Microbench usage to benchmark code execution and examine environment metadata for reproducibility purposes.AvailabilityInstall from the Python Package Index using pip install microbench. Source code is available from https://github.com/alubbock/microbench.
Categories: Bioinformatics Trends

Multi-way relation-enhanced hypergraph representation learning for anti-cancer drug synergy prediction

Wed, 24/08/2022 - 5:30am
AbstractMotivationDrug combinations have exhibited promise in treating cancers with less toxicity and fewer adverse reactions. However, in vitro screening of synergistic drug combinations is time-consuming and labour-intensive because of the combinatorial explosion. Although a number of computational methods have been developed for predicting synergistic drug combinations, the multi-way relations between drug combinations and cell lines existing in drug synergy data have not been well exploited.ResultsWe propose a multi-way relation-enhanced hypergraph representation learning method to predict anti-cancer drug synergy, named HypergraphSynergy. HypergraphSynergy formulates synergistic drug combinations over cancer cell lines as a hypergraph, in which drugs and cell lines are represented by nodes and synergistic drug-drug-cell line triplets are represented by hyperedges, and leverages the biochemical features of drugs and cell lines as node attributes. Then, a hypergraph neural network is designed to learn the embeddings of drugs and cell lines from the hypergraph and predict drug synergy. Moreover, the auxiliary task of reconstructing the similarity networks of drugs and cell lines is considered to enhance the generalization ability of the model. In the computational experiments, HypergraphSynergy outperforms other state-of-the-art synergy prediction methods on two benchmark datasets for both classification and regression tasks, and is applicable to unseen drug combinations or cell lines. The studies revealed that the hypergraph formulation allows us to capture and explain complex multi-way relations of drug combinations and cell lines, and also provides a flexible framework to make the best use of diverse information.Availability and implementationThe source data and codes of HypergraphSynergy can be freely downloaded from https://github.com/liuxuan666/HypergraphSynergy.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Predicting cross-tissue hormone-gene relations using balanced word embeddings

Wed, 24/08/2022 - 5:30am
AbstractMotivationInter-organ/inter-tissue communication is central to multi-cellular organisms including humans, and mapping inter-tissue interactions can advance system-level whole-body modeling efforts. Large volumes of biomedical literature have fostered studies that map within-tissue or tissue-agnostic interactions, but literature mining studies that infer inter-tissue relations such as between hormones and genes are solely missing.ResultsWe present a first study to predict from biomedical literature the hormone-gene associations mediating inter-tissue signaling in the human body. Our BioEmbedS* models use neural network based Biomedical word Embeddings with a Support Vector Machine classifier to predict if a hormone-gene pair is associated or not, and whether an associated gene is involved in the hormone's production or response. Model training relies on our unified dataset HGv1 (Hormone-Gene version 1) of ground-truth associations between genes and endocrine hormones, which we compiled and carefully balanced in the embedded space to handle data disparities such as between poorly- vs. well-studied hormones. Our BioEmbedS model recapitulates known gene mediators of tissue-tissue signaling with 70.4% accuracy; predicts novel inter-tissue communication genes in humans which are enriched for hormone-related disorders; and generalizes well to mouse, thereby holding promise for its extension to other multi-cellular organisms as well.AvailabilityFreely available at https://cross-tissue-signaling.herokuapp.com are our model predictions & datasets; https://github.com/BIRDSgroup/BioEmbedS has all relevant code.Supplemental InformationSupplementary informationSupplementary information available at Bioinformatics online.
Categories: Bioinformatics Trends

Single-cell mutation calling and phylogenetic tree reconstruction with loss and recurrence

Wed, 24/08/2022 - 5:30am
AbstractMotivationTumours evolve as heterogeneous populations of cells, which may be distinguished by different genomic aberrations. The resulting intra-tumour heterogeneity plays an important role in cancer patient relapse and treatment failure, so that obtaining a clear understanding of each patient's tumour composition and evolutionary history is key for personalised therapies. Single-cell sequencing now provides the possibility to resolve tumour heterogeneity at the highest resolution of individual tumour cells, but brings with it challenges related to the particular noise profiles of the sequencing protocols as well as the complexity of the underlying evolutionary process.ResultsBy modelling the noise processes and allowing mutations to be lost or to reoccur during tumour evolution, we present a method to jointly call mutations in each cell, reconstruct the phylogenetic relationship between cells, and determine the locations of mutational losses and recurrences. Our Bayesian approach allows us to accurately call mutations as well as to quantify our certainty in such predictions. We show the advantages of allowing mutational loss or recurrence with simulated data and present its application to tumour single-cell sequencing data.AvailabilitySCIϕN is available at https://github.com/cbg-ethz/SCIPhINSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

ntHash2: recursive spaced seed hashing for nucleotide sequences

Wed, 24/08/2022 - 5:30am
AbstractMotivationSpaced seeds are robust alternatives to k-mers in analyzing nucleotide sequences with high base mismatch rates. Hashing is also crucial for efficiently storing abundant sequence data. Here, we introduce ntHash2, a fast algorithm for spaced seed hashing that can be integrated into various bioinformatics tools for efficient sequence analysis with applications in genome research.ResultsntHash2 is up to 2.1x faster at hashing various spaced seeds than the previous version and 3.8x faster than conventional hashing algorithms with naïve adaptation. Additionally, we reduced the collision rate of ntHash for longer k-mer lengths and improved the uniformity of the hash distribution by modifying the canonical hashing mechanism.AvailabilityntHash2 is freely available online at github.com/bcgsc/ntHash under an MIT license.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

GTFtools: a software package for analyzing various features of gene models

Wed, 24/08/2022 - 5:30am
AbstractMotivationGene-centric bioinformatics studies frequently involve calculation or extraction of various features of genes such as splice sites, promoters, independent introns, and untranslated regions (UTRs) through manipulation of gene models. Gene models are often annotated in gene transfer format (GTF) files. The features are essential for subsequent analysis such as intron retention detection, DNA-binding site identification, and computing splicing strength of splice sites. Some features such as independent introns and splice sites are not provided in existing resources including the commonly used BioMart database. A package that implements and integrates functions to analyze various features of genes will greatly ease routine analysis for related bioinformatics studies. However, to the best of our knowledge, such a package is not available yet.ResultsIn this work, we introduce GTFtools, a stand-alone command-line software that provides a set of functions to calculate various gene features, including splice sites, independent introns, transcription start sites (TSS)-flanking regions, UTRs, isoform coordination and length, different types of gene lengths, etc. It takes the ENSEMBL or GENCODE GTF files as input, and can be applied to both human and non-human gene models like the lab mouse. We compare the utilities of GTFtools with those of two related tools: Bedtools and BioMart. GTFtools is implemented in Python and not dependent on any third-party software, making it very easy to install and use.AvailabilityGTFtools is freely available at www.genemine.org/gtftools.php as well as pyPI and Bioconda
Categories: Bioinformatics Trends

Isoform function prediction by Gene Ontology embedding

Tue, 23/08/2022 - 5:30am
AbstractMotivationHigh resolution annotation of gene functions is a central task in functional genomics. Multiple proteoforms translated from alternatively spliced isoforms from a single gene are actual function performers and greatly increase the functional diversity. The specific functions of different isoforms can decipher the molecular basis of various complex diseases at a finer granularity. Multi-instance learning (MIL) based solutions have been developed to distribute gene(bag)-level Gene Ontology (GO) annotations to isoforms(instances), but they simply presume that a particular annotation of the gene is responsible by only one isoform, neglect the hierarchical structures and semantics of massive GO terms (labels), or can only handle dozens of terms.ResultsWe propose an efficacy approach IsofunGO to differentiate massive functions of isoforms by GO embedding. Particularly, IsofunGO firstly introduces an attributed hierarchical network to model massive GO terms, and a GO network embedding strategy to learn compact representations of GO terms and project GO annotations of genes into compressed ones, this strategy not only explores and preserves hierarchy between GO terms but also greatly reduces the prediction load. Next, it develops an attention based multi-instance learning network to fuse genomics and transcriptomics data of isoforms and predict isoform functions by referring to compressed annotations. Extensive experiments on benchmark datasets demonstrate the efficacy of IsofunGO. Both the GO embedding and attention mechanism can boost the performance and interpretability.AvailabilityThe code of IsofunGO is available at http://www.sdu-idea.cn/codes.php?name=IsofunGOSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

CLNN-loop: A deep learning model to predict CTCF-mediated chromatin loops in the different cell lines and CTCF-binding sites (CBS) pair types

Tue, 23/08/2022 - 5:30am
AbstractMotivationThree-dimensional (3D) genome organization is of vital importance in gene regulation and disease mechanisms. Previous studies have shown that CTCF-mediated chromatin loops are crucial to studying the 3D structure of cells. Although various experimental techniques have been developed to detect chromatin loops, they have been found to be time-consuming and costly. Nowadays, various sequence-based computational methods can capture significant features of 3D genome organization and help predict chromatin loops. However, these methods have low performance and poor generalization ability in predicting chromatin loops.ResultsHere, we propose a novel deep learning model, called CLNN-loop, to predict chromatin loops in different cell lines and CTCF-binding sites (CBS) pair types by fusing multiple sequence-based features. The analysis of a series of examinations based on the datasets in the previous study shows that CLNN-loop has satisfactory performance and is superior to the existing methods in terms of predicting chromatin loops. In addition, we apply the SHAP framework to interpret the predictions of different models, and find that CTCF motif and sequence conservation are important signs of chromatin loops in different cell lines and CBS pair types. The source code of CLNN-loop is freely available at https://github.com/HaoWuLab-Bioinformatics/CLNN-loop and the webserver of CLNN-loop is freely available at http://hwclnn.sdu.edu.cn.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Predicting cancer drug response using parallel heterogeneous graph convolutional networks with neighborhood interactions

Tue, 23/08/2022 - 5:30am
AbstractMotivationDue to cancer heterogeneity, the therapeutic effect may not be the same when a cohort of patients of the same cancer type receive the same treatment. The anticancer drug response prediction may help develop personalized therapy regimens to increase survival and reduce patients' expenses. Recently graph neural network-based methods have aroused widespread interest and achieved impressive results on the drug response prediction task. However, most of them apply graph convolution to process cell line-drug bipartite graphs while ignoring the intrinsic differences between cell lines and drug nodes. Moreover, most of these methods aggregate node-wise neighbor features but fail to consider the element-wise interaction between cell lines and drugs.ResultsThis work proposes a neighborhood interaction-based heterogeneous graph convolution network method, namely NIHGCN, for anticancer drug response prediction in an end-to-end way. Firstly, it constructs a heterogeneous network consisting of drugs, cell lines and the known drug response information. Cell line gene expression and drug molecular fingerprints are linearly transformed and input as node attributes into an interaction model. The interaction module consists of a parallel graph convolution network (PGCN) layer and a neighborhood interaction (NI) layer, which aggregates node-level features from their neighbors through graph convolution operation and considers the element-level of interactions with their neighbors in the NI layer. Finally, the drug response predictions are made by calculating the linear correlation coefficients of feature representations of cell lines and drugs. We have conducted extensive experiments to assess the effectiveness of our model on Cancer Drug Sensitivity Data (GDSC) and Cancer Cell Line Encyclopedia (CCLE) datasets. It has achieved the best performance compared with the state-of-the-art algorithms, especially in predicting drug responses for new cell lines, new drugs and targeted drugs. Furthermore, our model that was well trained on the GDSC dataset can be successfully applied to predict samples of PDX and TCGA, which verified the transferability of our model from cell line in vitro to the datasets in vivo.AvailabilityThe source code can be obtained from https://github.com/weiba/NIHGCN.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

YAMACS: a graphical interface for GROMACS

Tue, 23/08/2022 - 5:30am
AbstractSummaryA graphical user interface for the GROMACS program has been developed as plugins for YASARA molecular graphics suite. The most significant GROMACS methods can be run entirely via a windowed menu system, and the results are shown on screen in real-time.Availability and ImplementationYAMACS is written in Python and is freely available for download at https://github.com/YAMACS-SML/YAMACS and is supported on Linux. It has been released under GPL-3.0 license.Supplementary informationYAMACS User Manual, available at https://github.com/YAMACS-SML/YAMACS
Categories: Bioinformatics Trends

MMGraph: a multiple motif predictor based on graph neural network and coexisting probability for ATAC-seq data

Tue, 23/08/2022 - 5:30am
AbstractMotivationTranscription factor binding sites (TFBSs) prediction is a crucial step in revealing functions of transcription factors (TFs) from high-throughput sequencing data. Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) provides insight on TFBSs and nucleosome positioning by probing open chromatic, which can simultaneously reveal multiple TFBSs compare to traditional technologies. The existing tools based on convolutional neural network (CNN) only find the fixed length of TFBSs from ATAC-seq data. Graph neural network (GNN) can be considered as the extension of CNN, which has great potential in finding multiple TFBSs with different lengths from ATAC-seq data.ResultsWe develop a motif predictor called MMGraph based on three-layer GNN and coexisting probability of k-mers for finding multiple motifs from ATAC-seq data. The results of the experiment which has been conducted on 88 ATAC-seq datasets indicate that MMGraph has achieved the best performance on area of eight metrics radar (AEMR) score of 2.31 and could find 207 higher quality multiple motifs than other existing tools.AvailabilityMMGraph is wrapped in Python package, which is available at https://github.com/zhangsq06/MMGraph.gitSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

NetTIME: a Multitask and Base-pair Resolution Framework for Improved Transcription Factor Binding Site Prediction

Tue, 23/08/2022 - 5:30am
AbstractMotivationMachine learning models for predicting cell-type-specific transcription factor (TF) binding sites have become increasingly more accurate thanks to the increased availability of next-generation sequencing data and more standardized model evaluation criteria. However, knowledge transfer from data-rich to data-limited TFs and cell types remains crucial for improving TF binding prediction models because available binding labels are highly skewed towards a small collection of TFs and cell types. Transfer prediction of TF binding sites can potentially benefit from a multitask learning approach; however, existing methods typically use shallow single-task models to generate low-resolution predictions. Here we propose NetTIME, a multitask learning framework for predicting cell-type-specific transcription factor binding sites with base-pair resolution.ResultsWe show that the multitask learning strategy for TF binding prediction is more efficient than the single-task approach due to the increased data availability. NetTIME trains high-dimensional embedding vectors to distinguish TF and cell-type identities. We show that this approach is critical for the success of the multitask learning strategy and allows our model to make accurate transfer predictions within and beyond the training panels of TFs and cell types. We additionally train a linear-chain conditional random field (CRF) to classify binding predictions and show that this CRF eliminates the need for setting a probability threshold and reduces classification noise. We compare our method’s predictive performance with two state-of-the-art methods, Catchitt and Leopard, and show that our method outperforms previous methods under both supervised and transfer learning settings.AvailabilityNetTIME is freely available at https://github.com/ryi06/NetTIME**Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

HNOXPred: a web tool for the prediction of gas sensing H-NOX proteins from amino acid sequence

Mon, 22/08/2022 - 5:30am
AbstractSummaryHNOXPred is a webserver for the prediction of gas sensing H-NOX proteins from amino acid sequence. Heme-Nitric oxide/Oxygen (H-NOX) proteins are gas sensing hemoproteins found in diverse organisms ranging from bacteria to eukaryotes. Recently, gas sensing complex multi-functional proteins containing only the conserved amino acids at the heme centers of H-NOX proteins, have been identified through a motif-based approach. Based on experimental data and H-NOX candidates reported in literature, HNOXPred is created to automate and facilitate the identification of similar H-NOX centers across systems. The server features HNOXSCORES scaled from 0 to 1 that consider in its calculation, the physicochemical properties of amino acids constituting the heme center in H-NOX in addition to the conserved amino acids within the center. From user input amino acid sequence, the server returns positive hits and their calculated HNOXSCORES ordered from high to low confidence which are accompanied by interpretation guides and recommendations. The utility of this server is demonstrated using the human proteome as an example.Availability and implementationThe HNOXPred server is available at https://www.hnoxpred.com.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

The FASTQ+ format and PISA

Mon, 22/08/2022 - 5:30am
AbstractSummaryThe FASTQ+ format is designed for single-cell experiments. It extends various optional tags, including cell barcodes and unique molecular identifiers, to the sequence identifier, and is fully compatible with the FASTQ format. In addition, PISA implements various utilities for processing sequences in the FASTQ format and alignments in the SAM/BAM/CRAM format from single-cell experiments, such as converting FASTQ format to FASTQ+, annotating alignments, PCR deduplication, feature counting, and barcodes correction. The software is open-source and written in C language.Availabilityhttps://doi.org/10.5281/zenodo.6787430 or https://github.com/shiquan/PISASupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

scWMC: Weighted Matrix Completion-based Imputation of scRNA-seq Data via Prior Subspace Information

Fri, 19/08/2022 - 5:30am
AbstractMotivationSingle-cell RNA sequencing (scRNA-seq) can provide insight into gene expression patterns at the resolution of individual cells, which offers new opportunities to study the behavior of different cell types. However, it is often plagued by dropout events, a phenomenon where the expression value of a gene tends to be measured as zero in the expression matrix due to various technical defects.ResultsIn this paper, we argue that borrowing gene and cell information across column and row subspaces directly results in suboptimal solutions due to the noise contamination in imputing dropout values. Thus, to impute more precisely the dropout events in scRNA-seq data, we develop a regularization for leveraging that imperfect prior information to estimate the true underlying prior subspace and then embed it in a typical low-rank matrix completion-based framework, named scWMC. To evaluate the performance of the proposed method, we conduct comprehensive experiments on simulated and real scRNA-seq data. Extensive data analysis, including simulated analysis, cell clustering, differential expression analysis, functional genomic analysis, cell trajectory inference and scalability analysis, demonstrate that our method produces improved imputation results compared to competing methods that benefits subsequent downstream analysis.AvailabilityThe source code is available at https://github.com/XuYuanchi/scWMC and test data is available at https://doi.org/10.5281/zenodo.6832477.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Beacon V2 Reference Implementation: a Toolkit to enable federated sharing of genomic and phenotypic data

Thu, 18/08/2022 - 5:30am
AbstractSummaryBeacon v2 is an API specification established by the Global Alliance for Genomics and Health initiative (GA4GH) that defines a standard for federated discovery of genomic and phenotypic data. Here we present the Beacon v2 Reference Implementation (B2RI), a set of open-source software tools that allow lighting up a local Beacon instance “out-of-the-box”. Along with the software, we have created detailed “Read the Docs” documentation that includes information on deployment and installation.AvailabilityThe B2RI is released under GNU General Public License v3.0 and Apache License v2.0. Documentation and source code is available at: https://b2ri-documentation.readthedocs.ioSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Pages

Calendar

Mon
Tue
Wed
Thu
Fri
Sat
Sun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
 
September 2022