Jump to Navigation

Improving Biomedical Named Entity Recognition by Dynamic Caching Inter-Sentence Information

Bioinformatics Oxford Journals - Mon, 27/06/2022 - 5:30am
AbstractMotivationBiomedical Named Entity Recognition (BioNER) aims to identify biomedical domain-specific entities (e.g., gene, chemical, and disease) from unstructured texts. Despite deep learning-based methods for BioNER achieving satisfactory results, there is still much room for improvement. Firstly, most existing methods use independent sentences as training units and ignore inter-sentence context, which usually leads to the labeling inconsistency problem. Secondly, previous document-level BioNER works have approved that the inter-sentence information is essential, but what information should be regarded as context remains ambiguous. Moreover, there are still few pre-training based BioNER models that have introduced inter-sentence information. Hence, we propose a cache-based inter-sentence model called BioNER-Cache to alleviate the aforementioned problems.ResultsWe propose a simple but effective dynamic caching module to capture inter-sentence information for BioNER. Specifically, the cache stores recent hidden representations constrained by predefined caching rules. And the model uses a query-and-read mechanism to retrieve similar historical records from the cache as the local context. Then, an attention-based gated network is adopted to generate context-related features with BioBERT. To dynamically update the cache, we design a scoring function and implement a multi-task approach to jointly train our model. We build a comprehensive benchmark on four biomedical datasets to evaluate the model performance fairly. Finally, extensive experiments clearly validate the superiority of our proposed BioNER-Cache compared with various state-of-the-art intra-sentence and inter-sentence baselines.AvailabilityCode will be available at https://github.com/zgzjdx/BioNER-Cache.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

TIMSCONVER: A workflow to convert trapped ion mobility data to open data formats

Bioinformatics Oxford Journals - Mon, 27/06/2022 - 5:30am
AbstractMotivationAdvances in mass spectrometry have led to the development of mass spectrometers with ion mobility spectrometry (IMS) capabilities and dual source instrumentation, however the current software ecosystem lacks interoperability with downstream data analysis using open-source software and pipelines.ResultsHere, we present TIMSCONVERT, a data conversion high-throughput workflow from timsTOF Pro/fleX mass spectrometer raw data files to mzML and imzML formats that incorporates ion mobility data while maintaining compatibility with data analysis tools. We showcase several examples using data acquired across different experiments and acquisition modalities on the timsTOF fleX MS.AvailabilityTIMSCONVERT and its documentation can be found at https://github.com/gtluu/timsconvert and is available as a standalone command line interface tool for Windows and Linux, NextFlow workflow, and online in the Global Natural Products Social (GNPS) platform.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Predicting RNA distance-based contact maps by integrated deep learning on physics-inferred secondary structure and evolutionary-derived mutational coupling

Bioinformatics Oxford Journals - Sat, 25/06/2022 - 5:30am
AbstractMotivationRecently, AlphaFold2 achieved high experimental accuracy for the majority of proteins in Critical Assessment of Structure Prediction (CASP 14). This raises the hope that one day, we may achieve the same feat for RNA structure prediction for those structured RNAs, which is as fundamentally and practically important similar to protein structure prediction. One major factor in the recent advancement of protein structure prediction is the highly accurate prediction of distance-based contact maps of proteins.ResultsHere we showed that by integrated deep learning with physics-inferred secondary structures, co-evolutionary information, and multiple sequence-alignment sampling, we can achieve RNA contact-map prediction at a level of accuracy similar to that in protein contact-map prediction. More importantly, highly accurate prediction for top L long-range contacts can be assured for those RNAs with a high effective number of homologous sequences (Neff > 50). The initial use of the predicted contact map as distance-based restraints confirmed its usefulness in 3D structure prediction.AvailabilitySPOT-RNA-2D is available as a web server at https://sparks-lab.org/server/spot-rna-2d/ and as a standalone program at https://github.com/jaswindersingh2/SPOT-RNA-2D.
Categories: Bioinformatics Trends

ClearCNV: CNV calling from NGS panel data in the presence of ambiguity and noise

Bioinformatics Oxford Journals - Sat, 25/06/2022 - 5:30am
AbstractMotivationWhile the identification of small variants in panel sequencing data can be considered a solved problem, the identification of larger, multi-exon copy number variants (CNVs) still poses a considerable challenge. Thus, CNV calling has not been established in all laboratories performing panel sequencing. At the same time such laboratories have accumulated large data sets and thus have the need to identify copy number variants on their data to close the diagnostic gap.ResultsIn this manuscript we present our method clearCNV that addresses this need in two ways. First, it helps laboratories to properly assign data sets to enrichment kits. Based on homogeneous subsets of data, clearCNV identifies CNVs affecting the targeted regions. Using real-world data sets and validation, we show that our method is highly competitive with previous methods and preferable in terms of specificity.AvailabilityThe software is available for free under a permissible license at {{https://github.com/bihealth/clear-cnv}}Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

CancerCellTracker: A Brightfield Time-lapse Microscopy Framework for Cancer Drug Sensitivity Estimation

Bioinformatics Oxford Journals - Sat, 25/06/2022 - 5:30am
AbstractMotivationTime-lapse microscopy is a powerful technique that relies on images of live cells cultured ex vivo that are captured at regular intervals of time to describe and quantify their behavior under certain experimental conditions. This imaging method has great potential in advancing the field of precision oncology by quantifying the response of cancer cells to various therapies and identifying the most efficacious treatment for a given patient. Digital image processing algorithms developed so far require high-resolution images involving very few cells originating from homogeneous cell line populations. We propose a novel framework that tracks cancer cells to capture their behavior and quantify cell viability to inform clinical decisions in a high-throughput manner.ResultsThe brightfield microscopy images a large number of patient-derived cells in an ex vivo reconstruction of the tumor microenvironment treated with 31 drugs for up to six days. We developed a robust and user-friendly pipeline CancerCellTracker that detects cells in co-culture, tracks these cells across time, and identifies cell death events using changes in cell attributes. We validated our computational pipeline by comparing the timing of cell death estimates by CancerCellTracker from brightfield images and a fluorescent channel featuring ethidium homodimer. We benchmarked our results using a state-of-the-art algorithm implemented in ImageJ and previously published in the literature. We highlighted CancerCellTracker’s efficiency in estimating the percentage of live cells in the presence of bone marrow stromal cells.Availability and implementationhttps://github.com/compbiolabucf/CancerCellTracker
Categories: Bioinformatics Trends

Variational Bayes for high-dimensional proportional hazards models with applications within gene expression

Bioinformatics Oxford Journals - Sat, 25/06/2022 - 5:30am
AbstractMotivationFew Bayesian methods for analyzing high-dimensional sparse survival data provide scalable variable selection, effect estimation and uncertainty quantification. Such methods often either sacrifice uncertainty quantification by computing maximum a posteriori estimates, or quantify the uncertainty at high (unscalable) computational expense.ResultsWe bridge this gap and develop an interpretable and scalable Bayesian proportional hazards model for prediction and variable selection, referred to as SVB. Our method, based on a mean-field variational approximation, overcomes the high computational cost of MCMC whilst retaining useful features, providing a posterior distribution for the parameters and offering a natural mechanism for variable selection via posterior inclusion probabilities. The performance of our proposed method is assessed via extensive simulations and compared against other state-of-the-art Bayesian variable selection methods, demonstrating comparable or better performance. Finally, we demonstrate how the proposed method can be used for variable selection on two transcriptomic datasets with censored survival outcomes, and how the uncertainty quantification offered by our method can be used to provide an interpretable assessment of patient risk.Availability and implementationour method has been implemented as a freely available R package survival.svb (https://github.com/mkomod/survival.svb).Supplementary informationSupplementary materialsSupplementary materials are available at Bioinformatics online.
Categories: Bioinformatics Trends

ggtranscript: an R package for the visualization and interpretation of transcript isoforms using ggplot2

Bioinformatics Oxford Journals - Sat, 25/06/2022 - 5:30am
AbstractMotivationThe advent of long-read sequencing technologies has increased demand for the visualisation and interpretation of transcripts. However, tools that perform such visualizations remain inflexible and lack the ability to easily identify differences between transcript structures. Here, we introduce ggtranscript, an R package that provides a fast and flexible method to visualize and compare transcripts. As a ggplot2 extension, ggtranscript inherits the functionality and familiarity of ggplot2 making it easy to use.Availabilityggtranscript is an R package available at https://github.com/dzhang32/ggtranscript (DOI: https://doi.org/10.5281/zenodo.6374061) via an open-source MIT license. Further is available at https://dzhang32.github.io/ggtranscript/.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Deep learning models for RNA secondary structure prediction (probably) do not generalise across families

Bioinformatics Oxford Journals - Fri, 24/06/2022 - 5:30am
AbstractMotivationThe secondary structure of RNA is of importance to its function. Over the last few years, several papers attempted to use machine learning to improve de novo RNA secondary structure prediction. Many of these papers report impressive results for intra-family predictions, but seldom address the much more difficult (and practical) inter-family problem.ResultsWe demonstrate that it is nearly trivial with convolutional neural networks to generate pseudo-free energy changes, modeled after structure mapping data, that improve the accuracy of structure prediction for intra-family cases. We propose a more rigorous method for inter-family cross-validation that can be used to assess the performance of learning-based models. Using this method, we further demonstrate that intra-family performance is insufficient proof of generalisation despite the widespread assumption in the literature, and provide strong evidence that many existing learning-based models have not generalised inter-family. AvailabilitySource code and data is available at https://github.com/marcellszi/dl-rna.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

MIAMI: Mutual Information-based Analysis of Multiplex Imaging data

Bioinformatics Oxford Journals - Fri, 24/06/2022 - 5:30am
AbstractMotivationStudying the interaction or co-expression of the proteins or markers in the tumor microenvironment (TME) of cancer subjects can be crucial in the assessment of risks, such as death or recurrence. In the conventional approach, the cells need to be declared positive or negative for a marker based on its intensity. For multiple markers, manual thresholds are required for each marker, which can become cumbersome. The performance of the subsequent analysis relies heavily on this step and thus suffers from subjectivity and lacks robustness.ResultsWe present a new method where different marker intensities are viewed as dependent random variables, and the mutual information (MI) between them is considered to be a metric of co-expression. Estimation of the joint density, as required in the traditional form of MI, becomes increasingly challenging as the number of markers increases. We consider an alternative formulation of MI which is conceptually similar but has an efficient estimation technique for which we develop a new generalization. With the proposed method, we analyzed a lung cancer dataset finding the co-expression of the markers, HLA-DR and CK to be associated with survival. We also analyzed a triple negative breast cancer dataset finding the co-expression of the immuno-regulatory proteins, PD1, PD-L1, Lag3 and IDO, to be associated with disease recurrence. We demonstrated the robustness of our method through different simulation studies.AvailabilityThe associated R package can be found here, https://github.com/sealx017/MIAMI.Supplementary informationThe Supplementary MaterialSupplementary Material is attached.
Categories: Bioinformatics Trends

XSI - A genotype compression tool for compressive genomics in large biobanks

Bioinformatics Oxford Journals - Fri, 24/06/2022 - 5:30am
AbstractMotivationGeneration of genotype data has been growing exponentially over the last decade. With the large size of recent datasets comes a storage and computational burden with ever increasing costs. To reduce this burden we propose XSI, a file format with reduced storage footprint that also allows computation on the compressed data and we show how this can improve future analyses.ResultsWe show that XSI allows for a file size reduction of 4-20x compared to compressed BCF and demonstrate its potential for “compressive genomics” on the UK Biobank whole genome sequencing genotypes with 8x faster loading times, 5x faster run of homozygozity computation, 30x faster dot products computation, and 280x faster allele counts.AvailabilityThe xSqueezeIt file format (XSI) specifications, API, and command line tool are released under open-source (MIT) license and are available at https://github.com/rwk-unil/xSqueezeItSupplementary informationSupplementary materialsSupplementary materials are available at Bioinformatics online.
Categories: Bioinformatics Trends

EasyGDB, a low-maintenance and highly customizable system to develop genomics portals

Bioinformatics Oxford Journals - Fri, 24/06/2022 - 5:30am
AbstractSummaryEasyGDB is an easy to implement low-maintenance tool developed to create genomic data management web platforms. It can be used for any species, group of species, or multiple genome or annotation versions. EasyGDB provides a framework to develop a web portal that includes the general information about species, projects and members, and bioinformatics tools such as file downloads, BLAST, genome browser, annotation search, gene expression visualization, annotation and sequence download, and gene ids and orthologs lookup. The code of EasyGDB facilitates data maintenance and update for non-experienced bioinformaticians, using BLAST databases to store and retrieve sequence data in gene annotation pages and bioinformatics tools, and JSON files to customize metadata. EasyGDB is a highly customizable tool. Any section and tool can be enabled or disabled like a switch through a single configuration file. This tool aims to simplify the development of genomics portals in non-model species, providing a modern web style with embedded interactive bioinformatics tools to cover all the common needs derived from genomics projects.Availability and Implementationhttps://github.com/noefp/easy_gdb.
Categories: Bioinformatics Trends

The Practical Haplotype Graph, a platform for storing and using pangenomes for imputation

Bioinformatics Oxford Journals - Fri, 24/06/2022 - 5:30am
AbstractMotivationPangenomes provide novel insights for population and quantitative genetics, genomics, and breeding not available from studying a single reference genome. Instead, a species is better represented by a pangenome or collection of genomes. Unfortunately, managing and using pangenomes for genomically diverse species is computationally and practically challenging. We developed a trellis graph representation anchored to the reference genome that represents most pangenomes well and can be used to impute complete genomes from low density sequence or variant data.ResultsThe Practical Haplotype Graph (PHG) is a pangenome pipeline, database (PostGRES & SQLite), data model (Java, Kotlin, or R), and Breeding API (BrAPI) web service. The PHG has already been able to accurately represent diversity in four major crops including maize, one of the most genomically diverse species, with up to 1000-fold data compression. Using simulated data, we show that, at even 0.1X coverage, with appropriate reads and sequence alignment, imputation results in extremely accurate haplotype reconstruction. The PHG is a platform and environment for the understanding and application of genomic diversity.AvailabilityAll resources listed here are freely available. The PHG Docker used to generate the simulation results is https://hub.docker.com/ as maizegenetics/phg:0.0.27. PHG source code is at https://bitbucket.org/bucklerlab/practicalhaplotypegraph/src/master/. The code used for the analysis of simulated data is at https://bitbucket.org/bucklerlab/phg-manuscript/src/master/. The PHG database of NAM parent haplotypes is in the CyVerse data store (https://de.cyverse.org/de/) and named /iplant/home/shared/panzea/panGenome/PHG_db_maize/phg_v5Assemblies_20200608.db.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Batch alignment via retention orders for preprocessing large-scale multi-batch LC-MS experiments

Bioinformatics Oxford Journals - Fri, 24/06/2022 - 5:30am
AbstractMotivationMeticulous selection of chromatographic peak detection parameters and algorithms is a crucial step in preprocessing LC-MS data. However, as mass-to-charge ratio (m/z) and retention time shifts are larger between batches than within batches, finding apt parameters for all samples of a large-scale multi-batch experiment with the aim of minimizing information loss becomes a challenging task. Preprocessing independent batches individually can curtail said problems but requires a method for aligning and combining them for further downstream analysis.ResultsWe present two methods for aligning and combining individually preprocessed batches in multi-batch LC-MS experiments. Our developed methods were tested on six sets of simulated and six sets of real datasets. Furthermore, by estimating the probabilities of peak insertion, deletion, and swap between batches in authentic datasets we demonstrate that retention order swaps are not rare in untargeted LC-MS data.AvailabilitykmersAlignment and rtcorrectedAlignment algorithms are made available as an R package with raw data at https://metabocombiner.img.cas.czSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Functional Characterization of Co-Phosphorylation Networks

Bioinformatics Oxford Journals - Wed, 22/06/2022 - 5:30am
AbstractMotivationProtein phosphorylation is a ubiquitous regulatory mechanism that plays a central role in cellular signaling. According to recent estimates, up to 70% of human proteins can be phosphorylated. Therefore, characterization of phosphorylation dynamics is critical for understanding a broad range of biological and biochemical processes. Technologies based on mass spectrometry are rapidly advancing to meet the needs for high-throughput screening of phosphorylation. These technologies enable untargeted quantification of thousands of phosphorylation sites in a given sample. Many labs are already utilizing these technologies to comprehensively characterize signaling landscapes by examining perturbations with drugs and knockdown approaches, or by assessing diverse phenotypes in cancers, neuro-degerenational diseases, infectious diseases, and normal development.ResultsWe comprehensively investigate the concept of “co-phosphorylation”, defined as the correlated phosphorylation of a pair of phosphosites across various biological states. We integrate nine publicly available phosphoproteomics datasets for various diseases (including breast cancer, ovarian cancer and Alzheimer’s disease) and utilize functional data related to sequence, evolutionary histories, kinase annotations, and pathway annotations to investigate the functional relevance of co-phosphorylation. Our results across a broad range of studies consistently show that functionally associated sites tend to exhibit significant positive or negative co-phosphorylation. Specifically, we show that co-phosphorylation can be used to predict with high precision the sites that are on the same pathway or that are targeted by the same kinase. Overall, these results establish co-phosphorylation as a useful resource for analyzing phosphoproteins in a network context, which can help extend our knowledge on cellular signaling and its dysregulation.
Categories: Bioinformatics Trends

Identifying interactions in omics data for clinical biomarker discovery using symbolic regression

Bioinformatics Oxford Journals - Wed, 22/06/2022 - 5:30am
AbstractMotivationThe identification of predictive biomarker signatures from omics and multi-omics data for clinical applications is an active area of research. Recent developments in assay technologies and machine learning (ML) methods have led to significant improvements in predictive performance. However, most high-performing ML methods suffer from complex architectures and lack interpretability.ResultsWe present the application of a novel symbolic-regression-based algorithm, the QLattice, on a selection of clinical omics datasets. This approach generates parsimonious high-performing models that can both predict disease outcomes and reveal putative disease mechanisms, demonstrating the importance of selecting maximally relevant and minimally redundant features in omics-based machine-learning applications. The simplicity and high predictive power of these biomarker signatures make them attractive tools for high-stakes applications in areas such as primary care, clinical decision making and patient stratification.AvailabilityThe QLattice is available as part of a python package (feyn), which is available at the Python Package Index (https://pypi.org/project/feyn/) and can be installed via pip. The documentation provides guides, tutorials, and the API reference (https://docs.abzu.ai/). All code and data used to generate the models and plots discussed in this work can be found in (https://github.com/abzu-ai/QLattice-clinical-omics).Supplementary informationSupplementary materialSupplementary material is available at Bioinformatics online.
Categories: Bioinformatics Trends

Figbird: A probabilistic method for filling gaps in genome assemblies

Bioinformatics Oxford Journals - Wed, 22/06/2022 - 5:30am
AbstractMotivationAdvances in sequencing technologies have led to the sequencing of genomes of a multitude of organisms. However, draft genomes of many of these organisms contain a large number of gaps due to the repeats in genomes, low sequencing coverage and limitations in sequencing technologies. Although there exist several tools for filling gaps, many of these do not utilize all information relevant to gap filling.ResultsHere, we present a probabilistic method for filling gaps in draft genome assemblies using second generation reads based on a generative model for sequencing that takes into account information on insert sizes and sequencing errors. Our method is based on the expectation-maximization (EM) algorithm unlike the graph based methods adopted in the literature. Experiments on real biological datasets show that this novel approach can fill up large portions of gaps with small number of errors and misassemblies compared to other state of the art gap filling tools.Availability and ImplementationThe method is implemented using C ++ in a software named “Filling Gaps by Iterative Read Distribution (Figbird)”, which is available at: https://github.com/SumitTarafder/Figbird.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

riboCleaner: a pipeline to identify and quantify rRNA read contamination from RNA-seq data in plants

Bioinformatics Oxford Journals - Wed, 22/06/2022 - 5:30am
AbstractMotivationAnalysis of gene expression data can be crucial for elucidating biological relationships within living organisms. However, accurate quantification of gene expression relies directly upon the accuracy of the reference genome or transcriptome to which the expression data is mapped. Errors in gene annotation can lead to errors in quantification of gene expression. One source of gene annotation error in eukaryotes arises from incorrect predictions of mRNA gene models within ribosomal DNA (rDNA) regions.ResultsHere, we provide examples of how the presence of false gene models in rDNA regions can result in a handful of genes appearing to contribute to > 50% of the total transcripts per million (TPM) values of entire RNA-seq datasets. To this end, we have created riboCleaner, a bioinformatics pipeline designed to identify misannotated gene models in rDNA regions and quantify rRNA-derived reads in RNA-seq data. We also show the applicability of riboCleaner in several plant genome assemblies.AvailabilityWe have implemented riboCleaner as a containerized Snakemake workflow. The workflow, instructions for building the container, and other documentation is available at https://github.com/basf. For convenience, a prebuilt Docker image containing riboCleaner is available at https://hub.docker.com/u/basfcontainers.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

matOptimize: A parallel tree optimization method enables online phylogenetics for SARS-CoV-2

Bioinformatics Oxford Journals - Wed, 22/06/2022 - 5:30am
AbstractMotivationPhylogenetic tree optimization is necessary for precise analysis of evolutionary and transmission dynamics, but existing tools are inadequate for handling the scale and pace of data produced during the COVID-19 pandemic. One transformative approach, online phylogenetics, aims to incrementally add samples to an ever-growing phylogeny, but there are no previously-existing approaches that can efficiently optimize this vast phylogeny under the time constraints of the pandemic.ResultsHere, we present matOptimize, a fast and memory-efficient phylogenetic tree optimization tool based on parsimony that can be parallelized across multiple CPU threads and nodes, and provides orders of magnitude improvement in runtime and peak memory usage compared to existing state-of-the-art methods. We have developed this method particularly to address the pressing need during the COVID-19 pandemic for daily maintenance and optimization of a comprehensive SARS-CoV-2 phylogeny. matOptimize is currently helping refine on a daily basis possibly the largest-ever phylogenetic tree, containing millions of SARS-CoV-2 sequences.AvailabilityThe matOptimize code is freely available as part of the UShER package (https://github.com/yatisht/usher) and can also be installed via bioconda (https://bioconda.github.io/recipes/usher/README.html). All scripts we used to perform the experiments in this manuscript are available at https://github.com/yceh/matOptimize-experiments.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

PrISM: Precision for Integrative Structural Models

Bioinformatics Oxford Journals - Mon, 20/06/2022 - 5:30am
AbstractMotivationA single precision value is currently reported for an integrative model. However, precision may vary for different regions of an integrative model owing to varying amounts of input information.ResultsWe develop PrISM (Precision for Integrative Structural Models), to efficiently identify high and low-precision regions for integrative models.AvailabilityPrISM is written in Python and available under the GNU General Public License v3.0 at https://github.com/isblab/prism; benchmark data used in this paper is available at doi:10.5281/zenodo.6241200.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

A rarefaction-without-resampling extension of PERMANOVA for testing presence-absence associations in the microbiome

Bioinformatics Oxford Journals - Mon, 20/06/2022 - 5:30am
AbstractMotivationPERMANOVA (McArdle and Anderson, 2001) is currently the most commonly used method for testing community-level hypotheses about microbiome associations with covariates of interest. PERMANOVA can test for associations that result from changes in which taxa are present or absent by using the Jaccard or unweighted UniFrac distance. However, such presence-absence analyses face a unique challenge: confounding by library size (total sample read count), which occurs when library size is associated with covariates in the analysis. It is known that rarefaction (subsampling to a common library size) controls this bias, but at the potential costs of information loss and the introduction of a stochastic component into the analysis.ResultsHere we develop a non-stochastic approach to PERMANOVA presence-absence analyses that aggregates information over all potential rarefaction replicates without actual resampling, when the Jaccard or unweighted UniFrac distance is used. We compare this new approach to three possible ways of aggregating PERMANOVA over multiple rarefactions obtained from resampling: averaging the distance matrix, averaging the (element-wise) squared distance matrix, and averaging the F-statistic. Our simulations indicate that our non-stochastic approach is robust to confounding by library size and outperforms each of the stochastic resampling approaches. We also show that, when overdispersion is low, averaging the (element-wise) squared distance outperforms averaging the unsquared distance, currently implemented in the R package vegan. We illustrate our methods using an analysis of data on inflammatory bowel disease (IBD) in which samples from case participants have systematically smaller library sizes than samples from control participants.Availability and ImplementationWe have implemented all the approaches described above, including the function for calculating the analytical average of the squared or unsquared distance matrix, in our R package LDM, which is available on GitHub at https://github.com/yijuanhu/LDM.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Pages

Subscribe to Centre for Bioinformatics aggregator - Bioinformatics Trends

Calendar

Mon
Tue
Wed
Thu
Fri
Sat
Sun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
August 2022