Jump to Navigation

Correction to: plotsr: visualizing structural similarities and rearrangements between multiple genomes

Bioinformatics Oxford Journals - Thu, 13/10/2022 - 5:30am
This is a correction to: Manish Goel and Korbinian Schneeberger plotsr: visualizing structural similarities and rearrangements between multiple genomes, Bioinformatics, Volume 38, Issue 10, 15 May 2022, https://doi.org/10.1093/bioinformatics/btac196
Categories: Bioinformatics Trends

Estimation of Speciation Times Under the Multispecies Coalescent

Bioinformatics Oxford Journals - Thu, 13/10/2022 - 5:30am
AbstractMotivationThe multispecies coalescent model is now widely accepted as an effective model for incorporating variation in the evolutionary histories of individual genes into methods for phylogenetic inference from genome-scale data. However, because model-based analysis under the coalescent can be computationally expensive for large data sets, a variety of inferential frameworks and corresponding algorithms have been proposed for estimation of species-level phylogenies and associated parameters, including speciation times and effective population sizes.ResultsWe consider the problem of estimating the timing of speciation events along a phylogeny in a coalescent framework. We propose a maximum a posteriori estimator based on composite likelihood (MAPCL) for inferring these speciation times under a model of DNA sequence evolution for which exact site pattern probabilities can be computed under the assumption of a constant θ throughout the species tree. We demonstrate that the MAPCL estimates are statistically consistent and asymptotically normally distributed, and we show how this result can be used to estimate their asymptotic variance. We also provide a more computationally efficient estimator of the asymptotic variance based on the nonparametric bootstrap. We evaluate the performance of our method using simulation and by application to an empirical dataset for gibbons.Availability and implementationThe method has been implemented in the PAUP* program, freely available at https://paup.phylosolutions.com for Macintosh, Windows, and Linux operating systems.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

E-SNPs&GO: Embedding of protein sequence and function improves the annotation of human pathogenic variants

Bioinformatics Oxford Journals - Thu, 13/10/2022 - 5:30am
AbstractMotivationThe advent of massive DNA sequencing technologies is producing a huge number of human single-nucleotide polymorphisms occurring in protein-coding regions and possibly changing their sequences. Discriminating harmful protein variations from neutral ones is one of the crucial challenges in precision medicine. Computational tools based on artificial intelligence provide models for protein sequence encoding, bypassing database searches for evolutionary information. We leverage the new encoding schemes for an efficient annotation of protein variants.ResultsE-SNPs&GO is a novel method that, given an input protein sequence and a single amino acid variation, can predict whether the variation is related to diseases or not. The proposed method adopts an input encoding completely based on protein language models and embedding techniques, specifically devised to encode protein sequences and GO functional annotations. We trained our model on a newly generated dataset of 101,146 human protein single amino acid variants in 13,661 proteins, derived from public resources. When tested on a blind set comprising 10,266 variants, our method well compares to recent approaches released in literature for the same task, reaching a Matthews Correlation Coefficient (MCC) score of 0.72. We propose E-SNPs&GO as a suitable, efficient and accurate large-scale annotator of protein variant datasets.AvailabilityThe method is available as a webserver at https://esnpsandgo.biocomp.unibo.it. Datasets and predictions are available at https://esnpsandgo.biocomp.unibo.it/datasets.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Biomedical Evidence Engineering for Data-Driven Discovery

Bioinformatics Oxford Journals - Thu, 13/10/2022 - 5:30am
AbstractMotivationWith the rapid development of precision medicine, a large amount of health data (such as electronic health records, gene sequencing, medical images, etc.) has been produced. It encourages more and more interest in data-driven insight discovery from these data. A reasonable way to verify the derived insights is by checking evidence from biomedical literature. However, manual verification is inefficient and not scalable. Therefore, an intelligent technique is necessary to solve this problem.ResultsThis paper introduces a framework for biomedical evidence engineering, addressing this problem more effectively. The framework consists of a biomedical literature retrieval module and an evidence extraction module. The retrieval module ensembles several methods and achieves state-of-the-art performance in biomedical literature retrieval. A BERT-based evidence extraction model is proposed to extract evidence from literature in response to queries. Moreover, we create a dataset with 1 million examples of biomedical evidence, 10,000 of which are manually annotated.AvailabilityDatasets are available at https://github.com/SendongZhao.
Categories: Bioinformatics Trends

Evaluation of efficiency prediction algorithms and development of ensemble model for CRISPR/Cas9 gRNA selection

Bioinformatics Oxford Journals - Thu, 13/10/2022 - 5:30am
AbstractMotivationThe CRISPR/Cas9 system is widely used for genome editing. The editing efficiency of CRISPR/Cas9 is mainly determined by the guide RNA (gRNA). Although many computational algorithms have been developed in recent years, it is still a challenge to select optimal bioinformatics tools for gRNA design in different experimental settings.ResultsWe performed a comprehensive comparison analysis of fifteen public algorithms for gRNA design, using fifteen experimental gRNA datasets. Based on this analysis, we identified the top-performing algorithms, with which we further implemented various computational strategies to build ensemble models for performance improvement. Validation analysis indicates that the new ensemble model had improved performance over any individual algorithm alone at predicting gRNA efficacy under various experimental conditions.AvailabilityThe new sgRNA design tool is freely accessible as a web application via https://crisprdb.org. The source code and stand-alone version is available at Figshare (https://doi.org/10.6084/m9.figshare.21295863) and Github (https://github.com/wang-lab/CRISPRDB).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Leveraging a Pharmacogenomics Knowledge-base to Formulate a Drug Response Phenotype Terminology for Genomic Medicine

Bioinformatics Oxford Journals - Wed, 12/10/2022 - 5:30am
AbstractMotivationDespite the increasing evidence of utility of genomic medicine in clinical practice, systematically integrating genomic medicine information and knowledge into clinical systems with a high-level of consistency, scalability, and computability remains challenging. A comprehensive terminology is required for relevant concepts and the associated knowledge model for representing relationships.MethodsIn this study, we leveraged PharmGKB, a comprehensive pharmacogenomics (PGx) knowledgebase, to formulate a terminology for drug response phenotypes that can represent relationships between genetic variants and treatments. We evaluated coverage of the terminology through manual review of a randomly selected subset of 200 sentences extracted from genetic reports that contained concepts for “Genes and Gene Products” and “Treatments”.ResultsResults showed that our proposed drug response phenotype terminology could cover 96% of the drug response phenotypes in genetic reports. Among 18,653 sentences that contained both “Genes and Gene Products” and “Treatments”, 3,011 sentences were able to be mapped to a drug response phenotype in our proposed terminology, among which the most discussed drug response phenotypes were response (994), sensitivity (829), and survival (332). In addition, we were able to re-analyze genetic report context incorporating the proposed terminology and enrich our previously proposed PGx knowledge model to reveal relationships between genetic variants and treatments.ConclusionIn conclusion, we proposed a drug response phenotype terminology that enhanced structured knowledge representation of genomic medicine.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

METAbolomics data Balancing with Over-sampling Algorithms (Meta-BOA): an online resource for addressing class imbalance

Bioinformatics Oxford Journals - Wed, 12/10/2022 - 5:30am
AbstractMotivationClass imbalance, or unequal sample sizes between classes, is an increasing concern in machine learning for metabolomic and lipidomic data mining, which can result in overfitting for the over-represented class. Numerous methods have been developed for handling class imbalance, but they are not readily accessible to users with limited computational experience. Moreover, there is no resource that enables users to easily evaluate the effect of different over-sampling algorithms.ResultsMETAbolomics data Balancing with Over-sampling Algorithms (META-BOA) is a web-based application that enables users to select between four different methods for class balancing, followed by data visualization and classification of the sample to observe the augmentation effects. META-BOA outputs a newly balanced dataset, generating additional samples in the minority class, according to the user’s choice of Synthetic Minority Over-sampling Technique (SMOTE), Borderline-SMOTE (BSMOTE), Adaptive Synthetic (ADASYN), or Random Over-Sampling Examples (ROSE). To present the effect of over-sampling on the data META-BOA further displays both principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) visualization of data pre- and post-over-sampling. Random forest classification is utilized to compare sample classification in both the original and balanced datasets, enabling users to select the most appropriate method for their further analyses.Availability and implementationMETA-BOA is available at https://complimet.ca/meta-boa.Supplementary InformationSupplementary materialSupplementary material is available at Bioinformatics online.
Categories: Bioinformatics Trends

Prediction of Drug-likeness using Graph Convolutional Attention Network

Bioinformatics Oxford Journals - Wed, 12/10/2022 - 5:30am
AbstractMotivationThe drug-likeness has been widely used as a criterion to distinguish drug-like molecules from non-drugs. Developing reliable computational methods to predict the drug-likeness of compounds is crucial to triage unpromising molecules and accelerate the drug discovery process.ResultsIn this study, a deep learning method was developed to predict the drug-likeness based on the graph convolutional attention network (D-GCAN) directly from molecular structures. Results showed that the D-GCAN model outperformed other state-of-the-art models for drug-likeness prediction. The combination of graph convolution and attention mechanism made an important contribution to the performance of the model. Specifically, the application of the attention mechanism improved accuracy by 4.0%. The utilization of graph convolution improved the accuracy by 6.1%. Results on the dataset beyond Lipinski's rule of five space and the non-US dataset showed that the model had good versatility. Then, the billion-scale GDB-13 database was used as a case study to screen SARS-CoV-2 3C-like protease inhibitors. Sixty-five drug candidates were screened out, most substructures of which are similar to these of existing oral drugs. Candidates screened from S-GDB13 have higher similarity to existing drugs and better molecular docking performance than those from the rest of GDB-13. The screening speed on S-GDB13 is significantly faster than screening directly on GDB-13. In general, D-GCAN is a promising tool to predict the drug-likeness for selecting potential candidates and accelerating drug discovery by excluding unpromising candidates and avoiding unnecessary biological and clinical testing.AvailabilityThe source code, model, and tutorials are available at https://github.com/JinYSun/D-GCAN. The S-GDB13 database is available at https://doi.org/10.5281/zenodo.7054367.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Tree2GD: A Phylogenomic Method to Detect Large Scale Gene Duplication Events

Bioinformatics Oxford Journals - Tue, 11/10/2022 - 5:30am
AbstractMotivationWhole-genome duplication events have long been discovered throughout the evolution of eukaryotes, contributing to genome complexity and biodiversity and leaving traces in the descending organisms. Therefore, an accurate and rapid phylogenomic method is needed to identify the retained duplicated genes on various lineages across the target taxonomy.ResultsHere we present Tree2GD, an integrated method to identify large scale gene duplication events by automatically perform multiple procedures, including sequence alignment, recognition of homolog, gene tree/species tree reconciliation, Ks distribution of gene duplicates and synteny analyses. Application of Tree2GD on two datasets, 12 metazoan genomes and 68 angiosperms, successfully identifies all reported whole-genome duplication events exhibited by these species, showing effectiveness and efficiency of Tree2GD on phylogenomic analyses of large-scale gene duplications.Availability and implementationTree2GD is written in Python and C ++, and is available at https://github.com/Dee-chen/Tree2gdSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

GTDB-Tk v2: memory friendly classification with the Genome Taxonomy Database

Bioinformatics Oxford Journals - Tue, 11/10/2022 - 5:30am
AbstractMotivationThe Genome Taxonomy Database (GTDB) and associated taxonomic classification toolkit (GTDB-Tk) have been widely adopted by the microbiology community. However, the growing size of the GTDB bacterial reference tree has resulted in GTDB-Tk requiring substantial amounts of memory (∼320 GB) which limits its adoption and ease of use. Here we present an update to GTDB-Tk that uses a divide-and-conquer approach where user genomes are initially placed into a bacterial reference tree with family-level representatives followed by placement into an appropriate class-level subtree comprising species representatives. This substantially reduces the memory requirements of GTDB-Tk while having minimal impact on classification.AvailabilityGTDB-Tk is implemented in Python and licenced under the GNU General Public Licence v3.0. Source code and documentation are available at: https://github.com/ecogenomics/gtdbtk.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

CAMML with the Integration of Marker Proteins (ChIMP)

Bioinformatics Oxford Journals - Mon, 10/10/2022 - 5:30am
AbstractMotivationCell typing is a critical task in the analysis of single cell data, particularly when studying complex diseased tissues. Unfortunately, the sparsity and noise of single cell data make accurate cell typing of individual cells difficult. To address these challenges, we previously developed the CAMML method for multi-label cell typing of single cell RNA-sequencing (scRNA-seq) data. CAMML uses weighted gene sets to score each profiled cell for multiple potential cell types. While CAMML outperforms other scRNA-seq cell typing techniques, it only leverages transcriptomic data so cannot take advantage of newer multi-omic single cell assays that jointly profile gene expression and protein abundance (e.g., joint scRNA-seq/CITE-seq).ResultsWe developed the ChIMP (CAMML with the Integration of Marker Proteins) method to support multi-label cell typing of individual cells jointly profiled via scRNA-seq and CITE-seq. ChIMP combines cell type scores computed on scRNA-seq data via the CAMML approach with discretized CITE-seq measurements for cell type marker proteins. The multi-omic cell type scores generated by ChIMP allow researchers to more precisely and conservatively cell type joint scRNA-seq/CITE-seq data.AvailabilityAn implementation of this work is available on CRAN at https://cran.r-project.org/web/packages/CAMML/.Supplementary informationSupplementary methodsSupplementary methods and results are available at Bioinformatics online.
Categories: Bioinformatics Trends

SATINN: An automated neural network-based classification of testicular sections allows for high-throughput histopathology of mouse mutants

Bioinformatics Oxford Journals - Mon, 10/10/2022 - 5:30am
AbstractMotivationThe mammalian testis is a complex organ with a cellular composition that changes smoothly and cyclically in normal adults. While testis histology is already an invaluable tool for identifying and describing developmental differences in evolution and disease, methods for standardized, digital image analysis of testis are needed to expand the utility of this approach.ResultsWe developed SATINN (Software for Analysis of Testis Images with Neural Networks), a multi-level framework for automated analysis of multiplexed immunofluorescence images from mouse testis. This approach uses residual learning to train convolutional neural networks (CNNs) to classify nuclei from seminiferous tubules into 7 distinct cell types with an accuracy of 81.7%. These cell classifications are then used in a second-level tubule CNN, which places seminiferous tubules into one of 12 distinct tubule stages with 57.3% direct accuracy and 94.9% within ±1 stage. We further describe numerous cell- and tubule-level statistics that can be derived from wildtype testis. Finally, we demonstrate how the classifiers and derived statistics can be used to rapidly and precisely describe pathology by applying our methods to image data from two mutant mouse lines. Our results demonstrate the feasibility and potential of using computer-assisted analysis for testis histology, an area poised to evolve rapidly on the back of emerging, spatially-resolved genomic and proteomic technologies.AvailabilityThe source code to reproduce the results described here and a SATINN standalone application with graphic-user interface are available from http://github.com/conradlab/SATINN.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

MSNet-4mC: Learning effective multi-scale representations for identifying DNA N4-methylcytosine sites

Bioinformatics Oxford Journals - Fri, 07/10/2022 - 5:30am
AbstractMotivationN4-methylcytosine (4mC) is an essential kind of epigenetic modification that regulates a wide range of biological processes. However, experimental methods for detecting 4mC sites are time-consuming and labor-intensive. As an alternative, computational methods that are capable of automatically identifying 4mC with data analysis techniques become a reasonable option. A major challenge is how to develop effective methods to fully exploit the complex interactions within the DNA sequences to improve the predictive capability.ResultsIn this work, we propose MSNet-4mC, a lightweight neural network building upon convolutional operations with multi-scale receptive fields to perceive cross-element relationships over both short and long ranges of given DNA sequences. With strong imbalances in the number of candidates in different species in mind, we compute and apply class weights in the cross-entropy loss to balance the training process. Extensive benchmarking experiments show that our method achieves a significant performance improvement and outperforms other state-of-the-art methods.Availability and ImplementationThe source code and models are freely available for download at https://github.com/LIU-CT/MSNet-4mC, implemented in Python and supported on Linux and Windows.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

scHiCPTR: unsupervised pseudotime inference through dual graph refinement for single-cell Hi-C data

Bioinformatics Oxford Journals - Fri, 07/10/2022 - 5:30am
AbstractMotivationThe emerging single-cell Hi-C technology provides opportunities to study dynamics of chromosomal organization. How to construct a pseudotime path using single-cell Hi-C contact matrices to order cells along developmental trajectory is a challenging topic, since these matrices produced by the technology are inherently high-dimensional and sparse, they suffer from noises and biases, and the topology of trajectory underlying them may be diverse.ResultsWe present scHiCPTR, an unsupervised graph-based pipeline to infer pseudotime from single-cell Hi-C contact matrices. It provides a workflow consisting of imputation and embedding, graph construction, dual graph refinement, pseudotime calculation and result visualization. Beyond the few existing methods, scHiCPTR ties to optimize graph structure by two parallel procedures of graph pruning, which help reduce the spurious cell links resulted from noises and determine a global developmental directionality. Besides, it has an ability to handle developmental trajectories with multiple topologies, including linear, bifurcated and circular ones, and is competitive with methods developed for single-cell RNA-seq data. The comparative results tell that our scHiCPTR can achieve higher performance in pseudotime inference, and the inferred developmental trajectory exhibit a reasonable biological significance.AvailabilityscHiCPTR is freely available at https://github.com/lhqxinghun/scHiCPTR.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

MDSCAN: RMSD-Based HDBSCAN Clustering of Long Molecular Dynamics

Bioinformatics Oxford Journals - Fri, 07/10/2022 - 5:30am
AbstractMotivationThe term clustering designates a comprehensive family of unsupervised learning methods allowing to group similar elements into sets called clusters. Geometrical clustering of Molecular Dynamics (MD) trajectories is awell-established analysis to gain insights into the conformational behavior of simulated systems. However, popular variants collapse when processing relatively long trajectories because of their quadratic memory or time complexity. From the arsenal of clustering algorithms, HDBSCAN stands out as a hierarchical density-based alternative that provides robust differentiation of intimately related elements from noise data. Although a very efficient implementation of this algorithm is available for programming-skilled users (HDBSCAN*), it cannot treat long trajectories under the de facto molecular similarity metric RMSD.ResultsHere, we propose MDSCAN, an HDBSCAN-inspired software specifically conceived for non-programmers users to perform memory-efficient RMSD-based clustering of long MD trajectories. Methodological improvements over the original version include the encoding of trajectories as a particular class of vantage-point tree (decreasing time complexity), and a dual-heap approach to construct a quasi-minimum spanning tree (reducing memory complexity). MDSCAN was able to process a trajectory of one-million frames using the RMSD metric in about 21 hours with less than 8 GB of RAM, a task that would have taken a similar time but more than 32 TB of RAM with the accelerated HDBSCAN* implementation generally used.Availability and implementationThe source code and documentation of MDSCAN are free and publicly available on GitHub (https://github.com/LQCT/MDScan.git) and as a PyPI package (https://pypi.org/project/mdscan/).Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

RPPA SPACE: An R package for normalization and quantitation of Reverse-Phase Protein Array (RPPA) data

Bioinformatics Oxford Journals - Fri, 07/10/2022 - 5:30am
AbstractSummaryReverse Phase Protein Array (RPPA) is a robust high-throughput, cost effective platform for quantitatively measuring proteins in biological specimens. However, converting raw RPPA data into normalized, analysis-ready data remains a challenging task. Here, we present the RPPA SPACE R package, a substantially improved successor to SuperCurve, to meet that challenge. SuperCurve has been used to normalize over 170,000 samples to date. RPPA SPACE allows exclusion of poor-quality samples from the normalization process to improve the quality of the remaining samples. It also features a novel quality-control metric, “noise,” that estimates the level of random errors present in each RPPA slide. The noise metric can help to determine the quality and reliability of the data. In addition, RPPA SPACE has simpler input requirements and is more flexible than SuperCurve, it is much faster with greatly improved error reporting.Availability and implementationThe standalone RPPA SPACE R package, tutorials and sample data are available via https://rppa.space/, CRAN (https://cran.r-project.org/web/packages/RPPASPACE/index.html), and GitHub (https://github.com/MD-Anderson-Bioinformatics/RPPASPACE)Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Phytest: Quality Control for Phylogenetic Analyses

Bioinformatics Oxford Journals - Fri, 07/10/2022 - 5:30am
AbstractMotivationThe ability to automatically conduct quality control checks on phylogenetic analyses is becoming more important with the increase of genetic sequencing and use of real-time pipelines e.g. in the SARS-CoV-2 era. Implementations of real-time phylogenetic analyses require automated testing to make sure that problems in the data are caught automatically within analysis pipelines and in a timely manner. Here we present Phytest (version 1.1) a tool for automating quality control checks on sequences, trees and metadata during phylogenetic analyses.ResultsPhytest is a phylogenetic analysis testing program that easily integrates into existing phylogenetic pipelines. We demonstrate the utility of Phytest with real-world examples.AvailabilityPhytest source code available on GitHub (https://github.com/phytest-devs/phytest) and can be installed via PyPI with the command ‘pip install phytest‘. Extensive documentation can be found at https://phytest-devs.github.io/phytest/.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

A penalized linear mixed model with generalized method of moments estimators for complex phenotype prediction

Bioinformatics Oxford Journals - Fri, 07/10/2022 - 5:30am
AbstractMotivationLinear mixed models have long been the method of choice for risk prediction analysis on high-dimensional data. However, it remains computationally challenging to simultaneously model a large amount of variants that can be noise or have predictive effects of complex forms.ResultsIn this work, we have developed a penalized linear mixed model with generalized method of moments (pLMMGMM) estimators for prediction analysis. pLMMGMM is built within the linear mixed model framework, where random effects are used to model the joint predictive effects from all variants within a region. Different from existing methods that focus on linear relationships and use empirical criteria for variable screening, pLMMGMM can efficiently detect regions that harbour genetic variants with both linear and non-linear predictive effects. In addition, unlike existing linear mixed models that can only handle a very limited number of random effects, pLMMGMM is much less computationally demanding. It can jointly consider a large number of regions and accurately detect those that are predictive. Through theoretical investigations, we have shown that our method has the selection consistency and asymptotic normality. Through extensive simulations and the analysis of PET-imaging outcomes, we have demonstrated that pLMMGMM outperformed existing models and it can accurately detect regions that harbor risk factors with various forms of predictive effects.AvailabilityThe R-package is available at https://github.com/XiaQiong/GMMLasso.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Discovering Drug-Target interaction Knowledge from Biomedical Literature

Bioinformatics Oxford Journals - Fri, 07/10/2022 - 5:30am
AbstractMotivationThe Interaction between Drugs and Targets (DTI) in human body plays a crucial role in biomedical science and applications. As millions of papers come out every year in the biomedical domain, automatically discovering DTI knowledge from biomedical literature, which are usually triplets about drugs, targets and their interaction, becomes an urgent demand in the industry. Existing methods of discovering biological knowledge are mainly extractive approaches that often require detailed annotations (e.g., all mentions of biological entities, relations between every two entity mentions, etc.). However, it is difficult and costly to obtain sufficient annotations due to the requirement of expert knowledge from biomedical domains.ResultsTo overcome these difficulties, we explore an end-to-end solution for this task by using generative approaches. We regard the DTI triplets as a sequence and use a Transformer-based model to directly generate them without using the detailed annotations of entities and relations. Further, we propose a semi-supervised method, which leverages the aforementioned end-to-end model to filter unlabeled literature and label them. Experimental results show that our method significantly outperforms extractive baselines on DTI discovery. We also create a dataset, KD-DTI, to advance this task and release it to the community.AvailabilityOur code and data are available at https://github.com/bert-nmt/BERT-DTI.Supplementary informationSupplementary data is available at Bioinformatics. online.
Categories: Bioinformatics Trends

scSemiGAN: a single-cell semi-supervised annotation and dimensionality reduction framework based on generative adversarial network

Bioinformatics Oxford Journals - Tue, 04/10/2022 - 5:30am
AbstractMotivationCell-type annotation plays a crucial role in single-cell RNA-seq (scRNA-seq) data analysis. As more and more well-annotated scRNA-seq reference data is publicly available, automatical label transference algorithms are gaining popularity over manual marker gene-based annotation methods. However, most existing methods fail to unify cell-type annotation with dimensionality reduction, and are unable to generate deep latent representation from the perspective of data generation.ResultsIn this article, we propose scSemiGAN, a semi-supervised cell-type annotation and dimensionality reduction framework based on generative adversarial network, to overcome these challenges, modeling scRNA-seq data from the aspect of data generation. Our proposed scSemiGAN is capable of performing deep latent representation learning and cell-type label prediction simultaneously. Through extensive comparison with four state-of-the-art annotation methods on diverse simulated and real scRNA-seq datasets, scSemiGAN achieves competitive or superior performance in multiple downstream tasks including cell-type annotation, latent representation visualization, confounding factor removal and enrichment analysis.AvailabilityThe code of scSemiGAN is available on GitHub: https://github.com/rafa-nadal/scSemiGAN.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Pages

Subscribe to Centre for Bioinformatics aggregator - Bioinformatics Trends

Calendar

Mon
Tue
Wed
Thu
Fri
Sat
Sun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
December 2022