Jump to Navigation

Few-shot biomedical named entity recognition via knowledge-guided instance generation and prompt contrastive learning

Bioinformatics Oxford Journals - Mon, 07/08/2023 - 5:30am
AbstractMotivationFew-shot learning (FSL) that can effectively perform named entity recognition in low-resource scenarios has raised growing attention, but it has not been widely studied yet in the biomedical field. In contrast to high-resource domains, biomedical named entity recognition (BioNER) often encounters limited human-labeled data in real-world scenarios, leading to poor generalization performance when training only a few labeled instances. Recent approaches either leverage cross-domain high-resource data or fine-tune the pre-trained masked language model using limited labeled samples to generate new synthetic data, which is easily stuck in domain shift problems or yields low-quality synthetic data. Therefore, in this paper, we study a more realistic scenario, i.e., few-shot learning for BioNER.ResultsLeveraging the domain knowledge graph, we propose knowledge-guided instance generation for few-shot BioNER, which generates diverse and novel entities based on similar semantic relations of neighbor nodes. In addition, by introducing question prompt, we cast BioNER as question answering (QA) task and propose prompt contrastive learning to improve the robustness of the model by measuring the mutual information (MI) between query-answer pairs. Extensive experiments conducted on various few-shot settings show that the proposed framework achieves superior performance. Particularly, in a low-resource scenario with only 20 samples, our approach substantially outperforms recent state-of-the-art (SoTA) models on four benchmark datasets, achieving an average improvement of up to 7.1% F1.AvailabilityOur source code and data are available at https://github.com/cpmss521/KGPC.Supplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

TALAIA: A 3D visual dictionary for protein structures

Bioinformatics Oxford Journals - Mon, 07/08/2023 - 5:30am
AbstractSummaryGraphical analysis of the molecular structure of proteins can be very complex. Full-atom representations retain most geometric information but are generally crowded, and key structural patterns can be challenging to identify. Non-full atom representations could be more instructive on physicochemical aspects but be insufficiently detailed regarding shapes (e.g., entity beans-like models in coarse grain approaches) or simple properties of amino acids (e.g., representation of superficial electrostatic properties). TALAIA aims to provide another layer of structural representations. It is a visual dictionary where a unique object, with differentiated shapes and colors, represents each amino acid. It makes it easier to spot crucial molecular information, including patches of amino acids or key interactions between side chains. Most conventions used in TALAIA are standard in chemistry and biochemistry, so experimentalists and modelers can rapidly grasp the meaning of any TALAIA depiction.MotivationThe work aims to offer a visual grammar that combines simple representations of amino acids while retaining their general geometry and physicochemical properties.ResultsWe propose a tool that renders protein structures and encodes structure and physicochemical aspects as a simple visual grammar. The approach is fast, highly informative, and intuitive, allowing the identification of possible interactions, hydrophobic patches, and other characteristic structural features at first glance.Availabilityhttps://github.com/insilichem/talaiaSupplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Enhancing Cryo-EM Maps With 3D Deep Generative Networks For Assisting Protein Structure Modeling

Bioinformatics Oxford Journals - Mon, 07/08/2023 - 5:30am
AbstractMotivationThe tertiary structures of an increasing number of biological macromolecules have been determined using cryo-electron microscopy (cryo-EM). However, there are still many cases where the resolution is not high enough to model the molecular structures with standard computational tools. If the resolution obtained is near the empirical borderline (3–4.5 Å), improvement in the map quality facilitates improved structure modeling.ResultsWe report EM-GAN, a novel approach that modifies an input cryo-EM map to assist protein structure modeling. The method uses a 3D generative adversarial network (GAN) that has been trained on high- and low-resolution density maps to learn the density patterns, and modifies the input map to enhance its suitability for modeling. The method was tested extensively on a dataset of 65 EM maps in the resolution range of 3 Å to 6 Å and showed substantial improvements in structure modeling using popular protein structure modeling tools.Availabilityhttps://github.com/kiharalab/EM-GAN, Google Colab: https://tinyurl.com/3ccxpttx
Categories: Bioinformatics Trends

NanopoReaTA: a user-friendly tool for nanopore-seq real-time transcriptional analysis

Bioinformatics Oxford Journals - Mon, 07/08/2023 - 5:30am
AbstractSummaryOxford Nanopore Technologies' (ONT) sequencing platform offers an excellent opportunity to perform real-time analysis during sequencing. This feature allows for early insights into experimental data and accelerates a potential decision-making process for further analysis, which can be particularly relevant in the clinical context. Although some tools for the real-time analysis of DNA-sequencing data already exist, there is currently no application available for differential transcriptome data analysis designed for scientists or physicians with limited bioinformatics knowledge. Here we introduce NanopoReaTA, a user-friendly real-time analysis toolbox for RNA sequencing data from ONT. Sequencing results from a running or finished experiment are processed through an R Shiny-based graphical user interface (GUI) with an integrated Nextflow pipeline for whole transcriptome or gene-specific analyses. NanopoReaTA provides visual snapshots of a sequencing run in progress, thus enabling interactive sequencing and rapid decision-making that could also be applied to clinical cases.AvailabilityGithub https://github.com/AnWiercze/NanopoReaTA; Zenodo https://doi.org/10.5281/zenodo.8099825Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

PACT: A pipeline for analysis of circulating tumor DNA

Bioinformatics Oxford Journals - Mon, 07/08/2023 - 5:30am
AbstractMotivationDetection of genomic alterations in circulating tumor DNA (ctDNA) is currently used for active clinical monitoring of cancer progression and treatment response. While methods for analysis of small mutations are more developed, strategies for detecting structural variants (SVs) in ctDNA are limited. Additionally, reproducibly calling small scale mutations, copy number alterations, and SVs in ctDNA is challenging due to the lack to unified tools for these different classes of variants.ResultsWe developed a unified pipeline for the analysis of ctDNA (PACT) that accurately detects SVs and consistently outperformed similar tools when applied to simulated, cell line, and clinical data. We provide PACT in the form of a Common Workflow Language pipeline which can be run by popular workflow management systems in high-performance computing environments.AvailabilityPACT is freely available at https://github.com/ChrisMaherLab/PACTSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Towards in silico CLIP-seq: predicting protein-RNA interaction via sequence-to-signal learning

Genome Biology - BiomedCentral - Fri, 04/08/2023 - 5:30am
We present RBPNet, a novel deep learning method, which predicts CLIP-seq crosslink count distribution from RNA sequence at single-nucleotide resolution. By training on up to a million regions, RBPNet achieves ...
Categories: Bioinformatics Trends

CoCoNat: a novel method based on deep-learning for coiled-coil prediction

Bioinformatics Oxford Journals - Fri, 04/08/2023 - 5:30am
AbstractMotivationCoiled-coil domains (CCD) are widespread in all organisms and perform several crucial functions. Given their relevance, the computational detection of coiled-coil domains is very important for protein functional annotation. State-of-the art prediction methods include the precise identification of coiled-coil domain boundaries, the annotation of the typical heptad repeat pattern along the coiled-coil helices as well as the prediction of the oligomerization state.ResultsIn this paper we describe CoCoNat, a novel method for predicting coiled-coil helix boundaries, residue-level register annotation and oligomerization state. Our method encodes sequences with the combination of two state-of-the-art protein language models and implements a three-step deep learning procedure concatenated with a Grammatical-Restrained Hidden Conditional Random Field (GRHCRF) for CCD identification and refinement. A final neural network (NN) predicts the oligomerization state. When tested on a blind test set routinely adopted, CoCoNat obtains a performance superior to the current state-of-the-art both for residue-level and segment-level coiled-coil detection. CoCoNat significantly outperforms the most recent state-of-the art methods on register annotation and prediction of oligomerization states.AvailabilityCoCoNat web server is available at https://coconat.biocomp.unibo.it. Standalone version is available on GitHub at https://github.com/BolognaBiocomp/coconat.
Categories: Bioinformatics Trends

ProtoCell4P: An Explainable Prototype-based Neural Network for Patient Classification Using Single-cell RNA-seq

Bioinformatics Oxford Journals - Fri, 04/08/2023 - 5:30am
AbstractMotivationThe rapid advance in single-cell RNA sequencing (scRNA-seq) technology over the past decade has provided a rich resource of gene expression profiles of single cells measured on patients, facilitating the study of many biological questions at the single-cell level. One intriguing research is to study the single cells which play critical roles in the phenotypes of patients, which has the potential to identify those cells and genes driving the disease phenotypes. To this end, deep learning models are expected to well encode the single-cell information and achieve precise prediction of patients’ phenotypes using scRNA-seq data. However, we are facing critical challenges in designing deep learning models for classifying patient samples due to (1) the samples collected in the same dataset contain a variable number of cells — some samples might only have hundreds of cells sequenced while others could have thousands of cells, and (2) the number of samples available is typically small and the expression profile of each cell is noisy and extremely high-dimensional. Moreover, the black-box nature of existing deep learning models makes it difficult for the researchers to interpret the models and extract useful knowledge from them.ResultsWe propose a prototype-based and cell-informed model for patient phenotype classification, termed ProtoCell4P, that can alleviate problems of the sample scarcity and the diverse number of cells by leveraging the cell knowledge with representatives of cells (called prototypes), and precisely classify the patients by adaptively incorporating information from different cells. Moreover, this classification process can be explicitly interpreted by identifying the key cells for decision making and by further summarizing the knowledge of cell types to unravel the biological nature of the classification. Our approach is explainable at the single-cell resolution which can identify the key cells in each patient’s classification. The experimental results demonstrate that our proposed method can effectively deal with patient classifications using single-cell data and outperforms the existing approaches. Furthermore, our approach is able to uncover the association between cell types and biological classes of interest from a data-driven perspective.Availabilityhttps://github.com/Teddy-XiongGZ/ProtoCell4P
Categories: Bioinformatics Trends

SHEPHARD: a modular and extensible software architecture for analyzing and annotating large protein datasets

Bioinformatics Oxford Journals - Fri, 04/08/2023 - 5:30am
AbstractMotivationThe emergence of high-throughput experiments and high-resolution computational predictions has led to an explosion in the quality and volume of protein sequence annotations at proteomic scales. Unfortunately, sanity checking, integrating, and analyzing complex sequence annotations remains logistically challenging and introduces a major barrier to entry for even superficial integrative bioinformatics.ResultsTo address this technical burden, we have developed SHEPHARD, a Python framework that trivializes large-scale integrative protein bioinformatics. SHEPHARD combines an object-oriented hierarchical data structure with database-like features, enabling programmatic annotation, integration, and analysis of complex datatypes. Importantly SHEPHARD is easy to use and enables a Pythonic interrogation of largescale protein datasets with millions of unique annotations. We use SHEPHARD to examine three orthogonal proteome-wide questions relating protein sequence to molecular function, illustrating its ability to uncover novel biology.AvailabilityWe provided SHEPHARD as both a stand-alone software package (https://github.com/holehouse-lab/shephard), and as a Google Colab notebook with a collection of precomputed proteome-wide annotations (https://github.com/holehouse-lab/shephard-colab)Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Ionmob: A Python Package for Prediction of Peptide Collisional Cross-Section Values

Bioinformatics Oxford Journals - Fri, 04/08/2023 - 5:30am
AbstractMotivationIncluding ion mobility separation (IMS) into mass spectrometry proteomics experiments is useful to improve coverage and throughput. Many IMS devices enable linking experimentally derived mobility of an ion to its collisional cross-section (CCS), a highly reproducible physicochemical property dependent on the ion’s mass, charge and conformation in the gas phase. Thus, known peptide ion mobilities can be used to tailor acquisition methods or to refine database search results. The large space of potential peptide sequences, driven also by post-translational modifications (PTMs) of amino acids, motivates an in silico predictor for peptide CCS. Recent studies explored the general performance of varying machine-learning techniques, however, the workflow engineering part was of secondary importance. For the sake of applicability, such a tool should be generic, data driven and offer the possibility to be easily adapted to individual workflows for experimental design and data processing.ResultsWe created ionmob, a Python based framework for data preparation, training, and prediction of collisional cross-section values of peptides. It is easily customizable and includes a set of pretrained, ready-to-use models and preprocessing routines for training and inference. Using a set of ≈ 21.000 unique phosphorylated peptides and ≈ 17.000 MHC ligand sequences and charge state pairs, we expand upon the space of peptides that can be integrated into CCS prediction. Lastly, we investigate the applicability of in silico predicted CCS to increase confidence in identified peptides by applying methods of re-scoring and demonstrate that predicted CCS values complement existing predictors for that task.AvailabilityThe Python package is available at github: https://github.com/theGreatHerrLebert/ionmob.
Categories: Bioinformatics Trends

Flame (v2.0): advanced integration and interpretation of functional enrichment results from multiple sources

Bioinformatics Oxford Journals - Fri, 04/08/2023 - 5:30am
Abstract Functional enrichment is the process of identifying implicated functional terms from a given input list of genes or proteins. In this article, we present Flame (v2.0), a web tool which offers a combinatorial approach through merging and visualizing results from widely-used functional enrichment applications while also allowing various flexible input options. In this version, Flame utilizes the aGOtool, g: Profiler, WebGestalt and Enrichr pipelines and presents their outputs separately or in combination following a visual analytics approach. For intuitive representations and easier interpretation, it uses interactive plots such as parameterizable networks, heatmaps, barcharts and scatter plots. Users can also: (i) handle multiple protein/gene lists and analyze union and intersection sets simultaneously through interactive UpSet plots, (ii) automatically extract genes and proteins from free text through text-mining and Named Entity Recognition (NER) techniques, (iii) upload single nucleotide polymorphisms (SNPs) and extract their relative genes or (iv) analyze multiple lists of differentially-expressed proteins/genes after selecting them interactively from a parameterizable volcano plot. Compared to the previous version of 197 supported organisms, Flame (v2.0) currently allows enrichment for 14,436 organisms.Availability Web Applicationhttp://flame.pavlopouloslab.infoCodehttps://github.com/PavlopoulosLab/FlameDockerhttps://hub.docker.com/r/pavlopouloslab/flameSupplementary informationSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

A syntelog-based pan-genome provides insights into rice domestication and de-domestication

Genome Biology - BiomedCentral - Thu, 03/08/2023 - 5:30am
Asian rice is one of the world’s most widely cultivated crops. Large-scale resequencing analyses have been undertaken to explore the domestication and de-domestication genomic history of Asian rice, but the ev...
Categories: Bioinformatics Trends

BEDwARS: a robust Bayesian approach to bulk gene expression deconvolution with noisy reference signatures

Genome Biology - BiomedCentral - Thu, 03/08/2023 - 5:30am
Differential gene expression in bulk transcriptomics data can reflect change of transcript abundance within a cell type and/or change in the proportions of cell types. Expression deconvolution methods can help...
Categories: Bioinformatics Trends

Boosting variant-calling performance with multi-platform sequencing data using Clair3-MP

BMC Bioinformatics - Thu, 03/08/2023 - 5:30am
With the continuous advances in third-generation sequencing technology and the increasing affordability of next-generation sequencing technology, sequencing data from different sequencing technology platforms ...
Categories: Bioinformatics Trends

Prediction of pathogenic single amino acid substitutions using molecular fragment descriptors

Bioinformatics Oxford Journals - Thu, 03/08/2023 - 5:30am
AbstractMotivationNext Generation Sequencing technologies make it possible to detect rare genetic variants in individual patients. Currently, more than a dozen software and web services have been created to predict the pathogenicity of variants related with changing of amino acid residues. Despite considerable efforts in this area, at the moment there is no ideal method to classify pathogenic and harmless variants, and the assessment of the pathogenicity is often contradictory. In this article, we propose to use peptides structural formulas of proteins as an amino acid residues substitutions description, rather than a single-letter code. This allowed us to investigate the effectiveness of chemoinformatics approach to assess the pathogenicity of variants associated with amino acid substitutions.ResultsThe structure-activity relationships analysis relying on protein-specific data and atom centric substructural multilevel neighborhoods of atoms (MNA) descriptors of molecular fragments appeared to be suitable for predicting the pathogenic effect of single amino acid variants. MNA-based Naïve Bayes classifier algorithm, ClinVar and humsavar data were used for the creation of structure-activity relationships models for 10 proteins. The performance of the models was compared with 11 different predicting tools: eight individual (SIFT 4G, Polyphen2 HDIV, MutationAssessor, PROVEAN, FATHMM, MVP, LIST-S2, MutPred) and three consensus (M-CAP, MetaSVM, MetaLR). The accuracy of MNA-based method varies for the proteins (AUC: 0.631-0.993; MCC: 0.191-0.891). It was similar for both the results of comparisons with the other individual predictors and third-party protein-specific predictors. For several proteins (BRCA1, BRCA2, COL1A2, and RYR1), the performance of the MNA-based method was outstanding, capable of capturing the pathogenic effect of structural changes in amino acid substitutions.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

BERTrand—peptide: TCR binding prediction using Bidirectional Encoder Representations from Transformers augmented with random TCR pairing

Bioinformatics Oxford Journals - Thu, 03/08/2023 - 5:30am
AbstractMotivationThe advent of T cell receptor (TCR) sequencing experiments allowed for a significant increase in the amount of peptide: TCR binding data available and a number of machine learning models appeared in recent years. High-quality prediction models for a fixed epitope sequence are feasible, provided enough known binding TCR sequences are available. However, their performance drops significantly for previously unseen peptides.ResultsWe prepare the dataset of known peptide: TCR binders and augment it with negative decoys created using healthy donors’ T-cell repertoires. We employ deep learning methods commonly applied in Natural Language Processing (NLP) to train part a peptide: TCR binding model with a degree of cross-peptide generalization (0.69 AUROC). We demonstrate that BERTrand outperforms the published methods when evaluated on peptide sequences not used during model training.AvailabilityThe datasets and the code for model training are available at https://github.com/SFGLab/bertrandSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Block Aligner: an adaptive SIMD-accelerated aligner for sequences and position-specific scoring matrices

Bioinformatics Oxford Journals - Thu, 03/08/2023 - 5:30am
AbstractMotivationEfficiently aligning sequences is a fundamental problem in bioinformatics. Many recent algorithms for computing alignments through Smith-Waterman-Gotoh dynamic programming exploit Single Instruction Multiple Data operations on modern CPUs for speed. However, these advances have largely ignored difficulties associated with efficiently handling complex scoring matrices or large gaps (insertions or deletions).ResultsWe propose a new SIMD-accelerated algorithm called Block Aligner for aligning nucleotide and protein sequences against other sequences or position-specific scoring matrices. We introduce a new paradigm that uses blocks in the dynamic programming matrix that greedily shift, grow, and shrink. This approach allows regions of the dynamic programming matrix to be adaptively computed. Our algorithm reaches over 5-10 times faster than some previous methods while incurring an error rate of less than 3% on protein and long read datasets, despite large gaps and low sequence identities.AvailabilityOur algorithm is implemented for global, local, and X-drop alignments. It is available as a Rust library (with C bindings) at https://github.com/Daniel-Liu-c0deb0t/block-aligner.
Categories: Bioinformatics Trends

NCOurd: Modelling length distributions of NCO events and gene conversion tracts

Bioinformatics Oxford Journals - Thu, 03/08/2023 - 5:30am
AbstractMotivationMeiotic recombination is the main driving force of human genetic diversity, along with mutations. Recombinations split into crossovers, separating large chromosomal regions originating from different homologous chromosomes, and non-crossovers (NCOs), where a small segment from one chromosome is embedded in a region originating from the homologous chromosome. NCOs are much less studied than mutations and crossovers as NCOs are short and can only be detected at markers heterozygous in the transmitting parent, leaving most of them undetectable.ResultsThe detectable NCOs, known as gene conversions, hide information about NCOs, including their number and length, waiting to be unveiled. We introduce NCOurd, software and algorithm, based on an expectation maximisation algorithm, to estimate the number of NCOs and their length distribution from gene conversion data.Availabilityhttps://github.com/DecodeGenetics/NCOurdSupplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

KOunt—A reproducible KEGG orthologue abundance workflow

Bioinformatics Oxford Journals - Thu, 03/08/2023 - 5:30am
AbstractSummaryAccurate gene prediction is essential for successful metagenome analysis. We present KOunt, a Snakemake pipeline, that precisely quantifies KEGG orthologue abundance.Availability and implementationKOunt is available on GitHub: https://github.com/WatsonLab/KOunt. The KOunt reference database is available on figshare: https://doi.org/10.6084/m9.figshare.21269715. Test data are available at https://doi.org/10.6084/m9.figshare.22250152 and version 1.2.0 of KOunt at https://doi.org/10.6084/m9.figshare.23607834.Supplementary informationSupplementary dataSupplementary data are available at Bioinformatics online.
Categories: Bioinformatics Trends

Pages

Subscribe to Centre for Bioinformatics aggregator - Bioinformatics Trends

Calendar

Mon
Tue
Wed
Thu
Fri
Sat
Sun
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
 
September 2023