CADD-SV – Scoring functional effects and deleteriousness of structural variants using machine learning

Project: DFG ProjectsDFG Individual Projects

Project Details

Description

In light of recent advances in structural variant (SV) detection and the study of regulatory genome architectures, we propose a computational approach to estimate the effects of SVs across the human genome. Due to their size, SVs may encompass different types of genomic sequence, i.e. encoding for proteins and functional RNAs, sequences that are of regulatory nature or sequences that are not anticipated to be functional. Particularly, SVs might interfere with the regulatory architecture of the genome and therefore moved into the focus of research as they can help to understand previously unexplained disease phenotypes. In our preliminary work, we derived an unbiased training dataset to differentiate functional SVs from neutral variants. This provides us with an unbiased and sufficiently large dataset to train machine learning models for insertions, deletions and duplications. This work also enables fast SV annotation and data summarization and allows us to combine a large collection of features in a machine learning model to identify functional and disease relevant SVs. Here, we will further develop this idea and specifically address the following aims: (1) improving the scoring of SVs by integrating sequence-based scores, e.g. predicting the potential functional content of inserted sequences, (2) inclusion of new model features (e.g. SCREEN candidate regulatory elements and gene fusions) and application of CNNs to generalize functional data (e.g. across many cell-types) or to predict molecular assay data for new sequences (e.g. Hi-C contacts with deepC), and (3) developing a robust and superior score for SVs all over the genome – confirmed by an unbiased benchmark, as well as model interpretation for the most relevant predictive features and assessing the contribution of mechanistic effects in pathogenic SVs (e.g. 3D architecture vs coding sequence effects). The result will be an improved general framework (Combined Annotation Dependent Depletion for Structural Variants, CADD-SV) for the computational scoring of structural variants, based on integrating diverse information from regulatory genome architecture to coding sequence effects. We will develop an innovative machine learning tool and scoring website to make SV variant prioritization easily accessible for the community. The interpretation of our models can provide mechanistic insights into genome regulation as well as a resource for the discovery of new genotype-phenotype effects.
StatusActive
Effective start/end date01.01.2331.12.27

UN Sustainable Development Goals

In 2015, UN member states agreed to 17 global Sustainable Development Goals (SDGs) to end poverty, protect the planet and ensure prosperity for all. This project contributes towards the following SDG(s):

  • SDG 3 - Good Health and Well-being

Research Areas and Centers

  • Research Area: Medical Genetics

DFG Research Classification Scheme

  • 2.11-07 Bioinformatics and Theoretical Biology
  • 2.11-05 General Genetics and Functional Genome Biology
  • 2.22-03 Human Genetics

Funding Institution

  • DFG: German Research Association

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.