Project Details
Description
The majority of variants associated with common diseases and an unknown proportion of causal variants for rare diseases fall in non-coding regions of the genome. Although catalogs of regulatory elements are steadily improving, we have a limited understanding of functional effects of variants within them. In the context of precision medicine, machine learning (ML) methods are developed and applied to prioritize and implicate deleterious variants in human disease. Their major focus has been on coding s on coding sequence and the much larger non-coding part of the human genome remains under explored. We believe that ML methods can create valuable models of regulatory variant, but obtaining comprehensive training and validation datasets remains a major challenge. Massively parallel reporter assays (MPRAs) can overcome this shortage, but are limited in their throughput given the large universe of hundreds of millions potential variants. Here, we aim to develop improved predictors of regulatory function using ML and high-throughput assays.We propose an innovative variant selection approach on a genome-wide scale that will select more than 120,000 variants across more than 60,000 regions for MPRA testing in multiple cell-types. We will use deep neural networks trained on active and non-active open chromatin sequences from multiple cell-types to select potential high-effect and no-effect changes. This initial model will provide insights into the sequence encoding of regulatory variant effects in different cell-types and resulting MPRA datasets will provide a better understanding of regulatory sequence function when analyzed in the context of available epigenomic datasets. Further, obtained readouts can be the basis of iterative improvements to the selection strategy and will profit future MPRA studies.The derived training dataset will constitute a genome-wide gold standard of quantitative variant effects urgently needed by modeling groups. By integrating comprehensive sets of publicly available datasets across multiple cell-types and tissues, it will enable us to establish a new generation of regulatory variant predictors. The integration of the new predictions in a genome-wide variant effect prediction framework (CADD) will improve prioritizing disease causing variants and make these predictions widely accessible to the clinical community. While initially limited to a small number of well-studied cell-types with comprehensive experimental data, we are confident that principles identified from our analyses and models will transfer to other cell-types, for which datasets are getting available with recent single cell epigenetic assays.We previously developed ML methods for different variant classes and have a long-standing interest in addressing the variant interpretation problem. With our established collaborations and expertise, we are uniquely positioned to develop improved variant effect predictors of regulatory sequences using MPRAs and ML methods.
Status | Active |
---|---|
Effective start/end date | 01.01.21 → 31.12.26 |
UN Sustainable Development Goals
In 2015, UN member states agreed to 17 global Sustainable Development Goals (SDGs) to end poverty, protect the planet and ensure prosperity for all. This project contributes towards the following SDG(s):
Research Areas and Centers
- Research Area: Medical Genetics
DFG Research Classification Scheme
- 2.11-07 Bioinformatics and Theoretical Biology
Funding Institution
- DFG: German Research Association
Fingerprint
Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.