Date(s) - 29/05/2017
11 h 00 min - 12 h 00 min
Complex phenotypes are influenced by different single nucleotide polymorphisms (SNPs) and genes. Genome-wide association studies (GWAS) is a common technique to statistically associate tag SNPs to given complex phenotypes. A large number of associated loci fall into non-coding regions, which contribute to the phenotype through the regulation of target genes. SNPs in these regions are more difficult to analyze, because heterogeneity of regulatory mechanisms. A number of bioinformatics tools exist to prioritize causal regulatory SNPs, but these methods are not able to distinguish associated SNPs, because most associated loci do not cause the phenotype and do only co-occur with an unknown causal SNP. Predictive models of associated loci might be useful to integrate the vast number of known phenotype-specific associated loci and to prioritize loci with an association at borderline significance.
Here we present a method to train a supervised classification model using associated loci and regulatory features to predict likely association loci in non-coding regions at the genome. Leave-one-chromosome-out cross-validation shows area-under-the-curve (AUCs) performances between 0.79 and 0.71 for intronic and intergenic SNPs. Analysis of the learning matrices shows a good agreement of known roles of histone marks for prediction of associated SNPs. We also find that crucial genes like cancer genes often contain SNPs with positive scores whereas likely less important unannotated genes mostly contain SNPs with negative scores.
In conclusion, this new method predicts and helps understand the function of the non-coding genome based on associated SNPs and regulatory feature data.