TY - JOUR
T1 - Comparison of parametric and machine methods for variable selection in simulated Genetic Analysis Workshop 19 data
AU - Holzinger, Emily R.
AU - Szymczak, Silke
AU - Malley, James
AU - Pugh, Elizabeth W.
AU - Ling, Hua
AU - Griffith, Sean
AU - Zhang, Peng
AU - Li, Qing
AU - Cropp, Cheryl D.
AU - Bailey-Wilson, Joan E.
PY - 2016
Y1 - 2016
N2 - Current findings from genetic studies of complex human traits often do not explain a large proportion of the estimated variation of these traits due to genetic factors. This could be, in part, due to overly stringent significance thresholds in traditional statistical methods, such as linear and logistic regression. Machine learning methods, such as Random Forests (RF), are an alternative approach to identify potentially interesting variants. One major issue with these methods is that there is no clear way to distinguish between probable true hits and noise variables based on the importance metric calculated. To this end, we are developing a method called the Relative Recurrency Variable Importance Metric (r2VIM), a RF-based variable selection method. Here, we apply r2VIM to the unrelated Genetic Analysis Workshop 19 data with simulated systolic blood pressure as the phenotype. We compare the number of "true" functional variants identified by r2VIM with those identified by linear regression analyses that use a Bonferroni correction to calculate a significance threshold. Our results show that r2VIM performed comparably to linear regression. Our findings are proof-of-concept for r2VIM, as it identifies a similar number of functional and nonfunctional variants as a more commonly used technique when the optimal importance score threshold is used.
AB - Current findings from genetic studies of complex human traits often do not explain a large proportion of the estimated variation of these traits due to genetic factors. This could be, in part, due to overly stringent significance thresholds in traditional statistical methods, such as linear and logistic regression. Machine learning methods, such as Random Forests (RF), are an alternative approach to identify potentially interesting variants. One major issue with these methods is that there is no clear way to distinguish between probable true hits and noise variables based on the importance metric calculated. To this end, we are developing a method called the Relative Recurrency Variable Importance Metric (r2VIM), a RF-based variable selection method. Here, we apply r2VIM to the unrelated Genetic Analysis Workshop 19 data with simulated systolic blood pressure as the phenotype. We compare the number of "true" functional variants identified by r2VIM with those identified by linear regression analyses that use a Bonferroni correction to calculate a significance threshold. Our results show that r2VIM performed comparably to linear regression. Our findings are proof-of-concept for r2VIM, as it identifies a similar number of functional and nonfunctional variants as a more commonly used technique when the optimal importance score threshold is used.
UR - http://www.scopus.com/inward/record.url?scp=85016050093&partnerID=8YFLogxK
U2 - 10.1186/s12919-016-0021-1
DO - 10.1186/s12919-016-0021-1
M3 - Journal articles
AN - SCOPUS:85016050093
SN - 1753-6561
VL - 10
JO - BMC Proceedings
JF - BMC Proceedings
M1 - 52
ER -