PaRFR - Parallel Random Forest Regression for Hadoop
Random Forest (RF) is amongst the best performing machine learning algorithms for classification tasks and has been successfully applied to the identification of genome-wide associations in case-control studies. RF can also be applied to population association studies with multivariate quantitative traits, whereby the classification task is replaced by a regression task. For instance, high- dimensional traits arise naturally in recent neuroimaging genetics studies, in which the phenotypic variability in the human brain is measured by means of 3D neuroimaging data. We have developed a parallel version of RF for regression tasks with both univariate and multivariate responses, called PaRFR (Parallel Random Forest Regression), to support multivariate quantitative trait loci mapping in unrelated subjects. PaRFR takes advantage of the MapReduce programming model and is deployed on Hadoop. Notable speed-ups have been obtained by introducing a distance-based criterion for node splitting.
Wang Y., Goh W., Wong L. and Montana G. (2013) Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes. BMC Bioinformatics 2013, 14(Suppl 16):S6