LAPACK routines are thread-safe... depending on how you build them

August 14, 2012Posted by Maurice Berk

 
I spent last week testing my R package 'sme' for race conditions using the helgrind tool. I was already aware that the LAPACK routines built-in to R, based on the now fairly old version 3.1 released in 2006, were not thread-safe but I wasn't fully aware how pervasive the problem was. DLAMCH is the offending routine, which is called pretty much all over the place, regardless of whether you want to carry out a singular-value decomposition, find a least-squares solution or invert a matrix.

The solution was to bundle up-to-date versions of the routines my code calls (DLMACH has been fixed since LAPACK version 3.3.0) but the story does not end there. Depending on how you compile these routines, they may still not be thread-safe. I had problems using gfortran to compile DORMQR which declares a local two-dimensional double-precision array T of size LDT by NBMAX where NBMAX = 64 and LDT = NBMAX + 1. The problem is that if doubles on your platform are 8-bytes in size, the array occupies 65 * 64 * 8 = 33,280 bytes. Referring to the gfortran man page:

-fmax-stack-var-size=n

This option specifies the size in bytes of the largest array that will be put on the stack; if the size is exceeded static memory is used (except in procedures marked as RECURSIVE ). Use the option -frecursive to allow for recursive procedures which do not have a RECURSIVE attribute or for parallel programs. Use -fno-automatic to never use the stack.

This option currently only affects local arrays declared with constant bounds, and may not apply to all character variables. Future versions of GNU Fortran may improve this behavior.

The default value for n is 32768.


T is slightly too big to fit on the stack; it therefore ends up as a static variable and race conditions ensue. The solution? Use the -frecursive flag or in the case of OpenMP programs, -fopenmp. In the context of building OpenMP enabled R packages such as 'sme', simply don't forget to set PKG_FFLAGS=$(SHLIB_OPENMP_FFLAGS) in your Makevars file...

sme R package now available on CRAN

Jun 26, 2012Posted by Maurice Berk

 
I've recently released my first R package, 'sme', for fitting and visualising smoothing-splines mixed-effects models. It's available on CRAN so you can install it from R using the install.packages() function as normal. There's a tutorial accompanying the package as a vignette and a full paper describing the package is in the works. If you have any feedback or need help please don't hesitate to email me.

PhD thesis now available online

Feb 16, 2012Posted by Maurice Berk

 
I'm pleased to report that I've passed my PhD viva, completed my corrections and submitted the final version of my thesis to Imperial College. You can download a PDF copy here, although be warned that it's 14Mb so you might want to save the link rather than open it in your browser.

In brief, my thesis is entitled "Statistical Methods for Replicated, High-Dimensional Biological Time Series" and covers the application of functional data analysis (FDA) techniques to genomics and metabolomics time course experiments. I start with simpler "single-level" models that are used to fit each gene probe or NMR spectral bin independently. This independence assumption is unlikely to hold in practice but it greatly simplifies the analysis and reduces the computational burden of fitting the model, so it is extremely proliferative within the literature. I then discuss the various issues surrounding hypothesis testing in a FDA context, and introduce my own interpretation of a "functional t-test", which performed well in simulation studies compared to existing approaches. These first two topics are covered in Berk et al. (2011). I then present more complex "multi-level" models capable of modelling all gene probes or NMR spectral bins simultaneously, leading to better parameter estimates and a useful low dimensional representation of the data. A pre-print of a paper describing one of these multi-level models can be found here.

I'm on Twitter

Jan 15, 2012Posted by Maurice Berk

 
I've finally taken the plunge and signed up for Twitter as @MauriceBerk. Follow me if you'd like more frequent updates on my research, reading list and outside interests.

Statistica paper now available on arXiv.org

Dec 14, 2011Posted by Maurice Berk

 
A couple of years ago I wrote an invited paper with my PhD supervisor, Giovanni Montana, for a small Italian journal called Statistica. This paper was entitled "Functional modelling of microarray time series with covariate curves" and described an extension I had developed for the functional mixed-effects model in order to incorporate covariates as functions of time. The paper is now finally available online here, thanks to arXiv.org.

Amazon Web Services in Education research grant awarded!

Dec 9, 2011Posted by Maurice Berk

 
I've been awarded an Amazon Web Services in Education research grant. Briefly, the project aims to build on the functional models and software for 'omics time series data analysis that I developed during the course of my PhD. Leveraging both Elastic MapReduce and Elastic Cloud Compute GPU instances I'll initially develop more efficient implementations of the existing models that can handle the very high dimensional data sets we're starting to see from fields such as medical imaging and next generation sequencing. Then I'll focus on developing more sophisticated models that are better able to estimate and summarise the temporal dynamics of the data sets. You can read the detailed project proposal here.

Preprint of latest paper now available on arXiv.org

Dec 6, 2011Posted by Maurice Berk

 
A preprint of my latest paper "A Skew-t-Normal Multi-Level Reduced-Rank Functional PCA Model with Applications to Replicated 'Omics Time Series Data Sets" is now available here. This paper describes a novel model I developed as part of my PhD for replicated metabolomics and genomics time series data sets.

Most existing methods for these type of data sets, such as Storey et al. (2005) or Liu and Yang (2009) model each variable (e.g. gene transcript) independently as this greatly simplifies the analysis and reduces the computational burden of fitting the data. On the other hand, this is biologically implausible as the variables are known to be highly correlated. The idea of multi-level models which could model all variables simultaneously are rare but not unheard of - see Zhou et al. (2010) for example - however, none of these have been illustrated on 'omics data sets.

Furthermore, multi-level functional principal component models such as those proposed by Zhou et al. (2010) generally assume that the principal component loadings are Normally distributed. During the course of our research, we discovered that in 'omics data sets this assumption is violated, which motivated the development of a model in which the loadings were instead Skew-t-Normally distributed. The Skew-t-Normal distribution is a flexible, four parameter distribution that can account for both skewness and heavy tails. Such an assumption complicates the analysis as a closed-form expression for the likelihood is challenging to obtain. Instead we turn to the Monte-Carlo Expectation-Maximisation (MCEM) algorithm for parameter estimation. Full technical details are given in the paper, particularly in the supplementary material.

About Me section added

Dec 1, 2011Posted by Maurice Berk

 
I've added an About Me section where you can read about my background and research interests.

Welcome!

Nov 25, 2011Posted by Maurice Berk

 
Right now this website is very much a work in progress. You're probably here to see my list of publications. You can also connect with me on LinkedIn or send me an email.