posted on 08:16 PM on Tuesday 17 September 2013
I have been having pretty large data set for statistical testing recently. Large as in millions of comparisons which is really taking a lot of time. In the past, I have used plyr and the doMC parallelization backend. That took a lot of time. The tests themselves seem to finish fast enough, but the aggregation was taking nearly forever. Then I switched to using foreach with doMC which was better but not much. The initial code using plyr took more than 2 hours, switching to foreach and doing optimization took that down to about 1 hour. Considering that I was using 10 cores, this was still slow when my laptop using another program could finish twice as many comparisons in 3 hours plus using 1 core. Finally, I split the jobs using Python and run it using different processes and this brought it down to 15 mins using 10 cores. This was at least acceptable performance. R has some really great libraries but getting scalable performance seems so difficult. One stupid use of some data structure can result in huge slow downs. With big data becoming more common place, is there a better statistical alternative to R? I tried pqR and that was not too bad, it was definitely faster than standard R but it is probably not enough. I have workloads that comprises of more than 3 billion comparisons that using Java and threading still requires more than a week of compute. Using R for that would be insane, but coding in R is so much quicker as the libraries are all there while I have to write it in Java. It is a serious dilemma. For now, I will stick to writing R codes for single cores and then parallelize it using old fashion processes in Python. That seems to be the best performing so far.