Our long runs are rooted in large data sets as opposed to any particular module being inefficient. The longer running modules have tended to be motion correction/image registration, deconvolution, and some of the stats. If the goal is to run this in near realtime, it might still be necessary to look at all the typical steps including things like temporal and spacial smoothing even though they rarely are major time consumers. (Besides - they look pretty easy to do :) )
Having read ahead in this thread, if we used MPI and ifdef'ed the calls in,
it would be benign to the traditional single CPU AFNI user community, and yet be portable to clusters, SMP systems, and massively parallel machines. MPI is a widely adopted de facto standard for this kind of processing. That would also leave the door open for future work to support grids. MPI is a message passing interface as well as a control stucture, so the "embarrassingly parallel" and the more complex fine grained parallelism can both be handled.