The calculation of the local nbhd FWHM estimates is what takes the most time. I suppose this could be parallelized, but I'm not likely to do it in this weekend.
The other difference (besides edge/mask effects) between 3dmerge blurring and 3dBlurToFWHM is that the latter will blur to a global FWHM estimate, whereas the former just adds smoothness (including past the brain edge).
So we have the following hierarchy:
1) 3dBlurToFWHM with local estimates == respects a mask, does local and global smoothness estimates/control
2) 3dBlurToFWHM with '-nbhd NULL' == respects a mask, does global smoothness estimate/control
3) 3dBlurInMask == respects a mask
4) 3dmerge -1blur_fwhm == just does blurring