You could try 3dREMLfit with the -usetemp option, and a brain-only mask (from 3dAutomask) to cut down the number of voxels being worked on. An up-to-date version of 3dDeconvolve should generate the 3dREMLfit command into a file for you, which you can then edit to add various options.
3dREMLfit can do OLSQ (ordinary least squares) like 3dDeconvolve (the '-O' output file options), and also pre-whitened least squares (the '-R' output file options). If you don't use any '-R' options, then its OLSQ computations should be essentially identical to those of 3dDeconvolve (although it uses a different algorithm for solving the linear systems), and just as fast. And use somewhat less memory when -usetemp is turned on.
However, it cannot yet handle a few options that 3dDeconvolve allows; in particular, -jobs, -allzero_OK and -iresp/-sresp aren't in 3dREMLfit.