The problem comes from the fact the that output dataset is created in memory and only written to disk at the end. So if the output dataset gets HUGE, then you run out of memory (which is usually what "killed" means).
To get around this, you could do (as you suggest) a separate 3dNwarpApply for each TR. However, that would be slow, with a lot of file I/O.
You could do a bunch of TRs at a time, say 50. For example (in tcsh syntax)
set nvals = `3dnvals pb02.$subj.r01.tshift+orig`
@ nvm1 = $nvals - 1
set jbot = 0 ; set jstep = 50
while( $jbot < $nvals )
set ppp = `count -dig 6 $jbot $jbot`
@ jtop = $jbot + $jstep - 1
if( $jtop > $nvm1 ) set jtop = $nvm1
3dNwarpApply \
-nwarp "$anat_warp whole_matrix_volreg_epi2anat.aff12.1D{$jbot..$jtop}" \
-source pb02.$subj.r01.tshift+orig"[$jbot..$jtop]" \
-dxyz 3.5 -prefix rm.epi.nomask.$ppp -master ~/abin/MNI152_T1_2009c+tlrc
@ jbot = $jtop + 1
end
3dTcat -prefix rm.epi.nomask "rm.epi.nomask.0*.HEAD"
In the above (which is not guaranteed to work, since I haven't tested it), I'm using row selectors "{a..b}" on the matrix file to select a subset of rows to read in, and the corresponding volume (sub-brick) selectors on the input dataset "[a..b]" to select a subset of 3D volumes to process. At the end, all the partial results are 3dTcat-ed together.
I hope the above helps (and works).