Hi Nic,
The program estimates the delay between each voxel time series and the reference time series you used. The references you compare have slightly differing shapes but most importantly, have different time offset.
Since the program looks for positive delays, I suggest you do the following. Start from a square wave that starts a couple of TRs earlier than your actual stimulus timing square wave, convolve it as you did with 3dDeconvolve to get a more hemodynamic looking function, then use that function in the 3ddelay program. I don't quite see a point in using a square wave even if others have used it before.
cheers,
Ziad