3dttest++ -Clustsim has now been tested against 1-sample problems. The false positive rates (FPRs) were pretty close to the nominal 5%. Sometime "soon" I will add those tables to our bioRxiv manuscript.
I also ran 2-sample tests with the -unpooled option, and these FPRs were also were close to the nominal 5% false positive rate.
With -covariates (1- and 2-sample), the main effect FPRs were again in the 5% range. However, the FPRs for the covariate effect were not always so good -- in some cases well above 5% (10-11%), and in some cases well below 5% (1-2%) -- that is, applying the cluster-size thresholds to the covariate statistics can be too liberal or too conservative.
This effect is probably due to the fact that the cluster-size thresholds are derived from the main effect t-statistics in the randomized samples, and not from the covariate effect t-statistics. Which unfortunately implies that different cluster-size thresholds need to be used for different statistics, even from the same data.
Each set of tests above is with the 16,000 cases outlined in the Eklund PNAS paper -- 4 levels of blurring, times 4 pseudo-stimuli, times 1000 t-tests; each set of tests took about 3 days to run on the NIH compute cluster.
At this time, I have not modified 3dttest++ to allow the generation of cluster-size thresholds from covariate statistics. Not that this would be hard, but to use this in the AFNI Clusterize GUI will also require revamping the internals of AFNI, which at present only allows for 1 set of cluster-size threshold tables to apply to the entire dataset -- I'd have to allow for multiple cluster-size threshold sets and tag each such set to apply to a specific statistic sub-brick. My reaction to this is -- UGH.
I am now working on developing an idea to allow for spatially variable cluster-size thresholds, to deal with the effects caused by non-uniform smoothness in the FMRI noise. So far, I'm just scratching on paper, trying to formalize ideas I developed while backpacking in the Rockies 2 weeks ago, and trying to figure out how these moderately complex ideas can be implemented semi-efficiently. Not close to creating code yet, much less testing it for FPR or (even more complex) for power. Coding will take several weeks, and testing will take more weeks -- at best (and with software, "best" never happens). In other words, don't hold your breath -- unless you are Michael Phelps or Katie Ledecky.