Re: Modeling the BOLD5000 dataset

RWCox

June 10, 2021 11:17AM

Registered: 6 years ago
Posts: 144

If you wish to analyze such a large dataset as a whole, there are some options in AFNI. Most of them will require running on a system with enough RAM to hold the entire dataset easily. However, if you are industrious enough, you can break the dataset into pieces to reduce the memory footprint required at any given point.

No matter what you do, you will have to manage the analysis yourself, as I doubt that you can get afni_proc.py to do the work for you -- unless you have access to a computer system with a LOT of memory.

The NIH Biowulf cluster, for example, has several systems with over 1 TB of RAM. Probably that would be more than you need, but a few hundred GB would be comforting to have. If you have access a system with 200+ GB of RAM, that is what I would try first, using the "Big Dataset" recommendations below.

Using a SSD (solid state drive) to store the data and all intermediate files will help with processing, as access to such data is a lot faster than to data stored on rotating physical disks. On the NIH cluster, each node has an SSD that can be used for temp storage as a job runs. In our usual processing script for running on the cluster, we copy data over to the SSD at the start of a job, process it there, and then copy the results back to permanent storage at the end of the job. This procedure can speed things up significantly.

Two pieces of advice for the case where you try to process this collection as one giant dataset:

One Big Dataset Recommendation 1: Create the dataset in AFNI .HEAD/.BRIK uncompressed format and process it when it is stored on a SSD. AFNI programs can load such datasets into memory using Unix memory-mapping (mmap) of the .BRIK file, which will reduce the RAM required. But you will still need a fair amount of memory.

One Big Dataset Recommendation 2: For time series regression, use 3dREMLfit with a mask to restrict processing to brain voxels, and also use the -usetemp option if the program complains about running out of memory -- or if it dies with the cryptic message "Killed", which is not from the program but from the Unix system itself, which will kill a program that is causing trouble -- and that trouble is almost always caused by demanding way too much memory.

Two pieces of advice for how to deal with the collection in pieces. This will require a lot of management from you, as afni_proc.py will not manage the process outlined below. You'll have to develop your own processing script (perhaps inspired by afni_proc.py's script) to deal with the Recommendations below in the midst of the processing stream.

Smaller Datasets Recommendation 1: If you want to blur the dataset (spatially), this operation does not deal with data along the time axis -- so you can create pieces that are full in space (the entire volume) but small in time (segments). Blur these pieces in a script separately. Any other spatial-only processing can be done on these pieces as well.

Smaller Datasets Recommendation 2: For the time series regression, one voxel at a time is processed. So you can break the dataset into individual slices, and process the full time series of 30000 points for each slice dataset separately. The programs 3dZcutup and 3dZcat can be used to manage the processes of slice cutting-up and slice re-glueing (of the 3dREMLfit outputs).

Best of luck. Unless you are adroit at scripting, the "Big Dataset" approach is going to be the way to go. All it requires is a big computer.

Reply Quote

RSS

Subject	Author	Posted
Modeling the BOLD5000 dataset	ChrisCox	June 08, 2021 10:06PM
Re: Modeling the BOLD5000 dataset	gang	June 09, 2021 05:38PM
Re: Modeling the BOLD5000 dataset	ChrisCox	June 09, 2021 09:01PM
Re: Modeling the BOLD5000 dataset	rick reynolds	June 10, 2021 11:59AM
Re: Modeling the BOLD5000 dataset	ChrisCox	June 10, 2021 03:07PM
Re: Modeling the BOLD5000 dataset	RWCox	June 10, 2021 11:17AM
Re: Modeling the BOLD5000 dataset	ChrisCox	June 10, 2021 02:56PM