AFNI Pre-Built Binaries that Use OpenMP
Several sets of AFNI binaries are pre-compiled with OpenMP support. At this time [Jan 2011],
these are
Introduction
In the last couple months, I have explored using the parallelization toolkit OpenMP
in AFNI. OpenMP is built in to several compilers, including gcc 4.2. OpenMP comprises
a set of #pragma-s, functions, etc., that are used to control parallel threads
on a shared-memory multi-CPU system (e.g., my octo-core Mac).
An OpenMP tutorial can be found at https://computing.llnl.gov/tutorials/openMP/ and the OpenMP 2.5 spec can be found at http://www.openmp.org/mp-documents/spec25.pdf -- both of these are valuable documents with sample code. Another interesting document is http://www.hpcwire.com/features/17884424.html, which discusses OpenMP vs. MPI and an extended version of OpenMP called Cluster OpenMP from Intel.
OpenMP should not be confused with Open MPI, which is a completely different piece of software for parallelization! MPI (= Message Passing Interface) lets you distribute work across multiple CPUs and multiple systems by explicitly passing messages back and forth between processes/threads. MPI is more flexible than OpenMP in that it allows a whole network of computers to be harnessed into one job.
MPI is also harder to use than OpenMP, in that the programmer has to manage the inter-thread data flow very explicitly. In OpenMP, the data model is that some data is explicitly in shared memory, visible to be read and written by each thread by normal C statements, and some data is explictly private to each thread, and when modified in one thread that change will not be visible to the other threads. In OpenMP, the principal burden on the programmer is to make sure that the shared data is written to in a coherent manner by the different threads. In MPI, the burden on the programmer is to explicitly determine when data held in one thread is communicated to another thread. Converting an existing program to use OpenMP is generally much simpler than making it work with MPI.
Later: I now keep this document up-to-date with the list of OpenMP-ized AFNI programs, and I add comments about newish issues that arise as I do more with OpenMP. (These later comments are not dated, so the future historians of AFNI will have to suffer.)
Memory Concepts
The two most important things to realize about OpenMP's parallel model for computation are
Compiling to use OpenMP
In your AFNI Makefile, you have to set the variable OMPFLAG to be
the value of the flag(s) tothe compiler/linker needs for dealing with OpenMP,
plus you have to enable OpenMP inside AFNI code by setting the C macro variable
USE_OMP to be defined. For gcc 4.2, my Makefile contains
the single extra line
OMPFLAG = -fopenmp -DUSE_OMP
Some AFNI Makefile.*-s now contain definitions of OMPFLAG
so that the binary builds include some parallelized programs. At this
moment, these programs are:
All of the above programs, except for 3dGroupInCorr [Dec 2009] and 3dClustSim [Jul 2010], were originally written as serial programs, and later converted to OpenMP. Some of the problems arising in these conversions are outlined below. It is much easier to write a program from scratch to use OpenMP than to retrofit it later!
In AFNI's Makefile.INCLUDE, files that are to be compiled/linked with OpenMP use CCOMP instead of CC or CCFAST for the compilation/linking. You'll have to add an extra make rule for each file that needs this special handling. N.B.: Any program that calls an OpenMP-enabled function, even if the main program knows nothing about OpenMP, must be linked with CCOMP to get the proper libraries included.
At the top of any C source file that's going to use OpenMP #pragma-s or function calls, you should put a code block like so:
#ifdef USE_OMP #include <omp.h> #endifBy default, OpenMP will use all CPUs available in any parallel block of code. This behavior can be changed by setting the environment variable OMP_NUM_THREADS to some smaller integer value. (There are also OpenMP library functions that let you control this from within a program, but I've not used these.)
Example #1 -- Something Simple
My first example is a function from mri_genalign_util.c. The function
below does linear interpolation from the input image fim at
npp output points whose index coordinates are given in input
arrays ip, jp, and kp, storing the results
into user-allocated array vv. (Rather than make up a trivial
sample case, I'm showing a real piece of code that's used in
3dAllineate.) The two OpenMP directives
are shown in bold:
#define FAR(i,j,k) far[(i)+(j)*nx+(k)*nxy] #define CLIP(mm,nn) if(mm < 0)mm=0; else if(mm > nn)mm=nn void GA_interp_linear( MRI_IMAGE *fim , int npp, float *ip, float *jp, float *kp, float *vv ) { ENTRY("GA_interp_linear") ; #pragma omp parallel if(npp > 9999) { int nx=fim->nx , ny=fim->ny , nz=fim->nz , nxy=nx*ny , pp ; float nxh=nx-0.501f , nyh=ny-0.501f , nzh=nz-0.501f , xx,yy,zz ; float fx,fy,fz ; float *far = MRI_FLOAT_PTR(fim) ; int nx1=nx-1,ny1=ny-1,nz1=nz-1 ; float ix,jy,kz ; int ix_00,ix_p1 ; /* interpolation indices */ int jy_00,jy_p1 ; int kz_00,kz_p1 ; float wt_00,wt_p1 ; /* interpolation weights */ float f_j00_k00, f_jp1_k00, f_j00_kp1, f_jp1_kp1, f_k00, f_kp1 ; #pragma omp for for( pp=0 ; pp < npp ; pp++ ){ xx = ip[pp] ; if( xx < -0.499f || xx > nxh ){ vv[pp]=outval; continue; } yy = jp[pp] ; if( yy < -0.499f || yy > nyh ){ vv[pp]=outval; continue; } zz = kp[pp] ; if( zz < -0.499f || zz > nzh ){ vv[pp]=outval; continue; } ix = floorf(xx) ; fx = xx - ix ; /* integer and */ jy = floorf(yy) ; fy = yy - jy ; /* fractional coords */ kz = floorf(zz) ; fz = zz - kz ; ix_00 = ix ; ix_p1 = ix_00+1 ; CLIP(ix_00,nx1) ; CLIP(ix_p1,nx1) ; jy_00 = jy ; jy_p1 = jy_00+1 ; CLIP(jy_00,ny1) ; CLIP(jy_p1,ny1) ; kz_00 = kz ; kz_p1 = kz_00+1 ; CLIP(kz_00,nz1) ; CLIP(kz_p1,nz1) ; wt_00 = 1.0f-fx ; wt_p1 = fx ; /* weights for ix_00 and ix_p1 points */ #undef XINT #define XINT(j,k) wt_00*FAR(ix_00,j,k)+wt_p1*FAR(ix_p1,j,k) /* interpolate to location ix+fx at each jy,kz level */ f_j00_k00 = XINT(jy_00,kz_00) ; f_jp1_k00 = XINT(jy_p1,kz_00) ; f_j00_kp1 = XINT(jy_00,kz_p1) ; f_jp1_kp1 = XINT(jy_p1,kz_p1) ; /* interpolate to jy+fy at each kz level */ wt_00 = 1.0f-fy ; wt_p1 = fy ; f_k00 = wt_00 * f_j00_k00 + wt_p1 * f_jp1_k00 ; f_kp1 = wt_00 * f_j00_kp1 + wt_p1 * f_jp1_kp1 ; /* interpolate to kz+fz to get output */ vv[pp] = (1.0f-fz) * f_k00 + fz * f_kp1 ; } } /* end OpenMP */ EXRETURN ; }
The directive
#pragma omp parallel if(npp > 9999)
is at the head of a "structured block" of code that will be executed
in parallel. (A "structured block" in OpenMP-lingo means a C
{...} block that contains no way out (e.g., no return)
except to fall through the bottom.)
You should imagine that all the code inside this block will be executed
in parallel on multiple CPUs -- even code that does exactly the same thing.
To get different things done on different CPUs, we need the second directive,
that will specify the "work-sharing".
In the above code, I've declared all the internal variables used in the
function inside the parallel block. This means that these variables
are all private to each thread. Assignments to any one of these in one thread
will have no impact on the other threads. Declaring variables like this is
the easy way to make sure they are thread-private and won't accidentally conflict.
It is also possible to declare outside variables to be thread-private
in the parallel #pragma, but I'd rather skip that -- doing it
with private declarations, as above, is simpler to think about and to program.
Thus, for example, the pointer
float *far = MRI_FLOAT_PTR(fim) ;
will have N identical copies spread around amongst the N threads.
This is slightly inefficient with respect to memory usage -- since far
is never changed after the initial assignment, it could be a shared variable
declared and initialized outside the parallel block -- but unless the
amount of memory duplication is huge, the rule
for most variables used and assigned to
in the parallel block, declare them
inside the parallel block
is the simplest to code with. The only exception would be variables whose
values you wish to preserve when the parallel block ends and normal
(sequential) program execution resumes -- typically, these variables are
output arrays.
Note that the if(npp > 9999) part of the parallel directive means that the code will actually only be parallelized if the number of points to interpolate at one shot is 10,000 or more. There is no point in parallelizing at too fine a level -- the thread startup and management overhead will be too large to get any net program speedup from the OpenMP-ization. (You might well ask where I got the number 9999 from -- the answer is that I just made it up -- I didn't actually test the function to see where the breakeven point for npp might lie.)
The directive
#pragma omp for
indicates that the next for statement should be parallelized across the
threads that were stared with the parallel directive -- that is,
that different threads should get different subsets of the index pp as
it ranges over 0..npp-1. In this code, pp is the voxel
index into the input arrays ip, jp, and kp, and into the
output array vv. The goal of the loop body is to compute vv[pp].
Each different value of pp writes to a different output location, so there is no
conflict possible even if two threads were executing the same statement at exactly the
same time (something you always have to think about).
For a for statement to be parallelizable, the number of iterations must be
easily determinable when the loop is started. In this case, it is obviously npp.
For example, a loop of the form
for( pp=0 ; pp < npp && vv[pp] != 666.0f ; pp++ ){ ... }
cannot be parallelized by OpenMP since the terminal condition is not determinable
when the loop is started. Similarly, the loop body cannot contain any break statements.
This function (and its analogs for NN, cubic, quintic, and wsinc5 interpolation) were pretty easy to adapt to OpenMP. I simply moved all the local variables into the parallel block, and that was about it. The only write to a variable visible outside the parallel block is to vv[pp] and there is obviously no possible thread conflict there. The only external function called is in the C library, and these are pretty much all supposed to be "thread-safe" (the technical term is "re-entrant"), unless the man page specifies otherwise.
Note that the parallel for loop will not be executed in sequential order of the control variable pp, even within a single thread. OpenMP chooses the order of execution, so the externally visible results (in this example, the vv[pp] values) should not depend on the order in which the pp values are chosen. (It is possible for the programmer to have some control over the division and sequence of labor in the different threads, but I've not used this feature of OpenMP, nor do I plan to.)
When parallelizing 3dDespike, I chose to parallelize the voxel loop in the main program. The first thing was to identify all the variables that receive assignments in this loop, and move the declarations of those that are purely internal to the loop (not containing output data) from the top of main down to be inside the parallel block. These variables include a number of work arrays for processing the voxel time series. Each thread gets its own instance of each of these work arrays, since the malloc call was moved inside the parallel block. I simply had to examine the code carefully to ensure that every variable that received an assignment was local to the thread -- since this loop wasn't self-contained inside a function, the work of scrutinizing the code was a little more tedious than for the interpolation functions described above. At the end of the voxel loop, the results are put into the output dataset. Each voxel calculation and assignment is independent of all others, so there is no potential thread conflict. However, a couple more issues arose after I got the code running -- these are described below in Examples #3 and #4.
Example #2 -- A Little Bit of Restructuring
My second example comes from program 3dTcorrMap.c and shows how a small
change to the logic of the program helped OpenMP-ize it. In this program, the
innermost loop is computing the sum (or other combination) of a lot of calculations.
Clearly, when the code is adding the new result into the accumulating sum, there is the
potential for conflict between two threads executing this summation at the same time.
One way to avoid this conflict is to mark this statement as being "critical" -- to be only
executed by one thread at a time. The other way to avoid this conflict is to modify
the code to put each iteration's result into a temporary array, and then add the
results up afterwards, outside the parallel block. The second way is
what I chose to do. Here is the parallelized code:
float *ccar = (float *)malloc(sizeof(float)*nmask) ; /* temporary array */ for( ii=0 ; ii < nmask ; ii++ ){ /* outer loop over voxels: */ /* time series to correlate with */ xsar = MRI_FLOAT_PTR( IMARR_SUBIM(timar,ii) ) ; /* ii-th time series */ #pragma omp parallel { int vv,uu ; float *ysar ; float qcc ; #pragma omp for for( vv=0 ; vv < nmask ; vv++ ){ /* inner loop over voxels */ if( vv==ii ){ ccar[vv] = 0.0f ; continue ; } ysar = MRI_FLOAT_PTR( IMARR_SUBIM(timar,vv) ) ; /** dot products (unrolled by 2 on 29 Apr 2009) **/ if( isodd ){ for( qcc=xsar[0]*ysar[0],uu=1 ; uu < ntime ; uu+=2 ) qcc += xsar[uu]*ysar[uu] + xsar[uu+1]*ysar[uu+1] ; } else { for( qcc=0.0f,uu=0 ; uu < ntime ; uu+=2 ) qcc += xsar[uu]*ysar[uu] + xsar[uu+1]*ysar[uu+1] ; } ccar[vv] = qcc ; /* save correlation in ccar for later (OpenMP mod) */ } /* end of inner loop over voxels (vv) */ } /* end OpenMP */ /* below here, combine results from ccar[] to get output for voxel #ii */ } /* end of outer loop over voxels (ii) */Note that all other variables (besides ccar[]) on the receiving end of an assignment inside the parallel block are local variables inside that block, and so are private to each thread. Note also that only the loop over vv is parallelized -- the innermost loops over uu run sequentially in each thread. In principle, you can nest parallel blocks, but I have not tried this. OpenMP version 2.5 does not require an implementation to support nested parallelism, and I've not bothered to try to use this feature.
The original code for the above fragment read like so:
for( ii=0 ; ii < nmask ; ii++ ){ /* time series to correlate with */ xsar = MRI_FLOAT_PTR( IMARR_SUBIM(timar,ii) ) ; Tcount = Mcsum = Zcsum = Qcsum = 0.0f ; for( jj=0 ; jj < nmask ; jj++ ){ /* loop over other voxels, correlate w/ii */ if( jj==ii ) continue ; ysar = MRI_FLOAT_PTR( IMARR_SUBIM(timar,jj) ) ; /** dot products (unrolled by 2 on 29 Apr 2009) **/ if( isodd ){ for( cc=xsar[0]*ysar[0],kk=1 ; kk < ntime ; kk+=2 ) cc += xsar[kk]*ysar[kk] + xsar[kk+1]*ysar[kk+1] ; } else { for( cc=0.0f,kk=0 ; kk < ntime ; kk+=2 ) cc += xsar[kk]*ysar[kk] + xsar[kk+1]*ysar[kk+1] ; } Mcsum += cc ; Zcsum += 0.5f * logf((1.0001f+cc)/(1.0001f-cc)); Qcsum += cc*cc ; if( fabsf(cc) >= Thresh ) Tcount++ ; } /* end of loop over jj */ if( Mar != NULL ) Mar[indx[ii]] = Mcsum / (nmask-1.0f) ; if( Zar != NULL ) Zar[indx[ii]] = tanh( Zcsum / (nmask-1.0f) ) ; if( Qar != NULL ) Qar[indx[ii]] = sqrt( Qcsum / (nmask-1.0f) ) ; if( Tar != NULL ) Tar[indx[ii]] = Tcount ; } /* end of loop over ii */Note the summation into Mcsum and other variables inside the jj loop -- these are not thread-safe. Instead of storing the components of Mcsum into an array ccar as done in the OpenMP-ization above, another way would be to force the OpenMP thread manager to ensure that only one thread at time updates these variables. This could be done with the following construction:
#pragma omp critical (TcorrMap) { Mcsum += cc ; Zcsum += 0.5f * logf((1.0001f+cc)/(1.0001f-cc)); Qcsum += cc*cc ; if( fabsf(cc) >= Thresh ) Tcount++ ; }This OpenMP construction would block more than one thread at a time from entering the structured block that follows the critical directive. So why didn't I use this approach to modifying 3dTcorrMap.c? The answer is simple -- I wasn't really aware of critical when I made the changes to the code in the distant past (3+ weeks ago now). (I probably would have tried that first, but I'm not going to go back and patch things up just for fun.) Of course, it's important not to put too much code inside a critical block, or that will slow the program down as threads are forced to wait.
Example #3 -- OpenMP and malloc
The C library functions malloc etc. are re-entrant, so that's good -- they can be
used inside parallel blocks with no real worries. However, the
tracking wrappers I wrote, mcw_malloc etc., are not re-entrant,
since they use and modify a static structure to keep track of whats been allocated.
Therefore, any OpenMP-ized function that might call (even indirectly in another function)
an mcw_malloc function must put this call inside a critical block.
This is pretty annoying. You can find some examples of this in 3dDespike.c
and other places.
However, later (about 1 day) I realized that this is unneeded.
All you need to do is not turn on
mcw_malloc in OpenMP-enabled main programs. At the top of many AFNI main
programs is code like so:
#ifdef USING_MCW_MALLOC enable_mcw_malloc() ; #endifThis code snippet should be replaced with
#if defined(USING_MCW_MALLOC) && !defined(USE_OMP) enable_mcw_malloc() ; #endifIn this way, a program that gets compiled with OpenMP will not turn on the mcw_malloc functions, and life will be cool.
Unfortunately, the same issue arises with the NI_malloc functions in the NIML library. I've not decided what to do about those just yet. These functions (in sub-directory niml) are designed to be compiled without reference to other AFNI code -- my (unfulfilled) dream was that other people might like to use these functions for inter-process communication and data storage, so I didn't want them to be dependent on the complicated forest of AFNI headers, macros, etc.
Example #4 -- OpenMP and static variables
Writing to static variables is a potential thread conflict problem.
When OpenMP-izing 3dDespike, I found that there were differences in
the output between the old and new versions in a tiny number of voxels. This was
pretty annoying. After some thought, I traced the problem down to the
Fortran-to-C translated function in cl1.c. All the local variables in
the translated code are declared static, since that is the semantics
of Fortran-77 (local variable values persist across function calls, unlike in C).
However, the function doesn't actually re-use any values from old calls, so
I removed all the static local variable declarations in this translated
code. The two versions of 3dDespike now agreed exactly.
Lesson A: when OpenMP-izing, scrutinize all functions being called for static variables that might get written into. If you find any, you'll have to de-static the code, or critical-ize the call to that function. (In the case of cl1.c, the latter choice wasn't an option, since 3dDespike spends most of its CPU time in that function.)
Lesson B: Be sure to run the program with and without OpenMP and diff the result files to make sure they are identical. Any differences need to be investigated, understood, and fixed.
A potential static problem arises with the function call traceback macros ENTRY and RETURN/EXRETURN defined in debugtrace.h. These macros use a static stack to keep track of function calls. Clearly, this stack could get confused with OpenMP. To 'fix' this problem, I've defined a macro that should be in as the first executable statement in a parallel block, and another as the last executable statement; for example:
#pragma omp parallel if(npp > 9999) { int fred ; /* other declarations ... */ AFNI_OMP_START ; /* parallel code ... */ AFNI_OMP_END ; } /* end OpenMP */At present, all these macros do is stop and re-start (respectively) the ENTRY/RETURN stack operations. In the future, they may do more things, which is why I wrote them as macros, to be more flexible.
Example #5 -- OpenMP and random numbers -- AlphaSim
A problem arose when I tried to OpenMP-ize the AlphaSim program.
This code generates a lot of random 3D volumes and processes them to
get some statistics. As it turns out, it spends most its time
(80+%) generating the random numbers. The routine that is used boils
down to the C library function drand48() and its ilk.
Parallelizing across instances of the volumes seems pretty straightforward
-- but the program becomes much slower. After some playing around,
I found the reason: the drand48() set of functions uses
an internal "state" to generate the sequence of random variates -- and
this state data is static. To be thread-safe,
these library functions simply block on re-entrancy -- that is, they are all
critical. Not very good for speedup!
But there is a solution that doesn't involve writing my own thread-safe random number generator (which was my first thought, and which gave me a headache). Instead of drand48() to get a U(0,1) random variate, I used the function erand48(unsigned short x[3]), where the random generation state is stored in the array x[]. I create a separate state array for each thread (at the start of the parallel block, before the parallelized for), and initialize them separately (so each thread gets a different sequence of random variates!). This solves the problem, and the program now runs with nearly perfect speedup. For the tedious details, see the AlphaSim.c source code -- including the file zgaussian.c, where the Gaussian N(0,1) random number generator resides.
Example #6 -- OpenMP and malloc() and memcpy() -- 3dREMLfit
Parallelizing 3dREMLfit effectively and correctly was much harder than the programs
above. The first problem was where to parallelize. I tried various places,
but none of them gave much speedup -- in some cases, OpenMP slowed the program down significantly.
After a long time, I finally thought of using Apple's Shark profiler program, and figured out
that the program was spending a vast amount of time spinning threads waiting for malloc
and free to do their work. Aha! The big CPU time sink in the "REML voxel loop",
REML_func() allocates and frees workspace arrays.
But this function is called hundreds of times per voxel, so the threads
were interfering with each other -- because the system memory allocator is thread-safe simply
by blocking -- it is not parallelized itself. I didn't want to install a parallelized
version of malloc, so I rearranged to code to provide the workspace from outside,
so that it wouldn't constantly be created and destroyed. Speedup!!!
... But ... In a sample run with about 133K voxels, about 40-100 voxels did not agree with the single-threaded case. Arrrgghhh!!! After printing out vast amounts of information from each thread's loop through the voxels, I finally found that these traitorous voxels in fact had the wrong data passed to REML_func(). This data time series vector was copied using memcpy. For whatever reason, this wasn't completely thread-safe -- I don't know why, and can't find any other info about this on the Web. But by critical-izing this statement, the problems vanished. Simple .. after 2.5 days of frustration, that is.
Other minor details in parallelizing this program: the -usetemp option writes intermediate data to disk and then re-reads it later, with the assumption that things are written in voxel order. This won't be true for OpenMP, so a special simpler loop, with all the -usetemp stuff deleted, was written for the parallel section of the code. This change makes it easier to deal with the parallelization, but of course means that any changes to the "REML voxel loop" must be made twice in the future, since there are now two different versions of this loop's code extant. C'est la soft-guerre.
Example #7 -- Looping over voxels in 3dAutoTcorrelate
This program correlates the time series from all voxels inside a mask with all other
voxels (inside the same mask, or all voxels). In the original sequential code,
the outer loop over voxel index ii just tested the voxel index to see if
was in the mask -- if not, then it just continue-d to the next voxel.
But in OpenMP, this means that
some parallel sub-loops get more work than others, if they happen to fall into a
denser part of the mask. Speedup was not all it should be. So by pre-making an
index table imap of voxel indexes in the mask, I could just loop over that
table and be sure each loop iteration was doing the same amount of labor.
Another issue that arose, that had a bigger impact, was the fact that extracting the time series from the space-then-time ordered input dataset was inefficient -- since each voxel time series would be extracted many many times as the inner voxel loop was traversed. Inverting the input dataset into an MRI_vectim struct (time-then-space ordered) provided a big speedup, making memory accesses much more local. Also, each time series could be pre-detrended, avoiding repeating this operation many many times.
With these 2 changes, 3dAutoTcorrelate became much faster, and in fact it now spends a significant fraction of its time just writing the lengthy results to disk. I suppose this would be the next target for speedup -- probably by writing each completed sub-brick to the output .BRIK file via mmap() when it is completed.
OpenMP Library Functions
I've only used a couple of these -- see the specification for the real scoop.
Printing -help for OpenMP-ized programs
In mrilib.h, the macro PRINT_AFNI_OMP_USAGE(pnam,extra)
is defined. When USE_OMP is defined (i.e., on the compile
command line via CCOMP), then this macro prints out some extra
help relevant to OpenMP usage. The argument pnam is intended
to be the program name, and the argument extra is any extra
text to be put at the end of the standard boilerplate (extra
can be NULL). If USE_OMP is not defined, then
PRINT_AFNI_OMP_USAGE(pnam,extra) prints a message that the
program is OpenMP-compatible, but this binary copy is not compiled
with OpenMP.
Thus, this macro is designed to be inserted at the end of the '-help'
output section in any OpenMP-ized program (cf. AlphaSim.c).