RBA


Welcome to RBA

Region-Based Analysis Program through Bayesian Multilevel Modeling
#+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Version 1.1.8, Feb 26, 2025
Author: Gang Chen (gangchen@mail.nih.gov)
Website - https://afni.nimh.nih.gov/gangchen_homepage
SSCC/NIMH, National Institutes of Health, Bethesda MD 20892
#+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Introduction
RBA (Region-Based Analysis) is performed via a shell script, as demonstrated in the
examples below. The method is detailed in the manuscript: https://rdcu.be/bhhJp.
Input data must be formatted as a plain-text table specifying regions and variables.
The response variable represents an effect at the individual subject level.

Special thanks to Paul-Christian Bürkner and the Stan/R communities for their
invaluable support.

Citation
If you wish to cite RBA, consider the following references:

Chen G, Xiao Y, Taylor PA, Riggins T, Geng F, Redcay E (2019). Handling Multiplicity
in Neuroimaging through Bayesian Lenses with Multilevel Modeling. Neuroinformatics.
https://rdcu.be/bhhJp

Chen G, Taylor PA, Cox RW, Pessoa L (2020). Fighting or embracing multiplicity in
neuroimaging? Neighborhood leverage versus global calibration. NeuroImage, 206,
116320. https://doi.org/10.1016/j.neuroimage.2019.116320

Chen G, Taylor PA, Stoddard J, Cox RW, Bandettini PA, Pessoa L (2022). Sources of
Information Waste in Neuroimaging: Mishandling Structures, Thinking Dichotomously,
and Over-Reducing Data. Aperture Neuro, 2021, 46.
https://doi.org/10.52294/2e179dbf-5e37-4338-a639-9ceb92b055ea

Data Format Requirements
A properly formatted plain-text data table is required for RBA. The table must
contain at least three columns specifying subjects, regions, and response variable
values. Column names are case-sensitive, but their order does not matter.

Example Format:
Subj   ROI       Y      Age
S1     Amyg    0.2643    11
S2     BNST    0.3762    16

Key Guidelines
1. Bayesian Approach:

Unlike frequentist methods, Bayesian analysis provides direct probability estimates
for effects rather than p-values.

2. Variable Naming & Encoding:

* Avoid using pure numbers for categorical variables.
* Default names (Subj, ROI, Y) simplify scripting but are not required.

3. Incorporating Explanatory Variables:

* Only between-subject variables (e.g., sex, age, patient/control status) are
  currently supported.
* Within-subject/repeated measures support may be added in the future.
* If encoding categorical variables numerically, ensure correct factor coding.
* Reference: Factor coding strategies.

4. Standardization for Improved Convergence:
* Use the -stdz option to standardize continuous explanatory variables.
* If grouping factors (e.g., sex) are present, standardize within each group if their
  means differ significantly.
* Do not standardize categorical variables or the response variable if the intercept
  is of interest.

5. Handling Within-Subject Variables:
* Express them as contrasts or linear combinations of factor levels.

6. Interpretation of Results:
* RBA estimates effects per region, with slight variations across runs due to Monte
  Carlo sampling.
* The key output metric P+ represents the probability of an effect being positive
  under the given model and data.
* Unlike NHST, we discourage rigid significance thresholds and advocate full results
  reporting.

7. Homogenization Warning:
* If results appear overly uniform across regions, cross-region variability may be too
  low, leading to excessive pooling.
* This suggests the need for more data to resolve subtle effects.

Installation Requirements
R & Required Packages
* Ensure you have an up-to-date R installation. The brms package is required:

* Installation via Terminal:
rPkgsInstall -pkgs "brms" -site http://cran.us.r-project.org

* Or within R:
install.packages("brms")

Parallelization for Performance
* For better performance, install cmdstan and cmdstanr and use the -WCP option in MBA.

* Installing cmdstanr in R:
install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))

* Installing cmdstan in R:
cmdstanr::install_cmdstan(cores = 2)

* Installation guide: https://mc-stan.org/cmdstanr/articles/cmdstanr.html
If installed outside the home directory, specify the path using -StanPath '~/my/stan/path'.

* Additional Packages for Ridge Plots
If using -ridgePlot, install the following R packages:
install.packages(c("data.table", "ggplot2", "ggridges", "dplyr", "tidyr", "scales"))

Running RBA
Once the RBA script is ready, execute it via the terminal.

Recommended Execution (tcsh shell)
Save the script as myRBA.txt, then run:

nohup tcsh -x myRBA.txt > diary.txt &
nohup tcsh -x myRBA.txt |& tee diary.txt &

The output is saved in diary.txt for debugging.
The nohup command allows the script to continue running even if the terminal session is closed.

Examples:

Example 1 --- Simplest scenario. Values from regions are the input from
  each subject. No explanatory variables are considered. Research
  interest is about the population effect at each region.

RBA -prefix output -dataTable myData.txt  \

The above script is equivalent to

RBA -prefix myResult -chains 4 -iterations 1000 -model 1 -EOI 'Intercept' \
-dataTable myData.txt  \

The 2nd version above is recommended because of its explicit specifications.

If the data are skewed or have outliers, use Student's t-distribution:

RBA -prefix myResult -chains 4 -iterations 1000 -model 1 -EOI 'Intercept' \
-distY 'student' -dataTable myData.txt  \

If a computer is equipped with as many CPUs as a factor 4 (e.g., 8, 16, 24,
...), a speedup feature can be adopted through within-chain parallelization
with the option -WCP. For example, the script assumes a computer with 24 CPUs
(6 CPUs per chain):

RBA -prefix myResult -chains 4 -WCP 6 \
-iterations 1000 -model 1 -EOI 'Intercept' -distY 'student' \
-dataTable myData.txt  \

The input file 'myData.txt' is a data table in pure text format as below:

Subj  ROI          Y
S01   lFFA       0.162
S02   lAmygdala -0.598
S03   DMNLAG     0.249
S04   DMNPCC     0.568

If t-statistic (or standard error) values corresponding to the response variable
Y are available, add the t-statistic (or standard error) values as a column in the input
data table so that they can be incorporated into the BML model using the option -tstat
or -se with the following script (assuming the tstat column is named as 'tvalue'),

RBA -prefix myResult -chains 4 -WCP 6 \
-iterations 1000 -model 1 -EOI 'Intercept' -distY 'student' -tstat tvalue \
-dataTable myData.txt  \

or (assuming the se column is named as 'SE'),

RBA -prefix myResult -chains 4 -WCP 6 \
-iterations 1000 -model 1 -EOI 'Intercept' -distY 'student' -se SE \
-dataTable myData.txt  \

Example 2 — 2 between-subjects factors (sex and group):

RBA -prefix output -Subj subject -ROI region -Y zscore -ridgePlot 10 8 \
-chains 4 -iterations 1000 -model '1+sex+group' \
-cVars 'sex,group' -EOI 'Intercept,sex,group' \
-dataTable myData.txt

If a computer is equipped with as many CPUs as a factor 4 (e.g., 8, 16, 24,
...), a speedup feature can be adopted through within-chain parallelization
with the option -WCP. For example, consider adding
'-WCP 6' on a computer with 24 CPUs.

The input file 'myData.txt' is formatted as below:

subject region  zscore  sex group
S1      DMNLAG  0.274    F  patient
S1      DMNLHC  0.443    F  patient
S2      DMNRAG  0.455    M  control
S2      DMNRHC  0.265    M  control

Notice that the interaction between 'sex' and 'group' is not modeled in
this case. The option -ridgePlot generates a stacked list of posterior
distributions in a sequential order among the regions for each effect of
interest specified through -EOI. The two numbers of 10 and 8 associated
with the option -ridgePlot specifies the figure window size with 10" wide
and 8" high.


Example 3 --- one between-subjects factor (sex), one within-subject factor (two
conditions), one between-subjects covariate (age), and one quantitative

variable:

RBA -prefix result -ridgePlot 8 6 -Subj Subj -ROI region -Y value \
-chains 4 -iterations 1000 -model '1+sex+age+SA' -qVars 'sex,age,SA' \
-EOI 'Intercept,sex,age,SA' -dataTable myData.txt

If a computer is equipped with as many CPUs as a factor 4 (e.g., 8, 16, 24,
...), a speedup feature can be adopted through within-chain parallelization
with the option -WCP. For example, consider adding '-WCP 6' to the script
on a computer with 24 CPUs.

The input file 'myData.txt' is formatted as below:

Subj   region   value sex  age   SA
S1    DMNLAG    0.274  1  1.73  1.73
S1    DMNLHC    0.443  1  1.73  1.73
S2    DMNRAG    0.455 -1 -0.52  0.52
S2    DMNRHC    0.265 -1 -0.52  0.52

Notice

1) The 'Y' column is the contrast between the two conditions.
2) Since we want to model the interaction between 'sex' and 'age', 'sex' is
   coded through deviation coding.
3) 'age' has already been standardized within each sex due to large age
   difference between the two sexes.
4) The 'SA' column codes for the interaction between 'sex' and 'age', which
   is the product of the two respective columns.


Example 4 --- a more flexible way to specify a model.

RBA -prefix test -chains 4 -iterations 1000 -mean 'score~1+(1|roi)+(1|subj)' \
-sigma '1+(1|roi)+(1|subj)' -ROI 'roi' -EOI 'Intercept' -WCP 8
-dataTable test.tbl

The input file 'test.tbl' is formatted as below:

subj    roi     score
S1    DMNLAG    0.274
S1    DMNLHC    0.443
S2    DMNLAG    0.455
S2    DMNLHC    0.265

Notice

1) The -mean option specifies the formulation for the mean of the likelihood (Gaussian
   in this case).
2) The -sigma option specifies the formulation for the standard deviation of likelihood
   (Gaussian in this case).
3) It is important to identify the pivotal variable as 'roi' since the label is different
   from the default ('ROI').


Options in alphabetical order:

   -chains N: Specify the number of Markov chains. Make sure there are enough
         processors available on the computer. Most of the time 4 cores are good
         enough. However, a larger number of chains (e.g., 8, 12) may help achieve
         higher accuracy for posterior distribution. Choose 1 for a single-processor
         computer, which is only practical only for simple models.

   -cVars variable_list: Identify categorical (qualitive) variables (or
         factors) with this option. The list with more than one variable
         has to be separated with comma (,) without any other characters such
         as spaces and should be surrounded within (single or double) quotes.
         For example, -cVars "sex,site"

   -dataTable TABLE: List the data structure in a table of long format (cf. wide
         format) in R with a header as the first line.

         NOTE:

         1) There should have at least three columns in the table. These minimum
         three columns can be in any order but with fixed and reserved with labels:
         'Subj', 'ROI', and 'Y'. The column 'ROI' is meant to code the regions
         that are associated with each value under the column Y. More columns can
         be added in the table for explanatory variables (e.g., groups, age, site)
         if applicable. Only subject-level (or between-subjects) explanatory variables
         are allowed now. The labels for the columns of 'Subj' and 'ROI'
         can be any identifiable characters including numbers.
         2) Each row is associated with one and only one 'Y' value, which is the
         response variable in the table of long format (cf. wide format) as
         defined in R. With n subjects and m regions, there should have totally mn
         rows, assuming no missing data.
         3) It is fine to have variables (or columns) in the table that are not used
         in the current analysis.
         4) The context of the table can be saved as a separate file, e.g., called
         table.txt. In the script specify the data with '-dataTable table.txt'.
         This option is useful when: (a) there are many rows in the table so that
         the program complains with an 'Arg list too long' error; (b) you want to
         try different models with the same dataset.

   -dbgArgs: This option will enable R to save the parameters in a file called
         .RBA.dbg.AFNI.args in the current directory so that debugging can be
         performed.

   -distROI distr_name: Use this option to specify the distribution for the ROIs.
        The default is Gaussian when this option is not invoked. When the number of
         regions is small (e.g., less than 20), consider adopting the Student's
        t-distribution by using this option with 'student'.

   -distSubj distr_name: Use this option to specify the distribution for the subjects.
        The default is Gaussian when this option is not invoked. When the number of
         regions is small (e.g., less than 20), consider adopting the Student's
        t-distribution by using this option with 'student'.

   -distY distr_name: Use this option to specify the distribution for the response
        variable. The default is Gaussian when this option is not invoked. When
        skewness or outliers occur in the data, consider adopting the Student's
        t-distribution or exGaussian by using this option with 'student' or
        'exgaussian'.

   -EOI variable_list: Identify effects of interest in the output by specifying the
         variable names separated with comma (,). For example, -EOI "sex,age".
         By default, the Intercept is considered to be an effect of interest.
         Currently only variables, not their interactions, can be directly
         requested for output. However, most interaction effects can be obtained by
         either properly coding the variables (see example 3) or post processing.

   -help: this help message

   -iterations N: Specify the number of iterations per Markov chain. Choose 1000 (default)
         for simple models (e.g., one or no explanatory variables). If convergence
         problem occurs as indicated by Rhat being great than 1.1, increase the number of
         iterations (e.g., 2000) for complex models, which will lengthen the runtime.
         Unfortunately, there is no way to predict the optimum iterations ahead of time.

   -mean FORMULA: Specify the formulation for the mean of the likelihood (sampling
          distribution).

   -model FORMULA: This option specifies the effects associated with explanatory
         variables. By default, (without user input) the model is specified as
         1 (Intercept). Currently only between-subjects factors (e.g., sex,
         patients vs. controls) and quantitative variables (e.g., age) are
         allowed. When no between-subject factors are present, simply put 1
         (default) for FORMULA. The expression FORMULA with more than one
         variable has to be surrounded within (single or double) quotes (e.g.,
         '1+sex', '1+sex+age'. Variable names in the formula should be consistent
         with the ones used in the header of data table. A+B represents the
         additive effects of A and B, A:B is the interaction between A
         and B, and A*B = A+B+A:B. Subject as a variable should not occur in
         the model specification here.

   -PDP nr nc: Specify the layout of posterior distribution plot (PDP) with nr rows
         and nc columns among the number of plots. For example, with 16 regions,
         you can set nr = 4 and nc = 4. The region names will be shown in each plot.
         So, label the regions concisely.

   -prefix PREFIX: The prefix option specifies the base name for output files. A
         directory path can be included in the file name to specify a designated
         location for storing output files. The primary output is a text file named
         <prefix>.txt, which contains inference results for effects of interest in a
         tabulated format based on the selected options. In addition to the text file,
         the prefix is also used for other output files, including visualization plots
         and an R data file saved in binary format (<prefix>.RData). The .RData file
         allows for post hoc analysis, such as customized processing and plotting in
         R. If disk space is a concern, you can safely remove the .RData file once it
         is no longer needed, as it is primarily for further exploratory analyses
         rather than essential results.

   -qContr contrast_list: Identify comparisons of interest between quantitative
         variables in the output separated with comma (,). It only allows for
         pair-wise comparisons between two quantitative variables. For example,
         -qContr "age vs IQ, age vs weight, IQ vs weight", where V1, V2, and V3 are three
         quantitative variables and three comparisons, V1 - V2, V1 - V3 and V2 - V3
         will be provided in the output. Make sure that such comparisons are
         meaningful (e.g., with the same scale and unit. This can be used to
         formulate comparisons among factor levels if the user quantitatively
         codes the factor levels.

   -qVars variable_list: Identify quantitative variables (or covariates) with
         this option. The list with more than one variable has to be
         separated with comma (,) without any other characters such as
         spaces and should be surrounded within (single or double) quotes.
         For example, -qVars "Age,IQ"

   -r2z: This option performs Fisher transformation on the response variable
         (column Y) if it is correlation coefficient.

   -ridgePlot width height: This option will plot the posterior distributions stacked
         together in a sequential order, likely preferable to the one generated
         with option -PDP. The size of the figure window is specified through the
         two parameters of width and height in inches. You can fine-tune the plot
         yourself by loading up the *.RData file if you know the tricks.

   -ROI var_name: var_name is used to specify the column name that is designated as
        as the region variable. The default (when this option is not invoked) is
        'ROI'.

   -scale d: Specify a multiplier for the Y values. When the values for response
         are too small or large, it may create a convergence problem for MCMC. To
         avoid the problem, set a scaling factor so that the range of value is
         around 1-10. The results will be adjusted back to the original scale.

   -se: This option indicates that standard error for the response variable is
         available as input, and a column is designated for the standard error
         in the data table. If effect estimates and their t-statistics are the
         output from preceding analysis, standard errors can be obtained by
         dividing the effect estimates ('betas') by their t-statistics. The
         default assumes that standard error is not part of the input.

   -show_allowed_options: list of allowed options

   -sigma FORMULA: Specify the formulation for the standard deviation (sigma) of the
          likelihood (sampling distribution). When this option is absent in the
          script, it is assumed to be 1, meaning a single parameter for the variance
          (homogeneity).

   -stdz variable_list: Identify quantitative variables (or covariates) to be
         standardized. To obtain meaningful and interpretable results and to
         achieve better convergence of Markov chains with reasonable iterations,
         it is recommended that all quantitative variables be standardized
         except for the response variable and indicator variables that code for
         factors. For example, -stdz "Age,IQ". If the mean of a quantitative
         variable varies substantially between groups, it may make sense to
         standardize the variable within each group before plugging the values
         into the data table. Currently RBA does not offer the option to perform
         within-group standardization.

   -Subj var_name: var_name is used to specify the column name that is designated as
        as the measuring unit variable (usually subject). The default (when this
        option is not invoked) is 'Subj'.

   -tstat var_name: var_name is used to specify the column name that lists
        the t-statistic values, if available, for the response variable 'Y'.
        In the case where standard errors are available for the effect
        estimates of 'Y', use the option -se.

   -verb VERB: Specify verbose level.

   -WCP k: This option will invoke within-chain parallelization to speed up runtime.
         To take advantage of this feature, you need the following: 1) at least 8
         or more CPUs; 2) install 'cmdstan'; 3) install 'cmdstanr'. The value 'k'
         is the number of threads per chain that is requested. For example, with 4
         chains on a computer with 24 CPUs, you can set 'k' to 6 so that each
         chain will be assigned with 6 threads.

   -Y var_name: var_name is used to specify the column name that is designated as
        as the response/outcome variable. The default (when this option is not
        invoked) is 'Y'.