:orphan:

.. _ahelp_RBA:

***
RBA
***

.. contents:: :local:


| 

    
Welcome to RBA
==============

.. code-block:: none

    Region-Based Analysis Program through Bayesian Multilevel Modeling 
    #+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Version 1.1.2, Oct 1, 2022 
    Author: Gang Chen (gangchen@mail.nih.gov)
    Website - https://afni.nimh.nih.gov/gangchen_homepage
    SSCC/NIMH, National Institutes of Health, Bethesda MD 20892
    #+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    

Usage:
======

.. code-block:: none

    RBA performs region-based analysis (RBA) as theoretically elaborated in the
    manuscript: https://rdcu.be/bhhJp and is conducted with a shell script (as
    shown in the examples below). The input data should be formulated in a
    pure-text table that codes the regions and variables. The response variable
    is some effect at the individual subject level.
    
    Thanks to Paul-Christian Bürkner and the Stan/R communities for the strong support.
    

Citation:
=========


If you want to cite the approach for RBA, consider the following
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

.. code-block:: none

    
    Chen G, Xiao Y, Taylor PA, Riggins T, Geng F, Redcay E, 2019. Handling Multiplicity
    in Neuroimaging through Bayesian Lenses with Multilevel Modeling. Neuroinformatics.
    https://rdcu.be/bhhJp
    
    Chen, G., Taylor, P.A., Cox, R.W., Pessoa, L., 2020. Fighting or embracing 
    multiplicity in neuroimaging? neighborhood leverage versus global calibration. 
    NeuroImage 206, 116320. https://doi.org/10.1016/j.neuroimage.2019.116320
    
    Chen, G., Taylor, P.A., Stoddard, J., Cox, R.W., Bandettini, P.A., Pessoa, L., 
    2022. Sources of Information Waste in Neuroimaging: Mishandling Structures, 
    Thinking Dichotomously, and Over-Reducing Data. Aperture Neuro 2021, 46. 
    https://doi.org/10.52294/2e179dbf-5e37-4338-a639-9ceb92b055ea
    
    Read the following carefully!
    A data table in pure text format is needed as input for an RBA script. The
    data table should contain at least 3 columns that specify the information
    about subjects, regions and the response variable values with the following
    fixed header. The header labels are case-sensitive, and their order does not
    matter.
    
    Subj   ROI        Y      Age
    S1     Amyg    0.2643    11
    S2     BNST    0.3762    16
    
    0) You are performing Bayesian analysis. So, you will directly obtain the
    probability of the respective effect being positive or negative with your 
    data and adopted model, instead of looking for the p-value (weirdness of 
    your data under the modeling assumptions when pretending that absolutely 
    no effect exists).
    
    1) Avoid using pure numbers to code the labels for categorical variables. The
    column order does not matter. You can specify those column names as you
    prefer, but it saves a little bit scripting if you adopt the default naming
    for subjects ('Subj'), regions ('ROI') and response variable ('Y').
    
    2) Add more columns if explanatory variables are considered in the model. Currently
    only between-subjects variables (e.g., sex, patients vs. controls, age) are
    allowed. Capability of modeling within-subject or repeated-measures variables
    may be added in the future. Each label in a between-subjects factor (categorical
    variable) should be coded with at least 1 character (labeling with pure numbers
    is fine but not recommended). If preferred, you can quantitatively code the
    levels of a factor yourself by creating k-1 columns for a factor with k levels.
    However, be careful with your coding strategy because it would impact how to
    interpret the results. Here is a good reference about factor coding strategies:
    https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
    
    3) It is strongly suggested that a quantitative explanatory variable be
    standardized with option -stdz; that is, remove the mean and scale by
    the standard deviation. This will improve the chance of convergence
    with each Markov chain. If a between-subjects factor (e.g., sex) is
    involved, it may be better to standardize a quantitative variable
    within each group in terms of interpretability if the mean value differs
    substantially. However, do not standardize a between-subjects factor if
    you quantitatively code it. And do not standardize the response variable
    if the intercept is of interest!
    
    4) For within-subject variables, try to formulate the data as a contrast
    between two factor levels or as a linear combination of multiple levels.
    
    5) The results from RBA are effect estimates for each region. They can be
    slightly different across different runs or different computers and R
    package versions due to the nature of randomness involved in Monte Carlo
    simulations.
    
    6) The evidence for region effects in the output can be assessed through P+,
    the probability that the effect is positive conditional on the current
    model and dataset. Unlike the NHST convention, we emphasize that a decision 
    about the strength of statistical evidence should not be purely based on an
    artificial threshold. Instead, we encourage full results reporting:
    highlight some results with strong evidence and literature support (if
    available) without hiding the rest.
    
    7) WARNING: If the results are unexpectedly homogenized across regions, it is an
    indication that presumably partial pooling becomes full pooling. Most
    likely the cross-region variability is so negligible that the model
    renders the overall average as individual effects for all regions. When
    this occurs, you may need much more data for the model to differentiate
    the subtle effects.
    
    
Installation requirements:
==========================

.. code-block:: none

    In addition to R installation, the R package "brms" is required for RBA. Make
    sure that you have the most recent version of R. To install "brms", run the following
    command at the terminal:
    
    rPkgsInstall -pkgs "brms" -site http://cran.us.r-project.org"
    
    Alternatively, you may install them in R:
    
    install.packages("brms")
    
    *** To take full advantage of parallelization, install both 'cmdstan' and 
    'cmdstanr' and use the option -WCP in MBA. However, extra stpes are required: 
    both 'cmdstan' and 'cmdstanr' have to be installed. To install 'cmdstanir',
    execute the following command in R:
    
    install.packages('cmdstanr', repos = c('https://mc-stan.org/r-packages/', getOption('repos')))
    
    Then install 'cmdstan' using the following command in R:
    
    cmdstanr::install_cmdstan(cores = 2)
    # Follow the instruction here for the installation of 'cmdstan': 
    #    https://mc-stan.org/cmdstanr/articles/cmdstanr.html
    # If 'cmdstan' is installed in a directory other than home, use option -StanPath 
    # to specify the path (e.g., -StanPath '~/my/stan/path').
    
    In addition, if you want to show the ridge plots of the posterior distributions
    through option -ridgePlot, make sure that the following R packages are installed:
    
    data.table
    ggplot2
    ggridges
    dplyr
    tidyr
    scales
    

Running:
========

.. code-block:: none

    Once the RBA command script is constructed, it can be run by copying and
    pasting to the terminal. Alternatively (and probably better) you save the 
    script as a text file, for example, called myRBA.txt, and execute it with the 
    following  (assuming on tcsh shell),
    
    nohup tcsh -x myRBA.txt > diary.txt &
    nohup tcsh -x myRBA.txt |& tee diary.txt &
    
    The advantage of the commands above is that the progression is saved into
    the text file diary.txt and, if anything goes awry, can be examined later.
    The 'nohup' command allows the analysis running in the background even if
    the terminal is killed.
    

Examples:
=========

.. code-block:: none

    
    Example 1 --- Simplest scenario. Values from regions are the input from
      each subject. No explanatory variables are considered. Research
      interest is about the population effect at each region.
    
    RBA -prefix output -dataTable myData.txt  \
    
    The above script is equivalent to
    
    RBA -prefix myResult -chains 4 -iterations 1000 -model 1 -EOI 'Intercept' \
    -r2z -dataTable myData.txt  \
    
    The 2nd version above is recommended because of its explicit specifications.
    
    If the data are skewed or have outliers, use Student's t-distribution:
    
    RBA -prefix myResult -chains 4 -iterations 1000 -model 1 -EOI 'Intercept' \
    -distY 'student' -dataTable myData.txt  \
    
    If a computer is equipped with as many CPUs as a factor 4 (e.g., 8, 16, 24,
    ...), a speedup feature can be adopted through within-chain parallelization
    with the option -WCP. For example, the script assumes a computer with 24 CPUs
    (6 CPUs per chain):
    
    RBA -prefix myResult -chains 4 -WCP 6 \
    -iterations 1000 -model 1 -EOI 'Intercept' -distY 'student' \
    -dataTable myData.txt  \
    
    The input file 'myData.txt' is a data table in pure text format as below: 
    						     
    Subj  ROI          Y
    S01   lFFA       0.162
    S02   lAmygdala -0.598
    S03   DMNLAG     0.249
    S04   DMNPCC     0.568
    
    If t-statistic (or standard error) values corresponding to the response variable
    Y are available, add the t-statistic (or standard error) values as a column in the input
    data table so that they can be incorporated into the BML model using the option -tstat
    or -se with the following script (assuming the tstat column is named as 'tvalue'),
    
    RBA -prefix myResult -chains 4 -WCP 6 \
    -iterations 1000 -model 1 -EOI 'Intercept' -distY 'student' -tstat tvalue \
    -dataTable myData.txt  \
    
    or (assuming the se column is named as 'SE'),
    
    RBA -prefix myResult -chains 4 -WCP 6 \
    -iterations 1000 -model 1 -EOI 'Intercept' -distY 'student' -se SE \
    -dataTable myData.txt  \
    
    
Example 2 --- 2 between-subjects factors (sex and group):
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

.. code-block:: none

    
    RBA -prefix output -Subj subject -ROI region -Y zscore -ridgePlot 10 8 \
    -chains 4 -iterations 1000 -model '1+sex+group' \
    -cVars 'sex,group' -EOI 'Intercept,sex,group' \
    -dataTable myData.txt
    
    If a computer is equipped with as many CPUs as a factor 4 (e.g., 8, 16, 24,
    ...), a speedup feature can be adopted through within-chain parallelization
    with the option -WCP. For example, consider adding 
    '-WCP 6' on a computer with 24 CPUs.
    
    The input file 'myData.txt' is formatted as below:
    
    subject region  zscore  sex group
    S1      DMNLAG  0.274    F  patient
    S1      DMNLHC  0.443    F  patient
    S2      DMNRAG  0.455    M  contorl
    S2      DMNRHC  0.265    M  control
    
    Notice that the interaction between 'sex' and 'group' is not modeled in
    this case. The option -ridgePlot generates a stacked list of posterior
    distributions in a sequential order among the regions for each effect of
    interest specified through -EOI. The two numbers of 10 and 8 associated
    with the option -ridgePlot specifies the figure window size with 10" wide
    and 8" high.
    
    
    Example 3 --- one between-subjects factor (sex), one within-subject factor (two
    conditions), one between-subjects covariate (age), and one quantitative

variable:
+++++++++

.. code-block:: none

    
    RBA -prefix result -ridgePlot 8 6 -Subj Subj -ROI region -Y value \
    -chains 4 -iterations 1000 -model '1+sex+age+SA' -qVars 'sex,age,SA' \
    -EOI 'Intercept,sex,age,SA' -dataTable myData.txt
    
    If a computer is equipped with as many CPUs as a factor 4 (e.g., 8, 16, 24,
    ...), a speedup feature can be adopted through within-chain parallelization
    with the option -WCP. For example, consider adding '-WCP 6' to the script
    on a computer with 24 CPUs.
    
    The input file 'myData.txt' is formatted as below:
    
    Subj   region   value sex  age   SA
    S1    DMNLAG    0.274  1  1.73  1.73
    S1    DMNLHC    0.443  1  1.73  1.73
    S2    DMNRAG    0.455 -1 -0.52  0.52
    S2    DMNRHC    0.265 -1 -0.52  0.52
    
    Notice
    
    1) The 'Y' column is the contrast between the two conditions.
    2) Since we want to model the interaction between 'sex' and 'age', 'sex' is
       coded through deviation coding.
    3) 'age' has already been standardized within each sex due to large age
       difference between the two sexes.
    4) The 'SA' column codes for the interaction between 'sex' and 'age', which
       is the product of the two respective columns.
    
    
    Example 4 --- a more flexible way to specify a model. 
    
    RBA -prefix test -chains 4 -iterations 1000 -mean 'score~1+(1|roi)+(1|subj)' \
    -sigma '1+(1|roi)+(1|subj)' -ROI 'roi' -EOI 'Intercept' -WCP 8 
    -dataTable test.tbl
    
    The input file 'test.tbl' is formatted as below:
    
    subj    roi     score 
    S1    DMNLAG    0.274 
    S1    DMNLHC    0.443
    S2    DMNLAG    0.455 
    S2    DMNLHC    0.265 
    
    Notice
    
    1) The -mean option specifies the formulation for the mean of the likelihood (Gaussian
       in this case).
    2) The -sigma option specifies the formulation for the standard deviation of likelihood
       (Gaussian in this case).
    3) It is important to identify the pivotal variable as 'roi' since the label is different
       from the default ('ROI').
    
    
    Options in alphabetical order:
    
       -chains N: Specify the number of Markov chains. Make sure there are enough
             processors available on the computer. Most of the time 4 cores are good
             enough. However, a larger number of chains (e.g., 8, 12) may help achieve
             higher accuracy for posterior distribution. Choose 1 for a single-processor
             computer, which is only practical only for simple models.
    
       -cVars variable_list: Identify categorical (qualitive) variables (or
             factors) with this option. The list with more than one variable
             has to be separated with comma (,) without any other characters such
             as spaces and should be surrounded within (single or double) quotes.
             For example, -cVars "sex,site"
    
       -dataTable TABLE: List the data structure in a table of long format (cf. wide
             format) in R with a header as the first line. 
    
             NOTE:
    
             1) There should have at least three columns in the table. These minimum
             three columns can be in any order but with fixed and reserved with labels:
             'Subj', 'ROI', and 'Y'. The column 'ROI' is meant to code the regions
             that are associated with each value under the column Y. More columns can
             be added in the table for explanatory variables (e.g., groups, age, site)
             if applicable. Only subject-level (or between-subjects) explanatory variables
             are allowed now. The labels for the columns of 'Subj' and 'ROI'
             can be any identifiable characters including numbers.
             2) Each row is associated with one and only one 'Y' value, which is the
             response variable in the table of long format (cf. wide format) as
             defined in R. With n subjects and m regions, there should have totally mn
             rows, assuming no missing data. 
             3) It is fine to have variables (or columns) in the table that are not used
             in the current analysis.
             4) The context of the table can be saved as a separate file, e.g., called
             table.txt. In the script specify the data with '-dataTable table.txt'.
             This option is useful when: (a) there are many rows in the table so that
             the program complains with an 'Arg list too long' error; (b) you want to
             try different models with the same dataset.
    
       -dbgArgs: This option will enable R to save the parameters in a file called
             .RBA.dbg.AFNI.args in the current directory so that debugging can be
             performed.
    
       -distROI distr_name: Use this option to specify the distribution for the ROIs.
            The default is Gaussian when this option is not invoked. When the number of
             regions is small (e.g., less than 20), consider adopting the Student's
            t-distribution by using this option with 'student'.
    
       -distSubj distr_name: Use this option to specify the distribution for the subjects.
            The default is Gaussian when this option is not invoked. When the number of
             regions is small (e.g., less than 20), consider adopting the Student's
            t-distribution by using this option with 'student'.
    
       -distY distr_name: Use this option to specify the distribution for the response
            variable. The default is Gaussian when this option is not invoked. When
            skewness or outliers occur in the data, consider adopting the Student's
            t-distribution or exGaussian by using this option with 'student' or
            'exgaussian'.
    
       -EOI variable_list: Identify effects of interest in the output by specifying the
             variable names separated with comma (,). For example, -EOI "sex,age".
             By default, the Intercept is considered to be an effect of interest.
             Currently only variables, not their interactions, can be directly
             requested for output. However, most interaction effects can be obtained by
             either properly coding the variables (see example 3) or post processing.
    
       -help: this help message
    
       -iterations N: Specify the number of iterations per Markov chain. Choose 1000 (default)
             for simple models (e.g., one or no explanatory variables). If convergence
             problem occurs as indicated by Rhat being great than 1.1, increase the number of
             iterations (e.g., 2000) for complex models, which will lengthen the runtime.
             Unfortunately, there is no way to predict the optimum iterations ahead of time.
    
       -MD: This option indicates that there are missing data in the input. With n
             regions, at least n(n-1)/2 values are assumed from each subject in the
             input with no missing data (default). When missing data are present,
             invoke this option so that the program will handle it properly.
    
       -mean FORMULA: Specify the formulation for the mean of the likelihood (sampling
              distribution).
    
       -model FORMULA: This option specifies the effects associated with explanatory
             variables. By default, (without user input) the model is specified as
             1 (Intercept). Currently only between-subjects factors (e.g., sex, 
             patients vs. controls) and quantitative variables (e.g., age) are
             allowed. When no between-subject factors are present, simply put 1
             (default) for FORMULA. The expression FORMULA with more than one
             variable has to be surrounded within (single or double) quotes (e.g.,
             '1+sex', '1+sex+age'. Variable names in the formula should be consistent
             with the ones used in the header of data table. A+B represents the
             additive effects of A and B, A:B is the interaction between A
             and B, and A*B = A+B+A:B. Subject as a variable should not occur in
             the model specification here.
    
       -PDP nr nc: Specify the layout of posterior distribution plot (PDP) with nr rows
             and nc columns among the number of plots. For example, with 16 regions,
             you can set nr = 4 and nc = 4. The region names will be shown in each plot.
             So, label the regions concisely.
    
       -prefix PREFIX: Prefix is used to specify output file names. The main output is
            a text with prefix appended with .txt and stores inference information 
            for effects of interest in a tabulated format depending on selected 
            options. The prefix will also be used for other output files such as 
            visualization plots and for saved R data in binary format. The .RData can
            be used for post hoc processing such as customized processing and plotting.
            Remove the .RData file to save disk space once you deem such a file is no
            longer useful.
    
       -qContr contrast_list: Identify comparisons of interest between quantitative
             variables in the output separated with comma (,). It only allows for
             pair-wise comparisons between two quantitative variables. For example,
             -qContr "age vs IQ, age vs weight, IQ vs weight", where V1, V2, and V3 are three
             quantitative variables and three comparisons, V1 - V2, V1 - V3 and V2 - V3
             will be provided in the output. Make sure that such comparisons are
             meaningful (e.g., with the same scale and unit. This can be used to
             formulate comparisons among factor levels if the user quantitatively
             codes the factor levels.
    
       -qVars variable_list: Identify quantitative variables (or covariates) with
             this option. The list with more than one variable has to be
             separated with comma (,) without any other characters such as
             spaces and should be surrounded within (single or double) quotes.
             For example, -qVars "Age,IQ"
    
       -r2z: This option performs Fisher transformation on the response variable
             (column Y) if it is correlation coefficient.
    
       -ridgePlot width height: This option will plot the posterior distributions stacked
             together in a sequential order, likely preferable to the one generated
             with option -PDP. The size of the figure window is specified through the
             two parameters of width and height in inches. You can fine-tune the plot
             yourself by loading up the *.RData file if you know the tricks.
    
       -ROI var_name: var_name is used to specify the column name that is designated as
            as the region variable. The default (when this option is not invoked) is
            'ROI'.
    
       -scale d: Specify a multiplier for the Y values. When the values for response
             are too small or large, it may create a convergence problem for MCMC. To
             avoid the problem, set a scaling factor so that the range of value is
             around 1-10. The results will be adjusted back to the original scale.
    
       -se: This option indicates that standard error for the response variable is
             available as input, and a column is designated for the standard error
             in the data table. If effect estimates and their t-statistics are the
             output from preceding analysis, standard errors can be obtained by
             dividing the effect estimates ('betas') by their t-statistics. The
             default assumes that standard error is not part of the input.
    
       -show_allowed_options: list of allowed options
    
       -sigma FORMULA: Specify the formulation for the standard deviation (sigma) of the
              likelihood (sampling distribution). When this option is absent in the
              script, it is assumed to be 1, meaning a single parameter for the variance
              (homogeneity).
    
       -stdz variable_list: Identify quantitative variables (or covariates) to be
             standardized. To obtain meaningful and interpretable results and to
             achieve better convergence of Markov chains with reasonable iterations,
             it is recommended that all quantitative variables be standardized
             except for the response variable and indicator variables that code for
             factors. For example, -stdz "Age,IQ". If the mean of a quantitative
             variable varies substantially between groups, it may make sense to
             standardize the variable within each group before plugging the values
             into the data table. Currently RBA does not offer the option to perform
             within-group standardization.
    
       -Subj var_name: var_name is used to specify the column name that is designated as
            as the measuring unit variable (usually subject). The default (when this
            option is not invoked) is 'Subj'.
    
       -tstat var_name: var_name is used to specify the column name that lists
            the t-statistic values, if available, for the response variable 'Y'.
            In the case where standard errors are available for the effect
            estimates of 'Y', use the option -se.
    
       -verb VERB: Specify verbose level.
    
       -WCP k: This option will invoke within-chain parallelization to speed up runtime.
             To take advantage of this feature, you need the following: 1) at least 8
             or more CPUs; 2) install 'cmdstan'; 3) install 'cmdstanr'. The value 'k'
             is the number of threads per chain that is requested. For example, with 4
             chains on a computer with 24 CPUs, you can set 'k' to 6 so that each
             chain will be assigned with 6 threads.
    
       -Y var_name: var_name is used to specify the column name that is designated as
            as the response/outcome variable. The default (when this option is not
            invoked) is 'Y'.