:orphan:

.. _ahelp_MBA:

***
MBA
***

.. contents:: :local:


| 

    
Welcome to MBA
==============

.. code-block:: none

        Matrix-Based Analysis Program through Bayesian Multilevel Modeling 
    #+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    Version 1.0.6, Feb 13, 2022
    Author: Gang Chen (gangchen@mail.nih.gov)
    Website - https://afni.nimh.nih.gov/gangchen_homepage
    SSCC/NIMH, National Institutes of Health, Bethesda MD 20892
    #+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    

Usage:
======

.. code-block:: none

     MBA performs matrix-based analysis (MBA) as theoretically elaborated in the
     manuscript: https://www.biorxiv.org/content/10.1101/459545v1
     MBA is conducted with a shell script (as shown in the examples below). The
     input data should be formulated in a pure-text table that codes the regions
     and variables. The response variable is usually correlation values (with or
     without Fisher-transformation) or white-matter properties (e.g., fractional
     anisotropy, mean diffusivity, radial diffusivity, axial diffusivity, etc.),
     but it can also be any values from a symmetric matrix (e.g., coherence,
     mutual information, entropy). In other words, the effects are assumed to be
     non-directional or non-causal. Diagonals can be included in the input if
     sensible.
    
     Thanks to Zhihao Li for motivating me to start the MBA work, and to 
     Paul-Christian Bürkner and the Stan/R communities for their strong support.
    

Citation:
=========


If you want to cite the approach for MBA, consider the following
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

.. code-block:: none

    
     Chen, G., Bürkner, P.-C., Taylor, P.A., Li, Z., Yin, L., Glen, D.R., Kinnison, J.,
     Cox, R.W., Pessoa, L., 2019. An Integrative Approach to Matrix-Based Analyses in
     Neuroimaging. Human Brain Mapping. In press. https://doi.org/10.1101/459545
    
     Read the following carefully!!!
     A data table in pure text format is needed as input for an MBA script. The
     data table should contain at least 4 columns that specify the information
     about subjects, region pairs and the response variable values with the
     following fixed header. The header lables are case-sensitive, and their order
     does not matter.
    
     Subj   ROI1   ROI2   Y      Age 
     S1     Amyg   SMA   0.2643  11
     S2     BNST   MPFC  0.3762  16 
    
     0) You are performing Bayesian analysis!!! So, you will directly obtain
        the probability of an effect being positive or negative with your data,
        instead of witch hunt - hunting the straw man of p-value (weirdness of your
        data when pretending that absolutely nothing exists).
    
     1) Avoid using pure numbers to code the labels for categorical variables. The
        column order does not matter. . You can specify those column names as you
        prefer, but it saves a little bit scripting if you adopt the default naming
        for subjects ('Subj'), regions ('ROI1' and 'ROI2') and response variable ('Y'). 
        The column labels ROI1 and ROI2 are meant to indicate the two regions
        associated with each response value, and they do not mean any sequence or
        directionality.
    
     2) Only provide half of the off-diagonals in the table (no duplicates allowed).
        Missing data are fine (e.g., white-matter property deemed nonexistent).
    
     3) Simple analysis can be done in a few minutes, but computational cost can be
        very high (e.g., weeks or even months) when the number of regions or subjects
        is large or when a few explanatory variables are involved. Be patient: there
        is hope in the near future that further parallelization can be implemented.
    
     4) Add more columns if explanatory variables are considered in the model. Currently
        only between-subjects variables (e.g., sex, patients vs. controls, age) are
        allowed. Each label in a between-subjects factor (categorical variable)
        should be coded with at least 1 character (labeling with pure numbers is fine
        but not recommended). If preferred, you can quantitatively code the levels of a
        factor yourself by creating k-1 columns for a factor with k levels. However, be
        careful with your coding strategy because it would impact how to interpret the
        results. Here is a good reference about factor coding strategies:
        https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
    
     5) It is strongly suggested that a quantitative explanatory variable be
        standardized with option -stdz; that is, remove the mean and scale by
        the standard deviation. This will improve the chance of convergence
        with each Markov chain. If a between-subjects factor (e.g., sex) is
        involved, it may be better to standardize a quantitative variable
        within each group in terms of interpretability if the mean value differs
        substantially. However, do not standardize a between-subjects factor if
        you quantitatively code it. And do not standardize the response variable
        if the intercept is of interest!
    
     6) With within-subject variables, try to formulate the data as a contrast
        between two factor levels or as a linear combination of multiple levels.
    
     7) The results from MBA are effect estimates for each region pair and at each
        region. They can be slightly different across different runs or different
        computers and R package versions due to the nature of randomness involved
        in Monte Carlo simulations.
    
     8) The range in matrix plot may vary across different effects within an analysis.
        It is possible to force the same range for all plots through fine-tuning
        within R using the output of .RData. The criteria of color coding for the
        strength of evidence in matrix plots in the output is as follows:
         Green  - two-tailed 95% compatible/uncertainty interval (or probability of effect
                  being positive >= 0.975 or <= 0.025)
         Yellow - one-tailed 95% compatible/uncertainty interval (or probability of effect
                  being positive >= 0.95 or <= 0.05)
         Gray   - one-tailed 90% compatible/uncertainty interval (or probability of effect
                  being positive >= 0.90 or <= 0.10)
         white  - anything else
    
     
Installation requirements:
==========================

.. code-block:: none

     In addition to R installation, the R package "brms" is required for MBA. Make
     sure you have the most recent version of R. To install "brms", run the following
     command at the terminal:
    
     rPkgsInstall -pkgs "brms" -site http://cran.us.r-project.org"
    
     Alternatively you may install them in R:
    
     install.packages("brms")
    
     *** To take full advantage of parallelization, install both 'cmdstan' and 
     'cmdstanr' and use the option -WCP in MBA. However, extra stpes are required: 
     both 'cmdstan' and 'cmdstanr' have to be installed. To install 'cmdstanir',
     execute the following command in R:
     
     install.packages('cmdstanr', repos = c('https://mc-stan.org/r-packages/', getOption('repos')))
    
     Then install 'cmdstan' using the following command in R:
    
     cmdstanr::install_cmdstan(cores = 2)
    # Follow the instruction here for the installation of 'cmdstan': 
    #    https://mc-stan.org/cmdstanr/articles/cmdstanr.html
    # If 'cmdstan' is installed in a directory other than home, use option -StanPath 
    # to specify the path (e.g., -StanPath '~/my/stan/path').
     

Running:
========

.. code-block:: none

     Once the MBA command script is constructed, it can be run by copying and
     pasting to the terminal. Alternatively (and probably better) you save the 
     script as a text file, for example, called myMBA.txt, and execute it with the 
     following  (assuming on tcsh shell),
     
     nohup tcsh -x myMBA.txt > diary.txt &
     nohup tcsh -x myMBA.txt |& tee diary.txt &
    
     The advantage of the commands above is that the progression is saved into
     the text file diary.txt and, if anything goes awry, can be examined later.
     The 'nohup' command allows the analysis running in the background even if
     the terminal is killed.
    

Examples:
=========

.. code-block:: none

    
    Example 1 --- Simplest scenario. Values from region pairs are the input from
              each subject. No explanatory variables are considered. Research
    	  interest is about the population effect at each region pair plus
    	  the relative strength of each region.
    
       MBA -prefix output -r2z -dataTable myData.txt  \
    
       The above script is equivalent to
    
       MBA -prefix myWonderfulResult -chains 4 -iterations 1000 -model 1 -EOI 'Intercept' \
       -r2z -dataTable myData.txt  \
    
       The 2nd version is recommended because of its explicit specifications.
    
       If a computer is equipped with as many CPUs as a factor 4 (e.g., 8, 16, 24,
       ...), a speedup feature can be adopted through within-chain parallelization
       with the option -WCP. For example, the script assumes a computer with 24 CPUs
       (6 CPUs per chain):
    
       MBA -prefix myWonderfulResult -chains 4 -WCP 6 \
           -iterations 1000 -model 1 -EOI 'Intercept' -r2z -dataTable myData.txt  \
    
       The input file 'myData.txt' is a data table in pure text format as below: 
                                                                 
         Subj  ROI1   ROI2      Y
         S01   lFFA lAmygdala  0.162
         S02   lFFA lAmygdala -0.598
         S03   lFFA lAmygdala  0.249
         S04   lFFA lAmygdala  0.568
    
       If the data is skewed or has outliers, consider using the Student t-distribution
       through the option -distY:
    
       MBA -prefix myWonderfulResult -chains 4 -WCP 6 \
           -iterations 1000 -model 1 -EOI 'Intercept' -distY 'student' -dataTable myData.txt  \ 
    
       If t-statistic (or standard error) values corresponding to the response variable
       Y are available, add the t-statistic (or standard error) values as a column in the input
       data table so that they can be incorporated into the BML model using the option -tstat
       or -se with the following script (assuming the tstat column is named as 'tvalue),
    
       MBA -prefix myWonderfulResult -chains 4 -WCP 6 \
           -iterations 1000 -model 1 -EOI 'Intercept' -tstat tvalue -dataTable myData.txt  \
     
       or (assuming the se column is named as 'SE'),
    
       MBA -prefix myWonderfulResult -chains 4 -WCP 6 \
           -iterations 1000 -model 1 -EOI 'Intercept' -se SE -dataTable myData.txt  \
    
     
Example 2 --- 2 between-subjects factors (sex and group):
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++

.. code-block:: none

    
       MBA -prefix output -Subj subject -ROI1 region1 -ROI2 region2 -Y zscore\
       -chains 4 -iterations 1000 -model '1+sex+group' \
       -cVars 'sex,group' -r2z -EOI 'Intercept,sex,group' \
       -dataTable myData.txt
    
       If a computer is equipped with as many CPUs as a factor 4 (e.g., 8, 16, 24,
       ...), a speedup feature can be adopted through within-chain parallelization
       with the option -WCP. For example, For example, consider 
       adding '-WCP 6' on a computer with 24 CPUs.
    
       The input file 'myData.txt' is formatted as below:
       
       subject region1 region2  zscore  sex group
       S1      DMNLAG  DMNLHC 0.274   F  patient
       S1      DMNLAG  DMNPCC 0.443   F  patient
       S2      DMNLAG  DMNRAG 0.455   M  contorl
       S2      DMNLAG  DMNRHC 0.265   M  control
    
       Notice that the interaction between 'sex' and 'group' is not modeled in this case.
    
    
    Example 3 --- one between-subjects factor (sex), one within-subject factor (two

conditions), and one quantitative variable:
+++++++++++++++++++++++++++++++++++++++++++

.. code-block:: none

        
       MBA -prefix result -chains 4 -iterations 1000 -model '1+sex+age+SA' \
       -qVars 'sex,age,SA' -r2z -EOI 'Intercept,sex,age,SA' \
       -dataTable myData.txt
    
       If a computer is equipped with as many CPUs as a factor 4 (e.g., 8, 16, 24,
       ...), a speedup feature can be adopted through within-chain parallelization
       with the option -WCP. For example, For example, consider adding 
       '-WCP 6' on a computer with 24 CPUs.
    
       The input file 'myData.txt' is formatted as below:
       
       Subj ROI1  ROI2    Y   sex  age    SA
       S1 DMNLAG DMNLHC 0.274  1  1.73  1.73
       S1 DMNLAG DMNPCC 0.443  1  1.73  1.73
       S2 DMNLAG DMNRAG 0.455 -1 -0.52 -0.52
       S2 DMNLAG DMNRHC 0.265 -1 -0.52 -0.52
    
       Notice
    
       1) the 'Y' column is the contrast between the two conditions.
       2) since we want to model the interaction between 'sex' and 'age', 'sex' is
          coded through deviation coding.
       3) 'age' has already been standardized within each sex due to large age
          difference between the two sexes.
       4) the 'SA' column codes for the interaction between 'sex' and 'age', which
          is the product of the two respective columns.
       

Options:
========

.. code-block:: none

       
    Options in alphabetical order:
    
       -chains N: Specify the number of Markov chains. Make sure there are enough
             processors available on the computer. Most of the time 4 cores are good
             enough. However, a larger number of chains (e.g., 8, 12) may help achieve
             higher accuracy for posterior distribution. Choose 1 for a single-processor
             computer, which is only practical only for simple models.
    
       -cVars variable_list: Identify categorical (qualitive) variables (or
             factors) with this option. The list with more than one variable
             has to be separated with comma (,) without any other characters such
             as spaces and should be surrounded within (single or double) quotes.
             For example, -cVars "sex,site"
    
       -dataTable TABLE: List the data structure in a table of long format (cf. wide
             format) in R with a header as the first line. 
    
             NOTE:
    
             1) There should have at least four columns in the table. These minimum
             four columns can be in any order but with fixed and reserved with labels:
             'Subj', 'ROI1', 'ROI2', and 'Y'. The two columns 'ROI1' and 'ROI2' are
             meant to code the two regions that are associated with each value under the
             column Y, and they do not connotate any indication of directionality other
             than you may want to keep track of a consistent order, for example, in the
             correlation matrix. More columns can be added in the table for explanatory
             variables (e.g., groups, age, site) if applicable. Only subject-level
             (or between-subjects) explanatory variables are allowed at the moment. The 
             columns of 'Subj', 'ROI1' and 'ROI2' code each subject and the two regions 
             associated with each region pair, and these labels that can be any identifiable
             characters including numbers. The column 'Y' can be correlation value, 
             Fisher-transformed correlation value, or white-matter property between 
             grey-matter regions (e.g., mean diffusivity, fractional anisotropy, radial
             diffusivity and axial diffusivity).
             2) Each row is associated with one and only one 'Y' value, which is the
             response variable in the table of long format (cf. wide format) as
             defined in R. In the case of correlation matrix or white-matter property
             matrix, provide only half of the off-diagonals. With n regions, there
             should have at least n(n-1)/2 rows per subject, assuming no missing data. 
             3) It is fine to have variables (or columns) in the table that are
             not used in the current analysis.
             4) The context of the table can be saved as a separate file, e.g., called
             table.txt. In the script specify the data with '-dataTable table.txt'.
             This option is useful when: (a) there are many rows in the table so that
             the program complains with an 'Arg list too long' error; (b) you want to
             try different models with the same dataset.
    
       -dbgArgs: This option will enable R to save the parameters in a file called
             .MBA.dbg.AFNI.args in the current directory so that debugging can be
             performed.
    
       -distY distr_name: Use this option to specify the distribution for the response
            variable. The default is Gaussian when this option is not invoked. When
            skewness or outliers occur in the data, consider adopting the Student's
            t-distribution or exGaussian by using this option with 'student' or
            'exgaussian'.
    
       -EOI variable_list: Identify effects of interest in the output by specifying the
             variable names separated with comma (,). For example, -EOI "sex,age".
             By default the Intercept is considered to be an effect of interest.
             Currently only variables, not their interactions, can be directly
             requested for output. However, most interaction effects can be obtained by
             either properly coding the variables (see example 3) or post processing.
    
       -fullRes: Use the option to indicate that a full set of results is shown in the
             the report. When option is not invoked (default), only those region pairs
             whose effect reaches at least 90% quantile are shown. 
    
       -help: this help message
    
       -iterations N: Specify the number of iterations per Markov chain. Choose 1000 (default)
             for simple models (e.g., one or no explanatory variables). If convergence
             problem occurs as indicated by Rhat being great than 1.1, increase the number of
             iterations (e.g., 2000) for complex models, which will lengthen the runtime.
             Unfortunately there is no way to predict the optimum iterations ahead of time.
    
       -MD: This option indicates that there are missing data in the input. With n
             regions, at least n(n-1)/2 values are assumed from each subject in the
             input with no missing data (default). When missing data are present,
             invoke this option so that the program will handle it properly.
    
       -model FORMULA: This option specifies the effects associated with explanatory
             variables. By default (without user input) the model is specified as
             1 (Intercept). Currently only between-subjects factors (e.g., sex, 
             patients vs. controls) and quantitative variables (e.g., age) are
             allowed. When no between-subject factors are present, simply put 1
             (default) for FORMULA. The expression FORMULA with more than one
             variable has to be surrounded within (single or double) quotes (e.g.,
             '1+sex', '1+sex+age'. Variable names in the formula should be consistent
             with the ones used in the header of data table. A+B represents the
             additive effects of A and B, A:B is the interaction between A
             and B, and A*B = A+B+A:B. Subject as a variable should not occur in
             the model specification here.
    
       -prefix PREFIX: Prefix is used to specify output file names. The main output is
            a text with prefix appended with .txt and stores inference information 
            for effects of interest in a tabulated format depending on selected 
            options. The prefix will also be used for other output files such as 
            visualization plots such as matrix plot, and saved R data in binary
            mode. The .RData can be used for post hoc processing such as customized
            processing and plotting. Remove the .RData file to save disk space once
            you deem such a file is no longer useful.
    
       -qContr contrast_list: Identify comparisons of interest between quantitative
             variables in the output separated with comma (,). It only allows for
             pair-wise comparisons between two quantitative variables. For example,
             -qContr "age vs IQ, age vs weight, IQ vs weight", where V1, V2, and V3 are three
             quantitative variables and three comparisons, V1 - V2, V1 - V3 and V2 - V3
             will be provided in the output. Make sure that such comparisons are
             meaningful (e.g., with the same scale and unit. This can be used to
             formulate comparisons among factor levels if the user quantitatively
             codes the factor levels.
    
       -qVars variable_list: Identify quantitative variables (or covariates) with
             this option. The list with more than one variable has to be
             separated with comma (,) without any other characters such as
             spaces and should be surrounded within (single or double) quotes.
             For example, -qVars "Age,IQ"
    
       -r2z: This option performs Fisher transformation on the response variable
             (column Y) if it is correlation coefficient. Do not invoke the option
             if the transformation has already been applied or the variable is
             not correlation coefficient.
    
       -ROI1 var_name: var_name is used to specify the column name that is designated as
            as the region variable for the first set of each region pair. The default
            (when this option is not invoked) is 'ROI1'.
    
       -ROI2 var_name: var_name is used to specify the column name that is designated as
            as the region variable for the second set of each region pair. The default
            (when this option is not invoked) is 'ROI2'.
    
       -ROIlist file: List all the regions in a text file with one column in an order
             preferred in the the plots. When the option is not invoked, the region
             order in the plots may not be in a preferred order.
    
       -se: This option indicates that standard error for the response variable is
             available as input, and a column is designated for the standard error
             in the data table. If effect estimates and their t-statistics are the
             output from preceding analysis, standard errors can be obtained by
             dividing the effect estimatrs ('betas') by their t-statistics. The
             default assumes that standard error is not part of the input.
    
       -show_allowed_options: list of allowed options
    
       -stdz variable_list: Identify quantitative variables (or covariates) to be
             standardized. To obtain meaningful and interpretable results and to
             achieve better convergence of Markov chains with reasonable iterations,
             it is recommended that all quantitative variables be standardized
             except for the response variable and indicator variables that code for
             factors. For example, -stdz "Age,IQ". If the mean of a quantitative
             variables varies substantially between groups, it may make sense to
             standardize the variable within each group before plugging the values
             into the data table. Currently MBA does not offer the option to perform
             within-group standardization.
    
       -Subj var_name: var_name is used to specify the column name that is designated as
            as the measuring unit variable (usually subject). The default (when this
            option is not invoked) is 'Subj'.
    
       -tstat var_name: var_name is used to specify the column name that lists
            the t-statistic values, if available, for the response variable 'Y'.
            In the case where standard errors are available for the effect
            estiamtes of 'Y', use the option -se.
    
       -verb VERB: Speicify verbose level.
    
       -WCP k: This option will invoke within-chain parallelization to speed up runtime.
             To take advantage of this feature, you need the following: 1) at least 8
             or more CPUs; 2) install 'cmdstan'; 3) install 'cmdstanr'. The value 'k'
             is the number of thread per chain that is requested. For example, with 4
             chains on a computer with 24 CPUs, you can set 'k' to 6 so that each
             chain will be assigned with 6 threads.
    
       -Y var_name: var_name is used to specify the column name that is designated as
            as the response/outcome variable. The default (when this option is not
            invoked) is 'Y'.