================== Welcome to PTA ==================
Program for Profile Tracking Analysis (PTA)
#+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Version 0.0.5, Oct 11, 2023
Author: Gang Chen (gangchen@mail.nih.gov)
Website - https://afni.nimh.nih.gov/gangchen_homepage
SSCC/NIMH, National Institutes of Health, Bethesda MD 20892, USA
#+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Introduction
------
Profile Tracking Analysis (PTA) estimates nonlinear trajectories or profiles
through smoothing splines. Currently the program PTA only works through a
command-line scripting mode. Check the examples below: find one close to your
specific scenario and use it as a template. The underlying theory is covered in
the following paper:
Chen, G., Nash, T.A., Cole, K.M., Kohn, P.D., Wei, S.-M., Gregory, M.D.,
Eisenberg, D.P., Cox, R.W., Berman, K.F., Shane Kippenhan, J., 2021. Beyond
linearity in neuroimaging: Capturing nonlinear relationships with application to
longitudinal studies. NeuroImage 233, 117891.
https://doi.org/10.1016/j.neuroimage.2021.117891
To be able to run PTA, one needs to have the R packages "mgcv" installed with
the following command at the terminal:
rPkgsInstall -pkgs "mgcv"
Alternatively you may install them in R:
install.packages("mgcv")
When a factor (e.g, groups, conditions) is involved, numerical coding is
required in formulating the data information. See Examples 3 and 4. The
following website provides some explanations regarding factor coding that
might be useful for modeling formulation:
https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
There are two output files generated by PTA: one (with the affix -stat.txt)
contains the information about the statistical evidence for various effects
while the other (with the affix -prediction.txt) tabulates the predicted
values and their standard errors which can be utilized to illustrate the
inferred trajectories or trends (e.g., using graphical tools such as ggplot2
in R).
Example 1 --- simplest case: one group of subjects with a between-subject
quantitative variable that does not vary within subject. Analysis is
set up to model the trajectory or trend along age:
PTA -prefix age \
-input data.txt \
-model 's(age)' \
-Y height \
-prediction pred.txt
The function 's(age)' indicates that 'age' is modeled via a smooth curve.
No empty space is allowed in the model formulation.
The file pred.txt lists all the explanatory variables (excluding lower-level variables
such as subject) for prediction. The file should be in a data.frame format as below:
age
10
12
14
20
22
24
...
The age step in the above example is 2 years. To obtain smoother graphical appearance
in plotted profiles, one can set the age values in pred.txt with a small grid sizer of,
for example, 0.5.
The file data.txt stores the information for all the variables and input data in a
data.frame format as below:
Subj age height
S1 24 175
S2 14 163
...
The subject labels in the above table can be characters or mixtures of characters
and numbers, but they cannot be pure numbers.
There will be two output files, one age-stat.txt and the other age-prediction.txt:
the former shows the statistical evidence; the latter contains a predicted value
for each age plus the associated uncertainty (standard error), which can be
plotted using tools such as ggplot2.
Example 2 --- Largely same as Example 1, but with 'age' as a within-subject
quantitative variable (varying within each subject). The model is now
specified by replacing the line of -model in Example 1 with the following
two lines:
-model 's(age)+s(Subj,bs="re")' \
-vt Subj 's(Subj)' \
The second term 's(Subj,bs="re")' in the model specification means that
each subject is allowed to have a varying intercept or random effect ('re').
To estimate the smooth trajectory through the option -prediction, the option
-vt has to be included in this case to indicate the varying term (usually
subjects). That is, if prediction is desirable, one has to explicitly
declare the variable (e.g., Subj) that is associated with the varying term
(e.g., s(Subj)). No empty space is allowed in the model formulation and the
the varying term.
The full script version is
PTA -prefix age2 \
-input data.txt \
-model 's(age)+s(Subj,bs="re")' \
-vt Subj 's(Subj)' \
-prediction pred.txt
All the rest remains the same as Example 1.
Example 3 --- two groups and one quantitative variable (age). The analysis is
set up to compare the trajectory or trend along age between the two groups,
which are quantitatively coded as -1 and 1. For example, if the two groups
are females and males, you can code females as -1 and males as 1. The following
script applies to the situation when the quantitative variable age does not vary
within subject,
PTA -prefix age3a \
-input data.txt \
-model 's(age)+s(age,by=MvF)' \
-prediction pred.txt
The prediction table in the file data.txt contains the following structure:
Subj age grp MvsF
S1 27 M 1
S2 21 M 1
S3 28 F -1
S4 18 F -1
...
The column grp above is not necessary for modeling, but it is included to
be more indicative for the prediction values in the output file
age3a-prediction.txt
Similarly, the prediction file pred.txt looks like (set the age values with
a small grid so that the graphical illustration would be smooth):
age grp MvsF
10 M 1
12 M 1
...
28 M 1
30 M 1
10 F -1
12 F -1
...
28 F -1
30 F -1
Note that the age values for prediction have a gap of 2 years: The a smaller
the gap, the smoother the plotted predictions.
On the other hand, go with the script below when the quantitative variable age
varies within subject,
PTA -prefix age3b \
-input data.txt \
-model 's(age)+s(age,by=grp)+s(Subj,bs="re")' \
-vt Subj 's(Subj)' \
-prediction pred.txt
Example 4 --- This example demonstrates the situations where more than two
levels are involved in a between-individual factor. Suppose that
three groups and one quantitative variable (age). The analysis is
set up to compare the trajectory or trend along age between the three groups,
A, B and C that are quantitatively represented using dummy coding.
PTA -prefix age4a \
-input data.txt \
-model 's(age)+s(age,by=AvC)+s(age,by=BvC)' \
-prediction pred.txt
The input table in the file data.txt contains the following structure:
Subj age grp AvsC BvC
S1 27 A 1 0
S2 21 A 1 0
S3 17 B 0 1
S4 24 B 0 1
S5 28 C 0 0
S6 18 C 0 0
...
The column grp above is not necessary for modeling, but it is included to
be more indicative for the prediction values in the output file
age4a-prediction.txt
On the other hand, go with the script below when the quantitative variable age
varies within subject,
PTA -prefix age4b \
-input data.txt \
-model 's(age)+s(age,by=AvC)+s(age,by=BvC)+s(Subj,bs="re")' \
-vt Subj 's(Subj)' \
-prediction pred.txt
Example 5 --- Suppose tht we compare the profiles between two conditions
across space or time that is expreessed as a variable x. In this case
profile estimation and statistical inference are separated into two steps.
First, estimate the profile for each condition using Example 1 or Example 2
as a template. Then, make inference about the contrast between the two
conditions. Obtain the contrast at each value of x for each individual, and
use the difference values as input. Specify the model as below if there are
multiple individuals:
-model 's(x)+s(id,bs="re")' \
-vt id 's(id)' \
For one individual, change the model to
-model 's(x)' \
Options in alphabetical order:
------------------------------
-dbgArgs: This option will enable R to save the parameters in a
file called .PTA.dbg.AFNI.args in the current directory
so that debugging can be performed.
-h: this help message
-help: this help message
-input file: input file in a table format (sames as the data frame structure of long format in R. Use the first row to specify the column names. The subject column, if applicable, should not be purely numbers. On the other hand, factors (groups, tasks) should be numerically coded using convenient coding methods such as deviation or dummy coding.
-interactive: Currently unavailable.
-model FORMULA: Specify the model formulation through multilevel smoothing splines
expression FORMULA with more than one variable has to be surrounded within
(single or double) quotes. Variable names in the formula should be
consistent with the ones used in the header of the input file.
The nonlinear trajectory is specified through the expression of s(x,k=?)
where s() indicates a smooth function, x is a quantitative variable with
which one would like to trace the trajectory and k is the number of smooth
splines (knots). The default (when k is missing) for k is 10, which is good
enough most of the time when there are more than 10 data points of x. When
there are less than 10 data points of x, choose a value of k slightly less
than the number of data points.
-prediction TABLE: Provide a data table so that predicted values could be generated for
graphical illustration. Usually the table should contain similar structure as the input
file except that columns for those varying smoothing terms (e.g., subject) and response
variable (i.e., Y) should not be included. Try to specify equally-spaced values with a small
for the quantitative variable of modeled trajectory (e.g., age) so that smooth curves could
be plotted after the analysis. See Examples in the help for a couple of specific tables used
for predictions.
-prefix PREFIX: Prefix for output files.
-show_allowed_options: list of allowed options
-verb VERB: VERB is an integer specifying verbosity level.
0 for quiet (Default). 1 or more: talkative.
-vt var formulation: This option is for specifying varying smoothing terms. Two components
are required: the first one 'var' indicates the variable (e.g., subject) around
which the smoothing will vary while the second component specifies the smoothing
formulation (e.g., s(age,subject)). When there is no varying smoothing terms (e.g.,
no within-subject variables), do not use this option.
-Y var_name: var_name is used to specify the column name that is designated as
as the response/outcome variable. The default (when this option is not
invoked) is 'Y'.