Regularized Hypothesis Testing in Random Fields with Applications to Neuroimaging

The task of determining for which elements of a random field (e.g., pixels in an image) a certain null hypothesis may be rejected is a relevant problem in several scientific areas. In the current contribution, we introduce a new method for performing this task, the regularized hypothesis testing (RHT) method, focusing on its use in neuroimaging research. RHT is based on the formulation of the hypothesis testing task as a Bayesian estimation problem, with the previous application of a Markovian random field. The latter allows for the incorporation of local spatial information and considers different noise models, including spatially correlated noise. In tests on synthetic data showing regular activation levels on uncorrelated noise fields, RHT furnished a true positive rate (TPR) of 0.97, overcoming the state-of-the-art morphology-based hypothesis testing (MBHT) method and the traditional family-wise error rate (FWER) method, which afforded 0.93 and 0.58, respectively. For fields with highly correlated noise, the TPR provided by RHT was 0.65, and by MBHT and FWER was 0.35 and 0.29, respectively. For tests utilizing real functional magnetic resonance imaging (fMRI) data, RHT managed to locate the activation regions when 60% of the original signal were removed, while MBHT located only one region and FWER located none.


INTRODUCTION
In areas of scientific research where imaging is involved (e.g., neuroimaging, remote sensing, etc.), it is often necessary to test statistical hypotheses at each element of a 2 or 3-dimensional field of sites (pixels or voxels). The purpose is to determine the set of sites at which the response to a given experiment may be different from baseline, or whether it is significantly correlated with another parameter.
For instance, researchers in the area of neuroscience typically conduct studies to identify the area of the brain responsible for a certain cognitive task. The experiments are composed of stimulus and rest periods applied to a single subject or several people [1] [2] [3] [4] [5] , applying a functional imaging approach such as positron emission tomography (PET) or functional magnetic resonance imaging (fMRI). Subsequently, the regions of voxels with a significant degree of activation have to be detected by performing simultaneous hypothesis tests over 2 or 3-dimensional measurements.
Because hundreds of thousands of comparisons are made at the same time, the well-known problem of multiple comparisons appears [6] [7] . The researcher is thus obliged to seek a solution to the resulting increase in the percentage of false positives (type I errors). A popular family of approaches to deal with this problem are the so-called pointwise (PW) methods, which utilize some type of thresholding technique to control the family-wise error rate (FWER) [6] [8] . Although they present a simple and easy to interpret solution, there is a high rate of type I errors.
Some authors address the problem through the Gaussian random fields (GRF) theory [9] [10] , under the assumption that the spatial correlation of the data is known or can be estimated. Since this is not true in practice [11] , a smoothing filter is applied to the raw images to ensure that these assumptions are met. Such a pre-processing process causes a loss of spatial resolu-tion [5] . On the other hand, threshold-free methodologies [12] employ erosion and dilatation morphological operators with a set of structuring elements of various sizes to detect regions exhibiting moderate activation levels and wide spatial size. However, their results are subject to the form of the structuring elements, which is determined arbitrarily.
In the current contribution, we propose a new method, denominated the regularized hypothesis testing (RHT) method. It is based on the formulation of the hypothesis testing task as a Bayesian estimation problem using a Markovian random field (MRF) to incorporate local spatial information. Firstly, mention is made of the state-of-the-art methods available to solve the problem of hypothesis testing in 2 and 3-dimensional fields. Thereafter, RHT is explained along with related theoretical considerations. The problem of parameter selection is addressed by proposing two algorithms for automatic calibration.
Having laid out the new method, it is validated by experiments with simulated and real data. Finally, the results are discussed and conclusions are drawn.

Theoretical framework
The problem of testing statistical hypotheses at each element of a 2 or 3-dimensional field can be conceived as the following general problem: Given a set of sites, there is a statistic T(u) defined for each site u∈ for which one wishes to test a null hypothesis (H 0 ). According to this hypothesis, all elements T(u) are furnished by the distribution P 0 (T) (the null distribution). H 0 is assumed to be of the form: where H 0u is a marginal null hypothesis about the probability distribution of the measurements at site u.
At H 0u , consequently, T(u) is generated by P 0 . In the active region, is defined as the set of sites u where H 0u does not hold, thus affirming the alternative hypothesis H 1u . The problem then is to find an estimate for te set (note: and may be empty).
Once a method is selected, its performance must be evaluated with standard tools. Some very common tools that will be used presently are described.
max ≤ , where c is the complement of the active region , |⋅| denotes the cardinality of a set and E[•] refers to the expected value of a random variable.
It is also possible to define FPR for elements that are not adjacent to the boundary of the active region: This measure is defined for two reasons. Firstly, the boundary of the active region is usually not well localized, since the activation level often decreases slowly as one moves away from . Secondly, the methods that consider the neighborhood of each element for Thus, FPR r could be a better performance measure in such a case.
2. The true positive rate (TPR), also called sensitivity, is defined by: denoting the expected proportion of sites correctly estimated as the activation region.
3. The family-wise error rate (FWER) is defined by: representing the probability of having at least one false positive, given that the activation region is empty.
Some of these measures can be combined. For example, a widely accepted way of characterizing the performance of a method is through the receiver operating characteristic (ROC) curve [13] , which indicates, for a fixed , the maximum attainable TPR for any given maximum allowable FPR. For one-sided tests, the maximum TPR is generally an increasing function of the maximum allowable FPR.
To construct the curve, the true region and the corresponding activation level a= E[T(u)], u∈ must be known. Then, the study of a method that depends on a parameter θ involves setting a value for θ to obtain a point on the ROC curve. The parameter acts as a cut-off point to distinguish between the sites considered positives or negatives. Take as an example the methods based on the following computation: where pv(u) depicts the p-value of T(u) and therefore pv(u) = 1-P 0 (T(u)).  [14] , with significance level α FWER , it follows that: where P 0 M is the distribution of max u∈L T(u) under H 0 .
Consequently, FWER can be controlled by choosing as the threshold the value located at the (1-α FWER ) portion of the right side of P 0 M . For an elaborate discussion on the association of FWER with the maximum statistical value, see Pantazis [15] .
In the case of FDR, with a significance level α FDR [16] , the procedure for finding the threshold θ begins with ordering the individual p-values of sites u ∈ L from the largest to the smallest. Accordingly, pv(u 1 ) ≥ pv(u 2 ) ≥ ⋯ pv(u N ) with N = | |. Let k be the index of the first site on the list, at which the p-value is less than or equal to the desired FDR proportion. Hence, and θ is set as: For more details on the method, consult Benjamini [17] .
Alternatively, given a desired value ϵ for FPR max , θ can simply be set as ϵ (e.g., ϵ = 10 -5 ). If the local hypotheses H 0u are independent, this is equivalent to the application of the standard PW method without correction for multiple hypotheses, but with low level of significance α = ϵ. Since (as mentioned) the maximum TPR is an increasing function of FPR, the value of θ* = ϵ will be the one that maximizes the TPR while keeping FPR under control: where ϵ is a user-specified small positive number.
Although the actual ROC curve for a given problem is unknown (because it depends on the values of T(u) for u ϵ ), it is possible to specify the value of ϵ. Thus, the optimal PW method, according to Eq. (11) with θ = θ*, will correspond to the UC method with α = 1ϵ. It may be applied if the field of p-values, or equivalently the field P 0 (T(u)), u ϵ , is known. Such fields can be estimated either theoretically or by non-parametric empirical means (e.g., permutation or re-sampling procedures) [18] [19] . The optimal PW method will be denoted by PW* (ϵ).
The signal-to-noise ratio (SNR) [20] is defined as: where σ 0 is the variance of T under H 0 and the minimum activation level in T has been arbitrarily assigned as the information signal. This represents the worstcase scenario in which the entire region has a minimum value. On the other hand, the mean activation level or the amplitude could also be employed, as discussed by Welvaert and Rosseel [17] and Acosta-Franco et al. [5] . For low values of SNR, the value of TPR obtained with PW* (ϵ) in the corresponding PW ROC curve is usually low for reasonable values of ϵ.
Consequently, PW methods have low sensitivity and the estimated is too conservative.
Hence, more sensitive methods must be developed that are able to accurately reflect the active regions, which in most cases consist of clusters of several contiguous elements of and not of isolated elements.
Customarily, these methods use the field T as a starting point for the generation of a new field . The new statistic takes the spatial correlation of into account, and then processes with a PW method as before. For example, may refer to the size or mass of a cluster of arc-connected elements (u) with values of T above a certain threshold [21] .
However, this approach has some drawbacks, one being that the results depend strongly on the value of the selected threshold, and in general no principled way exists to make such a selection. Some variants of the method alleviate the problem to a certain extent by computing as a weighted combination of cluster sizes obtained with different thresholds. Another problem is the loss of interpretability, since in many cases the significance of the statistic sought is directly related to the activation level (i.e., the value of T), and to the extension of supra-threshold clusters.
A distinct approach was developed to address these problems, being the morphology-based hypothesis testing (MBHT) method [12] . It involves the computation of as a combination of the results found when applying a set of K morphological erosions with different sizes of structuring elements to the field T. The statistics are calculated as: (13) where W k (u) denotes the structuring element k (e.g., a circle of radius r k ) centered at u. Then, these values are integrated into the statistic T: where P 0K is the null distribution of the statistic T k .
Afterwards, the optimal PW procedure can be applied to the field as detailed above.
The results of this approach are competitive with those based on supra-threshold cluster statistics [12] and allow a clearer interpretation. Once again, the disadvantage is that the results may depend on the shape of the structuring elements, and for their selection no principled solution exists.
In the current contribution, the approach introduced is capable of overcoming these difficulties. It formulates the hypothesis testing task as a Bayesian estimation problem, with an MRF previously applied to the active regions, thus implementing a prior constraint on the spatial contiguity of . For the application of a Gaussian Markov random field (GMRF) for fMRI data analysis, see Mejia et al. and the references therein [3] .
The new scheme proved to have better performance than PW methods, while maintaining interpretability.
It also has better performance than MBHT, and does not require the selection of any particular shape for the structuring elements.

The regularized hypothesis testing method
In RHT, the hypothesis testing problem is written in terms of an image segmentation problem, which is solved by using a Bayesian estimation framework.
First, the hypothesis testing formulation is explained.
The procedure is laid out for calculating the prior distributions of sites and the likelihood that they belong to an active region. Subsequently, an approximation algorithm for finding the maximum a posterior probability (MAP) estimate is established. Finally, the parameter selection problem is addressed.

Hypothesis testing formulation
Hypothesis testing may be formulated as a binary segmentation problem. Accordingly, given the set of sites , the problem is reduced to partitioning into non-overlapping cohesive regions 0 , 1 , such that the activation level in region 0 is zero, and in region 1 is greater than or equal to a known constant a 1 . Hence, 0 ∩ 1 = ∅ and = 0 ∩ 1 , where = 1 is the active region, and therefore 0 = C .
Let c be an unknown discrete label field that identifies the partitions of , defining c(u) ∈ {0,1} and Thus, a c(u) depicts the activation level at site u (note: a 0 = 0).
Without loss of generality, the activation level in 1 is assumed to be equal to a 1 , since this represents a worst case in terms of the desired false positive control in the estimation of . Considering T as the field formed by the statistics T(u) for all u ∈ , the following observation model is proposed: where n is a noise field showing distribution P n (n), with: where Z n is a normalization constant and U n (n) is a so-called energy function. Then, the likelihood P(T|c) is furnished as: where Z (T|c) is a normalization constant. It is important to find an estimate that considers the spatial correlation in both the noise field n and the label field c.

Noise field model
For the noise field n, a GMRF model was employed, taking the form of Eq. (16) with: where γ > 0 is a scale parameter, n(u) is the value of the field n in site u, and n(v) is the value of the field n in a neighboring site v. The parameter τ 1 ≥ 0 is related to the spatial correlation of the field n and the second term added to the first is taken from all pairs of neighboring sites ⟨u, v⟩ in the image.
When τ 1 = 0 and γ = 1, the formulation is reduced to a Gaussian white noise model with zero mean and unit variance (the "Parameter Selection" section explains other types of noise fields). With such a model, the likelihood takes the form of Eq. (17), being: In the above equation, c(u) is used as an indicator function. Accordingly, if its value is 1 (i.e., u ∈ 1 ), the term (1-c(u))T(u) 2 becomes zero, simplifying Eq. (19) to an expression that is equivalent to substituting n(u) for T(u) -a 1 c(u). In Eq. (18), the activation level in T(u) is eliminated, keeping only the noise component, as in the observation model proposed in (15).
Moreover, if c(u) = 0 (i.e., u ∈ 0 ), the term c(u)(T(u)a 1 c(u)) 2 becomes zero and the term (T(u) -a 1 c(u) in the second summation is simplified to T(u). This is equivalent to substituting n(u) for T(u) in Eq. (18), since T(u) already corresponds to purely noise.
Eq. (19) may be rewritten as a quadratic function of a vector-valued field b defined as b 1 (u) = c(u) and b 0 (u) with a 0 = 0 and:

Bayesian formulation of the hypothesis testing problem
For field b, we propose the previous application of an MRF model [22] [23] . In order to introduce the prior constraint that the active regions are spatially cohesive, b is modeled with Ising potentials, furnishing the following distribution: É j (8) ∈ {0,1}, ∀8 ∈ ℒ, m ∈ Ñ.
where τ 2 ≥ 0 is a parameter that controls the granularity of field b.
The MAP estimate for field b is obtained by minimizing (28), subject to the constraints (21)- (22). This is a combinatorial optimization problem that is, in general, very difficult to solve. Several methods have been proposed, such as the iterated conditional modes (ICM) algorithm [24] , stochastic relaxation [25] , graph cuts procedure [26] , etc.
Here, because of the form of the cost function that includes the noise correlation term, we prefer option [22] , which is based on the relaxation of constraint (22).
This is a quadratic minimization problem, subject to linear constraints, which can be solved efficiently by employing a projected gradient descent method.
Consequently, each b(u) (in order to simplify the notation we remove the dependency of ν, a 1 , λ) may now be interpreted as an approximation for the posterior marginal distribution of the indicator functions 1 k(u) .

Parameter selection
A general method can now be introduced to allow the model (31)-(33) to be used for any type of noise. The parameters a 1 , ν, λ will be able to be automatically selected by controlling the false positive proportion while maximizing the performance of the method with respect to the TPR.

Standardization of the noise field
The derivation of functional (31)  is uniform [27] [28] .
• If u is a uniform random variable and G the CDF of the normal distribution, then the distribution function for the random variable G -1 (u) is normal [29] .
Combining both properties, it follows that for any random variable x, the random variable G -1 (Q(x)) is normal. Consequently, the only requirement for transforming a random variable x into a normal variable is to know the corresponding CDF Q(x) or to estimate it by obtaining the empirical CDF (ECDF) (x) from samples of the particular distribution.
Assuming that random field samples T 0 can be generated under H 0 , it is possible to estimate the corresponding null distribution P 0 (⋅). Given a field T derived from the observations (e.g., the F-test for the generalized linear model (GLM) computed at each voxel), the sample can be mapped to a standard normal distribution Φ: ((f|É) ∝ exp (−y f|É (É)), y É|f É = y f|É É + y É É , 'ëtí É * è, q S , ê = arg ñët ó y ó; è, q S , ê , Y.^: where erf(y) denotes the error function [30] , the probability that a random variable Z normally distributed with mean μ = 0 and variance σ = 1/2 falls in the range [-y, y]. The field , with a standard normal distribution, is afforded by the following transformation:  The pseudo-likelihood is defined as the product of the conditional distributions for n(u) at each pixel u, given the values at its neighboring fields [31] . Using Eqs. (41) (41) (41) 2. According to our experimental work (see Fig. 2 and Experiment 1 below), the present method has the advantage that false positives do not extend, in a significant way, more than a couple of pixels away from this boundary. This is due to boundary effects close to the active region. For a given fixed a 1 and λ, therefore, it is possible to estimate the constraint FPR 2 ≤ ϵ by FPR 0 (i.e., the FPR computed in fields generated under H 0 ), which is depicted by FPR 0 (a 1 , λ). ν 3. The calculation can be further simplified based on the observation that TPR is an increasing function of FPR (i.e., the usual shape of the ROC curves). Consequently, for a fixed λ: which is equivalent to finding a 1 (λ), such that FPR 0 (a 1 (λ), λ)= ϵ. Given that a 1 is obtained from fields produced under H 0 and these fields are standardized as described above with the noise model (18), the values of a a 1 as a function of ν, λ, ϵ do not depend on the data. Hence, it is possible to pre-compute them based on fields generated by utilizing model (16) with U afforded by (18), being then the Gibbs sampler algorithm [25] .
where P(a) is taken as the uniform distribution in the set S. With these simplifications, the parameter estimation algorithm simply consists of sweeping the plausible λ values in a given set Λ (a discretization of an interval [0, λ max ]), followed by finding, for each λ, the value 1 (λ) (such that FPR ( 1 (λ), λ) =ϵ) and the corresponding (TPR)(λ), as explained above. The final λ* is then found as the maximizer of (TPR)(λ), affording a 1 *= 1 (λ*). This is summarized in Algorithm 1. Compute with (44) using the field ;

Data description
To study the behavior of the present method, we include various experiments based on both synthetic and real data.

Synthetic data:
The synthetic data was generated dynamically and consisted of two components. Firstly, 0 = {n 1 , n 2 , …, n 40 } is a set of noise fields of a regular lattice of 50×50 pixels, without any activation region.
That is, the images were produced under H 0 , using Eqs.
where a denotes the activation level and its value is specified in each experiment.
Real data: Experiments with real data were based on the auditory dataset [32] . The data corresponded to 96 volumes of a single subject (each volume composed of  Table 1. As can be appreciated, by calibrating the parameters of the method to control FPR 0 , the appropriate control over FPR 2 is also achieved.

Performance of the method
The first set of the following experiments was designed to make a comparison of RHT to FWER [8] and MBHT [12] , the latter two being the state-of-the-art methods that take the spatial expanse of the active region into account. Accordingly, the set S_0 was pro- as well as the spatial correlation in the noise field (controlled by the parameter ν). Contrarily, the rest of the algorithms assume that these two components of statistic T are formed by independent variables.

Experiments with fMRI data
In the second set of experiments, Algorithm 2 of the proposed method was applied to the real fMRI data described at the beginning of this section. RHT was not applied directly to the original data, but rather to a field T (an F-test field computed from the data). The first step was to standardize the field in order to obtain with Eq. (36), which is the input of the segmentation algorithm. Hence, it was necessary to calculate the ECDF from the data. Here after, some details about the aforementioned steps are explained.
Data pre-processing. The aim of the pre-processing was to remove artifacts in the data as well as to prepare the data to maximize the statistical analysis. Here we use spatial pre-processing provided in the script auditory_spm12_batch.m, which implies realignment, co-registration, segmentation and normalization.
where h t stands for the hemodynamic response function and d t ϵ {0,1} is an impulse train that indicates whether the stimulus was present at time t.
Then, the neural activity is modeled as a convolution,* in which the hemodynamic response function acts as a filter.
The F-test represents the ratio between the variance described in a reduced model y t = β 0 + ϵ t (without any additional effect) and the full model (48)

RHT algorithm robustness with respect to SNR
In order to investigate the stability of the present method, the algorithm was tested by modifying the RHT displayed the best sensitivity, which was particularly high for low SNR.
For simplicity, in the current analysis we focused on two-dimensional data. However, it is possible to directly extend the method to 3D simply by considering an extended neighborhood (e.g., 6 or 26 neighbors for each voxel) in the prior MRF models for the active region and the noise, at the expense of an increased computational complexity.
Although the example application here corresponds to fMRI, the RHT method may be applied to any situation involving the testing of a field of local hypotheses, such as the ones that are common in neuroimaging, remote sensing, and so on.