Schizophrenia Prediction Using Integrated Imaging Genomic Networks

In order to increase the diagnosis accuracy of schizophrenia (SCZ) disease, it is essential to comprehensively employ complementary information from multiple types of data. It is well known that a network is a general method for analyzing relationships between patients, with its nodes representing patients and its edges showing relationships between them. In this study, we constructed a fused network using three types of data including genetic, epigenetic and neuroimaging data from a study of schizophrenia patients. We developed a network-based prediction approach taking advantage of the whole network of patients rather than just individual clusters in the network. The majority neighborhood of a node in the network was exploited as a feature for discriminating SCZ from healthy controls. Compared with other 9 graph-based label prediction methods, our network-fusion based label prediction method shows the best performance according to the prediction accuracy. The prediction power of our proposed method was also tested under di ff erent parameters settings and an optimal parameter was found for achieving the best performance. The method is also computationally e f fi cient and can be extended to identify other clinical outcomes.


Introduction
This paper is an extension of work originally presented in the IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2016) [1].
Currently, there is a lack of objective ways for the diagnosis of schizophrenia (SCZ) [2]. It is generally diagnosed according to criteria set in either the American Psychiatric Associations fifth edition of the Diagnostic and Statistical Manual of Mental Disorders (DSM 5), or the World Health Organizations International Statistical Classification of Diseases and Related Health Problems (ICD-10). These criteria use the self-reported experiences of the patients and report abnormalities in their behaviors, followed by a clinical assessment by a mental health professional. However, symptoms associated with schizophrenia occur along a continuum in the population and must reach a certain severity before a diagnosis is made [3]. Although in psychiatry as in all of general medicine there is an irreducible element of the subjective [4], we should try our best to reduce the subjective part of medical and psychiatry practice. This goal is well recognized by the National Institutes of Mental Health as well [5].
Like many other complex diseases, SCZ is caused by a combination of genetic, biological, and environmental factors, creating a difficult challenge for diagnosis and defining subtypes. In addition, SCZ is a highly heritable disorder [6]. The current belief is that there are a number of genes that contribute to susceptibility or pathology of SCZ, but none of them exhibits full responsibility for the disease. It is believed that SCZ is caused by a number of genetic and environmental factors [7].
Recently, high-throughput studies are rapidly accumulating a wealth of 'omics'-scale data, which enable comprehensive monitoring of a biological system. Meanwhile, there are remarkable medical images accumulated in hospitals, such as images from functional magnetic resonance imaging (fMRI). fMRI is a fairly new tool that has been used to measure brain activation utilising the dependency of the magnetic properties of haemoglobin on the amount of oxygen it carries. Blood-oxygen-level-dependent (BOLD) signals measure the alterations in cerebral blood flow that mark functional brain activity [8]. The intrinsic BOLD contrast makes fMRI an invaluable noninvasive instrument for the investigation of the underlying physiological disturbances that lead to manifest psychiatric disorders. It is anticipated that successful application and analysis of fMRI in neurological disorders can be used to characterise and diagnose mental illnesses such as Alzheimers disease, SCZ, bipolar disorder, mild traumatic brain injury and addiction. It is a valuable way to combine genetic, epigenetic and imaging data for the diagnosis of SCZ disease. Many studies have demonstrated that the integration of fMRI and SNP information will give a more comprehensive analysis of schizophrenia. For example, Yang et al. [9] proposed a hybrid machine learning method to identify schizophrenia patients from healthy controls by combining genetic and fMRI data, and this method achieved better classification accuracy than using either data alone. Castro et al. [10] developed a multiple kernel learning approach that employed both the phase and the magnitude of complex-valued fMRI data for the classification, showing improved classification accuracy compared with using only the magnitude of fMRI data. Cao et al. proposed a sparse representation clustering (SRC) model [11,12] to simultaneously select SNPs and fMRI voxels as biomarkers for schizophrenia followed by a classifier such as support vector machine (SVM). The integrative model was tested on 20 SCZ patients and 20 healthy controls, demonstrating that the complementary information from both fMRI and SNP data can be utilized to improve diagnosis [13].
The above mentioned SCZ studies, however, only used two types of data: SNPs and fMRI data. DNA methylation is one of several epigenetic mechanisms and can be used by cells to control gene expression. There are a number of mechanisms controlling gene expression in eukaryotes, but DNA methylation is a frequently used epigenetic signaling tool, which can fix genes in the "off " position [14]. In addition, the integration methods used in the above mentioned research directly used features extracted from data as predictors. Alternatively, network-based classification methods mainly employ the topological features as predictors. With the rapid development of systems biology, network-based classification or prediction methods have been widely used [15,16,17,18,19,20,21].
In our study, we combined three types of data including SNPs, DNA methylation and fMRI to construct networks, which were subsequently be fused for diagnosis of SCZ. Specifically, for each type of data, we first constructed a patient network, called a single network. Thus for three types of data, we constructed three single networks. Then we fused two or three single patient networks into one subject network. As a result, we created four fused networks: three fused networks from the combinations of two data types and one fused network from three data types. Finally, we employed the neighborhood majority of the nodes in the network to predict unknown labels, that is, identifying whether a subject is a schizophrenia patient.

Data Preparation
In this study, participant recruitment and data collection were conducted at the Mind Research Network. Three types of data (SNP, DNA methylation and fMRI) were collected from 208 subjects including 96 schizophrenia patients (age: 34 ± 11, 22 females) and 112 healthy controls (age: 32 ± 11, 44 females). All of them were provided written informed consents. Healthy participants were free of any medical, neurological or psychiatric illnesses and had no history of substance abuse. By the clinical interview of patients for DSM IV-TR Disorders or the Comprehensive Assessment of Symptoms and History, patients met criteria for DSM-IV-TR schizophrenia. Antipsychotic history was collected as part of the psychiatric assessment. After a series of quality controls, we selected 184 subjects, including 80 SZ cases (age: 34 ± 11, 20 females and 60 males) and 104 healthy controls (age:32 ± 11, 38 females and 66 males). After pre-processing, 27,508 DNA methylation sites, 41,s236 fMRI voxels and 722,177 SNP loci were obtained for the subsequent biomarker selections [22,23,24,25,26,27].

SNPs data collection and preprocessing
A blood sample was obtained for each participant and DNA was extracted. Genotyping for all participants was performed at the Mind Research Network using the Illumina Infinium HumanOmni1-Quad assay covering 1,140,419 SNP loci.
Bead Studio was used to make the final genotype calls.
Next, the PLINK software package(http://pngu.mgh.harvard.edu/∼purcell/plink) was used to perform a series of standard quality control procedures, resulting in the final dataset spanning 722,177 SNP loci. Each SNP was categorized into three clusters based on their genotype and was represented with discrete numbers: 0 for 'BB'(no minor allele), 1 for 'AB' (one minor allele) and 2 for 'AA' (two minor alleles).

DNA methylation data collection and preprocessing
DNA from blood samples was assessed by the Illumina Infinium Methylation27 Assay. A methylation value, beta (β), represents the ratio of the methylated probe intensity to the total probe intensity. A series of quality controls (QC) on the beta values were applied to remove bad samples and probes, such as 1) Beta value QC: Change any beta value to NaN, if p>0.05. 2) Bad sample/ bad marker removing: Samples with >5% of missing (NaN) values; markers with >5% of missing (NaN) values. This resulted in the identification of good methylation data from 224 subjects, 27,508 markers (some have missing values <5%). After QC, we used the K nearest neighbor (KNN) method [28]to impute for the missing values.

fMRI data collection and preprocessing
The fMRI data were collected during a sensorimotor task, a block-design motor response to auditory stimulation. During the on-block, 200msec tones were presented with a 500 msec stimulus onset asynchrony (SOA ). Data were pre-processed in SPM5 (http://www.fil.ion.ucl.ac.uk/spm) and were realigned, spatially normalized and resliced to 3 × 3 × 3mm 3 , smoothed with a 10×10×10mm 3 Gaussian kernel to reduce spatial noise, and analyzed by multiple regression considering the stimulus and their temporal derivatives plus an intercept term as regressors. Finally the stimulus-on versus stimulus-off contrast images were extracted with 53×63×46 voxels and all the voxels with missing measurements were excluded.

Overview
We employed a network-based semi-supervised learning (SSL) scheme, which improves the predictive power by using unlabeled data [29,30,31,32]. It is compuationally efficient and the learning time depends nearly linearly on the number of network edges. Also, the accuracy is comparable with other methods such as kernel-based methods with a longer learning time [18,33]. Moreover, the network structure can facilitate the interpretation of gene-gene interactions and/or region-region connections in the brain [34,35,36], which is one of the advantages of network-based SSL.
Based on a fused network, our label prediction method mainly consists of three procedures: network construction, network fusion and label prediction using network-derived features.

Network Construction
Based on the features selected from SNPs, DNA methylation and fMRI data, a subject network can be constructed. Suppose we have n samples and m measurements (e.g., DNA methylation). A subject similarity network is represented as a graph G = (V, E). The vertices V correspond to the subjects {x 1 , x 2 , ..., x n } and the strengths of the edges E are the weighted value of the similarity between subjects. The edge weights are represented by an n × n similarity matrixW, with each W ij indicating the similarity between subjects x i and x j . ρ ( x i , x j ) is represented as the Euclidean distance between subjects x i and x j . In order to calculate edge strength, a Gaussian function was taken on the Euclidean distance between subjects: If i is in j s K-nearest-neighborhood and vice versa, nodes i, j can be connected by an edge. In our study, we constructed three individual networks for SNPs, DNA methylation, and fMRI data.

Network Fusion
We fuse multiple networks into one network, using mean fusion in our study. Multiple networks Using this graph fusion method, we fused two or three single patient networks into one subject network. As a result, we created four fused networks: three fused networks from the combinations of two data types and one fused network for three data types.

Label Prediction Based on Neighborhood Majority
A network is a map of interactions, with the links measuring the association between objects or subunits. A labeled subject is marked either by 0 or 1, representing two possible labels: healthy or disease, respectively. The edges have an essential role in influencing the propagation between the subjects to predict the true label of the unknown subject. We assumed that the labels of two subjects are more likely to be the same if the two subjects are more closely related to each other. Therefore, the labels can be predicted based on the similarities between subjects with their genomic profiles and/or imaging features. Edges in network-based SSL represent the similarities between subjects that can be extracted from different data including SNPs, DNA methylation and fMRI data (Figure). Network-based classification methods can be classified into two types [21]: direct annotation schemes, which infer the class label of a node based on its connections in the network, and module-assisted schemes, which first identify modules of related nodes and then annotate each module based on the known labels of its members. Neighborhood counting is a simple and direct method for label prediction, which determines the label of a subject based on the known label of subjects lying in its immediate neighborhood. Here, we used neighborhood majority to predict the unknown labels. We first calculated the neighborhood majority of a query subject. Then, the subject is assigned the same type of label as the one of its immediate neighbors with greatest neighborhood majority ( Figure). For a query subject i, the majority of its immediate neighbors can be defined as: where k represents subject class index and N k is the number of subjects in class k. Then the label of subject i is defined as:

Results and Discussions
We employed a network-based classification method to predict whether new subjects are schizophrenia patients. After preprocessing the three types of collected data: SNPs, DNA methylation and function MRI data, we chose 184 subjects with all three types of data. We used 5-fold cross validation to evaluate our prediction method. A two-step feature selection method was applied to the labeled data sets. At first, for SNPs and DNA methylation data sets, we preselected the features based on the genes in KEGG pathways and this procedure yielded 14,875 SNPs and 6,935 DNA methylation sites. Then, for three types of data, we utilized the t-test to select the significantly differential features (p≤0.01) between normal and disease status. The preselected features were used to construct a subject-subject network. A single-type network was built for each data type. That is, three single-type networks were generated. Next, a network fusion method was used to build fused networks from two or three single-type networks. Thus we can get four fused networks: three fused networks from the pairwise combinations of two data types and one fused network from three data types. In our previous study [37], it has been shown that the technique of data fusion will improve the classification performance of network based classifiers. Therefore, in this paper, we focus on the study of the fused network from three types of data. Based on the fused network, we utilized nodes' features in a network to predict the label of a testing subject. This process is illustrated in Figure. Our prediction method is called MMN, which is a majority-neighborhood-based classification method by mean fusion.
We also compared the performance of our model with that of several other network based classification methods, with fixed parameter settings (σ 2 = 510, andK = 20). We compared 10 network based methods, whose differences consist in the procedure of network fusion and the algorithm of classification. For network fusion, two alternative methods are the mean fusion as used by our model, and Similarity Network Fusion (SNF) [38]. For classification procedure, we used five different classification methods: majority-neighborhood based classification (MN), which is adopted by our model, spectral clustering (SPC) [39,40], support vector machine (SVM) [41], local and global consistency (LGC) [32], and label propagation (LP) [42]. SPC is a module-assisted classification approach based on spectral clustering, which is efficient in capturing global structure of a graph [43]. SVM utilizes 11 node features extracted from the network for label classification, and these 11 node features are: betweenness centrality, closeness centrality, degree centrality, Bonacich Power centrality scores, the (Harary) graph centrality, information centrality, the load centrality, the vertex prestige scores, the stress centrality, and clustering coefficients [44].
LGC utilizes both the local and the global consistency of a new patient in a patient-patient network for classification. LP predicts the clinical outcome of the new patient based on a network structure called label propagation. Thus, we have 10 different models, which are MMN (our model), SMN, MSPC, SSPC, MSVM, SSVM, MLGC, SLGC, MLP, and SLP, respectively. To evaluate the performances of these 10 network-based label prediction methods, we computed some widely used metrics, e.g., true positive rate (TPR), true negative rate (TNR), negative predictive value (NPV), false positive rate (FPR), false negative rate (FNR), false discovery rate (FDR), accuracy (ACC), and F1 (F1 = 2TP/2TP+FP+FN) of each method. The classification performances of all these models are listed in Table 1.
From these results it is evident that our model, MMN performs significantly better than other network based classification methods, in terms of some key metrics, e.g. ACC, FDR, etc. Also, it seems that those mean fusion based methods (MLP, MLGC, MMN, MSPC, MSVM) works slightly better than thef corresponding SNF based methods (SLP, SLGC, SMN, SSPC, SSVM), which demonstrates the advantage of mean fusion over SNF. In summary, mean fusion is more suitable than SNF for classification, and MMN works significantly better than other graph-based   At first, we want to compare two different network fusion methods: mean fusion and SNF [38]. For each classification approach, we used both network fusion methods. So according to the network fusion, the 10 prediction methods implemented in our study are all grouped into pairwise comparisons. For example, in SLP and MLP, the first letter S in the method symbol SLP denotes SNF fusion method and the first M in the method symbol MLP means mean fusion approach. It is the same for other pairs of prediction methods: SLGC and MLGC, SMN and MMN, SSPC and MSPC, SSVM and MSVM. Table 1 shows the performances of different prediction algorithms between two network fusion methods. As can be seen in Table 2, for the prediction methods based on LP and LGC, there are no differences in performances between two network  Figure 1 The Schematic of Schizophrenia Diagnosis fusion approaches. For SSPC and MSPC, the only one group of module-assisted prediction methods, their accuracies are close to each other and neither got good performance. The results are similar to another pair of prediction methods: SSVM and MSVM. The major difference between SPC-based pair is that the mean-fusion-based method MSVM obtained higher accuracy than SNF-based method SSVM. However, the comparison results between SMN and MMN differ greatly from other graph-based prediction method pairs. For each of the performance metrics, MMN performs better than SMN, especially the ACC. Actually, the network fusion method used in this study uses uniform weight, i.e., equal weight for each of multiple individual networks. In the future, we will try to develop a weighted mean fusion method, which can assign an optimal weight for each individual network.
Above all, our prediction method MMN performs the best among all those graph-based label prediction methods and it is computationally efficient.
When computing the edge strength for constructing networks, there are two parameters σ , K related to the prediction accuracy. When selecting an optimal σ , we fixed the parameter K (K=20) and evaluated the performances with σ 2 ranging from 100 to 950. The results are shown in Table 2 and we can see that when σ 2 is too small or too large, the prediction accuracy decreases and MMN performs best with σ 2 = 510. When selecting an optimal K, σ 2 was fixed at 510 with K changing in the range 7,9,12,18,20,37 corresponding to 1/25, 1/20, 1/15, 1/10, 1/9, 1/5 of the number of the samples. As shown in Table 3, our methods achieves the best accuracy with K = 20, which is about 1/10 of whole sample (nodes) size.

Conclusions
We combined SNPs, DNA methylation and fMRI data into a single comprehensive network using a simple network fusion approach. A network-based label prediction method was applied to the network for predicting schizophrenia patient. Compared with other 9 graph-based label prediction approaches, our prediction method shows the best performance. However, our network fusion method used a uniform weight based combination of each network and in the future we will find an optimal weight based method to improve the prediction accuracy. In addition, Kim et al. [19] incorporated genomic knowledge when integrating multi-omics data for predicting the clinical outcome of cancer, which improved the predictive power. Therefore, we will also incorporate other prior knowledge into the construction of patient connection networks.

Conflict of Interest
The authors declare no conflict of interest.