A Circular Invariant Convolution Model-Based Mapping for Multimodal Change Detection

The large and ever-increasing variety of remote sensing sensors used in satellite imagery today explains why detecting changes between identical locations in images that are captured, at two separate times, from heterogeneous capturing systems is a major and challenging recent research problem in the ﬁeld of satellite imaging for fast and accurate determination of temporal changes. This work presents an original concentric circular invariant convolution model that aims at projecting the ﬁrst satellite image into the imaging modality of the second image. This allows the two images to have identical statistics so that one can then e ﬀ ectively use a classical monomodal change detection method. The invariant circular convolution kernel parameters are estimated in the least squares sense using a conjugate gradient routine whose optimal direction is determined by a quadratic line search algorithm. After the projection of the before image into the imaging modality domain associated with the after image is achieved, a basic pixel-by-pixel di ﬀ erence permits the estimation of a relevant soft di ﬀ erence map which is segmented into two classes by a multilevel Markovian technique. A series of experiments conducted on several pair of satellite images acquired under di ﬀ erent imaging modalities, resolution scales, noise characteristics, change types and events, validates the e ﬀ ectiveness of this strategy. The experimentation shows that the proposed model can process di ﬀ erent image pairs with less re-striction about the source images and natural event, coming from di ﬀ erent sensors or from the same sensor, for detecting natural changes.


Introduction
The ever-increasing number of Earth observation satellites, which use today new high-tech sensors, are often technologically quite different from those that have provided the satellite images so far archived. This heterogeneity of satellite image data has thus lately contributed to the advent and development of a novel and rapidly growing research interest in remote sensing and geoscience imaging commonly known as multi-modal (or heterogeneous) change detection (MCD). The purpose of the multimodal CD (MCD) [1,2] is to detect and precisely locate any change in land cover between at least two satellite images taken in the same place, at different times and under different acquisition conditions. It usually concerns CD models that process heterogeneous satellite images, i.e. provided by different satellite sensor types which may combine active and passive sensors such as SAR/optical or by one sensor type, with two heterogeneous SAR, optical or others or finally by different satellites using the same satellite sensor but under different specifications, looks, wavelengths, or calibrations [3].
Recently, MCD has aroused a lot of attention and interest in satellite imagery thanks to its consistency and coherence with our environment favouring the emergence of increasingly heterogeneous images. Nevertheless, it is a difficult procedure because MCD has to be flexible enough to treat a multi-modal image pair, for solving the problems usually considered by the traditional single-modality CD methods [4,5] including damage, land, environmental monitoring or city planning.
In the literature, the proposed MCD approaches can be classified under four classes. The first class includes the simplest methods which use local descriptors [6] or similarity measures [6]- [9], using invariance properties relative to the imaging modality processed. The second category gathers the methods that do not have assumptions about the data distributions and are thus non-parametric. This includes algorithms using machine learning techniques that learn from the training samples [10]- [19] or from unsupervised parameter, without requiring any training phase, for example the least squares (LSQ) Multidimensional scaling (MDS) and energy-based model, integrating data-driven pixel pairwise constraints, introduced in [20]. The third group relies on parametric methods trying to model common statistical features or relationship between the different imaging modalities or between the different multisource data via a set of multivariate distributions [21]- [27] or through a pixel pairwise modeling integrated in a Markov Random Field (MRF) model [28]. Finally, the last category regroups algorithmic procedures essentially relying on a projective transformation of the heterogeneous images to a common intermediate feature space, where the projected satellite images would ultimately share the same statistical characteristics and on which any conventional homogeneous CD models could be subsequently used [29]- [38]. This category can encompass also the procedures projecting the first image in the imaging modality of the second image or inversely [39]. The method presented in this work fits fully into this sub-category.
More precisely, this work presents a concentric circular invariant convolution mapping which aims at projecting the before image onto the imaging modality of the after image. In this way, we ensure that the statistics of the pre and after-event satellite images are nearly identical and that a classical monomodal CD procedure can then be efficiently used, as easily as if the two images came from the same imaging modality or sensor with the same settings (specifications/calibrations/wavelength).
To this end, we will show that, once the mapping is done by the proposed convolutive model, the pixel-wise difference image estimated from the image pair shows good likelihood distributions properties corresponding to the "change" and "no change" label classes; that is to say, presenting a mixture of distributions not too far from normal distributions and also not too flat (i.e. large or noninformative) so that a binary segmentation with an unsupervised Markovian model to be efficient, despite the big difference between the imaging modalities that could be confronted in multi-modal satellite imagery and which will be evaluated in this paper. As an amelioration of our convolution model-based mapping [1], in the present research we propose a circular invariant model. Furthermore, a three dimensional (3D) convolution based mapping strategy was adopted in the specific case of MCD using two color images (or multispectral). A battery of tests was conducted to quantify the benefits of such improvements.
It is worth mentioning that convolution model estimation has also been widely experienced and analyzed in different image processing problems as image restoration [40] or 2D or 3D deconvolution issues [41]- [44].
The reminder of this paper is structured as follows: Section 2 describes the proposed projection model based on a concentric circular invariant convolution representation and the unsupervised Markovian approach. Section 3 presents the results and an experimental comparison with state of the art multimodal CD methods. Section 4 concludes the paper.

Proposed MCD Model
Let I b and I a , be the bi-temporal remote sensing image pair of N pixels, captured in the same area, at two consecutive times (i.e, before and after a given event), and coming from different imaging systems or sensors. Let I b and I a also be the informational part of the image or the semantic information of the scene concretely representing the set of real objects and materials imaged in I b and I a . The acquisition system, related to respectively the pre-event and post-event image can be modeled by the following linear models: I b = I b * u b and I a = I a * u a where " * " is the convolution operator and u b and u a mathematically model the underlying structure of the Point Spread Function (PSF) of the pre-event and post-event data acquisition system [45].
This modeling framework defines a reliable and efficient way for projecting the before-change image to the domain or imaging system of the after-change image image. It consists to apply the following operation to the pre-event image: such as δ is the Dirac impulse function) or by commutativity and associativity of the convolution product, applying the convolution operator This operation allows us to convert the original MCD issue into a conventional monomodal CD method which is used in the case of a pair of images having the same acquisition mode and therefore having the same statistics.
In order to reliably find u b →a , the simplest strategy is to search for the convolution filter parameters that will minimize in the LSQ context, the energy function given by equation 1: where S nc is the set of pixels that do not belong to the class label "change" and which must be evaluated in the final change detection map. As first approximation forû b →a , we can take for S nc , the set of all pixels in the image. This can be justified because the area of change is usually quite small in proportion to the image which, also, can be considered (in a very first approximation) as noisy observations or outliers in the image data with which it is shown that the LSQ estimator performs well (especially thanks to the fact that the function E(u b →a ) to be minimized is convex with regard to the parameter vector of the convolution filter u b →a ). As a result, different efficient minimization procedures can be adopted [47].
In this work, we use a combination of conjugate gradient and a line minimization search routine guaranteeing a considerably increased convergence speed compared to other minimization methods. Another advantage of such a minimization technique consists on its ability to integrate some hard constraints on u b →a . In our case, since the two imaging modalities are circular invariant, we constraint u b →a to be also circular invariant. To this end, at each iteration k of the conjugate gradient descent (CGD), we simply average each of the (sz × sz) parameter values of u [k] b →a that are equidistant (in the L 1 distance sense) from the center of the convolution filter www.astesj.com . This allows us to integrate this constraint with a linear complexity with respect to the (sz × sz) parameters of the convolution filter. This overall minimization routine is presented in our basic model [1].

Algorithm 1 Concentric Circular Constraint
In order to improve this approximation ofû b →a , we can formulate the LSQ estimation problem of the convolution filter parameters as a fixed point problem involving a contraction mapping. Technically speaking, the first estimation ofû b →a allows us to obtain I b →a (x, y) = I b (x, y) * u b →a (x, y), corresponding to the image preevent image projected on the modality of the post-event image. A simple pixel-by-pixel difference between I b →a (x, y) and I a (x, y) enables us then to produce an output image of absolute difference that is subsequently binarized to the "change" and the "non-change" class label. To this end, a Gaussian kernel is used to model the distribution of each class and the Expectation Maximization (EM) [48] algorithm is used both to estimate the parameters of this weighted distribution mixture but also to give an approximate estimate of S nc (i.e., the pixels that does not belong to the class label "change") with a binarization scheme in the Maximum Likelihood (ML) sense. This process allows us to finally better estimateû b →a (see Eq. 1). This process is repeated until the stability of the algorithmic steady state fixed point thus defined is reached. This fixed point-based strategy yields to the following iterative procedure: where S [0] nc stands for the whole image pixels and Φ [k+1] D is the parameter vector of the two weighted normal kernels evaluated by the EM procedure (at iteration k + 1) on the difference image D(x, y). A second enhancement can be incorporated in this iterative procedure by simply inverting the direction of temporality of the two (before and after) satellite images in the algorithm introduced in our basic model [1]. Concretely, the estimation ofû a →b can be done by I a →b (x, y) = I a (x, y) * u a →b (x, y). This allows us to obtain another difference map between I a →b (x, y) and I b (x, y). The reverse estimate can be used along with the original one to obtain an average estimate ensuring an improved (less noisy) difference image D(x, y) (see Algorithm 2).
In the case where the image before or after is a single-band image and the second image (possibly a multi-band) is then converted into gray levels and this pair of greyscale images are used as explained in Algorithm 3. A third improvement is proposed in the particular case where we have at our disposal two images before and after in colors (or with b > 1 color bands). In this case, the mapping and a difference map D(b, x, y) is estimated for each of the b images individually and sequentially. A difference map is defined finally as www.astesj.com the L ∞ distance, for each pixel (x, y), between the b-components of each pixel (D(x, y) ← max b D(b, x, y)) as illustrated in Algorithm 2.

After the fixed point becomes stable (concretely when S [k]
nc ≈ S [k+1] nc , or after L P max , a maximal number of iterations is reached), the parameter vector of the Gaussian kernel mixture Φ D (given by the EM procedure) is used to classify the difference map into two classes, once again, in the Sequential Maximum A Posteriori (SMAP) sense by using the multiscale, fine-to-coarse and coarse-tofine segmentation procedure [49].
This overall minimization routine is summarized in Algorithm 2 and in Algorithm 3 for a N-band image pair.

Segmentation
Step

Heterogeneous Dataset Description
Our strategy is validated by conducting a set of evaluation scenarios on thirteen dissimilar heterogeneous datasets exhibiting different multi modality imaging data such as cross-sensor or multisensor (#1,#5,#12,#13), multisource (#2,#3,#4,#7,#8), multi-looking (#6,#9,#11) or multispectral images (#10)) with different resolution levels and image sizes and imaging a wide variety of changed events which are degraded by a wide variety of both noise type and levels as depicted in Table 1.
In all the experimental results, we recall that if one of the two images is in grayscale, we must convert the other image (colors or with b (> 1) bands) into grayscale to apply algorithm 2. Otherwise, if the two images contain the same number of bands (b), we rely on algorithm 3.
In addition, for computational reasons, we have reduced the image size so that the widest edge of it is at most equal to 500 pixels. For the estimation stage, the size sz of the convolutional (square) filter was set to 9, with a maximal iteration number L G max set to 200 and a fixed-point iteration number L P max equal to 2. The iteration number L EM max used in the EM algorithm is 12 iterations (see our basic model [1]). For the segmentation stage, the depth of the SMAP was set to d = 9 [49]. The regularization parameter θ was fixed to 0.9 (a value commonly used for this SMAP fine-to-coarse, coarse-to-fine segmentation algorithm [49,54]).

Convolution Filter Estimation Step Result
For all the test scenarios, the maximum iteration number used in the conjugate gradient descent combined with the minimization line www.astesj.com process, is sufficient to ensure the convergence of the minimization procedure. As example, we show, in Figure 1, how evolves the error rate along the minimization process. In addition, for all the scenarios, we estimate for u b →a or u a →b a convolution filter (that can be likened to a point spread function (PSF)) which is quite different from one dataset to another since the frequency response (or modulation transfer function) of the imaging modality is also quite different for each pair of imaging multi-modalities. Besides, the convolution filter size sz is also justified, enough because small values of parameters are estimated at the edges of the rectangular filter in all the tested cases (see Figure 2) for an example of convolution filter estimate).

Results & Evaluation
We evaluate and compare the obtained results using the classification rate accuracy that measures the correct changed and unchanged pixels percentages: ACC and the F-measure Fm which can be defined in the equations 5 and 6.
where TP and TN are the true positives and negatives. FN and FP designate the false negatives and positives. Table 2 shows a comparison between our segmentation results and the different supervised and unsupervised state-of-the-art methods [9,12,16,17,20,23]- [27,32,38,51]- [53]. Figures 3,4 and 5 compare qualitatively the obtained results. The average accuracy rate of our proposed model is higher than the supervised and unsupervised state-of-the-art methods, obtained by our proposed MCD model on the thirteen heterogeneous dataset is 87.78% with a F-measure equals to 0.444. Compared to these results, our preliminary model (non-circular invariant) [1] gave essentially the same result in terms of rate accuracy but with a much lower F-measure which did not exceed 0.389.
Globally, we notice that our approach works efficiently when a) the considered MCD problem tends towards the simple monomodal CD problem; in other words, in the case of a simple imaging modality which is not too different between the pre-event and post-event image (e.g., when there are similar sensors involving slightly different noise laws between the before and after images), or b) in the case of a couple of images with high resolution, or c) in the case of    simple changed events images with a simple uniform texture; as example, a city is flooding by a river (inducing an uniform texture for a new zone or also in building construction or structure with a uniform texture very different from the rest of the image). Conversely, our approach is a little less suitable when it comes to processing a bi-temporal images in which the variability between the before and after images is more important. Otherwise said, in the case of an imaging modality resulting from the combination of active and passive sensors, thus inducing a combination of multiplicative and additive noises in the before and after images with a dissimilar noise law. Also for images pairs with a low spatial resolution and showing complex changed events; as example, urban site construction (replacing a complex texture zone by the new complex texture area).
It requires about 54 seconds to the MCD to process an image. The processing time depends on the estimation problem complexity, the size of image, and the mono or multichanel based estimation (or 660 seconds for the set of 13 image pairs considered in this validation study) using a code with non-optimized C++ version running on i7 − 930 Intel CPU with 2.8 GHz Linux machine.

Conclusion
We have presented in this work an efficient mapping model for the change detection problem in multimodal imagery relying on the parametric estimation of an invariant circular convolution CD model. Optimization of the model was performed on a convex quadratic error function using a conjugate gradient routine, whose optimal direction is determined by a quadratic line minimization algorithm, and formulated as a fixed point problem involving a contraction mapping. This allows us to project the before image in the imaging modality space of the second image and inversely so that a difference change map is then computed pixel by pixel and efficiently binarized with a multiscale Markov model. Our proposed strategy seems well suited for the multimodal change detection issue and robust enough against images provided under different resolution levels, type or level of noise and exhibiting a variety of natural event.