Fusion of Visual and Infrared Signals in Visual Sensor Network
-
Upload
independent -
Category
Documents
-
view
0 -
download
0
Transcript of Fusion of Visual and Infrared Signals in Visual Sensor Network
20th Iranian Conferance on Electrical Engineering, (ICEE2012), May 15-17,2012, Tehran, Iran
Fusion of Visual and Infrared Signals in Visual Sensor Network for
Night Vision
Sajjad Ghaeminejad1, Ali Aghagolzadeh2, and Hadi Seyedarabil
I-Faculty of Electrical and Computer Eng., University of Tabriz, Tabriz, Iran. 2-Faculty of Electrical and Computer Eng., Babol Nooshirvani University of Technology, Babol, Iran.
[email protected], [email protected], [email protected]
Abstract: Multisensory fusion has become an area of intense research activity in the past few years. The goal of this paper is
to present a technique for fusing infrared and visible videos. In this technique we propose a filsion method that quickly filses infrared and visible frames and gives a better performance. This is done by first decomposing the inputs using DWT and
extracting two maps (resulted from Choose Max rule) from approximation sub frames and then fusing detail subframes according to these maps. After being compared to some of the popular fusion methods, the experimental results demonstrate
that not only does this proposed method have a superior fusion performance, it can also be easily implemented in visual sensor
networks in which speed and simplicity are of critical importance.
Keywords: Fusion, Visual Sensor Networks, Night
Vision, Multi-scale Transformation, Visual awareness
1. Introduction
Visual sensor networks (VSN) are networks of smart cameras capable of local image processing and data communication. Cameras in VSNs form a distributed system performing information extraction and collaborating on application-specific tasks. The network generally consists of the cameras themselves, which have some local image processing, communication and storage capabilities, and possibly one or more central computers where visual data from multiple cameras is further processed and fused. Visual sensor networks are most useful in applications involving area surveillance, tracking, medicine and environmental monitoring.
Video fusion is one branch of multi-sensor data fusion. It is a technique to integrate information from multiple videos. These videos may come from one sensor or multiple sensors. Moreover, the sensors can be of different kinds or the same.
In recent years, it has been widely applied in machine vision, remote sensing, medical imaging, military applications, and etc.
Infrared sensors are sensitive to the differences of temperature in the scene and that is why IR videos have low definition and most of the context in their data will
be hard to be recognized by human eye. On the contrary, the visual sensor is sensitive to the reflecting properties of targets in the scene. Unfortunately if a target is behind an obstacle that does not allow reflecting light come through and reach the sensor (e.g. a person in the smoke) or when there is not enough light in the environment (e.g. a pedestrian walking in a dark street), the visible sensor does not give any information about these targets since it is not getting any. If IR and visible videos are fused, all advantages of both sensors (IR and visible) will be remained in the fused image.
Visual data fusion algorithms can be categorized into low, mid, and high levels. In some literature, this is referred to as pixel, feature, and symbolic levels. Pixellevel algorithms work either in the spatial domain or in the transform domain. By changing a single coefficient in the transformed fused image, all (or a whole neighborhood of) pixel values in the spatial domain will change. As a result, in the process of enhancing features in some image areas, undesirable artifacts may be introduced to some other image areas. Algorithms that work in the spatial domain, however, have the ability to focus on the desired image areas, limiting unwanted changes in other areas.
This paper is primarily concerned with the discussion of pixel level fusion. There are a number of pixel-based fusion schemes ranging from simply averaging the pixel values of inputs to more complex multi-resolution (MR) methods such as pyramid methods [3 and 9], wavelet methods [1, 5, 6, 10, 17 and 18] or Principal Component Analysis (PCA). A useful review of MR fusion schemes is given in [11]. MR pixel-based fusion methods, shown in Eq. (1), generally involve transforming each of the registered input frames ii, h,71Nfrom normal image space into some other domain by applying an MR transform, w. The transformed frames are fused using
some fusion rule, rjJ , and the fused image Pis
reconstructed by performing the inverse transform, w-1•
(1)
978-1-4673-1148-9112/$3l.00 ©2012 IEEE 1232
A common wavelet family used for fusion is the discrete wavelet transform (DWT). The 2-dimensional discrete wavelet transform and the Laplacian pyramid decompose an image into its multi-scale edge representation. They are based on this fact that the human visual system is primarily sensitive to the local contrast changes, i.e. edges.
Visual data fusion in the spatial domain can be also classified into three categories based on the type of chosen neighborhood. These are pixel-based, windowbased, and region based [12 and 13]. The pixel-based fusion is characterized by simplicity and the highest popularity. Because pixel based and window-based methods fail to take into account the relationship between points, the fused image with either of them might lose some gray level and feature information. The regionbased fusion, on the contrary, can obtain the best fusion results by considering the nature of points in each region altogether. Therefore, region-based fusion has advantages over the other two counterparts. This of course does not come without a cost. Increased complexity is the first drawback that can be mentioned. While region-based methods usually outperform the other methods, their complexity and the computation load for systems is a drawback. For our video fusion applications in a VSN network, this is even worse since bandwidth and power constraints should also be taken into consideration. More complex methods require more power and more time. The fusion method proposed in this paper tries to take advantage of good characteristics of region based method in a faster manner. Therefore, a simple, yet efficient, fusion scheme is introduced in which fusion rule for detail sub-frames of a DWT of a frame is replaced by a logic coefficient choosing map. This means the whole segmentation phase is bypassed, resulting in reduction of complexity and computational load. We used Daubechies Spline (DBSS) as DWT in our work. In this paper we will use image and frame interchangeably. In the case of video fusion these terms are technically the same.
3. Proposed fusion Technique for Videos
Block diagram of the proposed fusion technique is
shown in figure 1. Infrared and visual videos are considered to be spatially and temporally registered. The
steps of the proposed technique are as follows:
I. Get IR and Visual frames
II. Extract moving objects and still background from
IR frame
III. Calculate the difference between extracted
background from the current IR frame and
the background extracted from the previous frame.
IV. Calculate the difference between the current visual
frame and the previous one.
V. If any of two differences calculated in steps III
and IV are more than certain thresholds, fuse
the IR background and visual frames using
the proposed method in section (3.2) and
save it for the future use.
VI. If none of these differences are more than the pre
defined thresholds, just put new object map
on the previous fusion of IR background and
visual frame and save it for the future use.
3. 1 Background Subtraction
Since our main targets are usually warmer than their
environment (like a person walking in a street), first
moving objects and background are extracted from the
current infrared frame. There are different ways for
doing this ranging from simple frame differencing to
more complicated ones like mixture of Gaussians. In
visual Sensor Networks power and memory are two
critical factors that must be taken into account. Therefore,
more complicated methods are less desired for our application
3.2 Change Detection
At this stage, we see how different the current frames
in infrared and visible videos are from their previous counterparts. There is a wide variety of methods here. A
simple one is pixel by pixel differencing and adding
absolute values of these differences as in equation (2),
where A and B are two frames, A(i,j) and B(i,j) are the
grey values in pixel (i,j) and frames are of size m Xn. n 111
d(A, B) = I II AU,}) - BO,})I i=1 j=1
(2)
Histogram differencing is another simple method for
measuring similarity between two frames. If h;(A) is
value of histogram of frame A in grey value of i , the
similarity measure based on histogram differencing will
be: k
Hd(a, b) = II h;(A)-h;(B) I (3) ;=1
where k is the number of grey levels in input frames.
4. Proposed fusion method
If any of two differences calculated in steps III and IV
in section 3 are more than certain thresholds (meaning
there are some important change), the IR background and
visual frames are fused using proposed method shown in
Fig. 2. First, IR extracted background and visual frame
are decomposed by DWT. Here we will treat their
approximation and detail sub-frames differently. To
afford some insight into our method, we apply each step
on a pair of frames from Dune dataset [14] shown in
figure 3. Decomposing each frame using DWT (DBSS)
approximation and detail sub-frames are as shown in
figures 4 and 5.
For approximation sub-frames, we use "choose max"
(CM) rule which simply chooses pixels with larger grey
value from two approximation sub-frames.
1233
Object Map
\ IR
H Background
frame
� Just put new map on
Suhtraction IR Change previous background
background Detection:
Visual Important
r--change Frame , detected?
� Fuse the frames according to
the proposed method
FIg. I. The proposed techmque for vIdeo fustOn
IR background map Approximation CM
Visual frame
Object map
Detail Proposed
method
Fig. 2: The proposed method for fusion of frames
a b
Fig. 3: Frames no. 7406 from Dune dataset; a) infrared frame, b)
visible frame
a b c
d e f
Fig. 4: Detail sub-frames of frames in figure 3; a-c) detail sub-frames
of figure 3a, d-f) detail sub-frames of figure 3b.
a b
Fig. 5: Approximation sub-frames of frames in figure 3; a) for the
infrared frame, b) for the visible frame.
To do so, simply, a logical map is formed for each
sub-frame from applying eM rule in approximation parts.
Let A and B be two input frames, their approximation
sub-frames are LLA and LLB and their detail sub-frames
are HLA, HHA, LHA, HLB, HHB, LHB. For frame A,
'fusion choosing maps' will be:
(4)
and for frame B:
(5)
Now, considering F, LLF, HLF, HHF and LHF as the
fused image and its sub-frames, approximation band of
fused frame is produced according to equation (6):
LLF =LLA x map_A + LLR xmap_B (6)
Therefore, if we use eM rule for approximation
subframes, it will give us two partial approximations and
two maps for fusion detail subframes. These are shown in
figure 6. In maps given in figures 6c and 6d, the white
parts represent the pixels which were chosen by eM rule.
In the other word, the white parts of each map indicate
the pixels that contribute to the approximation subframe
of the combined frame. Fused approximation sub-frame,
built using figures 6a and 6b is also shown in figure 6e.
For detail sub-frames, as mentioned earlier, we do not
use conventional methods like eM or averaging or
weighted averaging as used by researchers so far. The
resulting maps of fusing approximation sub-frames are used as fusion choosing maps for detail sub-frames.
Using two maps 'map_A' and 'map_B' we form details of fused frame:
1234
HL F = HL A X map _A + HL R X map_B
LHF =LHA x map_A + LHRxmap_B
HH} = HH A x map_A + HHB xmap_B
(7)
(8)
(9)
a b
c d
e
Fig. 6: Approximation sub-frames and CM maps for frames of figure 3.
a) approximation sub-frame of figure 3a after applying CM rule, b)
approximation sub-frame of figure 3b after applying CM rule, c) CM
map for infrared approximation sub-frame of figure 3a, d) CM map for
visible approximation sub-frame of figure 3b, e) fused approximation
sub-frame resulted from combining figures 6a and 6b
As it can be clearly seen from equations (7)-(9), a
pixel in details is chosen from an input only if its grey
value outweighs the ones of other inputs in
approximation sub-frames. Figures 7a and 7b show detail
sub-frames of figure 4 after applying choosing maps of
figure 6c and 6d, and figure 7c shows the fused detail
sub-frame. Finally, to get the fusion result of frames
given in figure 3, inverse DWT (IDWT) is applied on
figures 6e and 7c. The result is shown in figure 8.
Fig. 7: Detail sub-frames of frames in figure 3. a) detail sub-frames
of the visible frame of figure 3 after applying the map of figure 6d,
b) detail sub-frames of the infrared frame of figure 3 after applying
the map of figure 6c, c) fused detail sub-frames
Fig. 8: Fusion result of frames shown in figure 3
5. Experimental Results
Unfortunately there is no suitable video sequence to
implement our proposed technique described in section 3.
All existing video sequences are less than 50 frames. Our
fusion method, however, can be thoroughly evaluated
using currently available video sequences. We applied
our method to 3 popular video sequence datasets, so
called Dune, Trees and UN Camp [14]. Three criteria are
used to evaluate the performance of our proposed
method. They are mutual information (MI), <tBF[ 15] and
FMI[ 16]. FMI is one of the newest measures which is a
modified MI. In [17], it is proven that this measure
closely matches the subjective measures, better than all
existing measures. Our proposed method is compared to seven other popular methods. They are DWT, Laplacian
Pyramid (Lap.), PCA, FSD pyramid (FSD), Contrast
pyramid (Con.), Contourlet Transform (Cont.) and Non
Subsampled Contourlet Transform with Pulse coupled
Neural Network (NT_P). For simplicity, we will refer to
our proposed method as "Prop" in the tables.
To begin, we need to fmd the best number of
decompositions for these methods. Since we need a fast
algorithm, we are not interested in high scales of
decomposition. Table 1 shows the results for two
methods DWT and Laplacian Pyramid for a frame of UN
Camp in 3 scales.
TABLE I: Results of UN Camp Video
Method DWT Lap.
Scales 1 2 3 1 2 3
<tB!F 0.487 0.460 0.371 0.448 0.420 0.415
As it is seen, the best results are achieved in one scale
of decomposition which is perfect for our application.
We chose CM for both approximation and detail sub
frames in pyramid based methods. For DWT, CM was
used in approximation and averaging was applied to
detail sub-bands. These rules are chosen so, because they
are shown to give the best results for these methods. The
results of applying these methods and our proposed one
to 3 datasets mentioned earlier are shown in table II. We
applied each method 10 times and the numbers in table
(II) are the average of all obtained results.
1235
TABLE II: Results of Dune Video
Dune Trees UN camp
MI QAB/F MI Q
AB/F MI QAB/F
DWT 2.702 0.541 3.550 0.462 3.347 0.483
Lap. 3.378 0.504 4.127 0.413 4.086 0.444
PCA 2.389 0.521 2.742 0.456 4.817 0.578
FSD 2.727 0.475 3.549 0.385 3.299 0.413
Can. 3.348 0.500 3.810 0.396 3.962 0.426
Cant. 1.009 0.360 1.521 0.298 1.205 0.254
NT P 3.013 0.587 3.225 0.505 2.216 0.453
Prop. 4.483 0.614 6.497 0.585 5.727 0.564
The results shown in table (II) indicate that our
proposed fusion method outperforms other popular
methods. Only in one measurement, one of three criteria
rated PCA's performance better than our proposed
method. Even though PCA is rated as the best in one of
nine measurements (3 measurements for 3 datasets), the
quality of resulting fused frames is not acceptable
subjectively. This is obvious in figure ge. The reason for
this inconsistency between objective and subjective
measures is that some criteria in some cases do not match
the subjective evaluations. It means although rated as a
high quality output, a frame is not rated high in a human
observer's view. Figure 9 shows the result of fusion of
frame 1826 from UN Camp dataset. It is obvious that in
the result of PCA method, the person is hardly seen.
Other methods, even those always evaluated weaker,
clearly show the person.
Time efficiency is of great importance when dealing
with videos. Table (III) shows the average time every
algorithm needed to fuse one single frame of all three
sequences mentioned earlier in seconds. As the table
suggests, except NT_P method which takes about 46
seconds to fuse one frame, the other methods have
relatively fast performance and are therefore acceptable
for video processing applications. Among these, the four
DWT, FSD, Contrast and Proposed are the best ones
from time efficiency aspect.
In [16], author proves that FMI criterion matches
subjective measures best. Therefore we rely on FMI and
evaluate results shown in figures 9 with this criterion.
However we do not take NT P into account since it is too
slow for video fusion application. Table (IV) shows the
result of FMI evaluation for results in figure 9. The FMI
considers the result of PCA as the worst which matches
the subjective expectation since the information about the
person is not transferred from infrared frame to the
resulting fused frame. Our proposed method is also rated better than other seven methods by FMI.
TABLE III: Average time required to fuse one single frame for diflerent
algorithms in seconds
DWT Lap. PCA FSD Con
sec 0.05 0.49 0.l2 0.03 0.04
Cont. NT -
0.22 46
P Prop.
0.05
ab cd ef gh IJ
Fig. 9: Fusion of frame 1826 of UN Camp dataset. a) Original infrared
frame, b) Original visual frame, c) DWT, d) Laplace pyramid, e) PCA,
t) FSD, g) Contrast pyramid, h) Contourlet Transform, i) Non
Subsampled Contourlet Transform with pulse Coupled Neural Network,
j) proposed method
1236
TABLE IV: Objective Evaluations of frame 1826 from UN Camp
dataset using FMI
DWT Lap. PCA FSD Con. Cont. Prop.
FMI 0.5023 0.4991 0.4573 0.4952 0.4912 0.3581 0.5024
Now in the last part of our experimental results, we
evaluate these six methods using FMI for our three
datasets. The results are shown in table (V). Again
proposed method outperforms all six methods.
TABLE V: FMI Evaluations on three datasets
DWT Lap. PCA FSD Con. Cont. Prop.
Dune 0.537 0.537 0.526 0.536 0.536 0.392 0.539
Trees 0.502 0.490 0.430 0.488 0.479 0.385 0.503
UN
Camp 0.577 0.577 0.575 0.576 0.575 0.344 0.598
From figure 9 and tables (II-V), it is clearly seen that
our method delivers a better performance than other
methods. The reason for this improved performance is
that when in previous methods a CM rule is applied
separately to detail sub-frames, sometimes some
inconsistencies occur. By inconsistency, we mean
sometimes a pixel's approximation is chosen from one
frame by CM rule, while not in all its three detail sub
frames that pixel is chosen from the same frame. In our
method, we believe, due to the nature of the night-vision
application, we choose detail pixels come from the same
frame that this pixel in the approximation sub-frame
came from. This means even if in approximation sub
frame a pixel wins the CM test, we pick detail
coefficients of the MR transform for the fused result from
the same frame, no matter what their values are. In fact,
the proposed method is a kind of fast region-based fusion
method since the grey value in grey images can be used as a measure of affiliation between pixels of the image. In
our case all pixels with high grey value can be clustered
into target regions and therefore treated likewise.
Although some may argue that this is not true in all cases,
it is the case in night vision applications. The numerical
results also support this idea.
5. Conclusion
In this paper, we have described a video fusion technique that utilizes similarity between consecutive frames in a video sequence. Also a method for fusing video frames is proposed. The comparative analysis between the proposed method and seven existing methods has shown the merit of our approach.
6. ACKNOWLAGEMEN
This research is partially supported by Iranian Telecommunication Research Center (lTRC) which is appreciated.
7. References
S. Li, 1. T. Kwok, Y. Wang, "Using the discrete wavelet frame transform to merge Landsat TM and SPOT panchromatic images," Information Fusion, Vol. 3, Pages 17-23,2002.
[1] A. Goshtasby, "2-D and 3-D Image Registration for Medical, Remote Sensing and Industrial Applications," Wiley Press, 2005.
[2] A. Toet, "Hierarchical image fusion," Machine Vision and Applications, Vol.3, Pages 1-11, 1990.
[3] S. K. Rogers, C. W. Tong, M. Kabrisky, J. P. Mills, "Multisensor fusion of ladar and passive infrared imagery for target segmentation," Optical Engineering, Vol. 28, Issue 8, Pages 881-886, 1989.
[4] J. J. Lewis, R. J. O'Callaghan, S. G. Nikolov, D. R. Bull and N. Canagarajah, "Pixel- and Region-Based Image Fusion with Complex Wavelets," Information Fusion, Elsevier, Vol. 8, Issue 2, Pages 119-130,2007.
[5] A. Petrosian, F. Meyer, "Wavelets in Signal and Image Analysis," Kluwer Academic Publishers, the Netherlands, Pages 213-244, 2001.
[6] D. L. Hall, 1. Linas, "An introduction to multisensor data fusion," Proceedings of the IEEE, Vol. 85, Issue I, Pages 6-23, 1997.
[7] G. Piella, "A general framework for multiresolution image fusion," Information Fusion, Journal of Elsevier on, Vol. 4, Issue 4, Pages 259-280, 2003.
[8] A. Toet, L. V. Ruyven, 1. Velaton, "Merging thermal and visual images by a contrast pyramid," Optical Engineering, .vol. 28, Issue 7, Pages 789-792, 1989.
[9] H. Li, B. Manjunath, S. Mitra, "Multisensor Image Fusion Using the Wavelet Transform," Graphical Models Image Processing, Journal of Elsevier on, Vol. 57, Issue 3, Pages 235 - 245, May 1995.
[10] Z. Zhang, R. S. Blum, "A Categorization of MultiscaleDecomposition-Based Image Fusion Schemes with a Performance Study for a Digital Camera Application," Proceedings of the IEEE, Vol. 87, Issue 8, Pages 1315-1326,2002.
[II] N. Cvejic, J. Lewis, D. Bull, et al. "Region-based multimodal image fusion using ICA bases," IEEE Sensors Journal, Vol. 7, Issue 5, Pages 743-751, 2007.
[12] G. Piella, "A region-based multiresolution image fusion Algorithm," Proceedings of the Fifth International Conference on Information Fusion, Pages 1557-1564,2002.
[13] http: //www.imagefusion.org/images/toet2/toet2.html (Last accessed on 11111/11)
[14] C. Xydeas, and V. Petrovic, "Objective pixel-level image fusion performance measure," Proceedings of SPIE, Pages 88 - 99, 2000.
[15] M. B. A. Haghighat, A. Aghagolzadeh, H. Seyedarabi, "A NonReference Image Fusion Metric Based on Mutual Information of Image Features," Computers and Electrical Engineering, Elsevier, Vol. 37, Issue 5, Pages 744-756,2011.
[16] W. Cai, M. Li, X. Y. Li, "Infrared and Visible Image Fusion Based On Contourlet Transform," Image and Graphics, Fifth International Conference on, 2009.
[17] Q. X. Bo, Y. 1. Wen, X. H. Zhi, Z. Z.QQian, "Image Fusion Algorithm Based on Spatial Frequency-Motivated Pulse Coupled Neural Networks in Nonsubsampled Contourlet Transform Domain," ACT A AUTOMA TlCA SINICA, Vol. 34, Pages 1508-1514,2008.
1237