Download - Detection of human intrusion in a smart home

DETECTION OF HUMAN INTRUSION IN A SMART HOME

A Project

Presented to the faculty of the Department of Computer Science

California State University, Sacramento

Submitted in partial satisfaction of

the requirements for the degree of

MASTER OF SCIENCE

in

Computer Science

by

Beerdwinder Deep Kaur

FALL

2021

ii

© 2021


ALL RIGHTS RESERVED

iii


A Project

by


Approved by:

__________________________________, Committee Chair Dr. Jun Dai

__________________________________, Second Reader Dr. Xuyu Wang

____________________________

Date

iv

Student: Beerdwinder Deep Kaur

I certify that this student has met the requirements for format contained in the University format

manual, and this project is suitable for electronic submission to the library and credit is to be

awarded for the project.

__________________________, Graduate Coordinator ___________________

Dr. Jinsong Ouyang Date

Department of Computer Science

v

Abstract

of


by


This project develops an AI-based system that will help detect intrusion of human in a

smart home with Camera and networking setting. The system will be made in such way

that once a camera detects a stranger with knife or gun, the system will capture the picture

of the stranger and will make all the cameras to detect/track the stranger anywhere in

home and keep sending the update of the person to the user on his mobile. Compared to

traditional and similar systems, this project introduces a secondary layer to have all the

cameras cooperate to track the thief (intruder detected before).

In summary, the system will be capable of doing face recognition and object detection

(weapons). This human intrusion detection system will be developed by using two models

– LBPH and Faster RC neural networks.

_______________________, Committee Chair

Dr. Jinsong Ouyang

_______________________

Date

vi

ACKNOWLEDGEMENTS

I am grateful to Dr. Jun Dai for giving me an opportunity to work on this project. His

in-depth technical inputs and guidance helped me shape this project. As a teacher,

he has taught me a great deal about AI/Machine Learning and also Software

Development Methodologies. The concepts I learned from the project has helped me

in my professional career. I would also like to thank Dr. Xuyu Wang for his

willingness to serve as a second reader for this project and provide valuable

feedback.

vii

TABLE OF CONTENTS

Page

Acknowledgements ......................................................................................................... vi

List of Tables ................................................................................................................... ix

List of Figures .................................................................................................................. x

Chapter

1. INTRODUCTION ........................................................................................................ 1

1.1 Problem Statement .................................................................................................. 1

1.2 Existing Solution .................................................................................................... 2

1.3 Proposed Solution ................................................................................................... 3

2. BACKGROUND AND LITERATURE REVIEW ...................................................... 4

3. PROPOSED FRAMEWORK (DESIGN) .................................................................. 10

4. IMPLEMENTATION AND DEVELOPMENT ........................................................ 12

4.1 Data Collection: .................................................................................................... 12

4.2 Models used .......................................................................................................... 15

4.2.1 LBPH: ............................................................................................................ 15

4.2.2 Faster R-CNN ................................................................................................ 18

4.3 Implementing Face Detection and Recognition ................................................... 20

4.4 Implementation for Object Detection: .................................................................. 23

viii

4.5 Primary Layer: ...................................................................................................... 27

4.6 Secondary Layer: .................................................................................................. 28

4.7 Impact Factor ........................................................................................................ 30

5. RESULTS AND EXPERIMENTATION .................................................................. 32

5.1 Results obtained for the first person: .................................................................... 34

5.2 Results obtained for the second person: ............................................................... 36

5.3 Results obtained from the third person. ................................................................ 39

6. FUTURE WORK ....................................................................................................... 42

7. CONCLUSION .......................................................................................................... 43

References: ..................................................................................................................... 44

ix

LIST OF TABLES

Tables Page

1. Packages Downloaded…………………………… .………………………………. 24

x

LIST OF FIGURES

Figure Page

1: Functioning of the project from user's perspective .................................................... 10

2: Flowchart describing the working of the project ....................................................... 11

3: It shows the different houses labelled in one image. .................................................. 14

4: Xml file generated after annotating an image. ........................................................... 14

5: Picture of knife held in hand ...................................................................................... 15

6: Working of the lbp operator. ...................................................................................... 17

7: Showing the working of LBP on am Image ............................................................... 17

8: Architecture of faster rcnn .......................................................................................... 19

9: Flowchart of face recognition..................................................................................... 21

10: Training faces ........................................................................................................... 22

11: Face recognition of a person .................................................................................... 22

12: Face recognition of unknown person ....................................................................... 23

13: Knife detection with 98% accuracy ......................................................................... 26

14: Knife detection with 100% accuracy ....................................................................... 27

15: Email sent to the user when an intruder is detected ................................................. 28

16: Code shows the working of secondary layer ............................................................ 30

17: User with the knife ................................................................................................... 32

xi

18: Unknown user with the knife ................................................................................... 33

19: Email screenshot ...................................................................................................... 33

20: Screenshot sent along with the email ....................................................................... 34

21: First video tracked and sent in email ........................................................................ 34

22: Second video tracked in camera zone 2 ................................................................... 35

23: Third video tracked in camera zone 2 ...................................................................... 35

24: Fourth video tracked in camera zone 2 .................................................................... 35

25: Camera screen (zone 1), detection of thief ............................................................... 36

26: Image sent along with the email ............................................................................... 36

27: Same thief detected in camera zone 1 ...................................................................... 37

28: Same thief detected in camera zone 1 ...................................................................... 37

29: Thief moves to second location ................................................................................ 38

30: Tracking video of camera zone 2 ............................................................................. 38



33: Thief detected and image sent .................................................................................. 39

34: Thief tracked in camera zone 1 ................................................................................ 40

35: Thief tracked again in camera zone 1 ....................................................................... 40

36: Thief tracked in camera zone 1 ................................................................................ 41

1

Chapter 1. Introduction

1.1 Problem Statement

Home security is one of the main concerns for the people. There are numerous alternatives

can provide home security. For instance: - CCTV cameras. These cameras can capture and

store the video for the whole day. However, there is an additional effort required to go

through the videos feed and check any malicious activity. Moreover, CCTV system does

not provide any alert system. These can be two major drawbacks for this security system.

Other systems like ADT protection, is a security with burglary alarm, fire alarm and many

more. However, if we delve into the background working of ADT, a protection agency is

linked to protect from any theft. Also, this system provides no privacy. Every single second

of your day life is being recorded in the house. There are other systems like Ring, that

provide human detection alert outside the home. However, this framework will still provide

the alert if the home owner itself is detected outside the house. So, keeping all these

concerns in mind, a better security system is required that can meet the satisfaction level

of the family. This motivation leads to build a system that:

• Makes distinction between legal user and unknown user.

• Send alert to the user for any thief detected.

• After detection of the thief/intruder, all the cameras will form a second layer where

the thief will be tracked.

• Moreover, there is no external agency involved and data is secured with the user.

2

1.2 Existing Solution

Khan A Sultana and Tan a Wahid et al [21] proposed a light weight intrusion detection

system with Raspberry Pi camera for detection of a robbery. It will detect an attacker

holding knife and gun in his hand. Here, event-driven approach is utilized, such that,

camera sensors will produce better surveillance services [21]. Patterns are monitored for

better results [21]. In addition, activity is supervised in the field of view [21], By making

use of event-driven approach, cameras will be on standby [21]. This build a framework

which is energy efficient and will only respond if any activity of the target event is

inspected [21]. In this research, they deployed CNN model to solve this problem statement.

This framework integrates image processing [21], AI computer vision and network

communications methods for real-time crime event detection [21]. Training of CNN model

is accomplished by using trained images of knives and guns. For validation and testing, 8:2

ratio is diffused for training and testing.

Kepin Yu et al proposed a framework that can recognize an object (knife, gun, any metal)

in a moving person [22]. In the traditional system, people have to wait for the security

check at the gate in the airport [22]. That will result in bottleneck and congestion.

Therefore, this paper purposed an AI based W band dubious object detection. This system

comprises of two layers- primary level and secondary level. In primary level screening,

suspicious object will be inspected using W-band radar, if that object is within 15 meters

[22]. In secondary level screening, the person who was found suspicious in the primary

layer, will be tracked and the suspicious object will be recognized. Recurrent Convolution

Neural network (R-CNN) methodology is used for image recognition and classification

3

[22]. Here four types of suspicious objects are used (knife, scissor, fork, bottle) to train and

evaluate CNN model. Blurred and noisy images are used as well to train the CNN model

to reduce false rate (FR) [22].

1.3 Proposed Solution

The proposed solution incorporates two layers: - primary layer and secondary layer. In the

primary layer, face recognition and object detection will take place to authenticate if the

person is unknown and if it is holding a knife. The user will receive an email alert along

with the intruder’s image if both the conditions satisfy. For the face recognition, LBPH

classifier model is deployed. For object detection Faster -RCNN model is disseminated to

train the data. Once the thief is exposed, secondary layer will be established where the thief

will be tracked. A model will be trained with the images of thief and face recognizer will

initiate the process of determining the similar unknown person. If the thief is detected in

secondary layer, a video will be sent to user through email. The system will continue the

procedure of sending emails, taking into consideration of the thief’s presence in any camera

zone.

4

Chapter 2. Background and Literature Review

Home security has always been major concern among people. However, with the help of

an intelligent system, a home can be secured by using CCTV cameras, video doorbell,

smart locks and many more. Life would become much easier if the system can predict a

crime by processing CCTV cameras [17]. In [17], the research work focuses on image

dataset to detect any gun, knife, gun in hand, blood. In the dataset of [17], gun was further

classified as short gun, machine gun and revolver. Therefore, there were, in total 6 classes

– knife, blood, short gun, machine gun, revolver and gun in hand. The images in dataset

were of size 150x100 pixels. Furthermore, max pooling was used to make the original

image as 75x50 and then again performed max pooling to get the dimensions as 35x27.

A multi-layered CNN model was used to detect the crime scene. To arrive at a detection

result in [17], this model employed the Convolutional Layer, Rectified Linear Unit (ReLU)

Fully Connected Layer, and CNN dropout function [17]. To obtain our desired results, in

[17], CNN was developed using TensorFlow (an open source platform) to obtain the

desired results [17]. This proposed model achieved an accuracy of 90.2% while testing on

dataset [17]. The [17], there was some test cases where the proposed model can detect any

color that was similar to blood as a blood. Moreover, the short guns and revolver look quite

similar. In [17], short guns and revolver were trained with CNN model after training

machine gun, so there were some accuracy problems with differentiating the guns by the

model. These were some of the flaws mentioned in [17] research work.

The weapon detection system should be robust to provide enhanced security. As a result,

the ability of an automated surveillance system to detect weapons raises the level of

5

security [18]. In [18], Deep Convolutional Neural Networks are being used in a

revolutionary way to classify weapons (DCNN) [18]. For initializing the weights of the

architecture's convolutional layer, the proposed approach employs the idea of transfer

learning [18]. Transfer learning is a form of learning that allows you to apply what you've

learned in one setting to help you learn in another [18]. In [18], number of neurons were

changed in fully connected layer. This leaded to two models which further helped to

determine the effect of neurons on the weapon classification accuracy [18].

Experiments were carried out on three types of images: knife, gun, and no-weapon [18].

Various varieties of knives were retrieved from the internet, and some of the images were

taken with the lab's camera [18]. Gun class consisted of several photos of pistols [18].

Pistols photos were downloaded from the internet [18]. Humans, automobiles, chairs, and

other objects are included in the No-weapon class [18]. The dataset used in [18] is too small

[18]. Therefore, some of the changes were made to make sure that training was done

properly [18]. Number of the parameters were reduced [18] and increasing of dropout value

was seen in [18]. After making two models by increasing the number of neurons, analysis

carried out was that true positive will not increase even by increasing the number of

neurons [18]. And increase in dropout may or may not lead to decrease in true positives

[18]. This concludes, that even after increasing the number of neurons in convolutional

neurons network, this will not lead to good or accurate results.

A deep research has been done in [19], which focuses on different algorithms for object

detection. The algorithms are further classified as non-deep learning and deep learning

algorithms [19]. Color segmentation, interest points, forms, and edge detectors are

6

commonly used in non-deep learning methods [19]. The dependency of non-deep learning

methods on the image quality and angle leads to major drawback of using this algorithm

[19]. Noise and occlusion in images might be difficult to notice [19]. If the needed object

shares the same color as its surrounding objects, color segmentation will struggle to

separate the image [19]. In non-deep algorithms, image quality is very crucial [19]. If the

images used are not of good quality, then this will provide bad results for the weapon

detection.

On the other hand, no requirement of feature engineering makes the deep learning

algorithms to have upper hand on non-deep learning algorithms [19]. In [19], it uses and

compares the accuracy obtained by different deep learning algorithms – RCNN, Faster-

RCNN, YOLO, SSD and Overfeat [19].

In the conclusion of [19], results comparison of the different algorithms differs based on

the dataset used [19]. Some of it used the ImageNet dataset and some used custom dataset

[19]. To resolve this issue in [19], ImageNet dataset is taken as a standard for generating

results [19]. Based on this assumption in [19], Faster-RCNN gives the best speed [19].

This conclusion in [19] about Faster-RCNN providing accurate results leads to using this

model for object detection in this project.

The research work in [17], [18], [19] detected the objects and crime scene based on the

images used in the testing dataset instead of live video feed.

However, there are some research work that focused both on real time and non-real time

system. In [20], a support system is built to provide security to the system [20]. Face

recognition and anomaly detection is done to recognize the individual's actions [20]. These

7

two are the main key aspects of the suggested system in [20]. As a result, if necessary, it

can be utilized to send alerts to relevant authorities and family members [20]. The proposed

system does not take a dataset of the home owner’s face [20].

The primary goal of this project is to develop an image processing-based security solution

[20]. The system's flow begins with the capture of a face image, which is then subjected to

face recognition [20]. The image capture is carried out once more if the face is recognized

[20]. Otherwise, the body tracking component is started [20]. If an anomaly is discovered,

an alert is sent to the cloud, along with the location where the anomaly was discovered

[20]. Face recognition, anomaly detection, and transferring data to the cloud are the three

primary components of the system software [20].

The first section employs a webcam to collect people's faces and forecast their identities

[20]. A PCA approach is used by the application to identify a face from the database [20].

The eigenfaces are discovered by projecting the primary components onto the eigenspace

[20]. The lowest Euclidean distance of projection determines the anonymous faces [20].

If the person detected matches with face in the unknown dataset, then anomaly detection

is carried out [20]. The anomaly detection focuses on activity pattern, further based on

unsupervised method, Hidden Markov [20]. The activity pattern is related to burglary

position. When any activity pattern is recognized, a picture is clicked and saved in the cloud

with the location of the camera. Moreover, a siren will start ringing to give the owner an

alert about this activity. There should be at least two detection of anomalies that passes

above the threshold set up [20].

8

Whenever an anomaly was detected, the time and location of the anomaly are reported to

Thing speak, a cloud platform [20]. Moreover, the image of the detected anomaly was also

delivered to the cloud as a 1D array [20]. The saved image in the cloud database, was

helpful for the owner, which can be retrieved for further investigation.

The research carried out in [20], was tested both on real time and non-real time dataset.

The proposed model in [20], turns out to be more efficient and effective in non-real time

dataset with an efficiency of 100%. But still, it was able to achieve an efficiency of 70% in

real time surveillance. In [20], it is concluded that large amount of dataset will leads to

more efficiency in real time surveillance [20].

In conclusion to the research work in [17], [18], [19], the kind of weapon used determines

the nature and scope of the crime [18]. If an automated video surveillance system can create

a previous alarm, then effects can be reduced to the greatest extent possible [18].

The effectiveness of video surveillance can be considerably enhanced by a comprehensive

weapon categorization system [18]. An automated surveillance system can also benefit

from weapon classification [18]. When it comes to weapon types, it has been observed that

guns and/or knives are frequently employed in criminal activity due to their portability

[18]. As a result, the work provided in [18] focuses on recognizing or classifying these two

types of firearms in any image [18].

This research work in [17] detected the crime scene based on the images used in the testing

dataset instead of live video feed. A person's ability to pay attention to all of the video feeds

on a single screen recorded by many video cameras is not always achievable [18].

9

However, there is a need of active surveillance systems such that an alert can be send to

the user when there is any malicious activity is recognized.

10

Chapter 3. Proposed Framework (Design)

Designing is necessary to get the layout of the project, before taking deep dive into it. It

helps to understand the rudiments of the project and give clear intention of the

functionality.

The picture below describes the functionality of the project in a user perspective way. So,

if a person that is unknown to the system is detected and moreover, he or she is holding a

knife, then an alert will be send from the system to the user through email address.

Furthermore, some pictures of the person will be taken to keep track of the person. The

cameras will be placed at different locations. Every camera will be active and detection of

the unknown person and the knife will be done simultaneously. This gives a general idea

of the project.

Figure 1: Functioning of the project from user's perspective

The flow chart below provides the layout or design in a more detail. As described in

flowchart, an email alert will only be sent if the person is unknown and have a knife. So, if

the person is unknown and does not have any knife, then the system will not detect it as a

11

thief. Once the thief is detected, then tracking will take place. But before tracking, images

of the thief will be stored which will be used to track the same unknown person.

Figure 2: Flowchart describing the working of the project

12

Chapter 4. Implementation and Development

4.1 Data Collection:

Dataset plays very important role for any AI project and system. The availability of data

will have a significant impact on how the system is put together and which AI approaches

are deployed [5]. The amount and quality of data available will have an impact on the final

product's quality [5]. In this way, one could argue that data availability (if data exists) and

accessibility (whether data is available) are the primary motivators for the development of

AI-based solutions [5].

ML is extremely reliant on data; without it, an "AI" will be unable to train [6]. It is the most

important factor that allows algorithm training to take place [6]. Any AI project will fail

regardless of how smart the AI team is or how big the data set is if the data set isn't good

enough [6].

For this project, two types of datasets were required for carrying out face detection and

object detection.

1. Face Detection and recognition: For face recognition, 120 images of the legal user

are used in dataset to differentiate between user and thief. To capture 120 images,

a python program is written, which will take pictures and will save those in dataset

folder. Haarcascde_frontface.xml classifier is used for detecting face. Pictures will

be saved in black and white. This program will take nearly 20 seconds to capture

120 images. While capturing, user will tilt the face to the four directions. This will

help in improving accuracy of the face recognition.

13

2. Object Detection: As mentioned before, data plays very important role in this

project. The bigger the dataset is, more accurate the results will be. Large dataset

provides great benefit in a research work for object detection and recognition [1].

LabelMe is a web-based tool which is used for building dataset in this project.

LabelMe is a tool for annotating images by classifying images into different classes

[2]. This tool is necessary because current segmentation and interest point

techniques cannot detect the outlines of plenty of item division, which are regularly

short and obscure in natural photos, annotated data is required [2]. It is vital to have

a "center of attention" [2]. The objective of LabelMe is to contribute with high

quality of labelling, diverse classes with detailed information like bounding boxes,

polygon and many more [1].

Thus, the purpose of the annotation tool is to grant a drawing interface that works across

several platforms, is simple to use, and enables for fast data sharing [1]. A picture appears

when the user first visits the page [1]. The picture is part of a much larger image database

that includes hundreds of object types and a wide range of situations [1]. By tapping control

points across the perimeter of the object, the user can label a new object [1]. The user will

start labelling by drawing the shape that will further connect the end points of the object.

Here, the rectangle shape is used for the knife. Once this is done, a dialog box will pop up

which will ask for object name. After entering the object name, the same image can be

labelled again for any other object. This dataset comprises of one class: knife. Total 1346

images are used for knife. So, the images are labeled one by one along with name of the

object. These labels are saved in an XML file, making it more extensible [1].

14

Figure 3: It shows the different houses labelled in one image [3].

For every image labelled, an xml file is created. Below is the xml file for one of the images

of the dataset.

Figure 4: Xml file generated after annotating an image.

This xml file will have the position of the knife (xmin, xmax, ymin, ymax) in the image. It

has the image name, width, height, depth along with the object name ‘knife’. To make the

15

system more accurate while detecting knife, a person holding a knife in hand are used as

well in dataset to train the model. There is a chance that while holding a knife, knife might

be covered 50% by hand (as shown in the picture below).

Figure 5: Picture of knife held in hand

So, to label this picture, the red bounding box is the correct representation for the

annotation. This will not only make sure that knife is detected, but also that the knife was

held by a human instead of only present on the floor, desk or somewhere else. The dataset

is divided into training and testing. For testing purpose, 234 pictures are used.

4.2 Models used

4.2.1 LBPH:

The most difficult problem in computer vision during the last decade has been automatic

individual face identification [7]. However, law enforcement agencies are still unable to

effectively classify and observe any person [32] using video surveillance cameras;

lightning, blur conditions, resolution, illumination are [32] all influential issues in facial

recognition [7]. This is where it comes, Local Binary Pattern Histogram which will be used

to handle human face identification [7], [32].

LBPH is a face recognition algorithm that is established on the local binary operator [8]. It

is an extensively employed algorithm for the sake its two main features. First one is

16

discriminative power and secondly, computational simplicity, [8]. Let's take a closer look

at the LBPH method; there are various steps to this approach.

The first is parameters, which are divided into four categories: neighbors, radius, grid Y

and grid x [9]. Radius is deployed to locate the image's center. The value for this radius is

set to 1. It cannot accommodate to have greater than 8 pixels as it would be too expensive

to compute [9]. The direction of the number of cells in grid x is horizontal, while it is

vertical in direction for the grid y. Given the cost of computing power, grid y and x are

both set to 8 [9]. The dataset must then be trained for the method. Each user has a distinct

identity, and photographs are kept according to that identification value [9].

For every image of size M x N, it will be split into M x M regions.

The local binary operator is used in each region. When used on pictures, this operator

compares a pixel to its eight closest neighbors [8]. If the effectiveness of the nearby pixel

is outstanding than the effectiveness of the center pixel, the comparison returns a value of

'1' [8]. A '0' value is returned in all other cases [8]. When this technique is applied to all

eight neighbors, eight binary values are obtained [8]. Following that, those values are

combined to generate an 8-bit binary number [8]. The acquired binary number can be

converted to a decimal value in the range of 0-255, known as the pixel LBP value [8]. In

the Fig below, the operation is performed on each region [8].

17

Figure 6: Working of the LBP operator [8].

The current region's histogram is then constructed by counting the number of times each

LBP value appears in this region [8]. Each region's histogram is made up of 256 bins as a

result of this technique [8]. Then, by concatenating each histogram, we generate a new,

larger histogram [9]. For an 8x8 grid, a total of 8x8256=16.384 places will be generated

for the histogram [9]. The resulting histogram imitate the peculiarity of the original image

[9]. Face recognition is the final phase, and the algorithm has already been trained for this

[9]. Each image in the training data set is represented by a histogram [9]. All of the previous

processes are repeated for each new image in order to construct a histogram that depicts it

[9]. So, the histogram generated for the image, will be compared with other histograms [9].

So, the histograms that will be closest, will be considered a match [9]. The image that

cognates to this particular histogram, will be returned as a result of this algorithm [9].

Figure 7: Showing the working of LBP on Image

18

In addition, the algorithm returns a confidence value, which is equal to the calculated

distance [10]. Lower confidence is preferable. The shorter the distance, the better will be

the accuracy of the algorithm [9]. The algorithm can then be given a threshold value, which

can be used to determine whether the algorithm has identified a picture with the help of

comparison of confidence value or not. [9].

So, basically, face recognizer will check the confidence value. If the confidence value is

greater than the threshold value, this means the face is not recognized and it will return -1

or negative value.

4.2.2 Faster R-CNN

Convolutional neural networks have great properties in the field of object detection [11].

Its weight-sharing network structure restrict in diminish the complexity of network model.

Fast-RCNN method is comparable to the R-CNN algorithm. Convolutional feature maps

are generated by CNN by feeding image, instead of region proposals [14]. The region of

proposals is selected from the convolutional feature map using this algorithm. These region

proposals are in different size. To fed these into fully connected layer, RoI pooling layer is

used to restructure these into fixed size [14]. After being reconstructed into fixed size, a

SoftMax layer is employed to predict class and the bounding box offset values from the

RoI feature vector [14]. However, region proposals used in Fast-RCNN become bottleneck

as it slows down the algorithm, ultimately leading to affecting the performance [14].

In this research, faster-RCNN is employed to detect objects [11]. Faster-RCNN deploy

fast-RCNN and RPN together for the object detection.

19

RPN is a fully convolutional deep learning network [13]. At each position in the input

image [11] [12], RPN has the capability for estimating the target area frame and target

score (probability of the actual target) [11], [12]. RPN is used to build high-quality Region

Proposal boxes; it shares the complete graph's convolution feature with the detection

network and eliminates the speed problem of the original Selective Search [11], [12],

considerably improving the object detection effect [11].

Figure 8: Architecture of Faster RCNN [14]

These region proposals are of different sizes [30]. This will further lead to CNN feature

maps of different sizes [30]. It will be a lot of struggle to come up with an efficient structure

which will be used for the different sized features maps [30]. To make these feature maps

into the same size, Region of interest pooling is used [30]. The input feature maps are

divided into a fixed number by ROI pooling and then, made into nearly equal regions [30].

After the creation of equal regions, max pooling is applied on every region [30]. The idea

behind making all these feature maps into same size is Max Pooling, and it can only be

applied on fixed size [30].

20

The features are sent into the sibling classification and regression branches after passing

through two fully linked layers [31]. It's worth noting that these categorization and

detection branches differ from the RPN's [31]. To get the estimation of accuracy,

classification scores are calculated. This calculation can be carried out by two ways. Either

the probability of a proposal belonging to each class is calculated or the features are passed

through a SoftMax layer [31]. To improve the projected bounding boxes, the regression

layer coefficients are employed [31]. The regressor in this case is size agnostic (unlike the

RPN), but it is unique to each class [31]. That is, in the regression layer, each class has its

own regressor with four parameters [31].

4.3 Implementing Face Detection and Recognition

For Face detection: For face detection, OpenCV DNN is used [15]. It is constructed on the

'Single Shot Multi-Box Detector' (SSD) and is based on the 'ResNet-10' architecture [15].

The YOLO technique and Single Shot Mutli- Box Detector shares similarity, such that, it

uses Multibox to detect numerous items in a single shot [15]. It has a substantially faster

object detecting system with good accuracy [15].

Below is the flowchart of the face recognition.

21

Figure 9: Flowchart of face recognition [7]

Face Recognition: To recognize the face of the legal user, dataset of user images is required

to train the model. To capture 120 images of the user, a python script is written, which will

automatically start taking pictures and will save those in a dataset folder. The number of

users can be increased as per the requirement of the family members. Once the dataset has

acquired 120 images per user, LBPH model is trained with the help of this dataset.

Below is the code for training the user images by using LBPH. Images taken from dataset

are converted to a grayscale. Here ‘id’ is given as a distinct name to the user. While running

the recognizer program, id refers to the index of the array, for instance, arr[id] = “Beer”.

22

Figure 10: Training faces

Below are the screenshots of the user and unknown person recognized after training the

model.

Figure 11: Face Recognition of a known person

Unknown user:

23

Figure 12: Face Recognition of unknown person

4.4 Implementation for Object Detection:

The foremost step is the construction of object detection classifier [16]. This can be

accomplished by training our own convolutional neural network for an object ‘knife’ [16].

Below are the steps that are followed to accomplish this purpose

1. Install Anaconda: Anaconda is a software that builds virtual Python environments,

which will allow us to install and utilize Python libraries without fear of conflicting

with current installations [16]. Anaconda is a Python library and provides with a

straightforward way to set up TensorFlow [16].

2. TensorFlow Directory will be set up.

• TensorFlow Object Detection API repository will be downloaded from

GitHub [16]: A folder named “tensorflow1” will be created under the

directory C. All the training images, classifiers, training data will be stored

in this directory [16]. TensorFlow object detection will be downloaded and

folder ‘models’ will be extracted.

24

• In this step, Faster-RCNN-Inception-V2-COCO model will be downloaded.

Faster RCNN provide more accuracy with little slow speed.

• Set up new folder: All the data (training and testing) will be put under this

directory. Further, directory for inference graph and label map will be

created.

• Download the packages described in the table below.

Table 1: Packages downloaded

Packages Usage

NumPy This is used to compute numerous operations on arrays

and multidimensional objects [29].

TensorFlow TensorFlow is an artificial intelligence software library.

It can be used for a variety of applications [27].

OpenCV-python OpenCV is a large open-source library that includes a lot

of functionality. Most importantly, it can be used for

image processing, computer vision, and machine learning

[28].

Matplotlib Matplotlib is a Python programming language plotting

library [26].

Pandas It is a library that is useful for analyzing the data [25].

Contextlib2 This library works with context manager for managing

the resources [23].

25

LXML Lxml is a Python module. It works with XML and HTML

files. It is also capable of web scraping [24].

• Hereafter, we will set up a virtual environment for tensorflow-gpu in

Anaconda [16]. Search for the Anaconda Prompt application in Windows'

Start menu, right-click it, and select "Run as Administrator" [16]. If

Windows prompts you to enable it to make modifications to your machine,

select Yes [16]. A new virtual environment named "tensorflow1" is created

in the command terminal [16].

• Configuration of the PYTHONPATH variable: A PYTHONPATH variable

pointing \models, \models\research should be created [16]. This

PYTHONPATH will be set in anaconda after activating tensorflow1, which

is a virtual environment.

• Compile Protobufs and run setup.py: This step includes commands to

compile the protobuf files. After compiling, these files can be used for

training the parameters [16]. Then the setup.py files are built and installed

using commands.

3. Generate Training Data: After collecting data and annotating those images, all the xml

files are converted into csv. This csv file will be used to generate TFRecords.

TFRecords are used as input for the TensorFlow training model. Therefore, tf_record

is generated for both training and test images.

4. Configure training and Label Map: Label Map file will be generated in this section.

The file will have the following code.

26

5. Run Training: Last but not least, the object detection training pipeline must be set up

[16]. It specifies the model and parameters will be utilized in the training process

[16]. This is the final step before start running the training.

6. Export inference graph: After we have finished training, we will need to create the

frozen inference graph (. pb file) [16]. The presence or absence of edges between a

set of points is predicted by using an inference graph, which is based on observations

about the points.

7. Testing Object Detection classifier: After doing all the steps to create custom object

classifier, it was run to test the results. Below is the picture, when this classifier was

run to test it.

Figure 13: Knife Detection with 98% Accuracy

27

Figure 14:Knife Detection with 100% Accuracy

This classifier of object detection was successfully able to detect the knife with

good accuracy.

4.5 Primary Layer:

In this layer, both the face recognition and object detection will be carried out to detect the

thief. If the thief is detected, an email will be sent to the email of the user attached with the

system as a warning. So, there are three main points, based on which system will decide to

send the warning on email or not: -

• If the person is unknown to the system and he is not holding any knife, no

warning will be sent.

• If the person is unknown to the system and he is holding a knife, then a

warning will be sent to the user.

• If the person is known to the system and he is holding a knife, no warning

will be sent as he/she is the legal user.

Here, the intruder21 will be used for knife presence and intruder11 will be used for

unknown person presence. If knife is detected, then intruder21 will be set to true. If an

28

unknown person is detected then intruder11 will be set to true. So, if both (intruder21 and

intruder11) are true then the system will send the email warning to the user along with the

image of unknown person (saved when the unknown person is detected).

Figure 15: Email sent to the user when an intruder is detected

Primary layer working is only limited to detect the thief and send the warning to the user

along with the image. However, to keep track of the image, a secondary layer is used.

4.6 Secondary Layer:

Secondary layer is used here for tracking the thief. When the thief is first detected in

primary layer, some of pictures (nearly 20) will be stored in folder. Then, those images will

be used to train the model. A variable named ‘lastSent’ is used to keep a track if there was

an unknown person was detected before. In the primary layer, it will be set to true.

There will be two cases after thief is detected.

Case 1: So, once training is done, if the intruder is detected along with knife, moreover, if

the lastSent is not None, then the face of the person will be checked with the unknown

29

person’s face (that was stored before), if a match occurs then, a five seconds video will be

sent to the user. Those five seconds video will be sent continuously to the user, if the same

thief will be detected in the camera. The main reason for using five seconds video is, if the

system keeps on checking every frame, then there will be a chance when a thief will be

present in one spot for a long time. For instance, if the thief is detected and tracked in

camera location 1 and the thief was present in the same location for 7 minutes, then the

video will be recorded for 7 minutes. After recording of thief for 7 minutes, a video will be

sent to user. However, it will be too late to track the activity of the thief.

Case 2: Once training is done, what if the intruder detected along with the knife does not

match with the image of previous thief. In this scenario, the second thief pictures should

be taken and another face recognizer for the second thief. In the code, lastSent will be set

to None, such that, if second thief is detected, primary layer will be created to detect the

thief and store the images. Primary layer run only when lastSent will be None.

30

Figure 16: code shows the working of secondary layer

For all the other scenarios, for instance, if the legal user with the knife is recognized, no

alert will be sent to the system. Also, if an unknown person is detected and he does not

have a knife in his hands, then there will be no alert. This scenario was taken into

consideration to make sure a guest (unknown person to the system and not holding knife)

will not be detected as thief.

4.7 Impact Factor

• If the thief has turned his back to the camera, the system is unable to detect if it is

a thief. So, if there is an unknown person in the camera range with a knife, and had

his back towards the camera, then the system will not be able to detect any face.

31

For face recognition, a face check is required that will be compared with the user

face. If there is no face found, then the system will not be able to make sure that

this person is a thief.

• If the person is too far from the camera (at least 8.5 feet), It’s hard for the system

to make distinction. Because smaller items may lose too much signal during down

sampling in the pooling layers of convolution, so it becomes hard for the system to

detect from too far.

32

Chapter 5. Results and experimentation

For testing this project, two cameras were deployed at different locations. This project is

extendable, more cameras can be added to this current module. For the chase of the felon,

a five second video will be sent to the user. The main reason behind employing a small

video is because, considering the scenario, that what if the thief stays in same camera zone

for 5 minutes straight, then this will lead to recording of video which will be 5 minutes

long. And, this video will be sent after 5 minutes. Then this can be too late to deal with or

forestall and prevent any serious happening.

The picture below shows the face detection along with the knife. In this scenario, an alert

will not be sent as the recognized person is the user.

Figure 17: User with the knife

33

In the second camera spot, an unknown person is detected with a knife. So, an alert will be

sent to the user through email along with the image as an attachment. The picture below

shows an unknown person detected with a knife.

we

Figure 18: Unknown user with the knife

The email alert sent to the user looks like the image in Fig 19. The system uses

[email protected] to send alerts to the user. The most current email

will have all the videos and images that were previously sent to the user for the same thief

detected.

Figure 19: Email screenshot

mailto:[email protected]

34

5.1 Results obtained for the first person:

Figure 20: Screenshot sent along with the email

An unknown person was found in camera zone 1 with the knife. The system identified the

person as intruder. As shown in Figure 20, this is the image of the intruder that was attached

along with the email as an alert. After detection of the intruder, secondary layer will start

tracking the thief. Pictures of the unknown person holding a knife will be taken and trained

for keeping track of thief to prevent any mis happening. Intruder was detected in same

camera zone 1. Figure 21 contains the video that tracks the thief when the thief is found in

camera zone 1.

Figure 21: First video tracked and sent in email

35

Figure 22, 23, 24 contains the videos sent to the user as a warning while tracking the

same thief in camera zone 2.

Figure 22: Second video tracked in camera zone 2

Figure 23: Third video tracked in camera zone 2

Figure 24: Fourth video tracked in camera zone 2

36

5.2 Results obtained for the second person:

Figure 25 shows the camera zone 1 where the criminal is recognized along with the knife.

After the thief is identified, an email alert will be sent to the user along with the picture in

Figure 26.

Figure 25: Camera Screen (Zone 1), detection of thief

Figure 26: Image sent along with the email

The same criminal is found in the camera zone 1. Figure 27, 28 shows the video sent to

the legal user’s email address while tracking down the thief, found in camera zone 1.

37

Figure 27: Same thief detected in camera zone 1

Figure 28: Same thief detected in camera Zone 1

The intruder changes the location and moves to the camera zone 2. In addition, the thief is

detected again and system will start tracking and sending alert along with the videos for

the same thief encountered. Figures 29, 30, 31 and 32 contains the video for the same

burglar found in camera zone 2.

38

Figure 29: Thief moves to second location

Figure 30: Tracking video of camera zone 2

Figure 31: Tracking Video of camera zone 2

39

Figure 32: Tracking video of camera zone 2

5.3 Results obtained from the third person.

A different intruder is observed in the location 1. Therefore, an alert is sent to user along

with the image, shown in Figure 33.

Figure 33: Thief detected and image sent

Furthermore, the similar person with the knife is detected again in location 1. As a result,

secondary layer will start tracking the intruder. Whenever, the same person is detected in

any of the camera locations, a tracking video of 5 minutes is recorded and sent to the user.

40

These videos are stored in the same location as the whole directory of the project. A

different folder named ‘thief_vidoes’ is constructed and used for the storing purpose.

Adding on, the thief in figure 33, is standing in same spot for location 1. A Video will be

sent to the user as long as the thief is in the camera zone, shown in Figure 34.

Figure 34: Thief tracked in camera zone 1

Same intruder is spotted in location 1, shown in figure 35 and 36. However, in figure 36,

it clearly shows the intruder moving out from the camera location. Therefore, tracking is

been finished and no more videos will be sent to the user.

Figure 35: Thief tracked again in camera zone 1

41

Figure 36: Thief tracked in camera zone 1

42

Chapter 6: Future work

1. A web application using HTML, CSS frameworks can be produced which will have

a user interface for addition of family member (collecting images of face), and

email address for alerts. Furthermore, having the accessibility of watching the live

camera feed.

2. Contrary to sending videos, flask framework will be efficient in generating a

website link for sharing purposes with user. The attached link will act as a tracking

link, which does not require a use of downloading a video. Once the thief is out of

the horizon of the camera, that particular live link can be transformed to video.

3. Adding on, an external device can be appended. For instance: - siren. This siren

will initiate audio alert like a ring in the house. This will further help in reducing

crime. The sound of siren will trigger a fear inside them. Therefore, instead of

executing that crime, they will struggle to escape the house in the agitation of

getting seize.

43

Chapter 7: Conclusion

A system is built that ensures the security of the home. With the help of video surveillance,

a thief can be tracked by this system to restrict any mis happenings in the future. This

system makes a distinction between a legal user and unknown person. Weapon detection

in the video surveillance will help in providing alerts and will further prevent the crime. It

uses Faster-RCNN for the object detection. This system works accurately to send alerts to

the user along with the videos (used for tracking). The system is little slow, as the laptop

used for this project does not fulfil the GPU requirements. However, if this system run on

laptop with CUDA installed or GPU requirements are met, then it will be very fast in

reading frames and detecting the weapon. Furthermore, the system provides an accuracy

of greater than 97% for object detection (knife). Overall, the system when tested for

detecting thief provided and accuracy of 94.7% and this further concluded that it can be

greatly utilized for smart home security.

44

References:

1. Russell, B.C., Torralba, A., Murphy, K.P. et al. LabelMe: A Database and Web-

Based Tool for Image Annotation. Int J Comput Vis 77, 157–173 (2008).

https://doi.org/10.1007/s11263-007-0090-8

2. Russell, B.C., Torralba, A., Murphy, K.P. et al. LabelMe: A Database and Web-

Based Tool for Image Annotation. Int J Comput Vis 77, 157–173 (2008).

https://doi.org/10.1007/s11263-007-0090-8

3. Solawetz, J. (2020, December 9). Getting started with LabelMe - Computer Vision

Annotation. Roboflow Blog. Retrieved November 13, 2021, from

https://blog.roboflow.com/labelme/.

4. Yue Han, Guoliang Li, Haitao Yuan, Ji Sun, "An Autonomous Materialized View

Management System with Deep Reinforcement Learning", Data Engineering

(ICDE) 2021 IEEE 37th International Conference on, pp. 2159-2164, 2021.

5. Digital Curation Centre Trilateral Research, The role of data in AI - GPAI. [Online].

Available: https://www.gpai.ai/projects/data-governance/role-of-data-in-ai.pdf.

[Accessed: 26-Oct-2021].

6. A. Gonfalonieri, “How to Build A Data Set For Your Machine Learning Project,”

Medium, 14-Feb-2019. [Online]. Available: https://towardsdatascience.com/how-

to-build-a-data-set-for-your-machine-learning-project-5b3b871881ac. [Accessed:

26-Oct-2021].

7. A. Ahmed, J. Guo, F. Ali, F. Deeba and A. Ahmed, "LBPH based improved face

recognition at low resolution," 2018 International Conference on Artificial

Intelligence and Big Data (ICAIBD), 2018, pp. 144-147, doi:

10.1109/ICAIBD.2018.8396183.

8. N. Stekas and D. Van Den Heuvel, "Face Recognition Using Local Binary Patterns

Histograms (LBPH) on an FPGA-Based System on Chip (SoC)," 2016 IEEE

International Parallel and Distributed Processing Symposium Workshops

(IPDPSW), 2016, pp. 300-304, doi: 10.1109/IPDPSW.2016.67.

9. A. M. Jagtap, V. Kangale, K. Unune and P. Gosavi, "A Study of LBPH, Eigenface,

Fisherface and Haar-like features for Face recognition using OpenCV," 2019

International Conference on Intelligent Sustainable Systems (ICISS), 2019, pp.

219-224, doi: 10.1109/ISS1.2019.8907965.

10. Face Recognition: Understanding LBPH Algorithm, 2017, [online] Available:

https://towardsdatascience.com/face-recognition-how-lbph-works-90ec258c3d6b.

https://doi.org/10.1007/s11263-007-0090-8

https://doi.org/10.1007/s11263-007-0090-8

https://towardsdatascience.com/face-recognition-how-lbph-works-90ec258c3d6b

45

11. J. Zou and R. Song, "Microarray camera image segmentation with Faster-RCNN,"

2018 IEEE International Conference on Applied System Invention (ICASI), 2018,

pp. 86-89, doi: 10.1109/ICASI.2018.8394403.

12. JRR Uijlings, KEAVD Sande, T Gevers et al., "Selective Search for Object

Recognition[J]", International Journal of Computer Vision, vol. 104, no. 2, pp. 154-

171, 2013.

13. J. Long, E. Shelhamer and T. Darrell, "Fully convolutional networks for semantic

segmentation", 2015 IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), pp. 3431-3440, 2015.

14. R. Gandhi, “R-CNN, fast R-CNN, Faster R-CNN, YOLO - object detection

algorithms,” Medium, 09-Jul-2018. [Online]. Available:

https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-

detection-algorithms-36d53571365e. [Accessed: 31-Oct-2021].

15. P. Nagrath, R. Jain, A. Madan, R. Arora, P. Kataria, and J. Hemanth, “SSDMNV2:

A real time DNN-based face mask detection system using single shot multibox

detector and mobilenetv2,” Sustainable cities and society, 31-Dec-2020. [Online].

Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7775036/. [Accessed:

31-Oct-2021].

16. Evan, Lodeiro, M., & Zylinski, A. (2020, December 15). Tensorflow-object-

detection-API-tutorial-train-multiple-objects-windows-10/readme.md at master ·

EdjeElectronics/tensorflow-object-detection-API-tutorial-train-multiple-objects-

windows-10. GitHub. Retrieved November 13, 2021, from

https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-

Train-Multiple-Objects-Windows-10/blob/master/README.md.

17. M. Nakib, R. T. Khan, M. S. Hasan and J. Uddin, "Crime Scene Prediction by

Detecting Threatening Objects Using Convolutional Neural Network," 2018

International Conference on Computer, Communication, Chemical, Material and

Electronic Engineering (IC4ME2), 2018, pp. 1-4, doi:

10.1109/IC4ME2.2018.8465583.

18. N. Dwivedi, D. K. Singh and D. S. Kushwaha, "Weapon Classification using Deep

Convolutional Neural Network," 2019 IEEE Conference on Information and

Communication Technology, 2019, pp. 1-5, doi:

10.1109/CICT48419.2019.9066227.

19. A. Warsi, M. Abdullah, M. N. Husen and M. Yahya, "Automatic Handgun and

Knife Detection Algorithms: A Review," 2020 14th International Conference on

Ubiquitous Information Management and Communication (IMCOM), 2020, pp. 1-

9, doi: 10.1109/IMCOM48794.2020.9001725.

46

20. K. Lashmi and A. S. Pillai, "Ambient Intelligence and IoT Based Decision Support

System for Intruder Detection," 2019 IEEE International Conference on Electrical,

Computer and Communication Technologies (ICECCT), 2019, pp. 1-4, doi:

10.1109/ICECCT.2019.8869327.

21. T. Sultana and K. A. Wahid, "IoT-Guard: Event-Driven Fog-Based Video

Surveillance System for Real-Time Security Management," in IEEE Access, vol.

7, pp. 134881-134894, 2019, doi: 10.1109/ACCESS.2019.2941978.

22. 2. K. Yu et al., "Design and Performance Evaluation of an AI-Based W-Band

Suspicious Object Detection System for Moving Persons in the IoT Paradigm," in

IEEE Access, vol. 8, pp. 81378-81393, 2020, doi:

10.1109/ACCESS.2020.2991225.

23. 28.7. Contextlib - utilities for with-statement contexts. 28.7. contextlib - Utilities

for with-statement contexts - Python 2.7.10 Documentation. (n.d.). Retrieved

November 13, 2021, from https://documentation.help/Python-

2.7.10/contextlib.html.

24. Contributor, G. (2019, April 10). Introduction to the python LXML Library. Stack

Abuse. Retrieved November 13, 2021, from https://stackabuse.com/introduction-

to-the-python-lxml-library/.

25. Pandas: Python library - mode. Mode Resources. (2016, May 23). Retrieved

November 13, 2021, from https://mode.com/python-tutorial/libraries/pandas/.

26. Wikipedia. (2021, October 25). Matplotlib. Wikipedia. Retrieved November 13,

2021, from https://en.wikipedia.org/wiki/Matplotlib.

27. Wikipedia. (2021, November 12). Tensorflow. Wikipedia. Retrieved November 13,

2021, from https://en.wikipedia.org/wiki/TensorFlow.

28. GeeksForGeeks. (2021, March 28). OpenCV python tutorial. GeeksforGeeks.

Retrieved November 13, 2021, from https://www.geeksforgeeks.org/opencv-

python-tutorial/.

29. Wikimedia Foundation. (2021, October 27). NumPy. Wikipedia. Retrieved

November 13, 2021, from https://en.wikipedia.org/wiki/NumPy.

30. Gao, H. (2017, October 5). Faster R-CNN explained. Medium. Retrieved

November 13, 2021, from https://medium.com/@smallfishbigsea/faster-r-cnn-

explained-864d4fb7e3f8.

47

31. Ananth, S. (2020, October 1). Faster R-CNN for object detection. Medium.

Retrieved November 13, 2021, from https://towardsdatascience.com/faster-r-cnn-

for-object-detection-a-technical-summary-474c5b857b46.

32. IEEE Xplore. (n.d.). Retrieved November 30, 2021, from

https://ieeexplore.ieee.org/.