Maghaleh -Bloomfilter Encryption for HBase-Eng

11
A Method of Query over Encrypted Data in HBase Database by Using Bloom Filter Algorithm Farrokh Shahriari 1 , Ahmad Baraani-Dastjerdi 2 1 Computer Engineering Department, Engineering Faculty, University of Isfahan, Isfahan, Iran [email protected], [email protected] 2 Computer Engineering Department, Engineering Faculty, University of Isfahan, Isfahan, Iran [email protected] Abstract Encryption is one of the key issues in information security. Nowadays, encryption and data protection are a field of research in cyberspace and is getting more and more attention in the last years because many organizations and companies store their data in databases. This paper presents a method of query over encrypted data in HBase database by using bloom filter algorithm. HBase is a kind of the various NOSQL (Not Only SQL) databases which differs from relational databases. In our schema, we use bloom filter compression algorithm to create a signature for each stored data. In this way, we can query over encrypted data without needing any additional encryption/decryption operations. Finally, the experimental results reveal that our encryption model has robust performance to retrieve data from HBase. Keywords Security, Encryption, HBase (Distributed Database), Bloom Filter Algorithm 1- Introduction By the fast growths of internet and its comprehensive use, the number of users, organizations and their information are increasing; therefore, big companies such as Google, Facebook, Amazon, Yahoo and etc. are facing new challenges of maintenance and processing of terabytes or sometimes petabytes of information and data, within a reasonable time. The relational databases such as Oracle, SQLserver and Mysql are not proper for modern companies due to the increasing processing needs in companies information [1]. Thus, another kind of database for saving a large amount of information is proposed which structure is far different from the structure of the relational database. This kind of database is known as NOSQL database such as Google Bigtable and Apache HBase. The relational databases in which data is saved as tables, suffer from the lack of power to save and process a large amount of disorganized and non- relational data in a proper time. In addition, the extension of these databases is a hard task because not only there are few kinds of hardware for making such database, but also they cost a lot. Therefore, the Distributed Databases are preferred to relational ones since they don’t have the weakness of the traditional databases [2]. Google Bigtable is an example for distributed database which is based on Google File System (GFS), but the details of it is not issued by Google and only the overall structure of it is described [2]. The HBase database is the open source version of Google BigTable which is introduced by Apache Company [3]. This database is capable of being put on Hadoop distributed file system and this leads to higher reliability for this database. Moreover, the usage of MapReduce is provided which is parallel process of data on Hadoop distributed file system [4]. On the other hand, the concept of security and data maintenance has been proposed since the existence of databases. It prevents the accesses of unauthorized people by authentication and provide access for people proportional to their needs. The concept of data encryption, saving data in databases

Transcript of Maghaleh -Bloomfilter Encryption for HBase-Eng

A Method of Query over Encrypted Data in HBase Database by Using Bloom

Filter Algorithm Farrokh Shahriari

1, Ahmad Baraani-Dastjerdi

2

1 Computer Engineering Department, Engineering Faculty, University of Isfahan, Isfahan, Iran

[email protected], [email protected]

2 Computer Engineering Department, Engineering Faculty, University of Isfahan, Isfahan, Iran

[email protected]

Abstract

Encryption is one of the key issues in information security. Nowadays, encryption and data protection

are a field of research in cyberspace and is getting more and more attention in the last years because

many organizations and companies store their data in databases. This paper presents a method of

query over encrypted data in HBase database by using bloom filter algorithm. HBase is a kind of the

various NOSQL (Not Only SQL) databases which differs from relational databases. In our schema, we

use bloom filter compression algorithm to create a signature for each stored data. In this way, we can

query over encrypted data without needing any additional encryption/decryption operations. Finally,

the experimental results reveal that our encryption model has robust performance to retrieve data

from HBase.

Keywords Security, Encryption, HBase (Distributed Database), Bloom Filter Algorithm

1- Introduction By the fast growths of internet and its

comprehensive use, the number of users,

organizations and their information are increasing;

therefore, big companies such as Google, Facebook,

Amazon, Yahoo and etc. are facing new challenges

of maintenance and processing of terabytes or

sometimes petabytes of information and data, within

a reasonable time. The relational databases such as

Oracle, SQLserver and Mysql are not proper for

modern companies due to the increasing processing

needs in companies information [1]. Thus, another

kind of database for saving a large amount of

information is proposed which structure is far

different from the structure of the relational

database. This kind of database is known as NOSQL

database such as Google Bigtable and Apache

HBase.

The relational databases in which data is saved as

tables, suffer from the lack of power to save and

process a large amount of disorganized and non-

relational data in a proper time. In addition, the

extension of these databases is a hard task because

not only there are few kinds of hardware for making

such database, but also they cost a lot. Therefore,

the Distributed Databases are preferred to relational

ones since they don’t have the weakness of the

traditional databases [2]. Google Bigtable is an example for distributed

database which is based on Google File System

(GFS), but the details of it is not issued by Google

and only the overall structure of it is described [2].

The HBase database is the open source version of

Google BigTable which is introduced by Apache

Company [3]. This database is capable of being put

on Hadoop distributed file system and this leads to

higher reliability for this database. Moreover, the

usage of MapReduce is provided which is parallel

process of data on Hadoop distributed file system

[4].

On the other hand, the concept of security and

data maintenance has been proposed since the

existence of databases. It prevents the accesses of

unauthorized people by authentication and provide

access for people proportional to their needs. The

concept of data encryption, saving data in databases

and query over encrypted data is introduced in the

last decade. The capability of querying on encrypted

data during data decryption causes better efficiency.

By Querying, there is no need to have data

encryption/decryption for each query and row in

order to compare its value with query condition. As

a result, the speed of the procedure will be

improved.

The method which is introduced in this paper,

provides the possibility of data encryption in HBase

database and also the querying in encryption mood.

This method is an extension of bloom filter

algorithm in [5]. Our method has changed a lot in

order to be used in HBase because HBase structure

differs from relational databases. In our method, the

data will be encrypted by AES-128 and in a new

field, the signature regarding the encrypted data will

be stored. During querying, the signature value of

the query condition should be calculated and be

compared to all existed signatures. Then, the bit and

logical operation will be applied on row signatures

and query signature. Finally, the rows which get the

query signature value will be known as output. By

applying this method on HBase database, it is

perceived that the proposed method has enough

efficiency to retrieve the information.

The rest of the paper is organized as follows:

Section 2 provides a brief review of querying on

encrypted data in relational databases. The

encryption method of bloom filter on relational

databases [5] is described in Section 3. The

architecture of HBase database is briefly explained

by Section 4. The proposed method is introduced in

Section 5; this section consists the applied changes

in bloom filter algorithm which is used in relational

databases, usage of HBase structure in encryption

and the efficiency optimization of bloom filter

algorithm in HBase. Section 6 presents the

experimental study and finally, the concluding

remarks are presented in Section 7.

2-Related Works In this section, the related works that concentrate on

querying over encrypted data, are presented. In

2002, Hakan Hacigumus [6] proposed a new method

for encrypting a row and stored the related signature

in a new field named etuple. This method used

Block-Cipher for encryption after partitioning the

values and giving an id to each one. This method is

very time consuming for larges amount of data since

it creates a signature for each value. And also this

method just supported the numeric data, not the

character strings. In 2004, Zheng-FEI Wang and et al. [7]

proposed a two phased query method for encrypting

numeric and characters data. The first phase filtered

false records by using signatures. The second phase,

decrypted the retrieve results from the first phase

and queried on them to get final results. This

method helped query over encrypted data by using

an encryption/decryption layer between database

and application and also creating a new attribute as

signatures. But according to designers' idea, this

method didn’t have good respond to query on a long

strings of characters [7]. In 2007, Yong Zhang and et al. [8] proposed a

bucket index on character data, which converted

characters data into numeric values to have better

speed for processing. In 2008, they improved their

method [9]. In this method, two new attributes were

added to the table, one was the index and the other

one was an id, which was unique for each row.

Since it was possible to have same index for

different rows, a new index which was different for

each row was created by using a reversible function

with the values of index and id as input parameters

in the last step of encryption. But the excessive

increasing in computational complexity was known

as a drawback of this method which would be

noticeable for large volume of data like terabytes

and petabytes. Lianzhang Liu and Hingfan Gai [5] proposed a

new method to create bucket index for numeric data

and used bloom filter compression algorithm [10]

for character strings and encrypted a particular

column in 2009. This method didn’t have any false

negative error, but could have some false positive

errors. The procedure of the work is explained

briefly as below. First, the words were isolated,

second the characters of each word were divided to

the groups of two letters, three letters,… . Third, the

bloom filter algorithm was used to convert character

value to numeric value. Therefore, this method

doesn’t have good performance for encrypting long

strings. In 2010, Nian Liu and et al. [11], proposed a

new method for DAS1 databases that its created

index is based on two parts. The first part was used

to keep the bundle of existed characters in main

string and the second part was used to store the

positions of characters. In 2011, M.Hussain and

Atiq ur Rehman [12] proposed another way to

encrypt data in DAS databases. In this method, the

metadata that contained encryption/decryption keys

and obfuscation/de-obfuscation information, was

created on client side. It meant that, obfuscation and

encryption operations were performed on the client

machines and then the encrypted data was sent to

the server for storage. It must be noted that the last

1 Database As a Service

two cases are related to cloud databases which are

different from distributed databases. In 2012, Al Derawi and Mohammed Alhanjouri

[13], proposed a method to encrypt data and query

over encrypted data. The base of their method is

similar to the previous ones in way that it performs

encryption/decryption operations by using a middle

layer and creating a new attribute as signature. What

distinguishes this approach from the others is using

hash function to generate signatures. This method is

proper just for the where clause query and is not

suitable for fuzzy query such as like “%string%” .

3- Bloom Filter Algorithm for

Relational database The procedure of work in Ref [5] is as followed.

First, a string of characters is encrypted by AES-128

algorithm and stored in the table. Then the

signatures are created and kept in separate columns.

Therefore, the algorithm will be later able to query

over encrypted data. Creating signature consists the

following steps: at the beginning, the string is

separated by special characters like blank, comma,

full stop to sets of words, and then for each word,

MD5 hash function is computed four times for every

two adjacent characters. It is enough to add four

different values to the main string to calculate the

hash function in order to have different outputs for a

same string. Another prominent point is that the

values which are added to the string must be the

same in all of phases because the query string is

calculated in the same way. Therefore, the hash

function’s outputs are mapped to numerical range

(0,m), where m=32. These numbers are called num.

At last, the final signature is calculated by ∑

and is stored in a separated column. The details are

available in paper [14].

Finally, if there is a query over encrypted data, first

the query condition should be converted to a

signature by bloom filter algorithm. We called it

signature_q and then it is compared it to all of the

existed signatures in the table. The rows which

satisfied the condition below, are returned as final

results :

bitand (signature i, signature_q) = signature_q

4- HBase : The Hadoop Database Today, due to the inability of relational databases to

keep unstructured data and perform query on a large

volumes of data like terabytes or petabytes, a new

kind of databases has been introduced as NOSQL

databases and among them Apache HBase and

Google BigTable are the most famous ones.

HBase is the open-source version of Google

BigTable based on hadoop distributed file system

(HDFS). It has the ability to have millions of

different columns per column families for each row,

and also HBase is suitable for sparse tables because

its structure is like linked list, so it doesn’t store null

values [15]. Since HBase is a distributed database, its

architecture is composed of individual master and

some region servers. Region Servers are responsible

for keeping regions because data is distributed on

regions. The main task of master is to monitor

regions and coordinate the Region Servers, and to

recover data if the region severs go down [4]. Any

kind of requests from clients, like adding a value,

deleting a value and etc. are sent to the master-

server and then it is master’s task to send that

requests to the region servers. Another feature of

HBase is the possibility of being on top of the

HDFS. This feature increases the reliability of

HBase and also provides parallel processing by

MapReduce framework that based on hadoop

system files. Figure 1 shows the physical schema of

regions and region servers.

Figure 1. physical schema of regions and region servers [16]

5- Proposed Method In this section, the proposed method consists of the

encryption algorithm used and its optimization is

described. The novelty of this paper is querying

over encrypted data in NOSQL databases which is

not yet proposed. In section 5-1, the changes made

to bloom filter algorithm are described and in

section 5-2 the procedure of algorithm is expressed

due to HBase's structure. In section 5-3, works

which have been done in order to optimize the

efficiency of the bloom filter algorithm used in

HBase, are described.

5-1 The changes made to Bloom Filter in

order to use in HBase The bloom filter method in HBase is

implemented with these major assumptions:

All columns are identical for all rows

All rows are placed in one column Family

for encryption

The key of AES is the same for all

encrypted rows.

As explained before, the bloom filter algorithm

method has used MD5 hash function. Despite the

good speed of hash function, it has a major

disadvantage that caused this function useless for

encryption method in HBase. The size of the MD5

hash function is 128 bit and it is possible to find

collisions in a short time and according to [17,18], it

is not recommended to use MD5 hash function in

security issues. Therefore, SHA1-256 hash function

is used instead of MD5. HBase is written in Java, hence the Java language is

used to implement bloom filter in HBase. Due to

restrictions of some types such as int and long in

implementation of bloom filter, BigInteger type is

used in Java (step 4). By using this type, there is no

limitation to choose _lengthOfIndex and any value

can be assigned to it (step 6). According to the tests

which are done, the value of 1024 seems proper for

the parameter _lengthOfIndex. It should be noted

that the long type is faster than BigInteger type.

The algorithm obtained the required accuracy, after

choosing BigInteger type and determing

_lengthOfIndex=1024. In the following section, it

will be explained how to use this algorithm in

HBase. Figure 2 shows the pseudo code of used

Bloom Filter algorithm.

Figure 2. The Proposed Bloom Filter Algorithm

5-2 Storing the Encrypted Data and

Signatures in HBase Among the existing methods of encryption, AES is

the most secure, that’s why AES is used to encrypt

data in HBase [19, 20]. Next passage would explain

how the bloom filter algorithm is used. As it is illustrated, for each row in HBase, the

column family and columns in which data is stored,

should be determined. Thus, by using AES-128 the

data would be encrypted and the related signature

would be generated as described in section 5-1; then

the encrypted data and its signature are stored in one

column family and two different columns. For

example, the encrypted data and the related

signature are stored in a column family, called cf1

which includes two columns named field1 to keep

the encrypted data and field1_signature to keep the

signature (Figure 3). According to HBase storage and data retrieval

which will be explained completely in section 5-3,

all data in a column family are stored in a file

named HFile. Therefore, when a query is executed,

all data including encrypted data and signatures are

sent to the client machine which decreases transfer

speed, efficiency and also uses large bandwidth. For

example, if there are 50 columns in a column

family, another 50 columns will be created as

signatures in that column family; therefore, there are

totally 100 columns in the column family. As a

result, the 100 columns are transferred to the client

machine; however it is necessary to compare just 50

columns of signatures to the query. This kind of

storage is causing the query operation over

encrypted data not to have good speed, because all

data in a column family first should be transferred to

the client and then compare the query condition to

the retrieved signatures. In order to fix this problem, the encrypted data is

stored in a different column family from the

signature column family. For example, the

encrypted data is kept in a column family named cf1

and the related signatures are kept in another

column family named cf2. It is necessary to know

the signatures are related to the encrypted data by

row keys. This kind of storage separation causes the

transfer of the existed rows in signatures column

family to the client in responding of each query (just

50 columns) that would be faster than the previous

one (Figure 4).

Figure 5 simply shows the procedure of the

proposed method.

Figure 3. Encrypted data and signatures are stored in a column family

named ColFam1

(1) Algorithm: Create Signature Value by Bloom Filter

(2) Input : original string, length of split unit, number of hash function, length

of Index

(3) Output: numeric value of Signature

(4) BigInteger _targetValue=0;

(5) _splitUnit=2;

(6) _lengthOfIndex=1024;

(7) _numOfHashFunc=4;

(8) String[] allWords= split(originalString); // split the string to words

(9) SortedSet<BigInteger> s; // An sorted array stores the positions of Hash

(10) for (int i = 0; i < _allWords.length; i++) {

(11) if (_allWords[i].length() <= _lengthofsplitunit) {

(12) BigInteger[] tt = MultiHashFunc -SHA1 (_allWords[i]);

(13) for (int r = 0; r < tt.length; r++) // Put position values to s

(14) s.add(tt[r]);

(15) }

(16) else {

// split allWords[i] in len,building bloom filter on r

(17) for (int j = 0; j < _allWords[i].length() - _lengthofsplitunit+ 1;

(18) j++) {

(19) String temp = _allWords[i].substring(j, j+ _lengthofsplitunit);

(20) BigInteger[] tt = MultiHashFunc -SHA1 (temp);

(21) for (int r = 0; r < tt.length; r++) // Put position values to s

(22) s.add(tt[r]);

(23) }

// convert the bloom filter s into numeric data

(24) Iterator<BigInteger> it = s.iterator();

(25) while (it.hasNext()) {

(26) BigInteger tt = it.next();

(27) BigInteger pow=new BigInteger("2");

(28) BigInteger temp=pow.pow(tt.intValue());

(29) _targetValue=_targetValue.add(temp);

(30) }

(31) return _targetValue;

Figure 4. Encrypted data and signatures are stored in two different

column families named ColFam1 and ColFam2

Figure 5. The schema of storing encrypted data and signatures

5-3 Optimization to improve performance As described in previous section, a part of the

implementation of bloom filter algorithm depends

on scan operation in HBase. Thus, the scan

operation will be explained in details, then some

methods will be illustrated for optimization. Scan operation is done by RPC. RPC sends

query condition of scan operation to regions. Then

on each region, searching is done and the results are

sent to the region servers and finally they would be

sent to the master server which is known as client

machine. Setting a specific filter is another

advantage of scan operation. For example, scan

operation can be set to run on a specific column

family or specific columns. Here is some parameters

that filters can be set on them : column family,

specific columns in a column family, row keys,

timestamps of data and etc. Setting the startRowKey

and stopRowKey is another distinguishing feature of

scan operation. This feature is used to implement

parallel scan in bloom filter algorithm ( Figure 6 ).

Figure 6. The filters created on the client side, sent through the RPC,

and executed on the server side [16]

Regarding the bloom filter algorithm in HBase,

it is necessary to compare all of the signatures in a

table to the query signature. Therefore, it is not

possible to use a specific filter for scan operation,

unless a specific column family is filtered after

separating column families by encrypted data and

signatures.

Since the scan operation is done based on row

keys and regions, and bloom filter algorithm

compares the query signature to other signatures, it

could be time consuming. In order to optimize the

bloom filter algorithm, scan is implemented parallel

which is explained in the following. Each region server controls a number of regions

in HBase, so scan operation is done through

different region servers simultaneously to have

better parallelism by creating threads for each

region servers. Each thread is responsible for a scan

operation which is bounded with the startRowKey

and stopRowKey of each regions ( Figure 7 ). It

should be noted that this method is implemented

with the assumption that the number of regions are

equal to the number of region servers and each

region is large enough, so that there is no need to

split region to other ones.

To have a good performance, it is better for each

region to have the same number of rows, so we can

divide a table into different regions by row keys

which is known as Presplitting Regions.

Presplitting Regions causes optimize parallel

scanning by uniform distribution of data over

regions. The above method leads to have bottleneck on

master side and crashes down the database because

the master server must receive a lot of data

including signatures from different region servers at

the same time and compare them to query signature.

As master server receives a lot of data in a moment,

it can not process them simultaneously and leads to

crash HBase.

Figure 7. Parallel scanning in HBase

To solve this problem, the Coprocessors

Framework in HBase can be used. This feature

provides the possibility of implementation an

arbitrary code on a specific region or a specific

region server [16]. After enabling this feature, a

large volume of data transmitted to the master server

is reduced because the signatures comparing are

done on every region servers. Therefore, the bloom

filter algorithm can implement in HBase without

any bottlenecks and problems . Figure 8 shows how

the coprocessor framework works.

Figure 8. Coprocessors executed sequentially, in their environment, and

per region [16]

6- Performance Analysis The experiments and tests were implemented on

three virtual machines as follow:

First machine consisted of one core

processor having speed 2.2 GHz, 2 GB of

RAM and 15 GB hard disk. Hadoop-

Namenode, Hadoop-Datanode and HBase-

Master were run on this machine.

Second machine consisted of one core

processor having speed 2.2 GHz, 1.5 GB of

RAM and 10 GB hard disk. Hadoop-

Datanode and HBase-Region Server were

run on this machine.

Third machine consisted of one core

processor having speed 2.2 GHz, 1.5 GB of

RAM and 10 GB hard disk. Hadoop-

Datanode and HBase-Region Server were

run on this machine.

After testing and evaluating the proposed

method with three different states: implementation

with just one column family, parallel

implementation of previous state and the simple

encryption/decryption implementation, the results

show that the proposed method has good

performance.

In section 6-1, the precision and efficiency of

proposed method are tested with random values. In

this section, we want to know whether our proposed

method works correctly or not, also the performance

and precision of the method is another issue which

is checked. In section 6-2, the proposed method is

tested by some real values like SOC2’s logs which

will be described later.

6-1 Analyze Proposed Method by Arbitrary

Values In this part, the proposed method is tested by

different arbitrary random values. Table 1, shows

the query time comparison between different states

and the related diagrams are presented in Figures

9,10 and 11, respectively, for the conditions that the

table has 5 columns, 10 columns and 50 columns. It

should be noted that each state is tested at least three

times and then the average of times is calculated as

the query time.

Also in another experiment, the efficiency of

proposed method for numeric values is tested and

different IP values are inserted to the table. The IP

values have four parts, the values of first and second

parts are defined randomly among these

(192,60,120,230) and (168,69,212,123),

respectively. The third and fourth values are random

numbers from numerical range of (0-255). The

results are shown in Figures 12,13. To test the accuracy of the proposed method, at

first experiment, 90 random values and 10 specific

values are added to a table contained 5 columns:

Field1=”this is a test for field[1]”

Field2=”this is a test for field[2]” . . . Field5=”this is a test for field[5]”

2 Security Operation Center

Then to test the precision, different queries are

run on that table:

Query1:field1=”test”,field2=”test”,field3=”test”,field4=”test”,

field5=”test”

Query2:field1=”field”,field2=”field”,field3=”field”,field4=”fie

ld”, field5=”field”

Query3:field1=”this”,field2=”is”,field3=”for”,field4=”test”,

field5=”field”

Query4:field1=”hello to this test”,field2=”field”, field3=”field”

,field4=”field”,field5=”field”

For the three first queries, all the 10 rows are

returned as outputs but there isn’t any result for the

last query. At second experiment, 90 random values are

added to a table contains 5 columns, as follow:

“this test is: random characters”

The string includes 10 characters between “a” and

“z” or numbers between “0” and “9”. 10 specific

string as following are also added to the table:

“this is a test for field[i]”

Then following queries are run on the table:

Query1:field1=”test”,field2=”test”,field3=”test”,field4=”test”,

field5=”test”

Query2:field1=”field”,field2=”field”,field3=”field”,field4=”fie

ld”, field5=”field”

Query3:field1=”this test is”,field2=” this test is”,field3=” this test is”,field4=” this test is”, field5=” this test is”

Query4:field1=”this is a test for” ,field2=” this is a test for”,

field3=” this is a test for” ,field4=” this is a test for”,field5=”

this is a test for”

For above queries, 97,10,88 and 10 rows are

outputted respectively.

Figure 9. Comparison of time query- For 5 columns

Figure 10. Comparison of time query - For 10 columns

Figure 11. Comparison of time query- For 50 columns

Figure 12. Comparison of time query for numeric values - For 10 columns

0

20

40

60

80

100

120

140

Comparison of time query - For 5 columns

BloomFilter Enc

with 1

columnFamily

Proposed Method

Simple

Encryption/Decr

yption

Parallel

Bloomfilter Enc

with 1

columnFamily

020406080

100120140160

Comparison of time query- For 10 columns

BloomFilter Enc

with 1

columnFamily

Proposed Method

Simple

Encryption/Decrypti

on

Parallel Bloomfilter

Enc with 1

columnFamily

0

100

200

300

400

500

600

Comparison of time query - For 50 columns

BloomFilter Enc

with 1

columnFamily

Proposed Method

Simple

Encryption/Decryp

tion

Parallel

BloomFilter Enc

with 1

columnFamily

0

20

40

60

80

100

Comparison of time query for numeric values- For 10

columns

BloomFilter Enc

with 1

columnFamily

Proposed Method

Simple Enc/Dec

Parallel

Bloomfilter Enc

with 1

columnFamily

Figure 13. Comparison of time query for numeric values - For 50

columns

Table 1. Comparison of time queries for random values

query’s

time

Encryption

Method

Number of

Columns

Number

of Rows 38 sec Simple

5

20000

24 sec BloomFilter with 1

column family

15 sec parallel BloomFilter

with 1 column family

13 sec proposed method

42 sec Simple

10

35 sec BloomFilter with 1

column family

13 sec parallel BloomFilter

with 1 column family

12 sec proposed method

70.5 sec Simple

50

81 sec BloomFilter with 1

column family

27 sec parallel BloomFilter

with 1 column family

20 sec proposed method

51 sec Simple

5

40000

40 sec BloomFilter with 1

column family

20 sec parallel BloomFilter

with 1 column family

10 sec proposed method

62 sec Simple

10

49 sec BloomFilter with 1

column family

15 sec parallel BloomFilter

with 1 column family

15 sec proposed method

125 sec Simple

50

152 sec BloomFilter with 1

column family

63 sec parallel BloomFilter

with 1 column family

46 sec proposed method

75 sec Simple

5

60000

57 sec BloomFilter with 1

column family

16 sec parallel BloomFilter

with 1 column family

11 sec proposed method

83 sec Simple

10 69 sec BloomFilter with 1

column family

18 sec parallel BloomFilter

with 1 column family

15 sec proposed method

175 sec Simple

50

223 sec BloomFilter with 1

column family

228 sec parallel BloomFilter

with 1 column family

66 sec proposed method

94 sec Simple

5

80000

70 sec BloomFilter with 1

column family

20 sec parallel BloomFilter

with 1 column family

12 sec proposed method

100 sec Simple

10

86 sec BloomFilter with 1

column family

24 sec parallel BloomFilter

with 1 column family

21 sec proposed method

227 sec Simple

50

289 sec BloomFilter with 1

column family

crash

cluster

parallel BloomFilter

with 1 column family

186 sec proposed method

117 sec Simple

5

100000

87 sec BloomFilter with 1

column family

21 sec parallel BloomFilter

with 1 column family

17 sec proposed method

125 sec Simple

10

143 sec BloomFilter with 1

column family

26 sec parallel BloomFilter

with 1 column family

24 sec proposed method

563 sec Simple

50

443 sec BloomFilter with 1

column family

crash cluster

parallel BloomFilter with 1 column family

233 sec proposed method

6-2 Analyze Proposed Method by Real

Values In this section, experiment results which are done on

SOC’s logs, are presented. Security Operation

Center (SOC) is a real-time security platform whose

main aim is to detect and prevent attacks and

performs suitable reaction immediately. To achieve

this goal, SOC uses special log files which have the

IDMEF3 format. An implementation of IDMEF

format is extensible markup language (XML)

because of its flexibility. A simple instance of the

IDMEF file is presented in Figure 14:

3 Intrusion Detection Message Exchange Format

0

50

100

150

200

250

300

350

Comparison of time query numeric values - For 50

columns

BloomFilter Enc

with 1

columnFamily

Proposed Method

Simple Enc/Dec

Parallel

BloomFilter Enc

with 1

columnFamily

(Sec)

It is obvious that processing the whole IDMEF file

is too time consuming, therefore, it is necessary to

preprocess the IDMEF files. Some useful columns

of IDMEF files are extracted in order to test the

proposed method such as messageId, sourceIp,

targetIp, effect, mechanism, resource and rawlog.

According to the assumption of proposed method,

these seven columns should be placed in one

column family.

The effect, mechanism and resource columns have

distinct values. The effect column has four different

values : System Compromised, Reconnaissance,

Unknown and Access. The mechanism column

includes four values: Host Sweep, Spyware, Trojan

and Network Http. The resource column has three

different values: Host, Web and Remote Service.

The rawlog includes a long string like “Feb 02

07:41:31 snort snort[1]: [1:2101918:7] GPL SCAN

SolarWinds IP scan attempt [Classification:

Detection of a Network Scan] [Priority: 3] {ICMP}

194.146.150.40 -> 194.146.151.73”.

This test is done on a table with IDMEF log files

which contains between 20,000 and 100,000

records. The query which is run on the table and

compared to all records is as below: Query:

messageId=”111”,sourceIP=”192.168.1.1”,targetIp=”192.168.1.

1”,effects=”Unknown”, mechanisms=”Host Sweep”,

resources=”Host”, rawlog=”Priority”

Figure 15 shows the efficiency and query time of

proposed method which is tested on SOC logs.

Figure 15. Comparison of time query for SOC's Logs

According to Figure 15, the performance of

proposed method is better than others. Although

there is a little difference between the proposed

method and BloomFilter with 1 columnfamily

method like Figure 9 and Figure 10, but when there

are many columns or long strings, the proposed

method works more desirable.

Also two other experiments are done to evaluate the

accuracy and precision of the proposed method on

the SOC logs. In the first experiment, there are

20,000 records which are stored in a table with one

column family and seven columns. 310 records have

the same values as shown below:

"messageId=555","sourceIp=192.168.1.1","targetIp=192.168.1.

1","effects=unknown","mechanisms=Host Sweep","resources=

Host","rawlog= Wed Aug 07 15:13:09 EDT 2013 snort

snort[1]: [1:150990:13] ET POLICY TeamViewer Dyngate

User-Agent [Classification:Attempted Information Leak]

[Priority:4] {TCP} 192.168.1.1:2636 -> 192.168.1.1:1272 "

Then to test the precision, a query is run on the

table:

Query: messageId=”555”, sourceIP=”192.168”,

targetIp=”192.168”, effects=”Unknown”, mechanisms=”Host

Sweep”, resources=”Host”, rawlog=”Priority”

The results are all of the 310 records which

retrieved correctly. Table 2 shows the relevant and

non-relevant of retrieved records. According to the

Table 2, the precision and recall of proposed method

can be measured as following:

Precision = tp/(tp+ fp) => Precision=310/310=100%

Recall = tp/(tp+ fn) => Recall= 88/88= 100%

Table 2. Relevant and non-relevant of retrieved records

Relevant Non-relevant

Retrieved 310 record (TP) 0 record (FP)

Not retrieved 0 record (FN) 19690 record (TN)

0

20

40

60

80

100

120

140

Comparison of time query for SOC's Logs

BloomFilter Enc

with 1

columnFamily

Proposed Method

Simple Enc/Dec

Parallel

Bloomfilter Enc

with 1

columnFamily

<IDMEF-Message xmlns:idmef="http://iana.org/idmef" version="1.0"

><Alertmessageid="135256225 841578901"><Analyzer analyzerid= "snort snort[1]" class="1"/><Assessment><Impact impacttype= other"

severity="low">GPL SCAN SolarWinds IP scan attempt </Impact>

</Assessment> <Classification ident="1:2101918" text="Detection of a Network Scan"/><CreateTime ntpstamp= "35604134 87">2012-10-

28T 11:44:47</CreateTime><Source spoofed="unknown"> <Node><

Address category="ipv4-addr"><address>194.146.150.40 </address> </Address></Node><Service iana_protocol _name="ICMP"> <port>

</port></Service></Source><Target decoy="unknown"><Node>

<Address category="ipv4-addr"><address>194.146.151.86</address> </Address></Node><Service iana_protocol _name="ICMP"><port>

</port></Service></Target><AdditionalData type="string" meaning=

"effect"> <string>Reconnaissance</string></AdditionalData><Additio nalData type="string" meaning="resource" ><string>Host</string>

</AdditionalData><AdditionalData type="string" meaning="mechanis

m"><string>Network Sweep</string></AdditionalData><Additional Data type="string" meaning= "RawLog"><string>Oct 28 11:44:47

snort snort [1]: [1:2101918:7] GPL SCAN SolarWinds IP scan attempt

[Classifi cation: Detection of a Network Scan] [Priority: 3] {ICMP} 194.146. 150.40 -> 194.146.151.86</string></AdditionalData>

<AdditionalData type='integer' meaning='Number of Repeats'>

<integer>1</integer> </AdditionalData></Alert></IDMEF-Message>

Figure 14. An instance of IDMEF File

In second experiment, there is a little difference

with previous one. There are 20,000 records in a

table with one column family and seven columns

like the previous experiment, but there are 88

records which have specific values for sourceIp and

targetIp columns as below:

sourceIp=”192.168.170.166” , targetIp=”192.168.170.165”

Also there are 146 records which have specific

values for those two columns as following:

sourceIp=”166.170.168.192” , targetIp=”165.170.168.192”

Therefore, there are 234 special records from 20,000

with specific IP values for sourceIp and targetIp

columns. Then two queries are run on the table with

different condition:

Query 1: messageId=” ”, sourceIP=”170.166”,

targetIp=”170.165”, effects=” ”, mechanisms=” ”, resources=”

”, rawlog=” ”

Query 2: messageId=” ”, sourceIP=”166.170.168.192”,

targetIp=”165.170.168.192”, effects=” ”, mechanisms=” ”,

resources=” ”, rawlog=” ”

It should be noted that the empty spaces shows there

is no condition for that column. For the first query,

234 records are retrieved which include 88 records

as true results of the query and 146 as false results..

Table 3 shows the relevant and non-relevant of

retrieved records. According to the Table 3, the

precision and recall of proposed method can be

measured as following:

Precision = tp/(tp+ fp) => Precision=88/234=38%

Recall = tp/(tp+ fn) => Recall= 88/88= 100%

Table 3. Relevant and non-relevant of retrieved records

Relevant Non-relevant

Retrieved 88 record (TP) 146 record (FP)

Not retrieved 0 record (FN) 19766 record (TN)

For the second query, there are 234 retrieved

records as the final result of the query. The precision

and recall for this query are as following:

Precision = tp/(tp+ fp) => Precision=146/234=62%

Recall = tp/(tp+ fn) => Recall= 146/146= 100%

This test shows that, in the worst case the precision

of proposed method is mediocre, although it has

very good recall because it has some false positive

records and no false negative records. because the

algorithm doesn’t keep the positions of the

characters which are the features of the bloom filter

algorithm.

What the evaluation and experiments show is that

the proposed method has good efficiency and

enough accuracy to retrieve records, although it has

some false positive errors. The precision of

proposed method relies on query condition. If the

quality of query condition increases, the precision

will get better.

7- Conclusion In this paper, solutions for data encryption in HBase

database is proposed by using bloom filter

algorithm. In order to have bloom filter efficiently,

some changes like SHA1-256 hash function and

BigInteger type are applied in the algorithm to have

better accuracy for long string. The proposed

method is improved by using the structure and

features of HBase like coprocessor framework. By

testing, the evaluation of obtained results shows that

the proposed method has 100% recall measurement.

It means that it always retrieves the related records

correctly although it has some false records which

are depend on quality of query condition. As it is

revealed, the efficiency of the proposed method

which has two Column Family and parallel

performance is much better than the basic method

which has only one Column Family. In future

works, the proposed method can be improved by

implementing MapReduce framework or supporting

different column families and different columns for

each row which are not provided yet.

References

[1] R. P. Padhy, M. R. Patra, and S. C. Satapathy, “RDBMS to

NoSQL: Reviewing Some Next-Generation Non-Relational

Database’s,” International Journal of Advanced Engineering Sciences and Technologies, vol. 11, no. 1, pp. 15–30, 2011.

[2] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google file system,” ACM SIGOPS Operating Systems Review, vol. 37,

no. 5, p. 29, Dec. 2003.

[3] D. Carstoiu, A. Cernian, and A. Olteanu, “Hadoop Hbase-

0.20.2 Performance Evaluation,” in New Trends in

Information Science and Service Science NISS, 2010, pp. 84–87.

[4] D. Carstoiu, E. Lepadatu, and M. Gaspar, “Hbase - non SQL Database , Performances Evaluation,” International Journal

of Advancements in Computing Technology, vol. 2, no. 5, pp.

42–52, 2010.

[5] L. Liu and J. Gai, “Bloom Filter Based Index for Query over

Encrypted Character Strings in Database,” in 2009 WRI World Congress on Computer Science and Information Engineering,

2009, pp. 303–307.

[6] H. Hacigümüş, B. Iyer, C. Li, and S. Mehrotra, “Executing

SQL over encrypted data in the database-service-provider

model,” in Proceedings of the 2002 ACM SIGMOD international conference on Management of data -

SIGMOD ’02, 2002, p. 216.

[7] Z.-F. Wang, J. Dai, B. Shi, and W. Wang, “Fast Query over

Encrypted Character Data in Database,” Communication in

Information and Systems, vol. 4, no. 4, pp. 289–300, 2004.

[8] Y. Zhang, W. Li, and X. Niu, “A Method of Bucket Index

over Encrypted Character Data in Database,” in Third International Conference on Intelligent Information Hiding

and Multimedia Signal Processing (IIH-MSP 2007), 2007,

vol. 1, pp. 186–189.

[9] Y. Zhang, W. Li, and X. Niu, “Secure cipher index over

encrypted character data in database,” in 2008 International Conference on Machine Learning and Cybernetics, 2008, no.

July, pp. 1111–1116.

[10] M. Mitzenmacher, “Compressed Bloom filters,” IEEE/ACM

Transactions on Networking, vol. 10, no. 5, pp. 604–612, Oct.

2002.

[11] N. Liu, Y. Zhou, X. Niu, and Y. Yang, “Querying Encrypted

Character Data in DAS Model,” in 2010 International

Conference on Networking and Digital Society, 2010, pp.

402–405.

[12] A. Rehman and M. Hussain, “Efficient Cloud Data

Confidentiality for DaaS,” International Journal of Advanced

Science and Technology, vol. 35, pp. 1–10, 2011.

[13] M. Alhanjouri and A. M. Al Derawi, “A New Method of

Query over Encrypted Data in Database using Hash Map,” International Journal of Computer Applications, vol. 41, no.

4, pp. 46–51, Mar. 2012.

[14] L. Liu and J. Gai, “A Method of Query over Encrypted Data

in Database,” in 2009 International Conference on Computer Engineering and Technology, 2009, pp. 23–27.

[15] J. Cryans, A. April, and A. Abran, “Criteria to Compare Cloud Computing with Current Database Technology,”

Proceedings of the International Conferences on Software

Process and Product Measurement, vol. 5338, no. 3, pp. 114–126, 2008.

[16] L. George, HBase The Definitive Guide. O’Reilly Media, 2011, p. 556.

[17] F. Mendel, C. Rechberger, and M. Schl, Advances in Cryptology – ASIACRYPT 2009, vol. 5912. Berlin,

Heidelberg: Springer Berlin Heidelberg, 2009, pp. 144–161.

[18] M. Stevens, “On collisions for MD5,” Eindhoven University

of Technology, 2007.

[19] N. Standard, “Announcing the Advanced Encryption Standard

(AES),” Federal Information Processing Standards

Publication, 2001.

[20] P. Hamalainen, T. Alho, M. Hannikainen, and T. D.

Hamalainen, “Design and Implementation of Low-Area and Low-Power AES Encryption Hardware Core,” in 9th

EUROMICRO Conference on Digital System Design

(DSD’06), 2006, pp. 577–583.

)