Maghaleh -Bloomfilter Encryption for HBase-Eng
Transcript of Maghaleh -Bloomfilter Encryption for HBase-Eng
A Method of Query over Encrypted Data in HBase Database by Using Bloom
Filter Algorithm Farrokh Shahriari
1, Ahmad Baraani-Dastjerdi
2
1 Computer Engineering Department, Engineering Faculty, University of Isfahan, Isfahan, Iran
[email protected], [email protected]
2 Computer Engineering Department, Engineering Faculty, University of Isfahan, Isfahan, Iran
Abstract
Encryption is one of the key issues in information security. Nowadays, encryption and data protection
are a field of research in cyberspace and is getting more and more attention in the last years because
many organizations and companies store their data in databases. This paper presents a method of
query over encrypted data in HBase database by using bloom filter algorithm. HBase is a kind of the
various NOSQL (Not Only SQL) databases which differs from relational databases. In our schema, we
use bloom filter compression algorithm to create a signature for each stored data. In this way, we can
query over encrypted data without needing any additional encryption/decryption operations. Finally,
the experimental results reveal that our encryption model has robust performance to retrieve data
from HBase.
Keywords Security, Encryption, HBase (Distributed Database), Bloom Filter Algorithm
1- Introduction By the fast growths of internet and its
comprehensive use, the number of users,
organizations and their information are increasing;
therefore, big companies such as Google, Facebook,
Amazon, Yahoo and etc. are facing new challenges
of maintenance and processing of terabytes or
sometimes petabytes of information and data, within
a reasonable time. The relational databases such as
Oracle, SQLserver and Mysql are not proper for
modern companies due to the increasing processing
needs in companies information [1]. Thus, another
kind of database for saving a large amount of
information is proposed which structure is far
different from the structure of the relational
database. This kind of database is known as NOSQL
database such as Google Bigtable and Apache
HBase.
The relational databases in which data is saved as
tables, suffer from the lack of power to save and
process a large amount of disorganized and non-
relational data in a proper time. In addition, the
extension of these databases is a hard task because
not only there are few kinds of hardware for making
such database, but also they cost a lot. Therefore,
the Distributed Databases are preferred to relational
ones since they don’t have the weakness of the
traditional databases [2]. Google Bigtable is an example for distributed
database which is based on Google File System
(GFS), but the details of it is not issued by Google
and only the overall structure of it is described [2].
The HBase database is the open source version of
Google BigTable which is introduced by Apache
Company [3]. This database is capable of being put
on Hadoop distributed file system and this leads to
higher reliability for this database. Moreover, the
usage of MapReduce is provided which is parallel
process of data on Hadoop distributed file system
[4].
On the other hand, the concept of security and
data maintenance has been proposed since the
existence of databases. It prevents the accesses of
unauthorized people by authentication and provide
access for people proportional to their needs. The
concept of data encryption, saving data in databases
and query over encrypted data is introduced in the
last decade. The capability of querying on encrypted
data during data decryption causes better efficiency.
By Querying, there is no need to have data
encryption/decryption for each query and row in
order to compare its value with query condition. As
a result, the speed of the procedure will be
improved.
The method which is introduced in this paper,
provides the possibility of data encryption in HBase
database and also the querying in encryption mood.
This method is an extension of bloom filter
algorithm in [5]. Our method has changed a lot in
order to be used in HBase because HBase structure
differs from relational databases. In our method, the
data will be encrypted by AES-128 and in a new
field, the signature regarding the encrypted data will
be stored. During querying, the signature value of
the query condition should be calculated and be
compared to all existed signatures. Then, the bit and
logical operation will be applied on row signatures
and query signature. Finally, the rows which get the
query signature value will be known as output. By
applying this method on HBase database, it is
perceived that the proposed method has enough
efficiency to retrieve the information.
The rest of the paper is organized as follows:
Section 2 provides a brief review of querying on
encrypted data in relational databases. The
encryption method of bloom filter on relational
databases [5] is described in Section 3. The
architecture of HBase database is briefly explained
by Section 4. The proposed method is introduced in
Section 5; this section consists the applied changes
in bloom filter algorithm which is used in relational
databases, usage of HBase structure in encryption
and the efficiency optimization of bloom filter
algorithm in HBase. Section 6 presents the
experimental study and finally, the concluding
remarks are presented in Section 7.
2-Related Works In this section, the related works that concentrate on
querying over encrypted data, are presented. In
2002, Hakan Hacigumus [6] proposed a new method
for encrypting a row and stored the related signature
in a new field named etuple. This method used
Block-Cipher for encryption after partitioning the
values and giving an id to each one. This method is
very time consuming for larges amount of data since
it creates a signature for each value. And also this
method just supported the numeric data, not the
character strings. In 2004, Zheng-FEI Wang and et al. [7]
proposed a two phased query method for encrypting
numeric and characters data. The first phase filtered
false records by using signatures. The second phase,
decrypted the retrieve results from the first phase
and queried on them to get final results. This
method helped query over encrypted data by using
an encryption/decryption layer between database
and application and also creating a new attribute as
signatures. But according to designers' idea, this
method didn’t have good respond to query on a long
strings of characters [7]. In 2007, Yong Zhang and et al. [8] proposed a
bucket index on character data, which converted
characters data into numeric values to have better
speed for processing. In 2008, they improved their
method [9]. In this method, two new attributes were
added to the table, one was the index and the other
one was an id, which was unique for each row.
Since it was possible to have same index for
different rows, a new index which was different for
each row was created by using a reversible function
with the values of index and id as input parameters
in the last step of encryption. But the excessive
increasing in computational complexity was known
as a drawback of this method which would be
noticeable for large volume of data like terabytes
and petabytes. Lianzhang Liu and Hingfan Gai [5] proposed a
new method to create bucket index for numeric data
and used bloom filter compression algorithm [10]
for character strings and encrypted a particular
column in 2009. This method didn’t have any false
negative error, but could have some false positive
errors. The procedure of the work is explained
briefly as below. First, the words were isolated,
second the characters of each word were divided to
the groups of two letters, three letters,… . Third, the
bloom filter algorithm was used to convert character
value to numeric value. Therefore, this method
doesn’t have good performance for encrypting long
strings. In 2010, Nian Liu and et al. [11], proposed a
new method for DAS1 databases that its created
index is based on two parts. The first part was used
to keep the bundle of existed characters in main
string and the second part was used to store the
positions of characters. In 2011, M.Hussain and
Atiq ur Rehman [12] proposed another way to
encrypt data in DAS databases. In this method, the
metadata that contained encryption/decryption keys
and obfuscation/de-obfuscation information, was
created on client side. It meant that, obfuscation and
encryption operations were performed on the client
machines and then the encrypted data was sent to
the server for storage. It must be noted that the last
1 Database As a Service
two cases are related to cloud databases which are
different from distributed databases. In 2012, Al Derawi and Mohammed Alhanjouri
[13], proposed a method to encrypt data and query
over encrypted data. The base of their method is
similar to the previous ones in way that it performs
encryption/decryption operations by using a middle
layer and creating a new attribute as signature. What
distinguishes this approach from the others is using
hash function to generate signatures. This method is
proper just for the where clause query and is not
suitable for fuzzy query such as like “%string%” .
3- Bloom Filter Algorithm for
Relational database The procedure of work in Ref [5] is as followed.
First, a string of characters is encrypted by AES-128
algorithm and stored in the table. Then the
signatures are created and kept in separate columns.
Therefore, the algorithm will be later able to query
over encrypted data. Creating signature consists the
following steps: at the beginning, the string is
separated by special characters like blank, comma,
full stop to sets of words, and then for each word,
MD5 hash function is computed four times for every
two adjacent characters. It is enough to add four
different values to the main string to calculate the
hash function in order to have different outputs for a
same string. Another prominent point is that the
values which are added to the string must be the
same in all of phases because the query string is
calculated in the same way. Therefore, the hash
function’s outputs are mapped to numerical range
(0,m), where m=32. These numbers are called num.
At last, the final signature is calculated by ∑
and is stored in a separated column. The details are
available in paper [14].
Finally, if there is a query over encrypted data, first
the query condition should be converted to a
signature by bloom filter algorithm. We called it
signature_q and then it is compared it to all of the
existed signatures in the table. The rows which
satisfied the condition below, are returned as final
results :
bitand (signature i, signature_q) = signature_q
4- HBase : The Hadoop Database Today, due to the inability of relational databases to
keep unstructured data and perform query on a large
volumes of data like terabytes or petabytes, a new
kind of databases has been introduced as NOSQL
databases and among them Apache HBase and
Google BigTable are the most famous ones.
HBase is the open-source version of Google
BigTable based on hadoop distributed file system
(HDFS). It has the ability to have millions of
different columns per column families for each row,
and also HBase is suitable for sparse tables because
its structure is like linked list, so it doesn’t store null
values [15]. Since HBase is a distributed database, its
architecture is composed of individual master and
some region servers. Region Servers are responsible
for keeping regions because data is distributed on
regions. The main task of master is to monitor
regions and coordinate the Region Servers, and to
recover data if the region severs go down [4]. Any
kind of requests from clients, like adding a value,
deleting a value and etc. are sent to the master-
server and then it is master’s task to send that
requests to the region servers. Another feature of
HBase is the possibility of being on top of the
HDFS. This feature increases the reliability of
HBase and also provides parallel processing by
MapReduce framework that based on hadoop
system files. Figure 1 shows the physical schema of
regions and region servers.
Figure 1. physical schema of regions and region servers [16]
5- Proposed Method In this section, the proposed method consists of the
encryption algorithm used and its optimization is
described. The novelty of this paper is querying
over encrypted data in NOSQL databases which is
not yet proposed. In section 5-1, the changes made
to bloom filter algorithm are described and in
section 5-2 the procedure of algorithm is expressed
due to HBase's structure. In section 5-3, works
which have been done in order to optimize the
efficiency of the bloom filter algorithm used in
HBase, are described.
5-1 The changes made to Bloom Filter in
order to use in HBase The bloom filter method in HBase is
implemented with these major assumptions:
All columns are identical for all rows
All rows are placed in one column Family
for encryption
The key of AES is the same for all
encrypted rows.
As explained before, the bloom filter algorithm
method has used MD5 hash function. Despite the
good speed of hash function, it has a major
disadvantage that caused this function useless for
encryption method in HBase. The size of the MD5
hash function is 128 bit and it is possible to find
collisions in a short time and according to [17,18], it
is not recommended to use MD5 hash function in
security issues. Therefore, SHA1-256 hash function
is used instead of MD5. HBase is written in Java, hence the Java language is
used to implement bloom filter in HBase. Due to
restrictions of some types such as int and long in
implementation of bloom filter, BigInteger type is
used in Java (step 4). By using this type, there is no
limitation to choose _lengthOfIndex and any value
can be assigned to it (step 6). According to the tests
which are done, the value of 1024 seems proper for
the parameter _lengthOfIndex. It should be noted
that the long type is faster than BigInteger type.
The algorithm obtained the required accuracy, after
choosing BigInteger type and determing
_lengthOfIndex=1024. In the following section, it
will be explained how to use this algorithm in
HBase. Figure 2 shows the pseudo code of used
Bloom Filter algorithm.
Figure 2. The Proposed Bloom Filter Algorithm
5-2 Storing the Encrypted Data and
Signatures in HBase Among the existing methods of encryption, AES is
the most secure, that’s why AES is used to encrypt
data in HBase [19, 20]. Next passage would explain
how the bloom filter algorithm is used. As it is illustrated, for each row in HBase, the
column family and columns in which data is stored,
should be determined. Thus, by using AES-128 the
data would be encrypted and the related signature
would be generated as described in section 5-1; then
the encrypted data and its signature are stored in one
column family and two different columns. For
example, the encrypted data and the related
signature are stored in a column family, called cf1
which includes two columns named field1 to keep
the encrypted data and field1_signature to keep the
signature (Figure 3). According to HBase storage and data retrieval
which will be explained completely in section 5-3,
all data in a column family are stored in a file
named HFile. Therefore, when a query is executed,
all data including encrypted data and signatures are
sent to the client machine which decreases transfer
speed, efficiency and also uses large bandwidth. For
example, if there are 50 columns in a column
family, another 50 columns will be created as
signatures in that column family; therefore, there are
totally 100 columns in the column family. As a
result, the 100 columns are transferred to the client
machine; however it is necessary to compare just 50
columns of signatures to the query. This kind of
storage is causing the query operation over
encrypted data not to have good speed, because all
data in a column family first should be transferred to
the client and then compare the query condition to
the retrieved signatures. In order to fix this problem, the encrypted data is
stored in a different column family from the
signature column family. For example, the
encrypted data is kept in a column family named cf1
and the related signatures are kept in another
column family named cf2. It is necessary to know
the signatures are related to the encrypted data by
row keys. This kind of storage separation causes the
transfer of the existed rows in signatures column
family to the client in responding of each query (just
50 columns) that would be faster than the previous
one (Figure 4).
Figure 5 simply shows the procedure of the
proposed method.
Figure 3. Encrypted data and signatures are stored in a column family
named ColFam1
(1) Algorithm: Create Signature Value by Bloom Filter
(2) Input : original string, length of split unit, number of hash function, length
of Index
(3) Output: numeric value of Signature
(4) BigInteger _targetValue=0;
(5) _splitUnit=2;
(6) _lengthOfIndex=1024;
(7) _numOfHashFunc=4;
(8) String[] allWords= split(originalString); // split the string to words
(9) SortedSet<BigInteger> s; // An sorted array stores the positions of Hash
(10) for (int i = 0; i < _allWords.length; i++) {
(11) if (_allWords[i].length() <= _lengthofsplitunit) {
(12) BigInteger[] tt = MultiHashFunc -SHA1 (_allWords[i]);
(13) for (int r = 0; r < tt.length; r++) // Put position values to s
(14) s.add(tt[r]);
(15) }
(16) else {
// split allWords[i] in len,building bloom filter on r
(17) for (int j = 0; j < _allWords[i].length() - _lengthofsplitunit+ 1;
(18) j++) {
(19) String temp = _allWords[i].substring(j, j+ _lengthofsplitunit);
(20) BigInteger[] tt = MultiHashFunc -SHA1 (temp);
(21) for (int r = 0; r < tt.length; r++) // Put position values to s
(22) s.add(tt[r]);
(23) }
// convert the bloom filter s into numeric data
(24) Iterator<BigInteger> it = s.iterator();
(25) while (it.hasNext()) {
(26) BigInteger tt = it.next();
(27) BigInteger pow=new BigInteger("2");
(28) BigInteger temp=pow.pow(tt.intValue());
(29) _targetValue=_targetValue.add(temp);
(30) }
(31) return _targetValue;
Figure 4. Encrypted data and signatures are stored in two different
column families named ColFam1 and ColFam2
Figure 5. The schema of storing encrypted data and signatures
5-3 Optimization to improve performance As described in previous section, a part of the
implementation of bloom filter algorithm depends
on scan operation in HBase. Thus, the scan
operation will be explained in details, then some
methods will be illustrated for optimization. Scan operation is done by RPC. RPC sends
query condition of scan operation to regions. Then
on each region, searching is done and the results are
sent to the region servers and finally they would be
sent to the master server which is known as client
machine. Setting a specific filter is another
advantage of scan operation. For example, scan
operation can be set to run on a specific column
family or specific columns. Here is some parameters
that filters can be set on them : column family,
specific columns in a column family, row keys,
timestamps of data and etc. Setting the startRowKey
and stopRowKey is another distinguishing feature of
scan operation. This feature is used to implement
parallel scan in bloom filter algorithm ( Figure 6 ).
Figure 6. The filters created on the client side, sent through the RPC,
and executed on the server side [16]
Regarding the bloom filter algorithm in HBase,
it is necessary to compare all of the signatures in a
table to the query signature. Therefore, it is not
possible to use a specific filter for scan operation,
unless a specific column family is filtered after
separating column families by encrypted data and
signatures.
Since the scan operation is done based on row
keys and regions, and bloom filter algorithm
compares the query signature to other signatures, it
could be time consuming. In order to optimize the
bloom filter algorithm, scan is implemented parallel
which is explained in the following. Each region server controls a number of regions
in HBase, so scan operation is done through
different region servers simultaneously to have
better parallelism by creating threads for each
region servers. Each thread is responsible for a scan
operation which is bounded with the startRowKey
and stopRowKey of each regions ( Figure 7 ). It
should be noted that this method is implemented
with the assumption that the number of regions are
equal to the number of region servers and each
region is large enough, so that there is no need to
split region to other ones.
To have a good performance, it is better for each
region to have the same number of rows, so we can
divide a table into different regions by row keys
which is known as Presplitting Regions.
Presplitting Regions causes optimize parallel
scanning by uniform distribution of data over
regions. The above method leads to have bottleneck on
master side and crashes down the database because
the master server must receive a lot of data
including signatures from different region servers at
the same time and compare them to query signature.
As master server receives a lot of data in a moment,
it can not process them simultaneously and leads to
crash HBase.
Figure 7. Parallel scanning in HBase
To solve this problem, the Coprocessors
Framework in HBase can be used. This feature
provides the possibility of implementation an
arbitrary code on a specific region or a specific
region server [16]. After enabling this feature, a
large volume of data transmitted to the master server
is reduced because the signatures comparing are
done on every region servers. Therefore, the bloom
filter algorithm can implement in HBase without
any bottlenecks and problems . Figure 8 shows how
the coprocessor framework works.
Figure 8. Coprocessors executed sequentially, in their environment, and
per region [16]
6- Performance Analysis The experiments and tests were implemented on
three virtual machines as follow:
First machine consisted of one core
processor having speed 2.2 GHz, 2 GB of
RAM and 15 GB hard disk. Hadoop-
Namenode, Hadoop-Datanode and HBase-
Master were run on this machine.
Second machine consisted of one core
processor having speed 2.2 GHz, 1.5 GB of
RAM and 10 GB hard disk. Hadoop-
Datanode and HBase-Region Server were
run on this machine.
Third machine consisted of one core
processor having speed 2.2 GHz, 1.5 GB of
RAM and 10 GB hard disk. Hadoop-
Datanode and HBase-Region Server were
run on this machine.
After testing and evaluating the proposed
method with three different states: implementation
with just one column family, parallel
implementation of previous state and the simple
encryption/decryption implementation, the results
show that the proposed method has good
performance.
In section 6-1, the precision and efficiency of
proposed method are tested with random values. In
this section, we want to know whether our proposed
method works correctly or not, also the performance
and precision of the method is another issue which
is checked. In section 6-2, the proposed method is
tested by some real values like SOC2’s logs which
will be described later.
6-1 Analyze Proposed Method by Arbitrary
Values In this part, the proposed method is tested by
different arbitrary random values. Table 1, shows
the query time comparison between different states
and the related diagrams are presented in Figures
9,10 and 11, respectively, for the conditions that the
table has 5 columns, 10 columns and 50 columns. It
should be noted that each state is tested at least three
times and then the average of times is calculated as
the query time.
Also in another experiment, the efficiency of
proposed method for numeric values is tested and
different IP values are inserted to the table. The IP
values have four parts, the values of first and second
parts are defined randomly among these
(192,60,120,230) and (168,69,212,123),
respectively. The third and fourth values are random
numbers from numerical range of (0-255). The
results are shown in Figures 12,13. To test the accuracy of the proposed method, at
first experiment, 90 random values and 10 specific
values are added to a table contained 5 columns:
Field1=”this is a test for field[1]”
Field2=”this is a test for field[2]” . . . Field5=”this is a test for field[5]”
2 Security Operation Center
Then to test the precision, different queries are
run on that table:
Query1:field1=”test”,field2=”test”,field3=”test”,field4=”test”,
field5=”test”
Query2:field1=”field”,field2=”field”,field3=”field”,field4=”fie
ld”, field5=”field”
Query3:field1=”this”,field2=”is”,field3=”for”,field4=”test”,
field5=”field”
Query4:field1=”hello to this test”,field2=”field”, field3=”field”
,field4=”field”,field5=”field”
For the three first queries, all the 10 rows are
returned as outputs but there isn’t any result for the
last query. At second experiment, 90 random values are
added to a table contains 5 columns, as follow:
“this test is: random characters”
The string includes 10 characters between “a” and
“z” or numbers between “0” and “9”. 10 specific
string as following are also added to the table:
“this is a test for field[i]”
Then following queries are run on the table:
Query1:field1=”test”,field2=”test”,field3=”test”,field4=”test”,
field5=”test”
Query2:field1=”field”,field2=”field”,field3=”field”,field4=”fie
ld”, field5=”field”
Query3:field1=”this test is”,field2=” this test is”,field3=” this test is”,field4=” this test is”, field5=” this test is”
Query4:field1=”this is a test for” ,field2=” this is a test for”,
field3=” this is a test for” ,field4=” this is a test for”,field5=”
this is a test for”
For above queries, 97,10,88 and 10 rows are
outputted respectively.
Figure 9. Comparison of time query- For 5 columns
Figure 10. Comparison of time query - For 10 columns
Figure 11. Comparison of time query- For 50 columns
Figure 12. Comparison of time query for numeric values - For 10 columns
0
20
40
60
80
100
120
140
Comparison of time query - For 5 columns
BloomFilter Enc
with 1
columnFamily
Proposed Method
Simple
Encryption/Decr
yption
Parallel
Bloomfilter Enc
with 1
columnFamily
020406080
100120140160
Comparison of time query- For 10 columns
BloomFilter Enc
with 1
columnFamily
Proposed Method
Simple
Encryption/Decrypti
on
Parallel Bloomfilter
Enc with 1
columnFamily
0
100
200
300
400
500
600
Comparison of time query - For 50 columns
BloomFilter Enc
with 1
columnFamily
Proposed Method
Simple
Encryption/Decryp
tion
Parallel
BloomFilter Enc
with 1
columnFamily
0
20
40
60
80
100
Comparison of time query for numeric values- For 10
columns
BloomFilter Enc
with 1
columnFamily
Proposed Method
Simple Enc/Dec
Parallel
Bloomfilter Enc
with 1
columnFamily
Figure 13. Comparison of time query for numeric values - For 50
columns
Table 1. Comparison of time queries for random values
query’s
time
Encryption
Method
Number of
Columns
Number
of Rows 38 sec Simple
5
20000
24 sec BloomFilter with 1
column family
15 sec parallel BloomFilter
with 1 column family
13 sec proposed method
42 sec Simple
10
35 sec BloomFilter with 1
column family
13 sec parallel BloomFilter
with 1 column family
12 sec proposed method
70.5 sec Simple
50
81 sec BloomFilter with 1
column family
27 sec parallel BloomFilter
with 1 column family
20 sec proposed method
51 sec Simple
5
40000
40 sec BloomFilter with 1
column family
20 sec parallel BloomFilter
with 1 column family
10 sec proposed method
62 sec Simple
10
49 sec BloomFilter with 1
column family
15 sec parallel BloomFilter
with 1 column family
15 sec proposed method
125 sec Simple
50
152 sec BloomFilter with 1
column family
63 sec parallel BloomFilter
with 1 column family
46 sec proposed method
75 sec Simple
5
60000
57 sec BloomFilter with 1
column family
16 sec parallel BloomFilter
with 1 column family
11 sec proposed method
83 sec Simple
10 69 sec BloomFilter with 1
column family
18 sec parallel BloomFilter
with 1 column family
15 sec proposed method
175 sec Simple
50
223 sec BloomFilter with 1
column family
228 sec parallel BloomFilter
with 1 column family
66 sec proposed method
94 sec Simple
5
80000
70 sec BloomFilter with 1
column family
20 sec parallel BloomFilter
with 1 column family
12 sec proposed method
100 sec Simple
10
86 sec BloomFilter with 1
column family
24 sec parallel BloomFilter
with 1 column family
21 sec proposed method
227 sec Simple
50
289 sec BloomFilter with 1
column family
crash
cluster
parallel BloomFilter
with 1 column family
186 sec proposed method
117 sec Simple
5
100000
87 sec BloomFilter with 1
column family
21 sec parallel BloomFilter
with 1 column family
17 sec proposed method
125 sec Simple
10
143 sec BloomFilter with 1
column family
26 sec parallel BloomFilter
with 1 column family
24 sec proposed method
563 sec Simple
50
443 sec BloomFilter with 1
column family
crash cluster
parallel BloomFilter with 1 column family
233 sec proposed method
6-2 Analyze Proposed Method by Real
Values In this section, experiment results which are done on
SOC’s logs, are presented. Security Operation
Center (SOC) is a real-time security platform whose
main aim is to detect and prevent attacks and
performs suitable reaction immediately. To achieve
this goal, SOC uses special log files which have the
IDMEF3 format. An implementation of IDMEF
format is extensible markup language (XML)
because of its flexibility. A simple instance of the
IDMEF file is presented in Figure 14:
3 Intrusion Detection Message Exchange Format
0
50
100
150
200
250
300
350
Comparison of time query numeric values - For 50
columns
BloomFilter Enc
with 1
columnFamily
Proposed Method
Simple Enc/Dec
Parallel
BloomFilter Enc
with 1
columnFamily
(Sec)
It is obvious that processing the whole IDMEF file
is too time consuming, therefore, it is necessary to
preprocess the IDMEF files. Some useful columns
of IDMEF files are extracted in order to test the
proposed method such as messageId, sourceIp,
targetIp, effect, mechanism, resource and rawlog.
According to the assumption of proposed method,
these seven columns should be placed in one
column family.
The effect, mechanism and resource columns have
distinct values. The effect column has four different
values : System Compromised, Reconnaissance,
Unknown and Access. The mechanism column
includes four values: Host Sweep, Spyware, Trojan
and Network Http. The resource column has three
different values: Host, Web and Remote Service.
The rawlog includes a long string like “Feb 02
07:41:31 snort snort[1]: [1:2101918:7] GPL SCAN
SolarWinds IP scan attempt [Classification:
Detection of a Network Scan] [Priority: 3] {ICMP}
194.146.150.40 -> 194.146.151.73”.
This test is done on a table with IDMEF log files
which contains between 20,000 and 100,000
records. The query which is run on the table and
compared to all records is as below: Query:
messageId=”111”,sourceIP=”192.168.1.1”,targetIp=”192.168.1.
1”,effects=”Unknown”, mechanisms=”Host Sweep”,
resources=”Host”, rawlog=”Priority”
Figure 15 shows the efficiency and query time of
proposed method which is tested on SOC logs.
Figure 15. Comparison of time query for SOC's Logs
According to Figure 15, the performance of
proposed method is better than others. Although
there is a little difference between the proposed
method and BloomFilter with 1 columnfamily
method like Figure 9 and Figure 10, but when there
are many columns or long strings, the proposed
method works more desirable.
Also two other experiments are done to evaluate the
accuracy and precision of the proposed method on
the SOC logs. In the first experiment, there are
20,000 records which are stored in a table with one
column family and seven columns. 310 records have
the same values as shown below:
"messageId=555","sourceIp=192.168.1.1","targetIp=192.168.1.
1","effects=unknown","mechanisms=Host Sweep","resources=
Host","rawlog= Wed Aug 07 15:13:09 EDT 2013 snort
snort[1]: [1:150990:13] ET POLICY TeamViewer Dyngate
User-Agent [Classification:Attempted Information Leak]
[Priority:4] {TCP} 192.168.1.1:2636 -> 192.168.1.1:1272 "
Then to test the precision, a query is run on the
table:
Query: messageId=”555”, sourceIP=”192.168”,
targetIp=”192.168”, effects=”Unknown”, mechanisms=”Host
Sweep”, resources=”Host”, rawlog=”Priority”
The results are all of the 310 records which
retrieved correctly. Table 2 shows the relevant and
non-relevant of retrieved records. According to the
Table 2, the precision and recall of proposed method
can be measured as following:
Precision = tp/(tp+ fp) => Precision=310/310=100%
Recall = tp/(tp+ fn) => Recall= 88/88= 100%
Table 2. Relevant and non-relevant of retrieved records
Relevant Non-relevant
Retrieved 310 record (TP) 0 record (FP)
Not retrieved 0 record (FN) 19690 record (TN)
0
20
40
60
80
100
120
140
Comparison of time query for SOC's Logs
BloomFilter Enc
with 1
columnFamily
Proposed Method
Simple Enc/Dec
Parallel
Bloomfilter Enc
with 1
columnFamily
<IDMEF-Message xmlns:idmef="http://iana.org/idmef" version="1.0"
><Alertmessageid="135256225 841578901"><Analyzer analyzerid= "snort snort[1]" class="1"/><Assessment><Impact impacttype= other"
severity="low">GPL SCAN SolarWinds IP scan attempt </Impact>
</Assessment> <Classification ident="1:2101918" text="Detection of a Network Scan"/><CreateTime ntpstamp= "35604134 87">2012-10-
28T 11:44:47</CreateTime><Source spoofed="unknown"> <Node><
Address category="ipv4-addr"><address>194.146.150.40 </address> </Address></Node><Service iana_protocol _name="ICMP"> <port>
</port></Service></Source><Target decoy="unknown"><Node>
<Address category="ipv4-addr"><address>194.146.151.86</address> </Address></Node><Service iana_protocol _name="ICMP"><port>
</port></Service></Target><AdditionalData type="string" meaning=
"effect"> <string>Reconnaissance</string></AdditionalData><Additio nalData type="string" meaning="resource" ><string>Host</string>
</AdditionalData><AdditionalData type="string" meaning="mechanis
m"><string>Network Sweep</string></AdditionalData><Additional Data type="string" meaning= "RawLog"><string>Oct 28 11:44:47
snort snort [1]: [1:2101918:7] GPL SCAN SolarWinds IP scan attempt
[Classifi cation: Detection of a Network Scan] [Priority: 3] {ICMP} 194.146. 150.40 -> 194.146.151.86</string></AdditionalData>
<AdditionalData type='integer' meaning='Number of Repeats'>
<integer>1</integer> </AdditionalData></Alert></IDMEF-Message>
Figure 14. An instance of IDMEF File
In second experiment, there is a little difference
with previous one. There are 20,000 records in a
table with one column family and seven columns
like the previous experiment, but there are 88
records which have specific values for sourceIp and
targetIp columns as below:
sourceIp=”192.168.170.166” , targetIp=”192.168.170.165”
Also there are 146 records which have specific
values for those two columns as following:
sourceIp=”166.170.168.192” , targetIp=”165.170.168.192”
Therefore, there are 234 special records from 20,000
with specific IP values for sourceIp and targetIp
columns. Then two queries are run on the table with
different condition:
Query 1: messageId=” ”, sourceIP=”170.166”,
targetIp=”170.165”, effects=” ”, mechanisms=” ”, resources=”
”, rawlog=” ”
Query 2: messageId=” ”, sourceIP=”166.170.168.192”,
targetIp=”165.170.168.192”, effects=” ”, mechanisms=” ”,
resources=” ”, rawlog=” ”
It should be noted that the empty spaces shows there
is no condition for that column. For the first query,
234 records are retrieved which include 88 records
as true results of the query and 146 as false results..
Table 3 shows the relevant and non-relevant of
retrieved records. According to the Table 3, the
precision and recall of proposed method can be
measured as following:
Precision = tp/(tp+ fp) => Precision=88/234=38%
Recall = tp/(tp+ fn) => Recall= 88/88= 100%
Table 3. Relevant and non-relevant of retrieved records
Relevant Non-relevant
Retrieved 88 record (TP) 146 record (FP)
Not retrieved 0 record (FN) 19766 record (TN)
For the second query, there are 234 retrieved
records as the final result of the query. The precision
and recall for this query are as following:
Precision = tp/(tp+ fp) => Precision=146/234=62%
Recall = tp/(tp+ fn) => Recall= 146/146= 100%
This test shows that, in the worst case the precision
of proposed method is mediocre, although it has
very good recall because it has some false positive
records and no false negative records. because the
algorithm doesn’t keep the positions of the
characters which are the features of the bloom filter
algorithm.
What the evaluation and experiments show is that
the proposed method has good efficiency and
enough accuracy to retrieve records, although it has
some false positive errors. The precision of
proposed method relies on query condition. If the
quality of query condition increases, the precision
will get better.
7- Conclusion In this paper, solutions for data encryption in HBase
database is proposed by using bloom filter
algorithm. In order to have bloom filter efficiently,
some changes like SHA1-256 hash function and
BigInteger type are applied in the algorithm to have
better accuracy for long string. The proposed
method is improved by using the structure and
features of HBase like coprocessor framework. By
testing, the evaluation of obtained results shows that
the proposed method has 100% recall measurement.
It means that it always retrieves the related records
correctly although it has some false records which
are depend on quality of query condition. As it is
revealed, the efficiency of the proposed method
which has two Column Family and parallel
performance is much better than the basic method
which has only one Column Family. In future
works, the proposed method can be improved by
implementing MapReduce framework or supporting
different column families and different columns for
each row which are not provided yet.
References
[1] R. P. Padhy, M. R. Patra, and S. C. Satapathy, “RDBMS to
NoSQL: Reviewing Some Next-Generation Non-Relational
Database’s,” International Journal of Advanced Engineering Sciences and Technologies, vol. 11, no. 1, pp. 15–30, 2011.
[2] S. Ghemawat, H. Gobioff, and S.-T. Leung, “The Google file system,” ACM SIGOPS Operating Systems Review, vol. 37,
no. 5, p. 29, Dec. 2003.
[3] D. Carstoiu, A. Cernian, and A. Olteanu, “Hadoop Hbase-
0.20.2 Performance Evaluation,” in New Trends in
Information Science and Service Science NISS, 2010, pp. 84–87.
[4] D. Carstoiu, E. Lepadatu, and M. Gaspar, “Hbase - non SQL Database , Performances Evaluation,” International Journal
of Advancements in Computing Technology, vol. 2, no. 5, pp.
42–52, 2010.
[5] L. Liu and J. Gai, “Bloom Filter Based Index for Query over
Encrypted Character Strings in Database,” in 2009 WRI World Congress on Computer Science and Information Engineering,
2009, pp. 303–307.
[6] H. Hacigümüş, B. Iyer, C. Li, and S. Mehrotra, “Executing
SQL over encrypted data in the database-service-provider
model,” in Proceedings of the 2002 ACM SIGMOD international conference on Management of data -
SIGMOD ’02, 2002, p. 216.
[7] Z.-F. Wang, J. Dai, B. Shi, and W. Wang, “Fast Query over
Encrypted Character Data in Database,” Communication in
Information and Systems, vol. 4, no. 4, pp. 289–300, 2004.
[8] Y. Zhang, W. Li, and X. Niu, “A Method of Bucket Index
over Encrypted Character Data in Database,” in Third International Conference on Intelligent Information Hiding
and Multimedia Signal Processing (IIH-MSP 2007), 2007,
vol. 1, pp. 186–189.
[9] Y. Zhang, W. Li, and X. Niu, “Secure cipher index over
encrypted character data in database,” in 2008 International Conference on Machine Learning and Cybernetics, 2008, no.
July, pp. 1111–1116.
[10] M. Mitzenmacher, “Compressed Bloom filters,” IEEE/ACM
Transactions on Networking, vol. 10, no. 5, pp. 604–612, Oct.
2002.
[11] N. Liu, Y. Zhou, X. Niu, and Y. Yang, “Querying Encrypted
Character Data in DAS Model,” in 2010 International
Conference on Networking and Digital Society, 2010, pp.
402–405.
[12] A. Rehman and M. Hussain, “Efficient Cloud Data
Confidentiality for DaaS,” International Journal of Advanced
Science and Technology, vol. 35, pp. 1–10, 2011.
[13] M. Alhanjouri and A. M. Al Derawi, “A New Method of
Query over Encrypted Data in Database using Hash Map,” International Journal of Computer Applications, vol. 41, no.
4, pp. 46–51, Mar. 2012.
[14] L. Liu and J. Gai, “A Method of Query over Encrypted Data
in Database,” in 2009 International Conference on Computer Engineering and Technology, 2009, pp. 23–27.
[15] J. Cryans, A. April, and A. Abran, “Criteria to Compare Cloud Computing with Current Database Technology,”
Proceedings of the International Conferences on Software
Process and Product Measurement, vol. 5338, no. 3, pp. 114–126, 2008.
[16] L. George, HBase The Definitive Guide. O’Reilly Media, 2011, p. 556.
[17] F. Mendel, C. Rechberger, and M. Schl, Advances in Cryptology – ASIACRYPT 2009, vol. 5912. Berlin,
Heidelberg: Springer Berlin Heidelberg, 2009, pp. 144–161.
[18] M. Stevens, “On collisions for MD5,” Eindhoven University
of Technology, 2007.
[19] N. Standard, “Announcing the Advanced Encryption Standard
(AES),” Federal Information Processing Standards
Publication, 2001.
[20] P. Hamalainen, T. Alho, M. Hannikainen, and T. D.
Hamalainen, “Design and Implementation of Low-Area and Low-Power AES Encryption Hardware Core,” in 9th
EUROMICRO Conference on Digital System Design
(DSD’06), 2006, pp. 577–583.
)