Covert channel detection in VoIP streams
Transcript of Covert channel detection in VoIP streams
Covert Channel detection in VoIP streams
Gonzalo Garateguy
Department of Electrical and
Computer Engineering
University of Delaware
Newark, DE 19711
Gonzalo R. Arce
Department of Electrical and
Computer Engineering
University of Delaware
Newark, DE 19711
Juan Pelaez
U.S. Army Research Laboratory
Adelphi, MD 20783
Abstract—This paper presents two approaches to detect VoIPcovert channel communications using compressed versions ofthe data packets. The approach is based on specialized randomprojection matrices that take advantage of prior knowledge aboutthe normal traffic structure. The reduction scheme relies on theassumption that normal traffic packets belong to a subspace ofsmaller dimension or that can be included a convex set. We showthat through the incorporation of this information in the designof the random projection matrices, the detection of anomaloustraffic packets can be performed in the compressed domain with aslight performance loss with respect to the uncompressed domain.The validation of the detection algorithm is based on real datacaptured on a test bed designed to that end.
I. INTRODUCTION
VoIP is one of the most popular services in IP networks and
is being used not only at the user level but also for inter and
intra company communications as well as to replace traditional
analog land lines. With the increase of the traffic volume due
to VoIP services the suitability of using it for steganographic
purposes had become an important threat to network security.
In recent studies [1]–[4] many techniques to disguise stegano-
graphic information in media and call signaling protocols
had been proposed, showing a surprising capacity to exfil-
trate great amounts of information. Considering the inevitable
convergence of voice, video and data communications in
both commercial and tactical environments; new techniques
to uncover VoIP covert channels are of high interest to avoid
exfiltration of sensitive information. Among all the protocols
used in VoIP communications (e.g. SIP, SDP, RTP, RTCP,
etc) media protocols represent the biggest threat to network
security. In an average call, media traffic packets correspond
to 99% of the total number of packets transmitted. The method
proposed in this paper focuses on the analysis of the RTP
media transport protocol but the same technique can be applied
to other signaling and control protocols. Since a typical VoIP
call is in the order of minutes, considerable amounts of data
have to be analyzed to effectively detect potential covert
channel communications. In this context an in depth inspection
of the packets becomes a very computationally intensive task.
The analysis of packets can degrade the quality of a VoIP
call or even make this communication impractical due to the
sensitivity of voice traffic to latency. Our proposed solution
uses compressive sensing techniques to acquire sketches of the
data packets and then perform the detection and classification
SIP Signaling Server
OpenSIPS
SIP client ASIP client B
Sniffing station
(wireshark)
10.43.42.2/24
Fig. 1. Test bed used to capture the VoIP traffic data
in the compressed domain, thus reducing processing time and
storage capacity. We present a procedure to design specialized
random matrices as in [5], [6] which take advantage of prior
knowledge about the normal traffic structure. Once the data
dimensionality is reduced, classification is performed using
support vector machines based on training samples from both
anomalous and normal data. We show that classification using
support vector machines gives good performance for high
compression ratios as stated in [7].
II. DATA CAPTURE AND FORMATTING
The traffic used to test the detection algorithms was gen-
erated in a test bed built to that end. A diagram of the setup
is depicted in Figure 1. The test bed consist of a signaling
server executing Opensips [8], two clients which are capable
of executing several types of softphones in both Windows
and Linux environments and a sniffing station used to capture
signaling and media traffic. The sniffing station runs wireshark
software which includes a module to detect SIP signaling and
identify individual RTP streams associated to a each of the
active VoIP calls. The VoIP analysis module of wireshark
allows to extract all the packets associated to a call and save
them in the RTPdump format specified in [9]. The Client
stations are dual boot machines capable of executing Windows
978-1-4244-9848-2/11$26.00©2011 IEEE
bit
offset0-1 2 3 4-7 8 9-15 16-31
0 Ver. P X CC M PT Sequence Number
32 Timestamp
64 SSRC identifier
96 CSRC identifiers (optional)
RTP header Extension (optional)
Payload
RTP padding RTP count
SRTP master key identifier (MKI optional)
Authentication tag
Fig. 2. RTP packet format
and Linux operating systems and the clients used are X-lite
softphone for Windows and Twinkle softphone for linux. In
addition also an exfiltration client is installed at Client A
allowing to transmit non-voice data to Client B. This packets
are considered attacks and are the ones we try to identify.
A. Data matrix formation
The data used in the simulations is prepared as follows. The
stream of packets captured is divided into groups and then each
group of packets is arranged as a column of the data matrix
D. As the size of the packets might change during a single
call, we group a number of packets that does not exceed a
preset number m of bytes per column. If the number of bytes
exceeds the limits, the last packet in the group is mapped
onto the following column and the remaining bytes of the
present column filled with 0. In this way the data matrix Dalways has dimensions m × n where m is fixed and n can
change according to the number of groups formed from the
captured packets. The standard fields of RTP packet headers
(see Figure 2) are mapped to the beginning of the columns.
The remaining bytes in the packet (including the payload)
are associated to one entry immediately after the headers (see
Figure 3), all this values are normalized to the interval [0, 1]by dividing its values over 255. Considering the format of
the RTP packets, most of the fields take values in a small
set, Ver ∈ {2}, P ∈ {0, 1}, X ∈ {0, 1}, CC ∈ {0, .., 16}.
Some increase monotonically from one packet to the next
one, for example the Timestamp field and the Sequence
number field. And some remain fixed for the whole call as
the SSRC field. Since the Sequence number, Timestamp
and SSRC fields take large values with respect to the other
fields only the difference between the present and a previous
packet is stored in the data matrix.
B. Normal and anomalous data
The exfiltration attacks were carried out using an attacking
tool developed by Salare Security [10]. This tool inject the
steganographic content in the payload of the RTP packets
while keeping the values of the header fields in their typical
values. Different types of data was exfiltrated, i.e. JPG images,
PDF files and Text files. The codec declared in the headers
. . . . . . . . . . . . . . . .
(np-nh).L rows
nh.L rows
n colums
headerspayloads
m rows
Stream of RTP Packets
. . . . . . . .
Data Matrix
Fig. 3. nh is the number of fields in the header that are mapped at thebeginning of each column, np is the maximum number of bytes per packetallowed and L is the number of packets per column. If a group of L packetsexceeds np bytes the last one is mapped to the next column.
of exfiltration packets was G711 which was also the codec
used in the normal traffic generated. Several minutes of normal
traffic calls were recorded using speech audio content.
III. DIMENSIONALITY REDUCTION AND SIGNAL
SEPARATION
In the proposed algorithm the data used to classify the
traffic is sampled and compressed while taking advantage
of the prior knowledge about the normal signal structure.
Random Projections have shown a great capacity to capture the
fundamental characteristics of signals that have considerable
structure, allowing to perform filtering, detection and classi-
fication in a dimensionally reduced space. It is demonstrated
that the loss incurred by the dimensionality reduction via ran-
dom projections, with respect to the classification in the high
dimensional space, can be bounded with arbitrary precision as
long as the number of dimensions of the reduced space and the
random matrices are designed properly [11]–[14]. Moreover,
if prior information with respect to the normal behaviour of the
signal is known (i.e. a basis of a subspace containing normal
signals), the performance of classification and detection can
be further improved. If the data is represented as a vector
x ∈ RN (i.e. one of the columns of our data matrix) the
random projections are calculated as y = Φx with Φ being
a random matrix of size M × N , M ≪ N . In this case we
take advantage of the prior knowledge about the normal traffic
structure in the design of the random matrix Φ. We assume
two basic models, the first one is the subspace or affine space
model and the second one is a convex set covering model.
In the first case we assume that normal traffic belongs to a
particular subspace S of smaller dimension than the ambient
978-1-4244-9848-2/11$26.00©2011 IEEE
space RN and also that anomalous vectors lie in a different
subspace or affine space which may have a small intersection
with the normal traffic space. In the second case, the model
assumes that there exist a convex set that only contains normal
traffic vectors excluding all the anomalous vectors.
A. Subspace model
Under the subspace model the random matrix Φ is designed
as the composition of two linear transformations. The first
transformation projects the data vectors over S⊥ and then
a random matrix is used to reduce the dimensionality. The
orthogonal projection matrix is defined as PS⊥ = I −B(BT B)−1BT where B is a matrix whose columns generate
the subspace S. The random projection is performed by means
of a random matrix G of dimensions M × N with M < Nand such that
gi,j =
+√
3/M with probability 1
6
0 with probability 2
3
−√
3/M with probability 1
6
. (1)
The specialized random matrix for the subspace model
assumption is then Φ = G.PS⊥ . This type of matrices are
known to fulfill the RIP property if the dimension of the
subspace S⊥ is small enough [15], this ensures that distances
are almost preserved in the low dimensional space. To estimate
a basis of the subspace we use a training sample {xi}Ti=1,
with T > N taken from a segment of normal traffic. Before
the estimation the mean value of the sample is subtracted
xi = xi − µ, where µ = 1
T
∑Ti=1
xi. We use a variation of
MacQueen algorithm [16] to train a set of Nq points with
Nq ≤ N , that generate the space of normal traffic. The
algorithm used is the following.
1) Take Nq points at random from the training sample
A = {xi}Ti=1 and denote that set by Q = {zi}
Nq
i=1
2) initialize indices ji = 1 for all i = 1, .., Nq
3) take a new point x from A\Q4) Find a zi that is closest to x in L1 norm and update the
point zi by zi = ji.zi+xji+1
5) Update ji associated with the above zi by setting ji =ji + 1
6) Repeat steps 3 to 5 until there are no remaining points
in A\Q.
This algorithm is similar to k-means algorithm for vector
quantization and allows to find the optimal vector quantizer
according to the distribution of the data and a given dis-
tance metric. In particular since our data is contained in a
subspace, the quantization points would also be contained in
that subspace. If the number of quantization vectors Nq is
equal to the dimension of the subspace S, and the resulting
vectors are linearly independent they form a basis of the
subspace, while if the number is smaller than n the subspace
generated by them is included in S. Changing the number
of vectors Nq allows us to control the dimension of the
subspace generated and therefore the sparsity of the vectors
PS⊥ x. The drawback of choosing a number of vectors Nq
smaller than the dimension of the subspace is that some of
the normal traffic components will appear in the projection
reducing the maximum achievable compression ratio. A way
to estimate the dimension of the subspace is to set Nq = Nand then calculate the SVD decomposition of the Matrix Zwhose columns are the elements of Q = {zi}
Nq
i=1. Then select
Nq such that 1 −∑Nq
i=1σi(Z)/
∑Ni=1
σi(Z) ≈ 0.001, where
{σ1(Z), ..., σN (Z)} are the singular values of Z. With this
value of Nq run the clustering algorithm again to obtain a
generator of S. Orthonormalizing the elements of Q we can
simplify the computation of the orthogonal projection matrix
to PS⊥ = I − BBT .
B. Convex set covering model
Even though the subspace model appears to be a good
model for the type of VoIP traffic tested in our experiments,
it might be the case when attacks and normal traffic lay on
the same or almost the same subspace but are still separable.
If we can find a convex set that includes most of the normal
vectors while keeping the anomalous vectors outside the set
we can take advantage of it to improve the performance of
the classifier in the compressed domain. In a similar fashion
as we defined the projection over a subspace we can define
the projection over the orthogonal complement to the convex
set as PC⊥(x) = x−PC(x) where PC(x) = argminy∈C
||x−y||2.
The calculation of the projection PC(x) for a general convex
set can be very complex, however there are some sets like half
planes, balls or ellipsoids in which it can be easily defined. In
this work we use an elliptical closed convex set oriented along
the principal vectors of a training sample with scale parameters
along those axes given by the singular values of the sample.
From the training sample of normal traffic {xi}Ti=1 we form
a matrix A ∈ RN×T whose columns are the elements of the
sample. This matrix can be decomposed as ASV D= UΣV T ,
where U and V are unitary square matrices and Σ is a rect-
angular matrix with non zero elements only in the diagonal.
The principal directions along which the data is distributed
are given by the columns of U while the scales along each
of the directions are given by the associated singular values
σi(A). Without loss of generality we can assume that N < Tand then form the matrix ΣN by eliminating the last T − Ncolumns of Σ. If we also restrict the matrix V in the same
manner to obtain VN , we have that A = UΣNV TN . Based on
this restricted representation the projection over the ellipsoid
is defined by
PEρ(x) = UΣ
1/2
N PBρ(Σ
−1/2
N UT x) (2)
where the operator PBρis the projection over the centered
ball of radius ρ in RN
PBρ(x) =
{
x , if ||x||2 < ρ
ρ x||x||2
, if ||x||2 > ρ.(3)
978-1-4244-9848-2/11$26.00©2011 IEEE
The projection defined in equation 2 is the composition
of several transformations. First we rotate the vector x by
means of the unitary matrix U which aligns the principal
vectors with the coordinate axis. Then we rescale the vector by
means of Σ−1/2
N . The composition of this two transformations,
maps vectors inside an ellipsoid to vectors inside a sphere
allowing to use PBρ(x) to find the projection over a ball
and then inverting the rotation and scaling. Clearly the size
of the ellipsoid is controlled by the parameter ρ. In the
following we will see that a correct selection of this parameter
is fundamental in the performance of the classifiers as the
compression ratio increases.
Since the matrix A is usually low rank, say rank(A) = r <min(N,T ) there would be N − r elements that are zero or
near zero in the diagonal of ΣN . That presents a problem in
the calculation of Σ−1/2
N but it can be easily overcome if we
modify the matrix Σ by ΣN = ΣN + δI with δ > 0 being a
small value.
After calculating the projection over the orthogonal com-
plement to the convex set the dimensionality is reduced using
a random matrix of the class defined in equation 1. The final
dimensionality reduction operator is Φ(x) = G.PE⊥ρ
(x).
C. Classification
Our main goal is to determine the feasibility of performing
classification of the data packets sketches given that we have a
priori information about the structure of normal and anomalous
traffic. To that end, we employ compressed training samples
from normal and anomalous traffic to determine the best
separating surface between the two classes. The simplest and
fastest classifier is a linear kernel support vector machine.
Even though other kernels might yield better results, we
leave this analysis for future work and focus on the basic
linear kernel here. Given a training sample of normal traffic
{xni }
Ti=1 and a sample of attacks {xa
i }Ti=1 both with elements
in RN we obtain the compressed training samples {yni }
Ti=1
and {yai }
Ti=1 by applying the dimensionality reduction operator
to the original elements yni = GP (xn
i ) and yai = GP (xa
i ).Here, the operator P (·) refers either to the projection over the
orthogonal subspace S⊥ or the projection over the orthogonal
complement to the convex set C⊥. Additionally we associate
the labels lni = 1 to the samples in the first group, and the
labels lai = −1 to the samples in the second group, forming the
sets of labeled and compressed training samples {(yni , lni )}T
i=1
and {(yai , lai )}T
i=1. The optimal separating hyperplane is given
by two parameters, the normal vector to the plane w and a
bias scalar parameter b. This parameters are the solutions of
the following problem
minimizew,b,ξi
1
2w
Tw + C
∑
i ξi (4)
subject to lni (wT yni + b) ≥ 1 − ξn
i , i = 1, .., T
lai (wT yai + b) ≥ 1 − ξa
i , i = 1, .., T
ξa,ni ≥ 0
which finds the hyperplane with maximal margin and mini-
mal misclassification over the selected training samples. The
10 20 30 40 50 60 70 80 900.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
compression ratio in %
ave
rag
e p
rob
ab
ility
of
co
rre
ct
cla
ssific
atio
n
Nq=315
Nq=300
Nq=280
Nq=150
Nq=40
Fig. 4. The different graphs present the results using different numberof vectors Nq = i to estimate a basis of the subspace S. Each pointwas calculated averaging the results of 10 different classifiers learned usingdifferent training samples. The compression ratio is the quotient between thenumber of rows and the number of columns of the random matrix G used inthat series of experiments.
classification function based on this hyperplane is given by
g(x) = sign(〈w, x〉 + b).
IV. EXPERIMENTAL RESULTS
To test the two approaches for compressed classification
we recorded several minutes of conversations in our test bed.
The anomalous packets were generated injecting .pdf, .txt
and .jpg files in the payloads of the RTP packets. The data
matrices were then generated from this streams as described
in section II-A. The number of rows in the matrix required
to store a normal traffic packet is 9 for the header fields
and 160 for payload bytes while for the anomalous packets
is 9 for header and 161 for the payload. Taking that into
account we set the number of rows of the data matrices to
340 which suffice to accommodate 2 normal or anomalous
packets. As a consequence of different packet lengths, normal
and anomalous data matrices differ in the last 2 rows with the
normal data matrix having lower rank than the anomalous data
matrix. For this reason we generate 2 different data sets. The
first one corresponding to the original data matrices with 340
rows and the second one corresponding to the restriction of
this matrices to the first 338 rows. The classification approach
based on the subspace model was tested in the first data set
while the convex set covering approach was tested using the
second data set. Both datasets are available for downloaded at
http://www.ece.udel.edu/∼garategu/CISS2011-data/.
Figure 4 depicts the results of the classification using the
subspace model over the first dataset. The probability of cor-
rect detection for each level of compression was calculated av-
eraging the correct classification rate of 10 different classifiers
trained from 1000 normal and anomalous samples. For each
of the classifiers we calculate the rate of correct classification
using a sample of 4500 points different from the ones used
in the training stage. It can be seen that the performance of
978-1-4244-9848-2/11$26.00©2011 IEEE
10 20 30 40 50 60 70 80 900.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
compression ratio in %
ave
rag
e p
rob
ab
ility
of
co
rre
ct
cla
ssific
atio
n
Nq=2
Nq=10
Nq=30
Nq=90
Nq=180
Fig. 5. The graphs show the classification performance for the subspacemodel in the second data set, varying the level of compression and the numberof vectors Nq used in the basis estimation
the classification decrease with the compression rate if the
dimension of the estimated subspace is smaller that the true
dimension, which is 321 in this case. When the number of
vectors in the basis approaches the dimension of the subspace,
all the normal vectors are mapped to 0 by PS⊥ while the
anomalous vectors still have components outside the subspace.
The good performance achieved here can be attributed to the
fact that the subspace assumption is clearly true. Normal traffic
vectors have the last 2 components equal to 0 while anomalous
vectors don’t.
In the case of the second data set, the problem is more
challenging since it is not obvious that the anomalous and
normal vectors belong to different subspaces. Figure 5 shows
the results of repeating the experiment of Figure 4, but
this time using the second data set. We can see that the
projection over the orthogonal subspace actually degrade the
average performance of the classifier as Nq approaches the
dimensionality of the normal traffic subspace. This results
confirm the fact that the subspace model assumption doesn’t
hold for this dataset. If we use the convex set covering model
instead (see figure 6), the classification accuracy improves but
strongly depend on the selection of the parameter ρ which
define the size of the ellipsoid. In this simulation the average
probability of correct classification was calculated averaging
the results of 10 different classifiers for each compression
ratio.
V. DISCUSSION AND CONCLUSIONS
We have presented two simple methods to for the classifica-
tion of VoIP data packets taking advantage of the knowledge
about the structure of normal traffic. The Subspace model
method have the advantage of being very simple and yield
excellent performances if the vectors are clearly separable. It
only requires the multiplication of one column of the data
matrix with the projection matrix Φ, and the computation
of the discriminative function g(Φx) to label each new data
10 20 30 40 50 60 70 80 900.82
0.84
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
compression ratio in %
ave
rag
e p
rob
ab
ility
of
co
rre
ct
cla
ssific
atio
n
ρ=0
ρ=0.4
ρ=0.66667
ρ=0.8
ρ=1.0667
Fig. 6. The graphs shows the classification performance for different scalesof the ellipsoid used to calculate the projections, using the second dataset
vector. The computation of Φ involves the estimation of a basis
of the normal subspace, but this operation can be completed
offline and may be repeated over long periods of time to
account for variations in the normal traffic structure. On
the other hand if the anomalous traffic packets are on the
same or near the same subspace as the normal traffic the
model becomes too broad (see Figure 5) and we actually lose
separability by using the subspace information. The convex set
covering model on the contrary is more powerful, but at the
same time requires more computations in the high dimensional
space. Even though the matrices UΣ1/2
N and Σ−1/2
N UT can
be pre computed offline from a training sample of normal
traffic, the projection operator requires the computation of the
norm of each data vector, a comparison operation and possibly
one multiplication of a scalar by a vector before multiplying
by the random matrix G. We think that this work shows
promising results and demonstrates that the incorporation of
prior knowledge allows to compress network traffic data while
keeping relevant information that can be used for classification,
detection or analysis of statistical behaviour. Future directions
in this research includes the incorporation of non-linear kernels
in the support vector machine, which might help to improve
separability between classes. Another possibility to improve
the separability is to simply increase the dimension of the
data vectors. Augmenting the columns of the data matrix can
be sufficient to separate the subspaces enough so that the
subspace model can be easily applied.
REFERENCES
[1] T. Takahashi and W. Lee, “An assessment of voip covert channelthreats,” in Security and Privacy in Communications Networks and the
Workshops, 2007. SecureComm 2007. Third International Conference
on, pp. 371 –380, 2007.[2] J. LuBacz, W. Mazurczyk, and K. Szczypiorski, “Vice over ip,” IEEE
Spectrum, vol. 47, no. 2, pp. 42–47, 2010.[3] J. Lubacz, W. Mazurczyk, and K. Szczypiorski, “Hiding data in voip,”
in Proceedings of the Army Science Conference (26th), 2008.[4] W. Mazurczyk. and K. Szczypiorsk., “Steganography of voip streams,”
in On the Move to Meaningful Internet Systems: OTM 2008, pp. 1001–1018, Springer, 2008.
978-1-4244-9848-2/11$26.00©2011 IEEE
[5] Z. Wang, J. Paredes, and G. R. Arce, “Adaptive subspace compresseddetection of sparse signals,” submitted for publication, 2010.
[6] J. Paredes, Z. Wang, G. Arce, and B. Sadler, “Compressive matchedsubspace detection,” European Signal Processing Conf., 2009.
[7] R. Calderbank, S. Jafarpour, and R. Schapire, “Compressed learning:Universal sparse dimensionality reduction and learning in the measure-ment domain,” ht tp://dsp. rice. edu/files/cs/cl. pdf, 2009.
[8] OpenSIPS available at http://www.opensips.org/.[9] RTPdump, “Format specification.” available at
http://www.cs.columbia.edu/irt/software/rtptools/.[10] Salare-Security webpage http://www.salaresecurity.com/.[11] M. Davenport, M. Wakin, and R. Baraniuk, “Detection and estimation
with compressive measurements,” Dept. of ECE, Rice University, Tech.
Rep, 2006.[12] D. L. Donoho, “Compressed sensing,” IEEE Transactions on Informa-
tion Theory, vol. 52, no. 4, pp. 1289–1306, 2006.[13] M. Duarte, M. Davenport, M. Wakin, and R. Baraniuk, “Sparse signal
detection from incoherent projections,” IEEE International Conference
on Acoustics, Speech, and Signal Processing (ICASSP), May 2006,pp. 305–308, 2006.
[14] J. Haupt, R. Castro, R. Nowak, G. Fudge, and A. Yeh, “Compressivesampling for signal classification,” Signals, Systems and Computers,
2006. ACSSC’06. Fortieth Asilomar Conference on, pp. 1430–1434,2006.
[15] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin, “A simple proofof the restricted isometry property for random matrices,” Constructive
Approximation, vol. 28, no. 3, pp. 253–263, 2008.[16] D. Quiang and T.-W. W., “Numerical studies of macqueen’s k-means
algorithm for computing the centroidal vronoi tessellations,” Computers
and Mathematics with Applications, vol. 44, no. 3, pp. 511–523, 2002.
978-1-4244-9848-2/11$26.00©2011 IEEE