Download - Opportunistic Spectrum Access for Energy-Constrained Cognitive Radios

Opportunistic Spectrum Access for Energy-constrained

Cognitive Radios

Anh Tuan Hoang, Ying-Chang Liang, David Tung Chong Wong,

Yonghong Zeng, and Rui Zhang

Institute for Infocomm Research, Singapore

Abstract

This paper considers a scenario in which a secondary user makes opportunistic use of a channel allocated

to some primary network. The primary network operates in a time-slotted manner and switches between idle and

active states according to a stationary Markovian process.At the beginning of each time slot, the secondary user

can choose to stay idle or to carry out spectrum sensing to detect if the primary network is idle or active. If the

primary network is detected as idle, the secondary user can carry out data transmission. Spectrum sensing consumes

time and energy and introduces false alarms and mis-detections. Given the delay cost associated with staying idle,

the energy costs associated with spectrum sensing and data transmission, and the throughput gain associated with

successful transmissions, the objective is to decide, for each time slot, whether the secondary user should stay idle

or carry out sensing, and if so, for how long, to maximize the expected net reward. We formulate this problem

as a partially observable Markov decision process (POMDP) and prove several structural properties of the optimal

spectrum sensing/accessing policies. Based on these properties, heuristic control policies with low complexity and

good performance are proposed.

I. INTRODUCTION

The traditional approach of fixed radio spectrum allocationleads to under-utilization. It has been reported

in recent studies by the US Federal Communications Commission (FCC) that there are vast temporal and

spatial variations in the usage of allocated spectrum [1]. This motivates the concepts of opportunistic

spectrum access (OSA), which allows secondary cognitive radio (CR) systems to opportunistically exploit

the under-utilized spectrum.

One of the core components of an OSA system is the spectrum-sensing module, which examines a

spectrum of interest to determine whether the primary network which owns the spectrum is currently

active or idle. Spectrum sensing, therefore, is a binary hypothesis test or a series of binary hypothesis

tests. Spectrum sensing can have several effects on the OSA system, which are highlighted as follows.

https://www.researchgate.net/publication/247615821_Spectrum_Policy_Task_Force?el=1_x_8&enrichId=rgreq-3a0922b6c6bdbe9de0e188a9e870ebce-XXX&enrichSource=Y292ZXJQYWdlOzIyNDM5NDgyMDtBUzoxODA3ODQ0ODA2NjE1MDVAMTQyMDExMzc4NzY4OA==

2

• Energy consumption: carrying out spectrum sampling and subsequent signal processing consumes

energy. In general, the longer the sensing time, the more theconsumed energy.

• Time consumption: to avoid self interfering with the sensing process, the OSAsystem may need to

suspend its data communications when carrying out spectrumsensing. This means spectrum sensing

consumes communication time.

• False alarms: a false alarm occurs when the spectrum sensing module mistakes an idle primary

network as active. This leads to a spectrum access opportunity being overlooked.

• Mis-detections: a mis-detection occurs when the spectrum sensing module mistakes an active primary

network as idle. This leads to possible collision between secondary and primary transmissions.

In this paper, we take the above effects into account when designing an energy-constrained OSA system.

A. General Control Problem

We consider a scenario in which a secondary user makes opportunistic use of a channel allocated to

some primary network. The primary network operates in a time-slotted manner and switches between

idle and active states according to a stationary Markovian process. Within each time slot, the state of the

primary network remains unchanged. At the beginning of eachtime slot, the instantaneous state of the

primary network is not directly observable and the secondary user needs to decide whether to stay idle

or to carry out spectrum sensing. If the secondary user chooses to carry out spectrum sensing, it needs to

decide the duration of the sensing period and to configure related parameters to meet a minimum detection

probability. Subsequently, if spectrum sensing indicatesthat the primary network is idle, the secondary

user proceeds to transmit data during the rest of the time slot.

There are important trade-offs when the secondary user makes the above control decisions. By staying

idle in a particular time slot, the secondary user conservesenergy, but at the same time suffers increase in

delay and reduction in throughput. By carrying out spectrumsensing, the secondary user consumes time

and energy to acquire knowledge of the state of the primary network, and stands a chance to transmit

data if the primary network is idle. Furthermore, there are trade-offs involving energy consumption,

sensing accuracy, and transmission time when the duration of sensing periods is varied. When the required

probability of detection is fixed, increasing the sensing time can reduce the probability of false alarms and

therefore increase the probability of transmission for thesecondary user. However, increasing the sensing

time also reduces the time available for transmission.

3

For the secondary user, given the delay cost associated withstaying idle in a time slot, the energy

costs associated with spectrum sensing and data transmission, and the throughput gain associated with

a successful transmission, we consider the problem of finding an optimal policy which decides the idle

and sensing modes, together with spectrum sensing time, to maximize the expected net reward. Here, the

reward is defined as a function of delay and energy costs and throughput gain.

B. Contributions

The main contributions of this paper are as follows.

• We formulate the control problem that captures important throughput and delay/energy trade-offs for

OSA systems. The problem involves important decisions for the secondary user, i.e., to stay idle or

to carry out spectrum sensing, and to determine the optimal duration of each sensing period.

• We analyze the problem using the framework of partially observable Markov decision processes

(POMDP) and prove important structural properties of the optimal control policies that maximize the

expected net reward.

• Based on theoretical characterization of the optimal policies, we propose heuristic control policies

that can be obtained at lower complexity while achieving good performance. One of these policies

is based on grid approxmimation of POMDP solutions.

• Finally, we obtain numerical results to support our theoretical analysis.

C. Related Work

There has been a series of recent works on optimizing spectrum-sensing activities in OSA systems

[2]–[6]. These works can roughly be classified into two groups, i.e., those that focus on the control within

each time slot, when the status of a primary network is more orless static [2]–[4] and those that focus

on the time dynamics of the control problem [5], [6].

In [2]–[4], the sensing duration within each time slot can bevaried, i.e., the spectrum-sensing module

can operate at different ROC curves. The objective then is totrade off between sensing accuracy and time

available for communications. Assuming the mis-detectionprobability is fixed, the longer the sensing

duration, the lower the false alarm probability. However, the longer the sensing duration, the less time is

available for communications. As the focus is on control within each time slot, the dynamics of primary

networks is not taken into account in [2]–[4].

https://www.researchgate.net/publication/224295307_Sensing-Throughput_Tradeoff_in_Cognitive_Radio_Networks_How_Frequently_Should_Spectrum_Sensing_be_Carried_Out?el=1_x_8&enrichId=rgreq-3a0922b6c6bdbe9de0e188a9e870ebce-XXX&enrichSource=Y292ZXJQYWdlOzIyNDM5NDgyMDtBUzoxODA3ODQ0ODA2NjE1MDVAMTQyMDExMzc4NzY4OA==



https://www.researchgate.net/publication/220642426_Decentralized_cognitive_MAC_for_opportunistic_spectrum_access_in_ad_hoc_networks_A_POMDP_framework?el=1_x_8&enrichId=rgreq-3a0922b6c6bdbe9de0e188a9e870ebce-XXX&enrichSource=Y292ZXJQYWdlOzIyNDM5NDgyMDtBUzoxODA3ODQ0ODA2NjE1MDVAMTQyMDExMzc4NzY4OA==

https://www.researchgate.net/publication/221287854_Adaptive_Scheduling_of_Spectrum_Sensing_Periods_in_Cognitive_Radio_Networks?el=1_x_8&enrichId=rgreq-3a0922b6c6bdbe9de0e188a9e870ebce-XXX&enrichSource=Y292ZXJQYWdlOzIyNDM5NDgyMDtBUzoxODA3ODQ0ODA2NjE1MDVAMTQyMDExMzc4NzY4OA==


https://www.researchgate.net/publication/3434455_Sensing-Throughput_Tradeoff_for_Cognitive_Radio_Networks?el=1_x_8&enrichId=rgreq-3a0922b6c6bdbe9de0e188a9e870ebce-XXX&enrichSource=Y292ZXJQYWdlOzIyNDM5NDgyMDtBUzoxODA3ODQ0ODA2NjE1MDVAMTQyMDExMzc4NzY4OA==




4

In [5], Zhao et al. consider a spectrum access scenario similar to ours, where primary networks switch

between idle and active states in a Markovian manner and a secondary user carries out spectrum sensing

prior to opportunistic data transmission. An important result of [5] is the separation principle, which

decouples the sensing policy from that of the sensor and spectrum access policy. Unlike in our model,

in [5], the energy cost of sensing is not of concern and the secondary user carries out sensing in every

time slot. Furthermore, in [5], determining the sensing duration is not part of the control problem, rather,

it is assumed that a fixed Receiver Operating Characteristic(ROC) that defines the relationship between

false-alarm and mis-detection probabilities is given. In our model, varying the spectrum-sensing duration

results in different ROC curves. The work in [6] does take into account the energy and power consumption

when scheduling spectrum sensing. However, in [6], spectrum sensing is assumed perfect, i.e., there are

no false alarms and mis-detections.

To some extend, this paper bridges the gap between the two classes of problems considered in [2]–[4]

and in [5], [6]. In particular, our control problem recognizes the energy costs of sensing and transmission,

allows the variation of spectrum-sensing duration within each time slot, and incorporates the dynamics

of the primary network over time. All these factors are takeninto account for the final objective of

maximizing the expected long term net reward received by thesecondary user.

It is also interesting to note that some important results inthis paper bear significant similarities to

those in [7] and [8], even though the control scenarios are totally different. In [7], [8], the problem of

scheduling packet transmission over time-varying channels with memory is considered, with the objective

of balancing throughput and energy consumption. The state of the channel is not directly observed and

the authors also formulate and analyze the problem using a POMDP framework.

D. Paper Organization

The rest of this paper is organized as follows. In Section II,we describe the system model and the

control problem. Important properties of the optimal control policies are proved and discussed in Section

III. In Section IV, heuristic policies are proposed. Numerical results and discussion are presented in

Section V. Finally, we conclude the paper and highlight future directions in Section VI.







https://www.researchgate.net/publication/220127511_Opportunistic_file_transfer_over_a_fading_channel_A_POMDP_search_theory_formulation_with_optimal_threshold_policies?el=1_x_8&enrichId=rgreq-3a0922b6c6bdbe9de0e188a9e870ebce-XXX&enrichSource=Y292ZXJQYWdlOzIyNDM5NDgyMDtBUzoxODA3ODQ0ODA2NjE1MDVAMTQyMDExMzc4NzY4OA==


https://www.researchgate.net/publication/3958870_Transmission_schemes_for_time-varying_wireless_channels_with_partial_state_observations?el=1_x_8&enrichId=rgreq-3a0922b6c6bdbe9de0e188a9e870ebce-XXX&enrichSource=Y292ZXJQYWdlOzIyNDM5NDgyMDtBUzoxODA3ODQ0ODA2NjE1MDVAMTQyMDExMzc4NzY4OA==






5

II. SYSTEM MODEL

A. Primary Network

We consider a channel being allocated to a primary network operating in a time-slotted manner. In each

time slot of durationT , the primary network is either active (on) or idle (off). From one time slot to the

next, the primary network switches between active and idle states according to a stationary Markovian

process specified by the following state-transition matrix:

M =

1 − b b

g 1 − g

, 0 < g, b < 1, (1)

whereb is the probability that the primary network becomes active in the next time slot, given that it is

idle in the current slot andg is the probability that the primary network becomes idle in the next time

slot, given that it is active in the current time slot. The stationary probabilities of being idle and active

for the primary network areπi = g/(b + g) andπa = b/(b + g), respectively.

From the secondary user’s point of view, when the primary network is idle, the user has a ‘good’ channel

to exploit. On the other hand, an active primary network results in a ‘bad’ channel for the secondary user.

This leads to an interesting observation thatM can be regarded as the state-transition matrix for a virtual

Gilbert-Elliot (GE) channel of the secondary user. In [9], for a GE channel with state-transition matrix

M , the channel memory is defined asµ = 1− b− g. Whenµ > 0, it is said that the channel has positive

memory, i.e., the probability of remaining in a particular state is greater than or equal to the stationary

probability of that state. In this paper, we also assume thatµ = 1 − b − g > 0.

B. Opportunistic Spectrum Access

A secondary user opportunistically accesses the channel when the primary network is idle by first

synchronizing with the slot structure of the primary network and then carrying out the following mechanism

(illustrated in Fig. 1).

1) Spectrum Sensing: If the secondary user wishes to transmit in a particular slot, it will first spend a

time durationτ at the beginning of the slot to carry out spectrum sensing. This basically involves sampling

the channel and carrying out a binary hypothesis test:

H0 : the primary network is idle,

versus H1 : the primary network is active.(2)

Let θ denote the outcome of the above binary hypothesis test, where θ = 0 meansH0 is detected andθ = 1

otherwise. Associated with the spectrum sensing activity are probability of false alarm, i.e., mistakingH0

https://www.researchgate.net/publication/3077488_Capacity_and_coding_for_the_Gilbert-Elliot_channels?el=1_x_8&enrichId=rgreq-3a0922b6c6bdbe9de0e188a9e870ebce-XXX&enrichSource=Y292ZXJQYWdlOzIyNDM5NDgyMDtBUzoxODA3ODQ0ODA2NjE1MDVAMTQyMDExMzc4NzY4OA==

6

for H1, and probability of mis-detection, i.e., mistakingH1 for H0. In this paper, we assume that the

secondary network must carry out spectrum sensing to meet a fixed probability of detectionPd. Then,

the probability of false alarm is a function of the sensing time τ and is denoted byPfa(τ). The sensing

durationτ must be within the interval[τmin, τmax], where0 < τmin ≤ τmax < T . It is assumed that for the

given range ofτ , 0 < Pfa(τ) < Pd < 1. This is in fact a reasonable assumption as in practical cognitive

radio systems [10], we normally havePd > 90% andPfa < 10%. Furthermore, we assume thatPfa(τ) is

continuous, differentialbe, and decreasing inτ ∈ [τmin, τmax].

2) Data transmission: If the spectrum sensing results inθ = 0, the secondary user proceeds to transmit

data in the rest of the time slot. Otherwise, ifθ = 1, the secondary user must stay quiet and wait until

the next time slot to try again.

3) Acknowledgment: Even though the spectrum sensing outcome indicatesθ = 0, this can be due to

a mis-detection. Mis-detections result in collisions between primary and secondary transmissions. In this

paper, we assume that if collision happens due to mis-detection, a negative acknowledgment (NAK) is

returned. On the other hand, if the secondary transmission is carried out when the primary network is

actually idle, a positive acknowledgment (ACK) is returned.

C. POMDP Formulation

At the beginning of each time slot, the secondary user decides whether or not to carry out spectrum

sensing, and if so, for how long. As the instantaneous state of the primary network is not directly observed,

our control problem can be classified as a discrete-time POMDP with the following components.

1) Belief State: In a discrete-time POMDP, the decision maker selects an action and receives some

reward, together with some observation which reveal information about the actual system state. It is well

known ( [11]) that for each POMDP, all information that is useful for making decisions can be encapsulated

in the posterior distribution vector of the system states. In our control problem, at the beginning of each

time slot, based on previous actions and observations, the secondary user can calculate the probability

that the primary network is idle in the time slot. We denote this probability byp and name it the ‘belief

state’. After each time slot, depending on the action taken by the secondary user and the corresponding

outcome, the belief statep can be updated according to one of the following four cases.

Case 1: The secondary user stays idle and does not carry out spectrum sensing. Then, the next belief

state, i.e., the probability that the primary network is idle in the next time slot can be derived as:

L1(p) = p(1 − b) + (1 − p)g = p(1 − b − g) + g. (3)

https://www.researchgate.net/publication/243783869_Dynamic_Programming_and_Optimal_Control_2nd_Edition?el=1_x_8&enrichId=rgreq-3a0922b6c6bdbe9de0e188a9e870ebce-XXX&enrichSource=Y292ZXJQYWdlOzIyNDM5NDgyMDtBUzoxODA3ODQ0ODA2NjE1MDVAMTQyMDExMzc4NzY4OA==

7

Case 2: The secondary user senses the channel for the durationτ , obtains the outcomeθ = 1, i.e., the

primary network is active, and therefore needs to keep quietin the rest of the time slot. Using Bayes’

rule, the belief state in the next time slot can be derived as:

L2(p, τ) =pPfa(τ)(1 − b) + (1 − p)Pdg

pPfa(τ) + (1 − p)Pd. (4)

Case 3: The secondary user senses the channel for the durationτ , obtainsθ = 0, i.e., the primary

network is idle, carries out transmission and subsequentlyreceives an ACK at the end of the slot. The

ACK implies that the primary network is actually idle duringthe current time slot and therefore, the belief

state in the next time slot isL3 = 1 − b.

Case 4: All the same as Case 3, except that an NAK is received at the end of the slot. This implies that

mis-detection happens during spectrum sensing, the primary network is actually active during the current

time slot and therefore, the belief state in the next time slot is L4 = g.

As can be noted,L3 and L4 do not depend onp and τ . However, in the rest of this paper, in order

to simplify the notation, we useL1(p, τ), L3(p, τ), L4(p, τ) interchangeable forL1(p), L3, L4 defined

above. Also, lettingQi(p, τ), i = 2, 3, 4 denote the probabilities that Casei above happens, we have:

Q2(p, τ) = pPfa(τ) + (1 − p)Pd, Q3(p, τ) = p(1 − Pfa(τ)), and Q4(p, τ) = (1 − p)(1 − Pd). (5)

2) Properties of Li(p, τ): From (3) and the assumption that the state-transition matrix M has positive

memory, i.e.,1 − b − g > 0, it follows that L1(p) is increasing inp. Also, it can be verified that:

∂L2(p, τ)

∂p=

(1 − b − g)PdPfa(τ)

[Pd − p(Pd − Pfa(τ))]2. (6)

As 1− b−g > 0 andPd > Pfa(τ) > 0, ∂L2(p,τ)∂p

is positive and increasing inp. Therefore,L2(p) is convex

and increasing inp. At the same time:

L2(p) =(1 − b)(pPfa(τ) + (1 − p)Pd) − (1 − p)Pd(1 − b − g)

pPfa(τ) + (1 − p)Pd≤ (1 − b). (7)

Similarly, it can be shown thatL2(p) ≥ g. So

L4(p) ≤ L2(p) ≤ L3(p). (8)

3) Costs, Reward, and Control Objective: When carrying out spectrum sensing, the secondary user

spends energy in channel sampling and signal processing. Weassume that the energy cost of carrying

out spectrum sensing forτ units of time is a continuous, non-negative and increasing function in τ and

is denoted bycs(τ). If the sensing outcome isθ = 0, the secondary user proceeds to transmit during the

8

rest of the time slot. Letrt and ct respectively be the gain in throughput and the energy cost, both are

measured per unit of transmission time. It is reasonable to assume thatrt > ct ≥ 0, otherwise, there is

no justification for the secondary user to carry out transmission.

Note that the secondary user can also choose to stay idle, i.e., neither carry out spectrum sensing nor

data transmission, during a time slot to conserve energy. However, doing so results in negative effects

such as lower throughput and longer latency. We assume that,for the secondary user, the cost of staying

idle during each time slot isci, ci ≥ 0.

In a particular time slot, if the probability of the the primary network being idle isp and the secondary

user carries out spectrum sensing forτ units of time, then the expected net gain can be calculated as:

G(p, τ) =p(

1 − Pfa(τ))

(T − τ)rt − cs(τ) −(

p(1 − Pfa(τ)) + (1 − p)(1 − Pd))

(T − τ)ct

=p(1 − Pfa(τ))(T − τ)(rt − ct) − (1 − p)(1 − Pd)(T − τ)ct − cs(τ).

(9)

As, by assumption,rt > ct, T > τ , and Pfa(τ) < Pd < 1, G(p, τ) is continuous and increasing inp.

G(p, τ) is also continuous inτ .

To simplify the notation, we also useτ = 0 to represent that the secondary user chooses to stay idle.

We then have the following expected reward when the sensing decision is set toτ .

R(p, τ) =

−ci τ = 0

G(p, τ), τmin ≤ τ ≤ τmax.(10)

We are interested in the following problem.

Definition 1: Let pn denote the probability that the primary network is idle during time slot n, select

the sensing time τn, τn ∈ {0, [τmin, τmax]}, to maximize the following discounted reward function

E{

N−1∑

n=0

αnR(pn, τn)∣

∣ p0 = p}

, (11)

where 0 < α < 1 is a discounting factor, and 1 ≤ N ≤ ∞ is the control horizon.

III. STRUCTURE OFOPTIMAL POLICIES

A. Monotonicity and Convexity of Value Functions

WhenN < ∞, let V N(p) denote the maximum achievable discounted reward function in Definition 1,

V N(p) satisfies the following Bellman equation ( [11]):

V N(p) = maxτ∈[τmin,τmax]

{

−ci+αV N−1(L1(p)), G(p, τ)+α∑

i=2,3,4

Qi(p, τ)V N−1(Li(p, τ))}

, N > 1, (12)


9

where

V 1(p) = maxτ∈[τmin,τmax]

{

− ci, G(p, τ)}

. (13)

When N = ∞, let V (p) denote the maximum achievable discounted reward function in Definition 1,

V (p) satisfies the following Bellman equation ( [11]):

V (p) = maxτ∈[τmin,τmax]

{

− ci + αV (L1(p)), G(p, τ) + α∑

i=2,3,4

Qi(p, τ)V (Li(p, τ))}

. (14)

Note that in (14),G(p, τ) is the immediate gain obtained by sensing for durationτ while the expected

discounted future gain given this sensing duration isα∑

i=2,3,4 Qi(p, τ)V (Li(p, τ)). As can be seen, both

immediate and future gains are dependent onτ .

It can be shown ( [11]) thatlimn→∞ V n(p) = V (p). It can also be verified thatV N (p) and V (p) are

continuous inp. Let us now prove some important structural results forV N (p) andV (p).

Proposition1: V N(p) and V (p) are nondecreasing in p.

Proof: First, let us prove the property forV N(p), N < ∞. The proof proceeds by induction. As

G(p, τ) is increasing inp, from (13), it follows thatV 1(p) is nondecreasing inp. Now assuming that

V n(p) is nondecreasing inp for some value ofn ≥ 1, we have

V n+1(p) = maxτ∈[τmin,τmax]

{

− ci + αV n(L1(p)), G(p, τ) + α∑

i=2,3,4

Qi(p, τ)V n(Li(p, τ))}

. (15)

Let E1 = −ci + αV n(L1(p)), andE2 = G(p, τ) + α∑

i=2,3,4 Qi(p, τ)V n(Li(p, τ)). As L1(p) andV n(p)

are both nondecreasing inp, it follows that E1 is nondecreasing inp. The first term inE2, i.e., G(p, τ),

is nondecreasing inp. For the second term inE2, letting 0 < q < p, we have:∑

i=2,3,4

Qi(p, τ)V n(Li(p, τ)) −∑

i=2,3,4

Qi(q, τ)V n(Li(q, τ))

>∑

i=2,3,4

(Qi(p, τ) − Qi(q, τ))V n(Li(p, τ)), asL2(p, τ) > L2(q, τ),

= V n(L3) (Q3(p, τ) − Q3(q, τ)) + V n(L2(p, τ)) (Q2(p, τ) − Q2(q, τ))

+ V n(L4) (Q3(q, τ) + Q2(q, τ) − Q3(p, τ) − Q2(p, τ)) , as∑

i=2,3,4

Qi(p, τ) =∑

i=2,3,4

Qi(q, τ) = 1,

=(

V n(L3) − V n(L2(p, τ)))(

Q3(p, τ) − Q3(q, τ))

+(

V n(L2(p)) − V n(L4))(

(Q3(p, τ) + Q2(p, τ)) − (Q3(q, τ) + Q2(q, τ)))

≥ 0,

(16)

where the last inequality is due toQ3(p, τ) = p(1 − Pfa(τ)) is increasing inp and

Q3(p, τ) + Q2(p, τ) = p(1 − Pd) + Pd (17)



10

is also increasing inp. So the second term inE2 is increasing inp, which implies thatE2 is also increasing

in p. As bothE1 andE2 are nondecreasing inp, it follows thatV n+1(p) is nondecreasing inp.

As limn→∞ V n(p) = V (p), it follows thatV (p) is also nondecreasing inp.

Proposition2: V N(p) and V (p) are convex in p.

Please refer to the Appendix A for the proof.

Remark1: Proposition 1 states intuitively that, the higher the probability p that the primary network is

idle at the beginning of the control process, the higher the maximum achievable expected rewardV N(p)

andV (p). Proposition 2 then indicates how fastV N(p) andV (p) increase inp. As V N(p) andV (p) are

convex,V N (p) andV (p) increase at least linearly inp.

B. Properties of Optimal Policies

Let us explore some useful structural properties of the optimal control policies. Letting

G∗(p) = maxτ∈[τmin,τmax]

G(p, τ), (18)

asG(p, τ) is increasing inp, so isG∗(p). We state the following property of the optimal control policies.

Proposition3: Let p∗ be the minimum value of p such that G∗(p∗) > −ci. If, in a particular time slot,

the probability of the primary network being idle is p such that p ≥ p∗, then an optimal policy must carry

out spectrum sensing in that time slot.

Proof: Let us prove for the case whenN < ∞. We haveV 1(p) = maxτ∈[τmin,τmax]

{

−ci, G(p, τ)}

=

max{−ci, G∗(p)}, whereG∗(p) ≥ G∗(p∗) > −ci, therefore, spectrum sensing should be carried out when

N = 1. For N > 1 we have

V N(p) = maxτ>0

{

− ci + αV N−1(L1(p)), G(p, τ) + α∑

i=2,3,4

Qi(p, τ)V N−1(Li(p, τ))}

. (19)

Notice that∑

i=2,3,4 Qi(p, τ) = 1 and∑

i=2,3,4 Qi(p, τ)Li(p, τ) = L1(p). Then, asV N−1(p) is convex in

p, it follows that

V N−1(L1(p)) ≤∑

i=2,3,4

Qi(p, τ)V N−1(Li(p, τ)), ∀ τmin ≤ τ ≤ τmax. (20)

This, together with the fact thatG∗(p) ≥ G∗(p∗) > −ci, implies the system should carry out spectrum

sensing in the current time slot. The proof forN = ∞ is similar.

Remark2: Proposition 3 gives the sufficient condition on the value ofp, i.e., p ≥ p∗, for carrying out

spectrum sensing in a particular time slot. However, this may not be a necessary condition. In general,

11

the optimal control policies for our POMDP model may not possess the threshold-based structure (in

the value ofp). To the best of our knowledge, there have been a limited number of works that prove

the threshold-based characteristic of the optimal controlpolicies for some specific POMDP models (see

[12]–[14]). Unfortunately, our problem does not directly fit into these models.

A natural question to ask is, given that sensing is carried out, how the optimal sensing timeτ would

vary with p. Let

F N(p, τ) = α∑

i=2,3,4

Qi(p, τ)V N−1(Li(p, τ)), N > 1, τmin ≤ τ ≤ τmax, (21)

be the expected discounted future reward if the probabilitythat the primary network being active in a

particular time slot isp and sensing is carried out for a durationτ . Similarly, for the case of infinite

control horizon, define

F (p, τ) = α∑

i=2,3,4

Qi(p, τ)V (Li(p, τ)), τmin ≤ τ ≤ τmax. (22)

The following result highlights the effect of increasing the spectrum sensing timeτ .

Proposition4: F N(p, τ) and F (p, τ ) are nondecreasing in τ .

Please refer to the Appendix B for the proof.

Remark3: Essentially, Proposition 4 makes concrete the intuition that the more sensing being carried

out in the current time slot, the better the expected reward in the future time slots. This is because

increasing the sensing time gives the secondary user more accurate knowledge of the state of the primary

network, which in turn improves future control.

To further study the effect of varying the sensing time, we need to make the following assumptions.

• A1: The probability of false alarm, i.e.,Pfa(τ), is convex and decreasing inτ , τmin ≤ τ ≤ τmax.

• A2: The energy cost of sensing, i.e.,cs(τ), is convex and increasing inτ .

For the justifications of assumptions A1 and A2, please referto the Appendix E.

Lemma1: Given assumptions A1 and A2, function G(p, τ) is concave in τ , τmin ≤ τ ≤ τmax.

Please refer to the Appendix C for the proof.

As the functionG(p, τ) is concave, continuous, and has a strictly decreasing first order derivative for

all τ in the interval[τmin, τmax], there exists an unique maximum point for this function. Letting

τ ∗(p) = arg maxτmin≤τ≤τmax

G(p, τ), (23)

https://www.researchgate.net/publication/242931829_Structural_Results_for_Partially_Observable_Markov_Decision_Process?el=1_x_8&enrichId=rgreq-3a0922b6c6bdbe9de0e188a9e870ebce-XXX&enrichSource=Y292ZXJQYWdlOzIyNDM5NDgyMDtBUzoxODA3ODQ0ODA2NjE1MDVAMTQyMDExMzc4NzY4OA==

https://www.researchgate.net/publication/231871378_Optimal_Search_for_a_Moving_Target?el=1_x_8&enrichId=rgreq-3a0922b6c6bdbe9de0e188a9e870ebce-XXX&enrichSource=Y292ZXJQYWdlOzIyNDM5NDgyMDtBUzoxODA3ODQ0ODA2NjE1MDVAMTQyMDExMzc4NzY4OA==

12

the following proposition relates the optimal sensing timeto the value ofτ ∗(p) defined in (23).

Proposition5: Given assumptions A1 and A2, if at the beginning of a particular time slot, the prob-

ability of the primary network being idle is p and sensing is carried out, then the optimal sensing time

τ opt is greater than or equal to τ ∗(p).

Proof: We prove forN = ∞, the case whenN < ∞ is similar. The proof is by contradiction.

Suppose the optimal sensing timeτ opt < τ ∗(p). Due to the concavity ofG(p, τ) in τ we haveG(p, τ opt) <

G(p, τ ∗(p)). Furthermore, from Proposition 4, we haveF (p, τ opt) ≤ F (p, τ ∗(p)), therefore

G(p, τ opt) + F (p, τ opt) < G(p, τ ∗(p)) + F (p, τ ∗(p)), (24)

which contradicts to the fact thatτ opt is the optimal sensing time givenp. This completes the proof.

Proposition 5 gives the lower bound for the optimal sensing time τ opt. To see how this lower bound

varies withp, consider the first order derivative ofG(p, τ) with respect toτ :

∂G

∂τ= p

(

(rt − ct)(

Pfa(τ) − (T − τ)P ′fa(τ) − 1

)

+ (1 − Pd)ct

)

−(

c′s(τ) − (1 − Pd)ct

)

= pu(τ) − v(τ),

(25)

whereu(τ) = (rt − ct)(

Pfa(τ)− (T − τ)P ′fa(τ)−1

)

+(1−Pd)ct andv(τ) = c′s(τ)− (1−Pd)ct. We have

argued in the proof of Lemma 1 thatu(τ) is strictly decreasing inτ . Furthermore, ascs(τ) is convex,

v(τ) is nondecreasing inτ . The following lemma characterizes howτ ∗(p) varies inp.

Lemma2: Given assumptions A1 and A2, we have the following cases.

• Case 1: v(τmin) ≥ 0 then τ ∗(p) is nondecreasing in p.

• Case 2: v(τmax) < 0 then τ ∗(p) is nonincreasing in p.

• Case 3: v(τmin) < 0 ≤ v(τmax). Letting v(τ v) = 0, τmin < τ v ≤ τmax,

a) if u(τ v) > 0 then τ ∗(p) is nondecreasing in p.

b) if u(τ v) ≤ 0 then τ ∗(p) is nonincreasing in p.

Please refer to the Appendix D for the proof.

Remark4: Proposition 5 states that when the probability of the primary network being idle isp and

sensing is carried out, then the lower bound of the optimal sensing time isτ ∗(p). Lemma 2 further shows

that the lower boundτ ∗(p) is always monotonic inp.

13

Remark5: If the energy costs of sensing and transmission are ignorable, then the gain of sensing for

τ unit of time when the probability of primary network being idle is p can be simplified to:

G(p, τ) = p(

1 − Pfa(τ))

(T − τ)rt. (26)

Then the value ofτ = τ that maximizesG(p, τ) does not depends onp. Therefore, it can be verified

that the policy that carries out sensing forτ unit of time for every time slot maximizes the expected

reward. Our optimization problem is then equivalent to the problem considered in [2], [3], which focus

on maximizing throughput within each time slot.

IV. HEURISTIC POLICIES

Directly solving the POMDP described in Section III can be computational challenging. In this section,

suboptimal control policies that can be obtained at lower complexity are discussed.

A. Grid-based Approximation

Grid-based approximation is a widely-used approach for approximating solutions to POMDPs. In this

approach, the value function is approximated at a finite number of belief points on a grid. The value

function at belief points not belonging to the grid is evaluated using interpolation. In this paper, we

employed the fixed-resolution, regular-grid approach proposed by Lovejoy [15]. Applying to our POMDP,

the range of belief state is first divided intoP equally spaced pointsp0, p1, . . . pP−1. Then, the value

function at these grid points are calculated using the following iteration.

v(pj) = maxτ∈[τmin,τmax]

{

− ci + αV (L1(pj)), G(p, τ) + α

∑

i=2,3,4

Qi(p, τ)V (Li(pj, τ))

}

, (27)

where, for each value ofp, we find j such thatpj ≤ p ≤ pj+1 and calculate

V (p) =p − pj

pj+1 − pjv(pj+1) +

pj+1 − p

pj+1 − pjv(pj). (28)

As pointed out in [15], the iteration in (27) is guaranteed toconverge and the valueV (p) in (28) is an

upper-bound of the optimal value functionV (p) (asV (p) is convex). After obtaining the approximation

function V (p), we can substitute into the Bellman equation (14) to obtain the corresponding sensing time.

B. Myopic Policy ζm(.)

Proposition 3 identifies the sufficient condition on the probability that the primary network is idle for

sensing to be carried out. Furthermore, Proposition 5 givesthe lower bound on the optimal sensing time,

given that sensing is carried out. Based on these two propositions, we consider the following policy:

τn = ζm(pn) =

0, if pn < p∗

τ ∗(pn), if pn ≥ p∗,(29)

https://www.researchgate.net/publication/243701719_Computationally_Feasible_Bounds_for_Partially_Observed_Markov_Decision_Processes?el=1_x_8&enrichId=rgreq-3a0922b6c6bdbe9de0e188a9e870ebce-XXX&enrichSource=Y292ZXJQYWdlOzIyNDM5NDgyMDtBUzoxODA3ODQ0ODA2NjE1MDVAMTQyMDExMzc4NzY4OA==


https://www.researchgate.net/publication/261237006_Optimization_of_Spectrum_Sensing_for_Opportunistic_Spectrum_Access_in_Cognitive_Radio_Networks?el=1_x_8&enrichId=rgreq-3a0922b6c6bdbe9de0e188a9e870ebce-XXX&enrichSource=Y292ZXJQYWdlOzIyNDM5NDgyMDtBUzoxODA3ODQ0ODA2NjE1MDVAMTQyMDExMzc4NzY4OA==


14

where ζm(.) maps the belief statep into the sensing time for the secondary user. Note that setting the

sensing timeτ = ζm(p) myopically maximizes the instantaneous gain when the belief state isp.

C. Static Policy ζs(.)

Consider a static spectrum access policy that always carries out sensing with a fixed sensing duration.

Note that the stationary probability of the primary networkbeing idle isπi = bb+g

. We can calculate

τ ∗ (πi) = arg maxτmin≤τ≤τmax

G (πi, τ). (30)

Then, for every time slot, sensing is carried out for the duration of τ ∗(πi). We term this policyζs(.).

D. Genie’s Policy

To facilitate comparison, we are going to normalize the performance of different control policies to

that of the following so called ‘Genie’s policy’. Suppose that at the beginning of each time slot, a genie

tells the secondary user exactly what the state of the primary network is. Then, the best action for the

secondary user is to carry out transmission if the primary network is idle, and to stay idle if the primary

network is active. The expected discounted reward for the secondary user can be calculated as:

V genie =∞

∑

n=0

αn

(

g

b + gT (rt − ct) −

b

b + gci

)

=gT (rt − ct) − bci

(1 − α)(b + g). (31)

V. NUMERICAL RESULTS AND DISCUSSION

In this section, we present numerical results that illustrate our theoretic analysis. We also compare the

performance of the optimal policies to that of the suboptimal policies described in Section IV. We focus

on the infinite horizon control scenario, i.e., whenN = ∞.

A. Model for Numerical Studies

We assume that the primary network operates on a channel withbandwidth ofB = 6 MHz. The time

slot duration isT = 10 msecs. The secondary user employs energy detection for spectrum sensing, with

channel being oversampled at ratefs = 7/8B. Given the required probability of detectionPd, the SNR

of the active primary signal at the receiver of the secondaryuser beingγ, and the sensing time ofτ , the

probability of false alarm can be calculated by ( [2]):

Pfa(τ) = Q(√

2γ + 1Q−1(Pd) +√

τfsγ), (32)


15

whereQ(.) is the complementary distribution function of a standard Gaussian variable. We setγ = −15

dB andPd = 10%. We assume that the sensing cost is linear in sensing time, i.e., cs(τ) = sτ , s > 0. This

satisfies the assumption thatcs(τ) is convex and increasing inτ . The memory of the primary network

switching process, i.e.,µ = 1 − b − g, is varied from0.1 to 0.95. While doing so, we always set

b = g = (1 − µ)/2.

B. Characteristics of Optimal Policies

To obtain optimal solutions to our POMDP, we use the solver package provided by A. Cassandra [16].

1) Optimal reward function is convex and nondecreasing: In Fig. 2, we plot the optimal reward function

V (p) versus the initial belief statep. As proved in Propositions 1 and 2,V (p) is convex and nondecreasing

in p. As can be seen, for high value ofp, V (p) is close to a linear function. This can be explained by the

fact that the optimal control action does not vary much for high value ofp (refer to Figs. 5 and 6).

2) Convexity of Pfa(τ) and concavity of G(p, τ): In Figs. 3 and 4, we respectively plot the probability

of false alarmPfa(τ) and the instantaneous gainG(p, τ) as functions of sensing timeτ when other

parameters are fixed. As shown in Appendix E, for energy detection, Pfa(τ) is convex and decreasing

in the region where0 < Pfa(τ) ≤ 0.5. Also, from Lemma 1, whenPfa(τ) is convex and decreasing

while cs(τ) is convex and increasing, the instantaneous gainG(p, τ) is concave inτ . These effects are

illustrated in Figs. 3 and 4.

3) Optimal sensing time τ opt(p): In Figs. 5 to 8, given different system parameters, we plot the sensing

time versus the probability of the primary network being idle for different control policies, i.e., optimal,

grid-based approximation, myopic, and static. From these plots, it can be observed that there is a threshold

value for the belief statep above which it is optimal for the secondary user to carry out spectrum sensing

and below which it is optimal to stay idle. In other words, theoptimal sensing strategies tend to exhibit

the ‘threshold-based’ characteristic. Note that this conjecture is stronger than Proposition 3, where it is

shown that there is a minimum value ofp above which sensing should be carried out.

From Fig. 5, it can also be observed that for a relatively low value of memory of the primary network

switching process (µ = 0.5), both the grid-based and myopic policies approximate the optimal sensing

durations well. However, in Fig. 6, when the memory increases (µ = 0.9), only the grid-based policy is

close to optimal, while the myopic policy is significantly different from the optimal policy. This difference

can be explained by the fact that the myopic policy only focuses on instantaneous gain and ignores future

effects. As a result, it is not able to exploit the memory (correlation) in the primary network states. We note

16

also that the grid-based policy plotted in Figs. 5 and 6 is obtained using only10 discretized points. This

shows that, for our POMDP control problem, small-sized grid-based approximations are good enough.

Comparing Fig. 5 and 6, it is evident that when the memory of ofthe primary network switching

process increases fromµ = 0.5 to µ = 0.9, the optimal sensing time also increases. This is because the

higher the memory, the slower the change in status of the primary network and therefore, the more useful

the sensing activity for predicting the future state of the primary network. In Fig. 7, the cost of staying

idle in a time slot is increased toci = 1000. As can be seen, this discourages the secondary user from

staying idle, i.e., sensing is carried out given smaller value of probability that primary network is idle.

In Figs. 5 to 7, it can be observed thatτ ∗(p) is nondecreasing inp. This is because by settings = 1

and ct = 2, we havev(τ) = c′s(τ) − (1 − Pd)ct = 0.8 > 0, therefore, Case 1 in Lemma 2 applies. On

the other hand, in Fig. 8, it can be observed thatτ ∗(p) is nonincreasing inp. This is because by setting

s = 1 andct = 50, we havev(τ) = c′s(τ)− (1− Pd)ct = −4 < 0, therefore, Case 2 in Lemma 2 applies.

C. Performance Comparison

To facilitate comparison, we normalize the performance of different control policies to that of the

‘Genie’s policy’ described in Section IV-D.

1) Rewards: In Fig. 9, we plot the normalized expected reward versus the memoryµ of the primary

network switching process for optimal, grid-based, myopic, and static policies. As can be seen, the

performance of the optimal, grid-based and myopic polices all improves with increase in memoryµ.

This is because the higher the memory, the slower the primarynetwork switches state, which implies the

more useful sensing activities for future control. On the other hand, the performance of the static policy

(ζs(.)) does not change with memoryµ. This is because whenµ is varied, we always keepsb = g, which

makes the stationary probability of the primary network being active or idle unchanged (πi = 0.5). This

means that when the memory changes, in every time slot,ζs(.) always carries out sensing for a fixed

duration. Therefore, the performance ofζs(.) does not change with increase in memory.

In Fig. 9, it can be observed that for the low range of memory (µ < 0.7), the performance of both

grid-based and myopic policies are close to that of the optimal policy. On the other hand, for higher

value of memory, i.e., when primary network switches state more slowly, only grid-based policy can

approximate the performance of the optimal policy. The explanation for this is the same as what we have

given for the difference in the sensing time in Figs. 5 and 6. In particular, for low memory, both myopic

17

and grid-based policies can approximate the optimal decisions well. However, for high value of memory,

only grid-based policy has decisions close to optimal.

In Fig. 9, it is evident that the performance of a10-point grid-based approximation is almost the

same as that of the optimal policy. The performance of a5-point grid-based approximation is also very

close to optimal. This shows that we can employ the grid-based approximation approach without loss of

performance in our POMDP problem.

In Fig. 10, we setci = 1000 to examine the effect of introducing the cost of staying idle. As can be

seen, the performance trends are similar to those in Fig. 9. The extra effect that can be observed is that

the performance of myopic policy (ζm(.)) is closer to that of the optimal and grid-based policies in this

scenario. This can be explained by referring back to Fig. 7, where the sensing time of the myopic policy

is quite close to that of the optimal and grid-based policies.

VI. CONCLUSION

In this paper, we study spectrum-sensing policies that takeinto account the dynamics of the primary

networks and determine the spectrum sensing duration in order to optimize secondary users’ performance.

As such, this paper bridges the gap between the two groups of existing spectrum-sensing/control literature

that either focus on adapting to the dynamics of the primary networks or on optimizing the spectrum-

sensing duration, but not on both. We present detailed theoretical analysis and numerical studies for the

optimization problem.

We are currently extending this work in two directions. One is to study different trade-offs when some

constraints in this paper are relaxed, e.g., instead of enforcing a fixed probability of detection, a time-

average interference constraint can be introduced to protect primary networks’ operation. Another direction

is to extend this work to multi-user scenarios and consider cooperative and/or distributed spectrum-sensing

policies.

APPENDIX

A. Proof of Proposition 2

Proof: We proceed by induction. From (13),V 1(p) = maxτ∈[τmin,τmax]

{

− ci, G(p, τ)}

, where

G(p, τ) is linear inp. As maximum of convex functions is a convex function,V 1(p) is convex inp. Now,

assume thatV n(p) is convex inp for some valuen ≥ 1. From (12), if we can prove thatV n(L1(p)) and

Qi(p, τ)V n(Li(p, τ)), i = 2, 3, 4 are all convex. Then it follows thatV n+1(p) is convex.

18

As L1(p) is linear,L1(p) is convex. By assumption,V n(p) is convex and from Proposition 1,V n(p) is

nondecreasing. It can be shown that ( [17]), ifh(x) andg(x) are two functions such thath(x) is convex

and nondecreasing andg(x) is convex, then the compositionf(x) = h(g(x)) is also convex. Applying

here, it follows thatV n(L1(p)) is convex.

We haveQ3(p, τ)V n(L3(p, τ)) = p(1 − Pfa)(τ)V n(1 − b), i.e. linear and convex inp. Similarly,

Q4(p, τ)V n(L4(p, τ)) = (1 − p)(1 − Pd)Vn(g) is linear and convex inp.

To check ifQ2(p, τ)V n(L2(p, τ)) is convex inp, we look at its second order derivative with respect to

p. It can be verified that

∂2(Q2(p, τ)V n(L2(p, τ)))

∂p2=

V n′′(L2(p, τ))(1 − b − g)2P 2d P 2

fa(τ)

[Pd − p(Pd − Pfa(τ))]3(33)

We have shown earlier thatL2(p, τ) is convex and increasing inp, this, coupled with the fact thatV n(p) is

convex and nondecreasing, leads toV n(L2(p, τ)) is convex and nondecreasing. Therefore,V n′′(L2(p, τ)) ≥

0. It then follows from (33) that the second order derivative of Q2(p, τ)V n(L2(p, τ)), with respect top,

is nonnegative, which implies thatQ2(p, τ)V n(L2(p, τ)) is convex.

We have proved thatV n(L1(p)) and Qi(p, τ)V n(Li(p, τ)), i = 2, 3, 4 are all convex inp, therefore,

V n+1(p) is convex.

As limn→∞ V n(p) = V (p), it follows thatV (p) is also convex inp. This completes the proof.

B. Proof of Proposition 4

Proof: We prove the property ofF (p, τ), the proof forF N(p, τ) is similar. Letτmin ≤ τ1 < τ2 ≤ τmax.

It can be verified that

(F (p, τ2) − F (p, τ1))/α =(Q3(p, τ2) − Q3(p, τ1))V (L3) + Q2(p, τ2)V (L(p, τ2))

− Q2(p, τ1)V (L2(p, τ1)).(34)

From (5) and the fact thatPfa(τ) is decreasing inτ , we have:

Q3(p, τ2) − Q3(p, τ1) > 0 and Q2(p, τ2) > 0, and

Q3(p, τ2) − Q3(p, τ1) + Q2(p, τ2) = Q2(p, τ1).(35)

Furthermore

(Q3(p, τ2) − Q3(p, τ1))L3 + Q2(p, τ2)L2(p, τ2) = (pPfa(τ1) − pPfa(τ2))(1 − b)

+ (pPfa(τ2) + (1 − p)Pd)pPfa(τ2)(1 − b) + (1 − p)Pdg

pPfa(τ2) + (1 − p)Pd

= pPfa(τ1)(1 − b) + (1 − p)Pdg = Q2(p, τ1)L2(p, τ1).

(36)

19

From (35) and (36) and the fact thatV (p) is convex inp, it follows thatF (p, τ2)−F (p, τ1) is non-negative.

This completes the prove.

C. Proof of Lemma 1

Proof: We have:

G(p, τ) = p(1 − Pfa(τ))(T − τ)(rt − ct) − (1 − p)(1 − Pd)(T − τ)ct − cs(τ). (37)

As cs(τ) is convex,−cs(τ) is concave. Also, the second term in (37) is linear inτ . So all we have to

prove is that the first term in (37) is concave inτ . Let w(p, τ) = p(1 − Pfa(τ))(T − τ)(rt − ct).

∂w

∂τ= p(rt − ct)

(

Pfa(τ) − (T − τ)P ′fa(τ) − 1

)

, (38)

whereP ′fa(τ) is the first order derivative ofPfa(τ). From assumption A1, it follows thatP ′

fa(τ) is negative

and nondecreasing inτ . Furthermore,T − τ is positive and decreasing inτ . That leads to(T − τ)P ′fa(τ)

being negative and nondecreasing inτ . This, together with the fact thatPfa(τ) is decreasing inτ and

rt > ct, implies that∂w/∂τ is decreasing inτ . Therefore,w(p, τ) is concave inτ and so isG(p, τ).

D. Proof of Lemma 2

Proof: The proof is straightforward and based mainly on the fact that u(τ) is decreasing inτ while

v(τ) is nondecreasing inτ . Due to limited space, we only give the proof for Case 3a.

Given thatv(τmin) < 0 ≤ v(τmax), v(τ v) = 0, u(τ v) > 0, and letting0 < q < p < 1, we need to prove

that τ ∗(p) ≥ τ ∗(q). There are the following scenarios to consider:

Scenario 1: τ ∗(q) = τmin. It immediately follows thatτ ∗(p) ≥ τ ∗(q).

Scenario 2: τ ∗(p) = τmax. It immediately follows thatτ ∗(p) ≥ τ ∗(q).

Scenario 3: τ ∗(q) = τmax. This implies thatqu(τmax) − v(τmax) ≥ 0. Furthermore, asv(τmax) ≥ 0 by

assumption, we must haveu(τmax) ≥ 0. This leads to:

pu(τmax) − v(τmax) ≥ qu(τmax) − v(τmax) ≥ 0, (39)

which implies thatτ ∗(p) = τmax.

Scenario 4: τ ∗(p) = τmin. This implies thatpu(τmin)− v(τmin) < 0. Furthermore, from the assumption

that u(τ v) > 0 and the fact thatu(τ) is decreasing inτ , we must haveu(τmin) > 0. Therefore

qu(τmax) − v(τmax) < pu(τmax) − v(τmax) < 0, (40)

20

which implies thatτ ∗(q) = τmin.

Scenario 5: τmin < τ ∗(p), τ ∗(q) < τmax. We then have

pu(τ ∗(p)) − v(τ ∗(p)) = qu(τ ∗(q)) − v(τ ∗(q)) = 0. (41)

If τ ∗(p) ≤ τ v then from the assumption thatu(τ v) > 0 = v(τ v) and the fact thatu(.) is decreasing while

v(.) is nondecreasing, we havepu(τ ∗(p)) − v(τ ∗(p)) > 0, contradicts to (41). Therefore,τ ∗(p) > τ v.

Similarly, τ ∗(q) > τ v. Then from (41) we have:

u(τ ∗(p)) =v(τ ∗(p))

p<

v(τ ∗(q))

q= u(τ ∗(q)), (42)

which implies thatτ ∗(p) > τ ∗(q).

Combining the arguments in Scenarios 1 - 5, completes the proof for Case 3a. The proof for other

cases is similar.

E. Convexity of Pfa(τ) and cs(τ)

Let us justify the assumption thatPfa(τ) is convex and decreasing andcs(τ) is convex and increasing

for the case of spectrum sensing based on energy detection.

1) Convexity of Pfa(τ): Assume that the primary signal is complex-valued PSK and noise is circular

symmetric complex Gaussian. Given required probability ofdetectionPd and the SNR of primary signal

at the secondary receiver ofγ, the probability of false alarm can be calculated by ( [2]):

Pfa(τ) = Q(√

2γ + 1Q−1(Pd) +√

τfsγ). (43)

Differentiating with respect toτ gives:

P ′fa(τ) =

dPfa

dτ= − γ

√fs

2√

2πτ−1/2 exp

(

− (√

2γ + 1Q−1(Pd) + γ√

τfs)2/2

)

. (44)

For τ > 0, it is clear thatP ′fa(τ) < 0 soPfa(τ) is decreasing inτ . Furthermore, whenPfa(τ) ≤ 0.5, from

(43) we have√

2γ + 1Q−1(δ)+ γ√

τfs ≥ 0. This, together with (44), imply thatP ′fa(τ) is monotonically

increasing inτ whenPfa(τ) ≤ 0.5, i.e., Pfa(τ) is convex inτ whenPfa(τ) ≤ 0.5.

2) Convexity of cs(τ): Energy detection comprises of three main steps, i.e., i) to take K samples of

the signal; ii) to calculate the average power of theK samples; and iii) to compare the average power to

a certain threshold. The complexity of the first and second steps is linear inK. Furthermore,K is linear

in τ , so cs(τ) can be assumed linear inτ when spectrum sensing is based on energy detection.


21

REFERENCES

[1] FCC, “Spectrum policy task force report, FCC 02-155.” Nov. 2002.

[2] Y. C. Liang, Y. H. Zeng, E. Peh, and A. T. Hoang, “Sensing-throughput tradeoff for cognitive radio networks,” inProc. of IEEE

ICC’07, Glasgow, Jun. 2007.

[3] A. Ghasemi and E. Sousa, “Optimization of spectrum sensing for opportunistic spectrum access in cognitive radio networks,” in Proc.

of 4th IEEE Consumer Communications and Networking Conference (CCNC), Las Vegas, USA, Jan. 2007.

[4] Y. Pei, A. T. Hoang, and Y.-C. Liang, “Sensing-throughput tradeoff in cognitive radio networks: How frequently should spectrum sensing

be carried out?” inProc. of 18th Annual IEEE International Symposium on Personal, Indoor and Mobile Radio Communications, Athens,

Greece, Sep. 2007.

[5] Q. Zhao, L. Tong, A. Swami, and Y. Chen, “Decentralized cognitive mac for opportunistic spectrum access in ad hoc networks:

A pomdp framework,”IEEE Journal on Selected Areas in Communications (JSAC): Special Issue on Adaptive, Spectrum Agile and

Cognitive Wireless Networks, vol. 25, no. 3, pp. 589–600, Apr. 2007.

[6] A. T. Hoang and Y.-C. Liang, “Adaptive scheduling of spectrum sensing periods in cognitive radio networks,” into appear in Proc. of

50th IEEE Global Telecommunications Conference (Globecom), Washington DC, USA, Nov. 2007.

[7] D. Zhang and K. Wasserman, “Transmission schemes for time-varying wireless channels with partial state observations,” in Proceedings

of IEEE INFOCOM’02, New York, Jun. 2002, pp. 467–476.

[8] L. A. Johnston and V. Krishnamurthy, “Opportunistic filetransfer over a fading channel: A pomdp search theory formulation with

optimal threshold policies,”IEEE Transactions on Wireless Communications, vol. 5, no. 2, pp. 394–405, Feb. 2006.

[9] M. Mushkin and I. Bar-David, “Capacity and coding for thegilbert-elliott channels,”IEEE Transactions on Information Theory, vol. 35,

no. 6, pp. 1277–1290, Nov. 1989.

[10] IEEE 802.22 Wireless RAN, “Functional requirements for the 802.22 WRAN standard, IEEE 802.22- 05/0007r46,” Oct. 2005.

[11] D. P. Bertsekas,Dynamic Programming and Optimal Control: 2nd edition, vols. 1 and 2. Athena Scientific, 2001.

[12] S. C. Albright, “Structural results for partially observable markov decision processes,”Operations Research, vol. 27, no. 5, pp. 1041–

1053, Sep.-Oct. 1979.

[13] W. S. Lovejoy, “Some monotonicity results for partially observable markov decision processes,”Operations Research, vol. 35, no. 5,

pp. 736–743, Sep.-Oct. 1987.

[14] I. MacPhee and B. Jordan, “Optimal search for a moving target,” Probability in Engineering and Information Sciences, vol. 9, pp.

159–182, Sep.-Oct. 1995.

[15] W. Lovejoy, “Computationally feasible bounds for partially observed markov decision processes,”Operations Research, vol. 39, no. 1,

pp. 162–175, Jan./Feb 1991.

[16] A. R. Cassandra, “Tony’s pomdp page,”Website http://www.cs.brown.edu/research/ai/pomdp/.

[17] S. Boyd and L. Vandenberghe,Convex Optimization, 1st ed. Cambridge University Press, 2004.


















https://www.researchgate.net/publication/242932266_Some_Monotonicity_Results_for_Partially_Observed_Markov_Decision_Processes?el=1_x_8&enrichId=rgreq-3a0922b6c6bdbe9de0e188a9e870ebce-XXX&enrichSource=Y292ZXJQYWdlOzIyNDM5NDgyMDtBUzoxODA3ODQ0ODA2NjE1MDVAMTQyMDExMzc4NzY4OA==

https://www.researchgate.net/publication/242932266_Some_Monotonicity_Results_for_Partially_Observed_Markov_Decision_Processes?el=1_x_8&enrichId=rgreq-3a0922b6c6bdbe9de0e188a9e870ebce-XXX&enrichSource=Y292ZXJQYWdlOzIyNDM5NDgyMDtBUzoxODA3ODQ0ODA2NjE1MDVAMTQyMDExMzc4NzY4OA==









https://www.researchgate.net/publication/247615821_Spectrum_Policy_Task_Force?el=1_x_8&enrichId=rgreq-3a0922b6c6bdbe9de0e188a9e870ebce-XXX&enrichSource=Y292ZXJQYWdlOzIyNDM5NDgyMDtBUzoxODA3ODQ0ODA2NjE1MDVAMTQyMDExMzc4NzY4OA==

22

PU

Slots

SU

idle idle active idle

s trans. ack s quiet s quiet s nak trans. idle

1 2 3 4 5

active

Fig. 1. Operations of a primary network and secondary user. The primary network switches between active and idle according to a Markovianprocess. The secondary user must carry out spectrum sensingbefore transmitting in each time slot. Sensing can introduces false alarms (e.g.in slot 2) and mis-detections (e.g. in slot5). Positive acknowledgments (ACK) and negative acknowledgments (NAK) are returned forsuccessful and failed transmissions respectively.

0 0.2 0.4 0.6 0.8 10.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

p (prob. of primary network being idle)

Nor

mal

ized

V(p

)

Fig. 2. Normalized optimal reward function. Optimal rewardfunction V (p) is convex and increasing inp. Pd = 10%, ci = 0, s = 1,ct = 2, rt = 3, andµ = 0.9. The optimal reward function is normalized by the reward of Genie’s Policy.

23

0.05 0.1 0.15 0.2 0.25 0.30

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

τ/T

Pfa

(τ)

Convexity of Pfa

(τ)

Fig. 3. Probability of false alarm for energy detection. Given that energy detection is used for spectrum sensing, for the range ofτ inwhich 0 ≤ Pfa(τ ) ≤ 0.5, Pfa(τ ) is convex and decreasing inτ . Parameters: Channel bandwidth =6 MHz, oversampling ratio =8/7, SNR= −15 dB. Pd = 10%, s = 1, ct = 2, rt = 3.

0.05 0.1 0.15 0.2 0.25 0.3−0.04

−0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

τ/T

G(0

.5, τ

)/T

Concavity of G(0.5, τ)

Fig. 4. Instantaneous gain as a function of sensing time. Forthe range ofτ in which Pfa(τ ) is convex and decreasing,G(p, τ ) is concavein τ . Parameters: Channel bandwidth =6 MHz, oversampling ratio =8/7, SNR =−15 dB. Pd = 10%, s = 1, ct = 2, rt = 3.

24

0 0.2 0.4 0.6 0.8 10

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2


τ/T

OptimalGrid

ζm(.)

ζs(.)

µ=0.5

Fig. 5. Sensing times for different policies when the memoryof primary user switching process is set toµ = 0.5. Pd = 10%, ci = 0,s = 1, ct = 2, rt = 3.

0 0.2 0.4 0.6 0.8 10

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2


τ/T

OptimalGrid

ζm(.)

ζs(.)

µ=0.9

Fig. 6. Sensing times for different policies when the memoryof primary user switching process is set toµ = 0.9. Pd = 10%, ci = 0,s = 1, ct = 2, rt = 3.

25

0 0.2 0.4 0.6 0.8 10

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2


τ/T

OptimalGrid

ζm(.)

ζs(.)

µ=0.9

Fig. 7. Effect of delay cost to sensing time. Sensing times for different policies when the memory of primary user switching process is setto µ = 0.9, Pd = 10%, ci = 1000, s = 1, ct = 2, rt = 3.

0 0.2 0.4 0.6 0.8 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35


τ/T

Optimal

ζm(.)

ζs(.)

µ=0.9

Fig. 8. A scenario whenτ∗(p) (sensing time of myopic policyζm(.)) is non-increasing inp when sensing is carried out.µ = 0.9,Pd = 10%, ci = 0, s = 1, ct = 50, rt = 80.

26

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

µ (memory of primary network switching process)

Nor

mal

ized

rew

ard

OptimalGrid (10 points)Grid (5 ponts)

ζm(.)

ζs(.)

10−point grid approximation


Fig. 9. Normalized rewards achieved by different control policies. The rewards achieved by different policies are normalized by that of theGenie’s Policy.ci = 0, cs = 1, ct = 2, rt = 3.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

µ (memory of primary network switching process)

Nor

mal

ized

rew

ard

OptimalGrid (20 points)Grid (10 points)

ζm(.)

ζs(.)10−point grid approximation


Fig. 10. Effect of increasing delay cost to rewards achievedby different control policies. The rewards achieved by different policies arenormalized by that of the Genie’s Policy.ci = 1000, cs = 1, ct = 2, rt = 3.