Maximum-Likelihood Estimation, the CramÉr Rao Bound, and the Method of Scoring With Parameter...

14
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 3, MARCH 2008 895 Maximum-Likelihood Estimation, the Cramér–Rao Bound, and the Method of Scoring With Parameter Constraints Terrence J. Moore, Brian M. Sadler, Fellow, IEEE, and Richard J. Kozick, Senior Member, IEEE Abstract—Maximum-likelihood (ML) estimation is a popular approach to solving many signal processing problems. Many of these problems cannot be solved analytically and so numerical techniques such as the method of scoring are applied. However, in many scenarios, it is desirable to modify the ML problem with the inclusion of additional side information. Often this side information is in the form of parametric constraints, which the ML estimate (MLE) must now satisfy. We unify the asymptotic constrained ML (CML) theory with the constrained Cramér–Rao bound (CCRB) theory by showing the CML estimate (CMLE) is asymptotically efficient with respect to the CCRB. We also generalize the classical method of scoring using the CCRB to in- clude the constraints, satisfying the constraints after each iterate. Convergence properties and examples verify the usefulness of the constrained scoring approach. As a particular example, an alternative and more general CMLE is developed for the complex parameter linear model with linear constraints. A novel proof of the efficiency of this estimator is provided using the CCRB. Index Terms—Asymptotic normality, Cramér–Rao bound, iter- ative methods, maximum-likelihood (ML), method of scoring, op- timization, parameter estimation, parametric constraints. I. INTRODUCTION M AXIMUM-LIKELIHOOD (ML) estimation is a popular approach in solving signal processing problems, espe- cially in scenarios with a large data set, where the maximum- likelihood estimator (MLE) is in many ways optimal due to its asymptotic characteristics. The procedure relies on maximizing the likelihood equation, and, in analytically intractable cases, the MLE can still be obtained iteratively through methods of optimization, e.g., Newton’s method, or the method of scoring. However, in many signal processing problems, it is desirable or necessary to perform ML estimation when side information is available. Often this additional information is in the form of parametric equality or inequality constraints on a subset of the parameters. Examples of side information include the constant modulus property, some known signal values (semiblind prob- lems), restricted power levels (e.g., in networks), known angles of arrival, array calibration, precoding, etc. With the addition of Manuscript received May 26, 2006; revised May 30, 2007. The associate ed- itor coordinating the review of this manuscript and approving it for publication was Dr. Ta-Hsin Li. T. J. Moore and B. M. Sadler are with the Army Research Laboratory, AMSRD-ARL-CI-CN, Adelphi, MD 20783, USA (e-mail: [email protected]. mil; [email protected]). R. J. Kozick is with the Department of Electrical Engineering, Bucknell Uni- versity, Lewisburg, PA 17837 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TSP.2007.907814 these parametric constraints, this procedure is now called a con- strained maximum-likelihood (CML) problem. As a measure of performance for the MLE, the Cramér–Rao bound (CRB), obtained via the inverse of the Fisher information matrix (FIM), is a lower bound on the error covariance of any unbiased estimator. However, it is desirable to measure perfor- mance of estimators that satisfy the side information constraints. Gorman and Hero used the Chapman–Robbins bound to develop a constrained version of the CRB that lower bounds the error co- variance for constrained, unbiased estimators for the case when the unconstrained model has a nonsingular FIM [1]. Marzetta simplified their derivation and formulation for the same non- singular FIM case [2]. Then, Stoica and Ng constructed a more general formulation of the CRB that incorporates the constraint information without the assumption of a full-rank FIM [3]. Their constrained CRB (CCRB) was also shown to subsume the pre- vious cases which require a nonsingular FIM. The MLE is an optimal choice for an estimator in the sense that asymptotically the MLE is both consistent and efficient [4]. Asymptotic properties of the CMLE can be found in Crowder [5]. For the case of linear equality constraints, Osborne in [6] de- rived the asymptotic error covariance matrix of the CMLE inde- pendently from [5], which we show is equivalent to the CCRB. Osborne’s result also was obtained independently from [3] and is further confirmation of the CCRB result in [3], although the CCRB is not discussed in [6]. In the context of the more general nonlinear constraint case, we unify the connection between the covariance of the asymptotic distribution of the CMLE devel- oped by Crowder and the CCRB developed by Stoica and Ng. Specifically, we show that the CMLE is asymptotically efficient with respect to the CCRB. Although the ML problem is easy to express, obtaining the MLE is often a difficult task. Fortunately, iterative techniques, such as the method of scoring [4], [7], are available which reach the MLE under certain conditions. Typically, scoring is applied on top of a suboptimal method which provides an initialization, for example, see [8] and [9]. However, those schemes must be adjusted when constraints have been added to the model. There are a significant number of contributions in the literature relating to constrained optimization. Some prior research has focused on developing iterative techniques based on Lagrangian methods to obtain the CMLE, e.g., [6], [10], and [11]. Amongst these, are those that utilize the sequential quadratic programming (SQP) method while seeking the minimization of a cost function for- mulated to minimize the objective while still maintaining prox- imity to the constraint space; see, for instance, [12] and [13]. 1053-587X/$25.00 © 2008 IEEE

Transcript of Maximum-Likelihood Estimation, the CramÉr Rao Bound, and the Method of Scoring With Parameter...

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 3, MARCH 2008 895

Maximum-Likelihood Estimation, the Cramér–RaoBound, and the Method of Scoring With

Parameter ConstraintsTerrence J. Moore, Brian M. Sadler, Fellow, IEEE, and Richard J. Kozick, Senior Member, IEEE

Abstract—Maximum-likelihood (ML) estimation is a popularapproach to solving many signal processing problems. Many ofthese problems cannot be solved analytically and so numericaltechniques such as the method of scoring are applied. However,in many scenarios, it is desirable to modify the ML problemwith the inclusion of additional side information. Often this sideinformation is in the form of parametric constraints, which theML estimate (MLE) must now satisfy. We unify the asymptoticconstrained ML (CML) theory with the constrained Cramér–Raobound (CCRB) theory by showing the CML estimate (CMLE)is asymptotically efficient with respect to the CCRB. We alsogeneralize the classical method of scoring using the CCRB to in-clude the constraints, satisfying the constraints after each iterate.Convergence properties and examples verify the usefulness ofthe constrained scoring approach. As a particular example, analternative and more general CMLE is developed for the complexparameter linear model with linear constraints. A novel proof ofthe efficiency of this estimator is provided using the CCRB.

Index Terms—Asymptotic normality, Cramér–Rao bound, iter-ative methods, maximum-likelihood (ML), method of scoring, op-timization, parameter estimation, parametric constraints.

I. INTRODUCTION

MAXIMUM-LIKELIHOOD (ML) estimation is a popularapproach in solving signal processing problems, espe-

cially in scenarios with a large data set, where the maximum-likelihood estimator (MLE) is in many ways optimal due to itsasymptotic characteristics. The procedure relies on maximizingthe likelihood equation, and, in analytically intractable cases,the MLE can still be obtained iteratively through methods ofoptimization, e.g., Newton’s method, or the method of scoring.However, in many signal processing problems, it is desirableor necessary to perform ML estimation when side informationis available. Often this additional information is in the form ofparametric equality or inequality constraints on a subset of theparameters. Examples of side information include the constantmodulus property, some known signal values (semiblind prob-lems), restricted power levels (e.g., in networks), known anglesof arrival, array calibration, precoding, etc. With the addition of

Manuscript received May 26, 2006; revised May 30, 2007. The associate ed-itor coordinating the review of this manuscript and approving it for publicationwas Dr. Ta-Hsin Li.

T. J. Moore and B. M. Sadler are with the Army Research Laboratory,AMSRD-ARL-CI-CN, Adelphi, MD 20783, USA (e-mail: [email protected]; [email protected]).

R. J. Kozick is with the Department of Electrical Engineering, Bucknell Uni-versity, Lewisburg, PA 17837 USA (e-mail: [email protected]).

Digital Object Identifier 10.1109/TSP.2007.907814

these parametric constraints, this procedure is now called a con-strained maximum-likelihood (CML) problem.

As a measure of performance for the MLE, the Cramér–Raobound (CRB), obtained via the inverse of the Fisher informationmatrix (FIM), is a lower bound on the error covariance of anyunbiased estimator. However, it is desirable to measure perfor-mance of estimators that satisfy the side information constraints.Gorman and Hero used the Chapman–Robbins bound to developa constrained version of the CRB that lower bounds the error co-variance for constrained, unbiased estimators for the case whenthe unconstrained model has a nonsingular FIM [1]. Marzettasimplified their derivation and formulation for the same non-singular FIM case [2]. Then, Stoica and Ng constructed a moregeneral formulation of the CRB that incorporates the constraintinformation without the assumption of a full-rank FIM [3]. Theirconstrained CRB (CCRB) was also shown to subsume the pre-vious cases which require a nonsingular FIM.

The MLE is an optimal choice for an estimator in the sensethat asymptotically the MLE is both consistent and efficient [4].Asymptotic properties of the CMLE can be found in Crowder[5]. For the case of linear equality constraints, Osborne in [6] de-rived the asymptotic error covariance matrix of the CMLE inde-pendently from [5], which we show is equivalent to the CCRB.Osborne’s result also was obtained independently from [3] andis further confirmation of the CCRB result in [3], although theCCRB is not discussed in [6]. In the context of the more generalnonlinear constraint case, we unify the connection between thecovariance of the asymptotic distribution of the CMLE devel-oped by Crowder and the CCRB developed by Stoica and Ng.Specifically, we show that the CMLE is asymptotically efficientwith respect to the CCRB.

Although the ML problem is easy to express, obtaining theMLE is often a difficult task. Fortunately, iterative techniques,such as the method of scoring [4], [7], are available which reachthe MLE under certain conditions. Typically, scoring is appliedon top of a suboptimal method which provides an initialization,for example, see [8] and [9]. However, those schemes must beadjusted when constraints have been added to the model. Thereare a significant number of contributions in the literature relatingto constrained optimization. Some prior research has focused ondeveloping iterative techniques based on Lagrangian methods toobtain the CMLE, e.g., [6], [10], and [11]. Amongst these, arethose that utilize the sequential quadratic programming (SQP)method while seeking the minimization of a cost function for-mulated to minimize the objective while still maintaining prox-imity to the constraint space; see, for instance, [12] and [13].

1053-587X/$25.00 © 2008 IEEE

896 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 3, MARCH 2008

The constrained scoring algorithm (CSA) developed here isa generalization of the classical method of scoring to obtainthe CMLE under certain conditions. The scheme relies on al-ternating between a projection step, similar to gradient baseddescent iterations, and a restorative step that ensures that the so-lution satisfies the parametric constraints. The projection step issimilar to the gradient projection (GP) algorithm developed byJamshidian [14], except that the FIM replaces the negative Hes-sian (which in the GP scheme is sufficiently diagonally loadedto ensure positive definiteness). This departs from the usual SQPapproach by seeking a minimization of the expected value of thenegative Hessian as opposed to the Hessian directly, althoughour approach still iteratively seeks a solution to a QP problem.We detail several convergence properties associated with theCSA. Convergence of iterative techniques are always dependenton the initialization, but our results show that this method ob-tains at least a local MLE. Thus, with a sufficiently accurateinitialization, the CSA will in fact obtain the CMLE.

To demonstrate the effectiveness of the CSA, we first examinethe classical CMLE problem when imposing linear constraintson a linear model. The CSA analytically solves this problem ina single step and provides an equivalent alternative to the tra-ditional solution (e.g., see [4, p. 252] and [15, p. 299]), whereour new solution is applicable under weaker conditions. Wealso demonstrate that our CMLE and the traditional solutionare both unbiased and efficient. Furthermore, efficiency of theCMLE for the linearly constrained linear model is shown di-rectly using the CCRB. Second, we find the CMLE after im-posing nonlinear, nonconvex constraints on the signal modulusin a complex-valued linear model. Provided the initializationis sufficiently close, our simulations show evidence of unbi-asedness and efficiency for this particular choice of constraintas well. Third, we consider a nonlinear model parameteriza-tion from [16]. In this case, constant modulus and semiblindconstraints are applied on the signal that passes through an in-stantaneous mixing channel. We compare the results with thoseproduced by Jamshidian’s GP scheme when combined with thesame restorative step. While the likelihood is maximized fasterwith GP, the CSA performs better with regard to mean-squareerror (MSE) on the unknown signal parameters, when their di-rection-of-arrivals are close.

The key contributions of the paper are the connection and uni-fication with the CCRB, including the variance and asymptoticdistribution of the CMLE, the extension of the classical scoringalgorithm, including a unified formulation with the CCRB andCMLE, the similarities and differences with SQP methods, in-cluding both theory and simulations, and considering severalcases with simulations, including the special case of the linearmodel, in all the above. This paper is developed as follows.In the following section, we formally state the CML problemand give definitions for necessary terms used throughout. InSection III, we determine the asymptotic normality propertiesof the CMLE, showing both consistency and asymptotic effi-ciency. Next, in Section IV, we develop the CSA via the methodof Lagrange multipliers and natural projections. In Section V,we discuss the convergence properties of the given CSA. InSections VI and VII, we provide some examples that illustratethe effectiveness of the CSA.

II. PRELIMINARIES

In this and subsequent sections, we use the following nota-tion. Vectors are in bold font, and matrices in uppercase boldfont, where and denote the transpose, the con-jugate, and the conjugate transpose of , respectively. Formatrices, is the matrix inverse and the pseudoin-verse. Unless otherwise stated, all sets will be a subset of aEuclidean metric space, i.e., for some positive in-teger and where is the -norm. When applied to ma-trices, is the Frobenius norm. The derivative of a scalar orvector with respect to will be expressed either asor as . All expectations will be with respect tothe appropriate distribution of the likelihood, i.e.,

, where is the likelihood density.

A. Problem Statement & Definitions

We have a vector of observations in a sample spacesatisfying the likelihood function of a known

parametric form . We want to estimate the unknownparameter vector under the assumption that is restricted to aclosed, convex set , which we will assume can be de-fined parametrically, e.g., .1 Let

be the true vector of parameters. Then, the constrainedmaximum-likelihood estimator (CMLE) is given by

(1)

Since is strictly monotone decreasing, this CMLEcan alternately be viewed as the solution to the followingconstrained optimization problem:

(2)

(3)

where the negative log-likelihood function is the objective func-tion, and and are the functionalconstraints which define the constraint set . We make the as-sumption that , and all have continuous secondderivatives with respect to . We also require that the functionalconstraints be consistent, i.e., .

A point that satisfies the functional constraints is said to befeasible, and is referred to as the feasible region. The th in-equality constraint is said to be active at a feasiblepoint if , otherwise it is inactive. Thus, the equalityconstraints are always considered active. A feasible point

is a regular point of the constraints in and the active con-straints in if the vectors

, are linearly independent, when only the firstconstraints of are active. Thus, a regular point requires no re-dundancy in the active constraints. Properties for these defini-tions can be found in [17] and [18].

1The condition that� be convex is not so much a strict requirement as it is aconvenient one. More discussion on this issue is found in Section IV-C.

MOORE et al.: ML ESTIMATION, THE CRB, AND THE METHOD OF SCORING WITH PARAMETER CONSTRAINTS 897

Define the gradient matrices of and by the continuousfunctions

(4)

Note that, assuming is regular in the active constraint set,has full row rank , whereas this is not necessarily true for

. Define to be a continuous functionsuch that for each is a matrix whose columns form anorthonormal null space of the range space of the row vectors in

, i.e., such that

(5)

for every . If is regular then has full row rankso . Also, note that is independent of when-ever is. This occurs in the linearly constrained case, when

for some matrix and vector, in which case the gradient is a constant and,

thus, .

B. Constrained Cramér–Rao Bound

The error covariance of any unbiased estimator of islower-bounded by the Cramér–Rao bound. The classical devel-opment of the Cramér–Rao lower bound is well-known [4], [19].The bound is expressed as

(6)

where is the FIM given by

(7)

The CRB and FIM exist provided certain regularity2 condi-tions hold. One such condition, from (6), is the requirement ofa nonsingular FIM. When the model is not locally identifiable,however, then the FIM is singular and the CRB is not alwaysdefined [20], [21]. To obtain a CRB, the model must includesufficient constraints on the parameters to achieve local iden-tifiability and, hence, a nonsingular FIM. Even when the orig-inal model is identifiable locally, the classical Fisher informa-tion theory ignores the contribution of the side information.

To incorporate side information, Gorman and Hero [1] andthen Marzetta [2] developed formulations of the CRB that in-clude constraint information for the case where the FIM is fullrank. Improving on their work, Stoica and Ng [3] formulated aCCRB that explicitly incorporates the active constraint informa-tion with the original FIM, be it singular or nonsingular.

2In the statistical inference literature a point ��� that satisfies these regularityconditions is sometimes said to be regular, or information-regular [20]. To avoidthe confusion between that definition and the optimization literature’s definitionin the prior subsection, all instances of regular in this paper will be in the opti-mization context and all instances of regularity will be in the statistical inferencecontext.

Theorem 1 (Stoica and Ng [3]): Assume we know the activeconstraints and that these are all incorporated into . Letbe an unbiased estimate of satisfying . Then, under certainregularity conditions, if is nonsingular

(8)

where equality is achieved if and only if(in the mean

square sense).3

An interpretation of the CCRB is given in [22]. The modelis transformed into the constrained (reduced) parameter spacewhere the FIM is evaluated and inverted toobtain the CRB, and then transformed back into the originalparameter space. In the evaluation of this CCRB, the inactiveinequality constraints do not contribute any side informationor affect the estimator outcome [1]. Also, rather than re-quiring a nonsingular FIM , the alternate condition is that

must be nonsingular. Thus, the unconstrainedmodel might not be locally identifiable, as long as the con-strained model is so. In addition, it has been shown in [22] thatwe may replace with in (8), whereand rank rank .

III. ASYMPTOTIC NORMALITY OF THE CMLE

Asymptotic properties of the MLE can be found in [4] and[19]. Asymptotic properties of the CMLE can be found inCrowder [5]. Next, we unify Crowder’s results and Theorem 1.

Theorem 2: Assuming the pdf satisfies certain reg-ularity conditions, e.g., as in Theorem 1, then the CMLE ,where is the sample size, is asymptotically distributed ac-cording to

(9)

where is the true parameter vector.Proof: Crowder shows that the CMLE is asymptotically

distributed according to

(10)

where

(11)

and for any arbitrary positive-def-inite matrix . (It is shown in [5] that (11) is unique indepen-dent of , i.e., is the same for all possible .) Note, thisvariance matrix has the same structure as the Marzettaform of the CCRB when and the FIM is full rank.

3The natural extensions to the classical model, e.g., the CRB on differentiabletransformations [4, p. 45] and the CRB for biased bounds, also can be appliedto this constrained formulation [16].

898 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 3, MARCH 2008

Applying an algebraic identity in Khatri [23, Lemma 1], whichis also useful in equating the Marzetta and Stoica forms of theCCRB, and noting , we have

(12)

(13)

(14)

(15)

Theorem 2 encompasses the classical asymptotic normalityproperty for the MLE [4] provided the FIM is nonsingular in theabsence of equality constraints (in which casesince is null). And, this is another verification of the CCRBresult in (8).

IV. SCORING WITH CONSTRAINTS

An analytic, closed-form solution of the MLE is sometimesfound from the first-order conditions on the log-likelihood bysolving for in

(16)

Solutions are the stationary points, and if a unique solution ex-ists, it is the MLE. Similarly, a solution for the CMLE can befound by solving for on the feasible arc satisfying

(17)

Again, if a unique solution exists, is the CMLE [24].In general, however, analytic solutions of the MLE and

CMLE problem are unavailable, motivating iterative pro-cedures to attain the CMLE. In this section, we derive ageneralization of the classical scoring algorithm by incorpo-rating the side information contained in the constraints.

A. Projection Step

Iterative linear projection updates are typically developedfrom gradient methods of the formwhere and are suitably chosen sequencesof step sizes and directions [17], [18], [25]–[27]. Of particularimportance in the class of gradient methods is the subclass ofquasi-Newton methods of the form

(18)

where is a sequence of positive definite symmetric matricesand the gradient is of the objective function, which in ourcase is the negative log-likelihood function. Since the gradientof the objective function is the direction of steepest ascent, it’snegative is the direction of steepest descent. The matrixthen projects this direction vector to some purpose, dictatedby the model. This class of methods includes the method ofsteepest descent , Newton’s method

, as well as the method of scoring. None of these methods, however, consider the in-

formation and restrictions from the constraints.

Given the general iterative form, we derive a CSA via a Taylorexpansion.4 For any iterate satisfying the constraints, therow space of spans the directions for steepest descent andascent of at . Thus, for the step to be feasible (or nearlyso), it is necessary that be orthogonal to this gradient matrix,i.e., span . Let be such thatwhere is defined in (5). So, the iterative form is now

(19)

Now, consider the Taylor expansion of the log-likelihood aboutalong the vector . This is given by

(20)

(21)

Note above we replace the negative Hessian with the Fisher in-formation as well as drop the higher order terms, which is theidentical step taken in the development of the method of scoring.Using the FIM departs from the usual SQP methods, but it ishoped that this replacement improves the stability of the iter-ations [4]. One benefit is the elimination of the noisy observa-tions effect in the negative Hessian, which is desirable if the goalis to optimize the parameter estimation regardless of the noiserealization. Another benefit is that the FIM is always positivesemidefinite, even when the Hessian is not. We have, approxi-mately, a quadratic log-likelihood model restricted to the sub-space spanned by the columns of , i.e., the space whichlocally preserves the active constraints . The max-imum of this quadratic model is

(22)

where is a stationary point of the approximation. Hence, dif-ferentiating (22) and setting the result to , the optimal mustsatisfy

(23)

Note, the Hessian of the log-likelihood approximationmust be negative definite as well for to maximizethis function. Indeed, in this case, the Hessian ofis , so we must have that

is positive definite. The matrix is alreadypositive semidefinite, so nonsingularity is equivalent topositive definite here. Also, recall that nonsingularity of

4There are a number of ways to incorporate the constraints into an iterativemethod, e.g., utilizing a general Taylor expansion of the Lagrangian optimalityconditions as in [24], applying the method of Newton elimination, or utilizinga directional Taylor approximation of the likelihood as discussed here.

MOORE et al.: ML ESTIMATION, THE CRB, AND THE METHOD OF SCORING WITH PARAMETER CONSTRAINTS 899

is a regularity condition for existenceof the CCRB and a requirement for local identifiability.Solving for in (23) and substituting into (19), we obtain theprojection step of the CSA

(24)

which, instead of utilizing the CRB as the projection matrix,utilizes the CCRB.5 This is a generalization of the classicalmethod of scoring, since in the absence of constraints the nullmatrix and thus . It is necessaryto include the coefficient to ensure each step is also usable,i.e., that the resulting iterate reduces the objective or, in this case,the negative log-likelihood.

For linear constraints, we can note the similarity betweenthis update scheme compared with that of Jamshidian [14],where the linearly constrained maximum-likelihood problemis considered, i.e., s.t. . The update forJamshidian’s GP algorithm is given by

(25)

where is some positive definite matrix. It is suggested thata good choice for , with regard to the local rate of conver-gence, is a diagonally loaded version of the negative Hessian

, where is chosen sufficientlylarge to ensure the positive definiteness of . The FIM is al-ways positive semidefinite, so when it is nonsingular, then wemight set . In this case, the GP projection stepis identical to CSA, since the matrix coefficient in (25) is theMarzetta formulation of the CCRB [2], which is equivalent to

[3]. This inspires a generalization of Jamshidian’s up-date, given by

(26)

where is defined as in (5) and where now only needs tobe positive semidefinite. For the linear model, the Hessian andthe FIM are identical, and so if the FIM is nonsingular, the GPupdate and CSA update are identical. In any case, an alternativeto diagonally loading for a positive semidefinite matrix is givenby (26).

It is important to note here that the CSA is not anotherquasi-Newton method, as the corresponding premultiplyingmatrix , corresponding to in (18), is not posi-tive definite, but rather positive semidefinite. Indeed, for

to be positive definite,would necessarily be full row rank, which requires thatbe null, i.e., an unconstrained model, corresponding to

the classical method of scoring. Thus, will always besingular in the presence of constraints on .

5A special case of (24) for linear constraints was presented in [6], and a non-statistical formulation similar to (24) exists for the conventional optimizationproblem, again with linear constraints, in [25, p. 178].

Our constrained scoring formulation (24), as experiencedwith the Newton or classical scoring algorithm, does not neces-sarily lead to convergent sequences for any fixed step size .Since we lack knowledge of the Lipschitz condition satisfiedby the objective, must be varied by some appropriate stepsize rule that will guarantee usability and will stabilize theconvergence. If we did have that knowledge, we could possiblychoose a fixed step size that enables the iteration to bea local contraction mapping. Instead, we must employ a rulewith a variable step size [27], such as a minimization rule, adiminishing step-size rule, or a successive step-size rule (e.g.,the Armijo rule).

B. Restoration Step

In the development of (24), we assumed was feasible, butthe update is only guaranteed to be feasible if the constraintsare linear. Thus, while the sequence generated by (24) may con-verge, it might not converge to a point in the constraint set .The process to correct this error is the restoration step.

We restore the second iterate back onto the constraint setusing a projection. Since is convex, the Projection Theorem[27] favors using the natural projection. When the projectionis the natural one, the Projection Theorem says it is uniquelydetermined. Let be the natural projection ofonto . Then, the method of scoring with constraints, or theCSA, is given by

(27)

where satisfies one of the previously listed step size rules.This is similar to a two-metric projection method, which typ-ically has improved performance over simple gradient projec-tions [27]. A generalization of Jamshidian’s GP update, whichwe will call GP2, is similarly defined by

(28)

where , and is suchthat is at least positive semidefinite. This amountsto the replacement of Fisher information in (27) withthe diagonally loaded Hessian of the negative log-likelihood

in (28). We will compare CSA and GP2 in the ex-amples in Sections IV-C-5), VI.B and VII.

C. Implementation of the Constrained Scoring Algorithm

1) Initialization: Finding a proper initialization for any par-ticular adaptive/iterative scheme is important, but always depen-dent upon the characteristics of the particular objective functionthe scheme is to be applied to. If the objective function is convex(e.g., the quadratic negative log-likelihood case), then any initialvalue will suffice. But this is not true for many objectives; andconvergence to the global minimum requires the initializationmust be within a set containing the global minimum where theobjective function is locally convex (of course, this informationmay be very hard to come by in some cases).

900 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 3, MARCH 2008

In our examples (see Sections VI and VII), we consider linearand nonlinear problems. In the nonlinear case with a constantmodulus constraint, we utilize a very nice (suboptimal) algo-rithm for initialization called the algebraic constant modulus al-gorithm (ACMA) [28]. This algorithm is computationally nottoo complex and explicitly utilizes the constant modulus con-straint, but makes no claims with regard to being maximumlikelihood. We are interested in scenarios with additional con-straints, such as training symbols, to the constant modulus as-sumption, which ACMA (or a similar scheme) does not takeadvantage of. We show how further optimization improves thisinitialization.

Our approach is therefore similar to many signal processingproblems, in that we use a suboptimal algorithm for an initial-ization, and then apply the locally converging CSA. And, if wehave a convex problem, then we can make stronger statementsabout global convergence.

2) Restoration: It is desirable that the projection be a some-what simple operation, e.g., planar or spherical restrictions, so asnot to be a computational burden. However, constraint sets mayarise where the projection cannot be expressed analytically. Forthese scenarios, there exists schemes to minimize the error awayfrom to a predetermined acceptable level [25], e.g., updatingthe iterate to a point closer to the constraint set via

(29)

Alternatively, a common approach is to minimize the error awayfrom via a minimization of a cost function [6], [12], e.g.,

(30)

for some and where the step direction is still given by (24).3) Complexity: The CSA is in the class of null-space al-

gorithms, exploiting directly. Instead of employing theinverse of the Hessian or the FIM as done in quasi-Newtonmethods, we employ the always singular CCRB matrix to min-imize the constrained score function.6 In fact, given the uncon-strained FIM , (24) and (27) provide a simple algorithm toverify CML performance. If the quadratic negative log-likeli-hood model holds, and the constraints are linear, then when ini-tialized with a feasible point, the CSA solves the CMLE in asingle step (see Section VI-A).

The higher complexity components of the CSA (see Fig. 1)include the restoration step (if the projection is not simple),the matrix inverse component of the CCRB in (27), and theevaluation of an expectation (an integration) in the FIM. Thecomplexity of some components in the algorithm are givenin Table I. However, in many cases, it is possible to find ex-plicit expressions of, e.g., and (see Examples inSections VI-B and VII), which reduce the complexity. Also,

6The CCRB is only positive semidefinite, not positive definite, since the nullspace matrix UUU(���) can never be full row rank when constraints are applied.

Fig. 1. Constrained Scoring Algorithm.

TABLE ICOMPLEXITY OF THE CSA PER ITERATION (WHERE N IS THE NUMBER OF

PARAMETERS AND JJJ IS THE DIMENSION OF THE CONSTRAINED SPACE�)

we have shown that it is sufficient to use a matrix whosecolumns are only linearly independent and span the same spaceas the columns of [22]. Otherwise, standard complexityreduction techniques may be applied to CSA, e.g., a variationof the modified Newton’s method where the FIM componentis computed only once or every iterations, or a DFP-like(Davidon–Fletcher–Powell) update for the FIM.

4) Convexity: Until now, we have also always assumed thatthe projection has been onto a convex set. This condition guar-antees that each projection in the restoration step is unique; thisis shown via the Projection Theorem [27], [29]. However, notethat this condition was not used in Section III, nor was con-vexity used in the development of the projection step. In eithercase, only connectivity of the parameter space was required.This is important to point out, since many regions defined bya nonlinear equality constraint are nonconvex. However, thesenonlinear equality constraints are often the ones of practical in-terest. So, if we simply let be a connected set, we can makethe following enhancement in the definition of in (27): Let

be the natural projection of onto with theminimal distance. For practical sets, e.g., a sphere in , theprojection is almost everywhere unique.

5) Similarities of CSA and GP2: The CSA and GP2schemes are equivalent under the general complex parameterlinear model as well as asymptotically equivalent. For the linearmodel case, the Fisher information is exactly the Hessian ofthe negative log-likelihood, i.e.,(see Section VI). Hence, with no diagonal load, the stepsare equivalent. Furthermore, with a small diagonal load, e.g.,

, then and since

MOORE et al.: ML ESTIMATION, THE CRB, AND THE METHOD OF SCORING WITH PARAMETER CONSTRAINTS 901

then

(31)

(32)

(33)

so that the steps are nearly equivalent. To obtain the secondequation above, we used the matrix inversion Lemma [4, p. 571].

In the asymptotic case, as the number of samples ,where and is the sample size, then theHessian of the negative log-likelihood on the samples con-verges almost surely as

(34)

by the law of large numbers, where by we mean theFisher information on any single sample . Likewise, the Fisherinformation on the samples converges in probability as

(35)

by the additive property of information and consistency of theCMLE [19]. So, as , the steps for GP and CSA are nearlyequivalent (near ) as the Hessian and the FIM are asymptoti-cally equivalent.

In the next section, it will be shown that the CSA at least con-verges to a local maximum of the likelihood. The only require-ment is that be positive definite. However, thisis also a requirement for the existence of the CCRB [3].

V. CONVERGENCE PROPERTIES

In this section, we examine convergence properties of theCSA, motivated by the properties of projected gradients pre-sented in [30]. The obvious statement regarding convergenceis that the sequence of iterates has the CMLE as itslimit provided the initial guess is sufficiently close. For conve-nience, we will choose to analyze only the convergence proper-ties of the CSA with the Armijo step-size rule, although it is nottoo difficult to modify these proofs for a similar rule. By con-struction, we shall show the algorithm is one that converges to alocal stationary point if there is indeed a local maximum of thelikelihood.

Let be any sequence generated by the CSA. Thus,is an arbitrarily chosen feasible point, and successive iteratesare determined by the CSA in (27). (Of course, would not

7In this section, we will assume� is convex so that ���[ � ] is the natural pro-jection.

be chosen so randomly, but we ignore this aspect of the con-vergence for the moment.) Define

.7 Now, by definition of the chosen Armijo rule [27]

(36)

for some . Thus

(37)

for every positive integer , and we have the following fact about.

Property 1: The sequence is a monotone in-creasing sequence. Furthermore, if is bounded above,then converges.

Thus, given an iterate , all successive iterates are containedin , i.e., for all . The second statement issimply the monotone convergence principle from analysis. Wewill assume that the likelihood is bounded above, so that thissequence always converges. This leads to the following result.

Property 2: If is bounded above, the sequencevanishes as .

Proof: Again from (36) we have that isbounded by the product of an iterate from a sequence boundedbelow and . Sinceconverges, the bound is made arbitrarily small for sufficientlylarge . Thus, the same is true for .

This does not imply that the sequence converges. How-ever, if is bounded, then the Bolzano–Weierstrass theorem[31] implies the existence of a cluster, or accumulation, point,i.e., the existence of a subsequence that does converge.

Property 3: If is compact, then cluster points of the se-quence are also stationary points.

Recall, a point is stationary if and only if it satisfies theKarush–Kuhn–Tucker (KKT) conditions of the Lagrangian[18], [29]. The initial iterate is feasible and the restorationpreserves feasibility. Since the active constraints are in , theonly additional condition is ,where is a vector of Lagrange multipliers. So is a stationarypoint if and only if

(38)

Hence

Thus, a point is stationary if only if

(39)

By definition of the Armijo rule, we also have at sta-tionary points. While this step is necessary in the method of gra-dient projection, it is not so here as can be seen above, since theend result would hold regardless.

902 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 3, MARCH 2008

Proof [Proof of Property 3]: Let be a cluster point of thesequence . Then, there exists a convergent subsequence ,which has as its limit. Note that

(40)

(41)

where the first inequality (40) is due to the triangle inequality,the inequality (41) is due to projections on convex sets. (Notethat (41) is only an approximation for small for nonconvex

.) Continuing

(42)

(43)

where (42) is the triangle inequality again. The first term in (43)is bounded as

(44)

(45)

by the triangle inequality again in (44), and the distributive prop-erty that for any matrix and vector[32] in (45). Substituting (45) into (43), we have

(46)

Since is compact, then andare bounded. Note the inequality holds for all , solet . Then we have that , andby continuity, and

. Now the final term of (46) van-ishes by Property 2. Thus, the right side of (46) can be madearbitrarily small and, therefore,

(47)

i.e., is stationary. The statement holds true as well if iscompact for any .

Property 4: If is compact for all sequences in a setand there is a unique cluster point for all such sequencesthen for every sequence . Also, is theminimum of .

Essentially, if only one accumulation point exists for all suchsequences, it must be the CMLE, i.e., .

VI. LINEAR MODEL WITH CONSTRAINTS

Suppose now we have a vector of observations of a com-plex-valued parameter vector given by the following linearmodel:

(48)

where is a known complex-valued observation matrix andis random noise from with known covariance . Thetrue parameter vector will be . The MLE of this problem iswell known [4, pp. 528], and also happens to be the best linearunbiased estimator (BLUE) and the minimum variance unbiased(MVU) estimator

(49)

This MLE is also efficient, i.e., it has a covariance matrix equalto the complex CRB

(50)

where is the complex FIM [4, p. 529] of the model. Notethat the log-likelihood is quadratic with respect to ; thus, theFisher score and Hessian are given by8

(51)

(52)

Hence, the FIM and CRB are constant in . Thus, for conve-nience in this section, we will simply denote the complex FIMas .

8There are several definitions of the complex derivative. We define it to be(@)=(@�) = (1=2)((@)=(@(Re�)) � j(@)=(@(Im�))) for complex-valued�. A benefit of this definition is that numerous results are preserved for strictlyreal-valued parameters.

MOORE et al.: ML ESTIMATION, THE CRB, AND THE METHOD OF SCORING WITH PARAMETER CONSTRAINTS 903

The complex model can be described in terms of real-valuedparameters as well. This requirement is necessary for the appli-cation of the CSA later. Define , thenthe Fisher score and Hessian with respect to are given by

(53)

(54)

Naturally, the real-valued FIM and CRB are still constantwith respect to .

A. Linear Constraints

Now, suppose that a linear constraint is assumed on the com-plex parameter vector , i.e.,

(55)

where and are known. Then for all . If weassume that is regular, i.e., is full row rank, and thatis full column rank, then the CMLE can be expressed as theconstrained least squares estimator (CLSE) [4, p. 252]

(56)

(57)

This second form of the CMLE in (57) uses the Marzetta deriva-tion of the CCRB [2] for the expression in the parenthesis, al-beit in complex form. Given this knowledge, it can be shownthat this estimator is unbiased and efficient, a result that appearsto be undocumented in the literature, even for the real param-eter case. (A more general result will be proved in Theorem 3.)The second term in (57) is a specific feasible point of . How-ever, this formulation requires both a nonsingular complex FIM

(or, equivalently, a full column rank ) and a full row rank, conditions which might not hold. It is interesting to compare

the GP update in (25) and the CSLE (57) and their use of theconstrained bound. And we show next, using the CSA to cir-cumvent these conditions essentially turns the problem into astandard null space quadratic exercise [17].

First, we need to convert the problem to real-valued parame-ters. Note that (55) is equivalent to the following real-parameterconstraint set:

where

(58)

Let be the matrix defined by (5). It can be shown thatif is defined to be the matrix whose columns forman orthonormal basis of so that and ,then

(59)

Let be any vector satisfying the constraints so that .For example, if the space is still regular ( is full row rank),then is such a vector, otherwise let

; however, any feasible vector will work. Note,is the corresponding point in .

As stated earlier, since the log-likelihood is quadratic and theconstraints linear, the CSA finds the CMLE in one step to bethe linear estimator given by

(60)

(61)

where is defined similarly asin (8) to be the complex CCRB for this constrained problem.9

Comparing this CMLE result with the prior CLSE result in (56),we see the only requirement now is that be full column rank,whereas (56) required both to be full column rank and tobe full row rank. In other words, the prior solution requires in-formation-regularity and a regular point solution in the originalproblem; the solution given here only requires the alternativeinformation-regularity condition (see Theorem 1, or [3]). Thisleads to the following result.10

Theorem 3: If the complex-valued observations are de-scribed by a linear model of the form

(62)

where is a known matrix, is an unknown parameter subjectto the linear constraint with the trueparameter being , and is a noise vector with pdf ,then provided is full column rank where is defined by (5)using , the CMLE of is given by

(63)

where is the complex CCRBand is any arbitrary feasible point (e.g., ).is unbiased and is an efficient estimator which attains the CCRB

9It should be emphasized that the complex CCRB is of the form given by BBBonly for this particular model and constraint. For general constraints, the covari-ance matrix of the real-parameter estimator may not assume the necessary form[4, pp. 524–532].

10It is suspected by the authors that this result may be previously known,but a reference for the result was unable to be found. This can be shown by atransformation of the model into a reduced parameter space linear model, withefficiency shown with respect to the CRB in the reduced parameter space. Tothe best of our knowledge, this approach in the proof of Theorem 3 of provingthe efficiency of the CMLE directly using the CCRB is new.

904 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 3, MARCH 2008

and, therefore, is the BLUE and the MVU estimator. Further-more, when is full column rank and is full row rank, then

.Proof: First note that since , the estimator satisfies

the constraints

(64)

Next, for some since by a Taylor expan-sion . Hence, we have the followingconvenient property:

(65)

Thus, the CMLE is also unbiased

(66)

and the estimator is efficient, i.e., its covariance matrix is theCCRB

(67)

So this more general CMLE is also the BLUE as well as theMVU estimator in the general linear model under the linear con-straint. To see equivalence to the CLSE expression (56), assumethat the FIM is nonsingular and the constraint space isregular. Then, , i.e.,the Stoica CCRB formulation and the Marzetta formulation areidentical [3], and so

(68)

In addition to being an alternative CMLE, (63) is also an al-ternative CLSE.

B. Nonlinear Constraints

As an example of nonlinear parametric constraints imposedon (48), we consider the constraint set

where all the elements of arerestricted to be of unit modulus. This was one of the constraintsapplied in [16] in the evaluation of performance bounds witha different model. It should be noted that this set is not convexbut natural projections from onto are almost everywhere(a.e.) unique, i.e., unique except on a setof measure zero. The gradient matrix of the constraints is then

(69)

where , i.e., a diagonal matrix with throw-column element . A matrix whose columns form anorthonormal null space of was found in [16] to be

(70)

This results in the following CCRB:

(71)

Thus, the CSA is given by

(72)

where is the projection onto and the definitionholds for all .

For simulation, the elements of a 7 8 observation matrixwere chosen from a complex normal distribution,

and the elements of were generated as 8-phase shift keying(8-PSK) symbols chosen at random; both are fixed throughoutthe example. We consider the average performance of the MSEof , defined as , versus the SNR,defined as , over 5000 realizationsof the spatially white noise . We employ the diminishing step-size rule (all steps were usable) in the CSA andGP2 with the reasonably close, initial estimate of the CMLEbeing the projection of the MLE (49) onto , i.e.,

(73)

This initial value, as one would expect, turns out in general notto be the CMLE. However, the estimate is often (but not always)a sufficiently adequate initialization.

MOORE et al.: ML ESTIMATION, THE CRB, AND THE METHOD OF SCORING WITH PARAMETER CONSTRAINTS 905

Fig. 2. Average mean-square error of the CSA and a projected GP2 comparedwith the CCRB. The P-MLE curve is actually the projection of the MLE, orLSE, solution onto the constraint space and is the initial value of both the CSAand GP2 methods for this case.

In this scenario, the Hessian and the FIM are identical, so onewould expect there to be minimal differences between CSA andGP2 (which requires a slight diagonal loading due to the zeroeigenvalue for ). Fig. 2 demonstrates the result. As expected,both algorithms converge to the CCRB at the same rate and overthe same range of SNR values. The slight diagonal load offor GP2 has virtually no impact on the difference between theGP2 and CSA linear update. Significantly higher diagonal loads,although improving the condition number of the Hessian andthe rate of convergence of the iterations, lead to a slight degra-dation in MSE performance compared with the CCRB for theGP2 scheme over greater SNR values.

VII. NONLINEAR MODEL EXAMPLE

By a nonlinear model, we mean a model which is nonlinearwith respect to the parameter vector or . Unlike in the pre-vious section, there is no inherent guarantee that maximizing thelikelihood also minimizes the MSE of the parameter vector. Itdepends on the model. However, we still have the property thatif an efficient estimator exists for the constrained problem thenit must be the CMLE. Recall the condition for equality with theCCRB in Theorem 1:

(74)

This implies that . Whenis positive semidefinite, this implies

that span , which satisfies theLagrangian. Thus, as with the ML method, the CML methodalso produces an efficient estimator if it exists [22].

In this section we examine a scenario given in [16].

A. MIMO Instantaneous Mixing Model

Consider the multi-input, multi-output (MIMO) instanta-neous mixing model where is a vector of observations ofa linear mixing of unknown parameters at timegiven by the model

(75)

where is an unknown complex-valued channel matrix,each is a complex-valued data symbol vector, and the additivenoise vector is spatially and temporally white with variance

. We define a vector of unknown parameters by

(76)

where is the th column of and. Then, the complex FIM was found

in [33] to be

(77)

where with. As detailed in [33], this FIM is singular

due to the multiplicative ambiguity inherent in the model.The model can be alternatively described in terms of thereal-valued parameter vector , whichhas a real-valued FIM given by

(78)

By the structure of this matrix nullity nullity ,so information regularity cannot be achieved without con-straints. Sadler, et al., applied constant modulus and semiblindconstraints on (75) to obtain a locally identifiable model [16].The constraint set is

(79)

where the are known. For this constraint set, the gradientmatrix of the constraints was found in [8] to be

(80)

where

and

(81)

with defined as in Section VI-B. (Note that this con-struction of the gradient matrix is not regular since

906 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 3, MARCH 2008

contains redundancy in restricting some signal values to be unitmodulus as well as known values. might be reformulated sothe points are regular, but this is unnecessary.) A matrix whosecolumns form an orthonormal null space of is given by

(82)

where

(83)

where is the matrix with the first rows re-moved. This results in the following CCRB:

(84)

Note the similarity in the structure of the CCRB with that inSection VI-B. This is, generally, the form of the CCRB when theappropriate matrix can be found. Also note that hereis not the null space of , but rather these matrices are con-venient expressions to formulate and , respectively.

For a simulation example, we considered the scenario oftime samples each from sources,

sensors, -ray multipath channel where the spatialsignature of the th source is expressed as a weighted sumof steering vectors, i.e., whereand are the complex amplitude and direction-of-arrival(DOA) of the th multipath of the th source [16]. TheDOAs and corresponding amplitudes areand for source

and andfor source . The

source elements are randomly drawn from an 8-PSK alphabet.The channel elements are generated from a complex normal

distribution and then normalized to reflect the signalpowers ( 20 dB, 15 dB), i.e., for

, with .We obtain an initialization via the zero-forcing variant of theanalytical constant modulus algorithm (ZF-ACMA) found in[28]. This estimate is projected onto the constraint set andthen we apply both GP2 and CSA separately using a successivehalf step-size scheme ( for the smallest positiveinteger that results in a usable step). The constraints assumedare the constant modulus constraint as well as knowledge of thefirst symbols for each source. We calculate the averageMSE (per real coefficient) at each iteration over 5000 trialsand compare with the mean CCRB for each of the sources andchannels.

Fig. 3. Average MSE of (a) the two sources and (b) the combined channel pa-rameters for ZF-ACMA, GP and CSA compared with the mean CCRB overvarying � . The source SNRs are SNR(sss ) = 20 dB and SNR(sss ) =15 dB.

The numerical simulations show the improvement in the av-erage MSE of the signals (Fig. 3) as a function of . We setthe stopping criteria to be, e.g., a modification of the Newtondecrement [29]

(85)

Another possible convergence criteria is the norm. Typically, CSA requires an extra

iteration to satisfy the stopping criteria compared with GP2.And, both algorithms converged for every run and alwaysimproved the initialization provided by ZF-ACMA, especiallyfor close DOAs for the sources. In terms of the likelihood, itsvalue when using the CSA estimate was numerically close to

MOORE et al.: ML ESTIMATION, THE CRB, AND THE METHOD OF SCORING WITH PARAMETER CONSTRAINTS 907

that of GP2, with GP2 converging slightly faster in terms of thelikelihood value due to a better behaved condition number dueto the diagonal loading (the minimum eigenvalue of the loadedHessian was at least ). However as Fig. 3 demonstrates,CSA slightly outperforms GP with regard to the MSE on theparameter estimates when the two signals’ DOAs becomeclose (i.e., as nears 0). This illustrates, as noted earlier,that maximizing the likelihood does not always guaranteeminimizing the MSE on the unknown parameters when theestimator is not efficient. We can improve the performance gainof GP2 by decreasing the diagonal loading (thereby decreasingthe minimum eigenvalue nearer to 0) but this is at a slight costin the rate of convergence. Finally, we note that results forestimating the channel elements in were similar to those forestimating signal elements in , as shown in Fig. 3.

VIII. CONCLUSION

We provided a unification of the asymptotic normalityproperties of a parametrically constrained maximum-likeli-hood problem derived by Crowder and the CCRB derived byStoica and Ng. In particular, we showed that the CMLE isasymptotically efficient with respect to the CCRB. We thenderived a generalization of the classical method of scoring,the constrained scoring algorithm. This unifies and connectsfeatures of MLE and the CRB to those of CMLE and CCRB, in-cluding asymptotic normality of the CMLE, and the constrainedscoring algorithm. The CSA preserves desirable convergenceproperties, and convergence proofs were shown. To verifythe theory and value of the approach, we examined severalproblems of interest, including the classical linear model, andan instantaneous linear mixing model that arises in many signalprocessing applications. For the linear model, we found analternative CMLE that requires weaker conditions, which wehave shown is efficient, and generalizes previous well-knownresults. We also considered a nonlinear model, nonlinear con-straint example, where the CSA outperforms a modification ofJamshidian’s GP in terms of MSE on the parameters of interest.

REFERENCES

[1] J. D. Gorman and A. O. Hero, “Lower bounds for parametric esti-mation with constraints,” IEEE Trans. Inf. Theory, vol. 26, no. 6, pp.1285–1301, Nov. 1990.

[2] T. L. Marzetta, “A simple derivation of the constrained multiple param-eter Cramer–Rao bound,” IEEE Trans. Signal Process., vol. 41, no. 6,pp. 2247–2249, Jun. 1993.

[3] P. Stoica and B. C. Ng, “On the Cramér-Rao bound under parametricconstraints,” IEEE Signal Process. Lett., vol. 5, no. 7, pp. 177–179, Jul.1998.

[4] S. M. Kay, Fundamentals of Statistical Signal Processing, EstimationTheory. Englewood Cliffs, NJ: Prentice-Hall, 1993.

[5] M. Crowder, “On constrained maximum likelihood estimation withnon-i.i.d. observations,” Ann. Inst. Stat. Math., vol. 36, pt. A, pp.239–249, 1984.

[6] M. R. Osborne, “Scoring with constraints,” ANZIAM J., vol. 42, no. 1,pp. 9–25, Jul. 2000.

[7] M. R. Osborne, “Fisher’s method of scoring,” Int. Stat. Rev., vol. 60,pp. 99–117, 1992.

[8] R. J. Kozick, B. M. Sadler, and T. J. Moore, “Performance of MIMO:CM and semi-Blind cases,” in Proc. 2003 4th IEEE WorkshopSignal Processing Advances in Wireless Communications, 2003, pp.309–313.

[9] A. Leshem, “Maximum likelihood separation of constant modulus sig-nals,” IEEE Trans. Signal Process., vol. 48, no. 10, pp. 2948–2952, Oct.2000.

[10] J. Aitchison and S. D. Silvey, “Maximum-likelihood estimation of pa-rameters subject to restraints,” Ann. Math. Statist., vol. 29, pp. 813–828,1958.

[11] J. Aitchison and S. D. Silvey, “Maximum-likelihood estimation proce-dures and associated tests of significance,” J. R. Stat. Soc. B, vol. 22,pp. 154–71, 1960.

[12] S. P. Han, “A globally convergent method for nonlinear programming,”J. Optim. Theory Appl., vol. 22, pp. 297–309, 1977.

[13] Z. F. Li, M. R. Osborne, and T. Prvan, “Numerical algorithms forconstrained maximum likelihood estimation,” ANZIAM J., vol. 45, pp.91–114, 2003.

[14] M. Jamshidian, “On algorithms for restricted maximum likelihood es-timation,” Comput. Stat. Data Anal., vol. 45, pp. 137–157, 2004.

[15] S. M. Kay, Fundamentals of Statistical Signal Processing, DetectionTheory. Englewood Cliffs, NJ: Prentice-Hall, 1998.

[16] B. M. Sadler, R. J. Kozick, and T. J. Moore, “On the performance ofsource separation with constant modulus signals,” Army Research Lab,Adelphi, MD, Tech. Rep. ARL-TR-3462, Mar. 2005.

[17] P. E. Gill, W. Murray, and M. H. Wright, Practical Optimization.New York: Academic, 1981.

[18] D. G. Luenberger, Introduction to Linear and Nonlinear Programming.Reading, MA: Addison-Wesley, 1973.

[19] J. Shao, Mathematical Statistics. New York: Springer-Verlag, 2003.[20] B. Hochwald and A. Nehorai, “On identifiability and information-reg-

ularity in parameterized normal distributions,” Circuits, Syst., SignalProcess., vol. 16, no. 1, pp. 83–89, 1997.

[21] P. Stoica and T. L. Marzetta, “Parameter estimation problems with sin-gular information matrices,” IEEE Trans. Signal Process., vol. 49, no.1, pp. 87–90, Jan. 2001.

[22] T. Moore, R. Kozick, and B. Sadler, “The constrained Cramer-Raobound from the perspective of fitting a model,” IEEE Signal Process.Lett., vol. 14, no. 8, pp. 564–567, Aug. 2007.

[23] C. G. Khatri, “A note on a MANOVA model applied to problems ingrowth curve,” Ann. Inst. Stat. Math., vol. 18, pp. 75–86, 1966.

[24] T. Moore and B. Sadler, “Maximum-likelihood estimation and scoringunder parametric constraints,” Army Research Lab, Aldelphi, MD,Tech. Rep. ARL-TR-3805, May 2006.

[25] R. T. Haftka and Z. Gürdal, Elements of Structural Optimization.Norwell, MA: Kluwer Academic, 1992.

[26] T. K. Moon and W. C. Stirling, Mathematical Methods and Algorithmsfor Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 2000.

[27] D. P. Bertsekas, Nonlinear Programming. Belmont, MA: AthenaScientific, 1995.

[28] A.-J. van der Veen, “Asymptotic properties of the algebraic constantmodulus algorithm,” IEEE Trans. Signal Process., vol. 49, no. 8, pp.1796–1807, Aug. 2001.

[29] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge,U.K.: Cambridge Univ. Press, 2004.

[30] A. A. Goldstein, “Convex programming in Hilbert space,” Bull. Amer.Math. Soc., vol. 70, no. 5, pp. 709–710, 1964.

[31] J. R. Kirkwood, An Introduction to Analysis. Boston, MA: PWSKent, 1995.

[32] J. N. Franklin, Matrix Theory. New York: Dover, 1993.[33] T. J. Moore, B. M. Sadler, and R. J. Kozick, “Regularity and strict

identifiability in MIMO systems,” IEEE Trans. Signal Process., vol.50, no. 8, pp. 1831–1842, Aug. 2002.

Terrence J. Moore received the B.S. and M.A. de-grees in mathematics from the American University,Washington DC, in 1998 and 2000, respectively.

Since 2000, he has been a Research Scientistwith the Army Research Laboratory, Adelphi, MD,and is currently in the mathematics Ph.D. programat the University of Maryland, College Park. Hehas written papers on reconstructive samplingtheory and performance bound analysis. His currentresearch interests are statistical signal processing,optimization techniques applied to communications,

and statistical inference.

908 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 3, MARCH 2008

Brian M. Sadler (M’90–SM’02–F’06) receivedthe B.S. and M.S. degrees from the University ofMaryland, College Park, and the Ph.D. degree fromthe University of Virginia, Charlottesville, all inelectrical engineering.

He is a Senior Research Scientist at the ArmyResearch Laboratory (ARL), Adelphi, MD. Hewas a Lecturer at the University of Maryland andsince 1994 has been lecturing at The Johns HopkinsUniversity, Baltimore, MD, on statistical signal pro-cessing and communications. His research interests

include signal processing for mobile wireless and ultra-wideband systems,sensor signal processing and networking, and associated security issues.

Dr. Sadler is an Associate Editor for the IEEE SIGNAL PROCESSING LETTERS

and the IEEE TRANSACTIONS ON SIGNAL PROCESSING. He has been a GuestEditor for several journals, including the IEEE JOURNAL ON SELECTED AREAS

OF COMMUNICATIONS, the IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL

PROCESSING, and the IEEE Signal Processing Magazine. He is a member ofthe IEEE Signal Processing Society Sensor Array and Multichannel TechnicalCommittee, and received a Best Paper Award (with R. Kozick) from the SignalProcessing Society in 2006.

Richard J. Kozick (S’85–M’86–SM’07) re-ceived the B.S. degree from Bucknell University,Lewisburg, PA, in 1986, the M.S. degree fromStanford University, Stanford, CA, in 1988, and thePh.D. degree from the University of Pennsylvania,Philadelphia, in 1992, all in electrical engineering.

From 1986 to 1989 and from 1992 to 1993, he wasa Member of Technical Staff at AT&T Bell Laborato-ries. Since 1993, he has been with the Electrical En-gineering Department at Bucknell University, wherehe is currently Professor and holds the T. Jefferson

Miers Chair from 2006 to 2011. His research interests are in the areas of statis-tical signal processing, sensor networking, sensor fusion, and communications.

Dr. Kozick received a 2006 Best Paper Award from the IEEE SignalProcessing Society and the Presidential Award for Teaching Excellence fromBucknell University in 1999. He serves on the editorial boards of IEEE SIGNAL

PROCESSING LETTERS, the Journal of the Franklin Institute, and the EURASIPJournal on Wireless Communications and Networking. He is a member ofthe American Society for Engineering Education (ASEE), Tau Beta Pi, andSigma Xi.