Seven trade-offs in measuring nonprofit performance and effectiveness

ORI GIN AL PA PER

Seven Trade-offs in Measuring Nonprofit Performanceand Effectiveness

Jurgen Willems • Silke Boenigk • Marc Jegers

Published online: 6 March 2014

� International Society for Third-Sector Research and The Johns Hopkins University 2014

Abstract To complement contemporary nonprofit literature, which mainly offers

theory-driven recommendations for measuring nonprofit effectiveness, perfor-

mance, or related concepts; this article presents seven trade-offs for researchers and

practitioners to consider before engaging in a nonprofit effectiveness measurement

project. For each trade-off, we offer examples and suggestions to clarify the

advantages and disadvantages of methodological choices that take various contex-

tual elements into account. In particular, we address the differences between for-

mative and reflective approaches, as well as the differences between unit of interest,

unit of data collection, and unit of analysis. These topics require more in-depth

attention in the nonprofit effectiveness literature to avoid misinterpretations and

measurement biases. Finally, this article concludes with five avenues for further

research to help address key challenges that remain in this research area.

Resume Afin de completer la litterature contemporaine portant sur les organismes

a but non lucratif, qui propose principalement des recommandations basees sur la

theorie visant a evaluer leur efficacite, leurs performances ou des concepts con-

nexes, cet article presente sept compromis que les chercheurs et les professionnels

pourront prendre en consideration avant de s’engager dans un projet d’evaluation de

l’efficacite d’un organisme a but non lucratif. Pour chaque compromis, nous don-

nons des exemples et des suggestions mettant en lumiere les avantages et les in-

convenients de choix methodologiques qui tiennent compte de divers elements

contextuels. En particulier, nous traitons des differences qui existent entre les

J. Willems (&) � S. Boenigk

Department of Nonprofit & Public Management, University of Hamburg, Von-Melle-Park 5,

20146 Hamburg, Germany

e-mail: [email protected]

J. Willems � M. Jegers

Department of Applied Economics, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussel, Belgium

123

Voluntas (2014) 25:1648–1670

DOI 10.1007/s11266-014-9446-1

approches formative et reflexive, ainsi qu’entre unite d’interet, unite de collecte de

donnees et unite d’analyse. Ces sujets exigent d’etre approfondis dans la litterature

portant sur l’efficacite des organismes a but non lucratif pour eviter les interpre-

tations erronees et les biais d’evaluation. Enfin, cet article conclut avec cinq pistes a

explorer dans l’objectif de relever les defis importants qui demeurent dans ce secteur

de recherche.

Zusammenfassung Die Realisierung von empirischer Nonprofit Forschung hat in

den letzten Jahren, sowohl in der akademischen Forschung als auch in der Nonprofit

Praxis, stark zugenommen. Wahrend jedoch in anderen Disziplinen und Fac-

hzeitschriften die methodische Diskussion und Fragen der Messung sehr intensiv und

teilweise kritisch diskutiert werden, hat dieser Diskurs in der Nonprofit Forschung

noch nicht begonnen. Dieser Beitrag verfolgt daher das allgemeine Ziel, den me-

thodischen Diskus unter den Nonprofit Forschenden und Praktikern anzuregen, um

mittelfristig einen Beitrag zur Erhohung der Messqualitat von empirischen Nonprofit

Studien zu leisten. Zu diesem Zweck werden sieben Entscheidungsbereiche vorges-

tellt und vertiefend diskutiert, die bei der Realisierung von empirischen Studien im

Nonprofit Management, insbesondere bei der Messung von Nonprofit Erfolg und

Effektivitat, vermehrt beachtet werden sollten. Im Einzelnen sind dies folgende

Entscheidungstatbestande: (1) Uni- versus Multidimensionalitat, (2) Formative versus

reflective Messung, (3) Individual versus Gruppenmeinung, (4) Interne versus externe

Messung, (5) Leading versus Lagging, (6) Distinkt versus Uberlappende Messung

sowie letztlich (7) Adaptiver versus Multiplikativer Messansatz. Jeder Ents-

cheidungsbereich wird mit seinen Vor- und Nachteilen erlautert und exemplarisch

diskutiert, um Handlungsempfehlungen fur die Nonprofit Community abzuleiten, in

welchen Studiensituationen welcher Messansatz zu bevorzugen ist.

Resumen Para complementar el material publicado contemporaneo sobre orga-

nizaciones sin animo de lucro, que ofrece principalmente recomendaciones im-

pulsadas por la teorıa para medir la efectividad, el rendimiento o conceptos

relacionados de las organizaciones sin animo de lucro, este artıculo presenta siete

compromisos o terminos medios a considerar por los investigadores y profesionales

antes de implicarse en un proyecto de medicion de la efectividad de las organiz-

aciones sin animo de lucro. Para cada compromiso ofrecemos ejemplos y suge-

rencias para clarificar las ventajas y desventajas de las elecciones metodologicas que

toman en cuenta diversos elementos contextuales. En particular, abordamos las

diferencias entre enfoques formativos y reflexivos, ası como tambien la diferencia

entre unidad de interes, unidad de recopilacion de datos y unidad de analisis. Estos

temas requieren mas atencion en profundidad en el material publicado sobre la

efectividad de las organizaciones sin animo de lucro para evitar malas interpre-

taciones y sesgos de medicion. Finalmente, el presente artıculo concluye con cinco

vıas de investigacion adicional para ayudar a abordar los desafıos claves que siguen

existiendo en esta area de investigacion.

Keywords Nonprofit performance � Nonprofit effectiveness � Measurement �Formative or reflective specifications � Measurement biases

Voluntas (2014) 25:1648–1670 1649

123

Introduction

Measuring nonprofit performance and effectiveness has remained a topic of

substantial debate for more than half a century. Nonprofit organizations are defined

solely by their lack of profit distributions (Hansmann 1987), so they can have various

goals (Perrow 1961; DiMaggio 2001). Therefore, a one-size-fits-all solution to

measure the achievement of these goals is unlikely. As Lecy et al. (2012) and Jun and

Shiau (2012) note in their extensive historical overviews of nonprofit effectiveness

literature, multiple studies acknowledge the complexity and the plethora of theoretical

elements that constitute the very concept of effectiveness (Osborne and Tricker 1995;

Plantz et al. 1997; Rojas 2000; DiMaggio 2001; Sawhill and Williamson 2001; Sowa

et al. 2004; Herman and Renz 2008; Carman and Fredericks 2010).

Although previous research has used the terms nonprofit performance and

nonprofit effectiveness synonymously, we clarify their usage for this study. In line

with nonprofit research and organizational performance science (e.g., Richard et al.

2009), nonprofit performance encompasses four concrete areas of interest:

(a) financial performance (e.g., donations raised in a year, state funding),

(b) stakeholder performance (e.g., volunteer satisfaction, donor loyalty, stakeholder

identification), (c) market performance (e.g., nonprofit image, nonprofit brand

reputation, service quality), and (d) mission performance (achieving the mission of

the organization). Nonprofit effectiveness is closely related but broader than

nonprofit performance, in that it focuses more on the balanced input and output

achieved through the combination of processes, projects, and programs imple-

mented by the nonprofit organization to reach its predefined goals. These nonprofit

goals are framed by the organization’s mission, which generally focuses on creating

particular effects for various stakeholder groups (Frumkin 2002).

Herman and Renz (1997, 1999, 2000, 2008) offer an intensive discussion of

nonprofit effectiveness, from which they derive nine general theses. Some of these

theses regard important consequences of how to measure nonprofit effectiveness.

However, as pointed out by Lecy et al. (2012) and Jun and Shiau (2012), substantial

challenges remain with respect to the empirical verification of nonprofit perfor-

mance and effectiveness insights. In particular, they note the paradoxical

observation that proposed guidelines for measuring nonprofit effectiveness are

seldom all met in a single study.

Therefore, this study takes a more empirical-driven perspective on nonprofit

effectiveness measurements, and we add an important but missing methodological

part to the mainly theory-driven recommendations available in contemporary

literature (as summarized by Jun and Shiau 2012; Lecy et al. 2012). We focus on

how the contextual factors of a measurement project, from both research and

practitioner perspectives, can be addressed more clearly as a means to reach

appropriate nonprofit effectiveness measurements. Specifically, we aim to provide

researchers and practitioners with a detailed overview of criteria to be taken into

account when measuring nonprofit effectiveness. Furthermore, we formulate

concrete avenues for research that can enhance both the methodological robustness

and the theoretical depth of the ongoing discussion about nonprofit effectiveness.

1650 Voluntas (2014) 25:1648–1670

123

As we strongly want to avoid adding another list of theory-driven requirements to

the broad literature that already exists, we purposefully adopt a distinct approach.

We identify from our literature review seven trade-offs that researchers and

practitioners should consider when they initiate a project to measure nonprofit

effectiveness. For each trade-off, we note the advantages and disadvantages of each

choice, devoting special attention to the context in which the measurements would

be used (e.g., target group, unit of analysis, type of research question, etc.). From

this perspective, we define a trade-off in the context of this study as an evaluation of

several context-dependent advantages and disadvantages that results in a choice

between two methodological options, or a well-considered combination of these

options. Discussing trade-offs, rather than presenting a supposed one-size-fits-all list

of recommendations, enables us to focus on the importance of the contextual

elements on which decisions could be based in order to come to valuable and

adjusted measurement designs. We build on empirical studies in the domain of

nonprofit effectiveness, which provide examples of advantages and disadvantages of

the various choices. In addition, we use methodological contributions from outside

the traditional nonprofit research domain, such as organizational research,

marketing, psychology, and/or sociology, which offer more substantial experience

in terms of quantifying complex concepts.

Trade-off Decisions in Measuring Effectiveness

Before discussing each trade-off, we stress two points. First, we structure our

discussion in accordance with these seven trade-offs, because the classification

allows us to address stand-alone elements. That is, each element has to be

considered by a researcher or practitioner before engaging in a nonprofit

effectiveness measurement project. Furthermore, this classification enables us to

single out elements from the overall discussion of nonprofit effectiveness, on which

we base our recommendations for further research. We consider each of these trade-

offs equally important with respect to the need for being considered for a

measurement project. Though, our explanations of them differ in length, largely

because some aspects have not been discussed previously to the extent they require.

Second, because we rely on broader literature to frame our methodological

considerations (i.e., organizational research, marketing, psychology, and sociology),

our focal topics may be relevant in other research areas too (e.g., corporate social

responsibility in a for-profit or public setting). Nevertheless, to maintain a clear

focus and provide a targeted contribution, beyond the theoretical overviews already

offered by Jun and Shiau (2012) and Lecy et al. (2012), we use mainly nonprofit-

related examples and theoretical insights from the nonprofit research domain. The

seven trade-offs are summarized in Table 1, and explained in detail in the following

sections.

Uni- Versus Multidimensional Measurements

The first trade-off is between a unidimensional or a multidimensional measurement

approach. In their extensive chronological literature review, Jun and Shiau (2012)

Voluntas (2014) 25:1648–1670 1651

123

Ta

ble

1S

even

trad

e-o

ffs

wh

enm

easu

rin

gn

on

pro

fit

effe

ctiv

enes

san

dp

erfo

rman

ce(a

dv

anta

ges

and

dis

adv

anta

ges

)

Tra

de-

off

s

1.

Uni-

ver

sus

mult

idim

ensi

onal

mea

sure

men

t

Unid

imen

sional

mea

sure

men

t(o

ne

item

,cr

iter

ion,

or

dim

ensi

on

tom

easu

re

effe

ctiv

enes

s)

?C

anbe

appli

edin

het

erogen

eous

sam

ple

s

?T

ote

stre

lati

onsh

ips

inth

eco

nte

xt

of

apar

ticu

lar

theo

ry,

but

appli

cable

acro

ssm

any

dif

fere

nt

sett

ings

(hig

her

rele

van

cefo

rsp

ecifi

cre

sear

chpro

ject

s).

?E

asie

rto

find

larg

ersa

mple

sfo

rw

hic

ha

unid

imen

sional

mea

sure

men

tis

rele

van

tfo

r

all

sam

ple

enti

ties

(e.g

.,dif

fere

nt

org

aniz

atio

ns,

stak

ehold

ers)

?L

ess

cost

/eff

ort

nee

ded

toobta

inm

easu

rem

ents

–F

ocu

son

asi

ngle

aspec

tof

effe

ctiv

enes

s

–C

oncl

usi

ons

mig

ht

rem

ain

gen

eral

–L

ow

erpote

nti

alco

nte

nt

val

idit

yan

d/o

rre

liab

ilit

yof

mea

sure

men

ts

Mult

idim

ensi

onal

mea

sure

men

t(m

ult

iple

crit

eria

and/o

r

dim

ensi

ons)

?G

ives

more

det

aile

din

sights

inth

etr

ue

com

ple

xit

yof

nonpro

fit

effe

ctiv

enes

s

?C

anbe

use

dto

inves

tigat

eco

mple

xre

lati

onsh

ips

of

elem

ents

wit

hoth

erco

nce

pts

(det

aile

dan

alyse

s)

?H

igher

pra

ctic

alre

levan

ce

?U

seof

mult

idim

ensi

onal

mea

sure

men

tsco

nfo

rms

wit

h

conte

mpora

ryex

pec

tati

ons

innonpro

fit

effe

ctiv

enes

sli

tera

ture

–S

uit

able

for

par

ticu

lar

conte

xts

(more

hom

ogen

eous

sam

ple

s)

–P

ote

nti

aldat

asa

mple

sm

ight

be

nat

ura

lly

const

rain

ed

(sta

tist

ical

pow

erto

test

rela

tionsh

ips

of

inte

rest

mig

ht

be

low

)

–M

ore

com

ple

xan

alysi

sm

ethods

mig

ht

be

nec

essa

ry.

2.

Form

ativ

ever

sus

refl

ecti

ve

mea

sure

men

ts

Form

ati

vem

easu

rem

ent

(cri

teri

adefi

ne

nonpro

fit

effe

ctiv

enes

s)

?C

onfo

rms

wit

ha

norm

ativ

eap

pro

ach

toef

fect

iven

ess

mea

sure

men

t(d

om

inan

t

rati

onal

ein

conte

mpora

ryli

tera

ture

)

?C

ansu

pport

theo

reti

cal

clai

ms

that

dif

fere

nt

dim

ensi

ons

of

effe

ctiv

enes

shav

euniq

ue

impac

ts

?C

anin

ves

tigat

eco

mple

xre

lati

onsh

ips

of

elem

ents

of

effe

ctiv

enes

sw

ith

oth

er

conce

pts

(det

aile

dan

alyse

s)

?A

ppro

pri

ate

for

acti

on-o

rien

ted

asse

ssm

ents

of

effe

ctiv

enes

s

–W

hen

wro

ngly

spec

ified

,te

sted

,or

val

idat

ed,

mig

ht

resu

ltin

sever

e

mis

inte

rpre

tati

ons

Refl

ecti

vem

easu

rem

ent

(cri

teri

asu

mm

ari

zenonpro

fit

effe

ctiv

enes

s)

?C

onfo

rms

wit

ha

soci

alco

nst

ruct

ionis

tap

pro

ach

to

effe

ctiv

enes

sm

easu

rem

ent

?C

anbe

use

dto

impro

ve

reli

abil

ity

of

mea

sure

men

ts

?A

ppro

pri

ate

for

per

cepti

on-b

ased

eval

uat

ions

of

effe

ctiv

enes

s

?V

alid

atio

nm

ethods

are

bet

ter

know

nan

dm

ore

use

din

conte

mpora

ryli

tera

ture

–W

hen

wro

ngly

spec

ified

,te

sted

,or

val

idat

ed,

mig

ht

resu

ltin

sever

em

isin

terp

reta

tions

1652 Voluntas (2014) 25:1648–1670

123

Ta

ble

1co

nti

nu

ed

Tra

de-

off

s

3.

Indiv

idual

ver

sus

gro

up

mea

sure

men

ts

Indiv

idual

mea

sure

men

t(s

ingle

sourc

eis

consu

lted

tom

easu

reef

fect

iven

ess)

?L

arger

sam

ple

sca

nbe

com

pose

dat

the

level

of

the

unit

of

inte

rest

?A

bro

adra

nge

of

crit

eria

can

be

pro

bed

from

the

sourc

ew

ith

the

most

acce

ssto

info

rmat

ion

–W

hen

unit

of

inte

rest

dif

fers

from

unit

of

dat

aco

llec

tion

or

anal

ysi

s,su

bst

anti

al

mea

sure

men

tbia

ses

mig

ht

exis

tdue

to(a

)so

cial

des

irab

ilit

y,(b

)dif

fere

nt

bac

kgro

und

of

rate

rs,

or

(c)

dif

fere

nt

per

sonal

refe

rence

fram

ework

s

–In

terp

reta

tion

of

resu

lts

should

be

atth

eat

om

isti

cle

vel

(i.e

.,le

vel

of

dat

aco

llec

tion,

rath

erth

anle

vel

of

inte

rest

).

Gro

up

mea

sure

men

t(m

ult

iple

sourc

esare

consu

lted

tom

easu

re

effe

ctiv

enes

s)

?M

ore

reli

able

mea

sure

men

tsm

ight

be

obta

ined

(counte

ring

mea

sure

men

tbia

ses)

?A

more

holi

stic

vie

wfr

om

dif

fere

nt

per

spec

tives

can

be

acquir

ed

–D

ata

coll

ecti

on

mig

ht

be

cost

lyan

dco

mple

x

–A

ssum

pti

ons

nec

essa

ryto

aggre

gat

edat

a,w

hic

hm

ight

resu

lt

infe

wdat

apoin

tsat

the

unit

of

inte

rest

–C

ontr

ol

var

iable

atin

div

idual

(sourc

e)le

vel

mig

ht

be

nec

essa

ry

4.

Inte

rnal

ver

sus

exte

rnal

mea

sure

men

ts

Inte

rnal

mea

sure

men

t(i

nte

rnal

sourc

es)

?M

ore

det

aile

din

form

atio

nis

avai

lable

–M

easu

rem

ent

bia

ses

mig

ht

exis

tdue

toso

cial

des

irab

ilit

yan

div

ory

tow

erju

dgm

ents

Ext

ernal

mea

sure

men

t(e

xter

nal

sourc

es,

such

as

cust

om

ers,

oth

erorg

aniz

ati

ons,

donors

)

?In

form

atio

nof

hig

her

rele

van

cem

ight

be

obta

ined

(rea

l

worl

d)

–C

om

par

abil

ity

of

sam

ple

sac

ross

org

aniz

atio

ns

should

exis

t

–D

iffe

rence

sac

ross

sourc

esin

each

org

aniz

atio

nsh

ould

be

acco

unte

dfo

r(e

.g.,

contr

ol

var

iable

s,w

eights

),to

avoid

bia

ses

due

toth

e(a

)dif

fere

nt

bac

kgro

und

of

rate

rsor

(b)

dif

fere

nt

per

sonal

refe

rence

fram

ework

s

5.

Lea

din

gver

sus

laggin

g

mea

sure

men

ts

Lea

din

gm

easu

rem

ent

(focu

sed

on

act

ions

and

thei

rdir

ect

outc

om

es)

?M

ore

obje

ctiv

ely

obse

rvab

le

?G

ener

aliz

able

acro

ssm

ore

het

erogen

eous

sam

ple

sof

org

aniz

atio

ns

–A

ssum

pti

ons

should

be

real

isti

cre

gar

din

gth

eef

fect

sas

soci

ated

wit

hth

ese

mea

sure

men

ts

Laggin

gm

easu

rem

ent

(focu

sed

on

the

effe

cts

of

org

aniz

ati

onal

act

ions

and

outc

om

es)

?F

ocu

son

the

elem

ents

that

are

of

real

inte

rest

(eff

ects

:hig

her

pra

ctic

alre

levan

ce)

–M

ore

subje

ctiv

ean

ddep

enden

ton

per

sonal

bac

kgro

unds

and

refe

rence

fram

ework

s

–L

ess

gen

eral

izab

leac

ross

org

aniz

atio

ns

wit

hvar

ious

mis

sions

–L

ess

gen

eral

izab

leac

ross

stak

ehold

ergro

ups

wit

hvar

ious

inte

rest

san

dnee

ds

Voluntas (2014) 25:1648–1670 1653

123

Ta

ble

1co

nti

nu

ed

Tra

de-

off

s

6.

Dis

tinct

ver

sus

over

lappin

g

mea

sure

men

ts

Dis

tinct

conce

pts

mea

sure

d(t

oin

vest

igate

cause

sand

effe

ctof

nonpro

fit

effe

ctiv

enes

s)

?T

oin

ves

tigat

efa

ctors

and

condit

ions

that

det

erm

ine

dif

fere

nce

sin

nonpro

fit

effe

ctiv

enes

s

?T

oid

enti

fym

anag

emen

tac

tions

that

should

be

inpla

ceor

impro

ved

toobta

inhig

h

effe

ctiv

enes

s

?E

nab

les

stro

ng

theo

riza

tion

–E

xte

nsi

ve

pre

test

ing

nec

essa

ryto

det

erm

ine

dis

tinct

nes

sof

conce

pts

–M

ore

com

ple

xan

dco

stly

dat

aco

llec

tion

pro

cess

es

Ove

rlappin

gco

nce

pts

mea

sure

d(t

oin

vest

igate

men

tal

model

s

on

nonpro

fit

effe

ctiv

enes

s)

?T

oin

ves

tigat

ehow

conce

pts

are

per

ceptu

ally

rela

ted,

and/o

r

how

men

tal

model

sar

eco

nst

itute

dam

ong

man

ager

sor

stak

ehold

ers

(e.g

.,re

sear

chon

man

ager

ial

sense

mak

ing)

–R

isk

of

over

-inves

tigat

ion

and

inte

rpre

tati

on

of

nonex

iste

nt

rela

tions

(Type

Ier

rors

)

7.

Addit

ive

ver

sus

mult

ipli

cati

ve

mea

sure

men

ts

Addit

ive

mea

sure

men

t(v

ari

ous

crit

eria

aggre

gate

din

addit

ive

way,

addin

gor

ave

ragin

g)

?T

ore

ach

am

ore

reli

able

mea

sure

men

t(r

eflec

tive

crit

eria

)or

when

com

bin

ing

dif

fere

nt

sourc

es(e

.g.,

inte

r-ra

ter

agre

emen

t)

?T

oobta

inev

aluat

ive

mea

sure

men

tsw

hen

crit

eria

can

com

pen

sate

for

one

anoth

er

–U

niq

ue

contr

ibuti

ons

of

separ

ate

crit

eria

toover

all

effe

ctiv

enes

sm

easu

rem

ent

not

clea

rly

obse

rvab

le

Mult

ipli

cati

vem

easu

rem

ent

(vari

ous

crit

eria

aggre

gate

din

a

mult

ipli

cati

vew

ay)

?T

om

ake

condit

ional

ity

of

par

ticu

lar

crit

eria

more

pro

min

ent

?T

oid

enti

fycr

iter

ia,

dim

ensi

ons,

or

stak

ehold

ergro

ups

that

requir

em

ost

urg

ent

man

agem

ent

acti

ons.

?U

sefu

lfo

rth

ese

lect

ion

of

hig

hly

effe

ctiv

eorg

aniz

atio

ns

(e.g

.,

qual

itat

ive

rese

arch

anal

ysi

s)

–L

ess

consi

sten

tw

ith

conte

mpora

ryuse

sof

var

iable

san

d

quan

tifi

cati

on,

soas

sum

pti

ons

about

dis

trib

uti

on,

var

iance

s,

and

range

mig

ht

be

vio

late

d(i

.e.,

not

suit

able

for

com

monly

use

dsc

ienti

fic

rese

arch

met

hods)

1654 Voluntas (2014) 25:1648–1670

123

distinguish between unidimensional and framework-based approaches. Unidimen-

sional approaches, typical of the first generation of nonprofit effectiveness studies,

focus on a single dimension of nonprofit performance and are commonly applied

according to a particular theory. Framework-based approaches, or the second

generation of studies, apply an additional classification between multidimensional

and multi-constituency approaches. Multidimensional approaches take more than

one dimension of effectiveness into account; for example, Sowa et al. (2004)

differentiate management effectiveness from program effectiveness. A multi-

constituency approach instead addresses the different interests and perspectives of

separate stakeholder groups.

We consider the advantages and disadvantages of unidimensional versus

multidimensional measurements in this section. The reasons listed in contemporary

literature for using a multidimensional effectiveness assessment reflect mainly

theoretical considerations (Lecy et al. 2012). Most nonprofit organizations (1) have

multiple goals, (2) share goals with other organizations, (3) pursue subjective

outcomes, and (4) involve various stakeholders with different interests, so a single

criterion likely could not quantify an organization’s effectiveness sufficiently

(Perrow 1961; DiMaggio 2001; Kaplan 2001; Herman and Renz 2008; Moxham

2009). Therefore, considering various criteria would help researchers gain a more

holistic view of effectiveness, such that they could make better inferences about the

complexity of the ‘‘real world’’ (Micheli and Kennerley 2005; Gordon et al. 2010).

As another important benefit, a multidimensional scale can investigate the

differential impacts of various factors on different dimensions (Jackson and

Holland 1998; Brown 2005; Dart 2010).

However, at least three considerations involving multidimensional measurements

highlight their potential disadvantages. The first relates to the generalizability of the

findings. Because of the many differences that exist across nonprofit organizations,

it is hard, or even impossible, to find a set of performance dimensions that would be

broadly applicable to many different types of organizations and their varied

stakeholders (DiMaggio 2001; Micheli and Kennerley 2005; Baruch and Ramalho

2006; Eckerd and Moulton 2011). To ensure that the criteria adopted in a

multidimensional approach are relevant for all organizations and stakeholders

addressed, the data collection would need to be constrained to a particular context of

similar organizations and/or stakeholders (i.e., homogeneous sample). When

researchers seek generalizable findings or if practitioners want to make an overall

assessment across stakeholder groups, they might prefer measurements with a single

dimension or criterion, relevant for all the different entities. Thus they could

investigate more heterogeneous samples, in which the organizations or stakeholders

differ widely. A useful suggestion in this context comes from Baruch and Ramalho

(2006), who propose including both context-specific and generalist measurements

(items and/or dimensions) of effectiveness in every study. Such an approach could

offer a detailed analysis of a particular research question (context-specific

measures), while also framing and comparing results within a broader investigation

of effectiveness (general measurements).

A second consideration is the challenges that multidimensional measures create

for data analysis. Despite the growing availability of various statistical applications

Voluntas (2014) 25:1648–1670 1655

123

(Sowa et al. 2004), to date, few contributions apply them to the study of

multidimensional measurements of effectiveness (Lynn et al. 2000; Lecy et al.

2012). These methods require substantial sets of observations (Maas and Hox 2005;

Herman and Renz 2008), but in most cases, either the populations are naturally

constrained (i.e., there are no unlimited contexts in which all measurement

dimensions make sense), or the groups of target respondents are too heterogeneous

(i.e., requiring many control variables, reducing the statistical power for testing the

real relationships of interest). The use of multidimensional measurements thus

seems more appropriate for testing the potentially complex relatedness of a few

variables in a controlled environment. In contrast, a unidimensional measure could

test a more generalizable relationship, across more heterogeneous contexts, based

on a general theory (Jun and Shiau 2012).

Finally, from a scientific perspective, some nonprofit researchers pursue high

measurement quality and internal consistency by using items and criteria that are

very similar in nature. Despite their high internal reliability (e.g., high Cronbach’s

alpha values) and good model fit, the second or third item in such multi-item

constructs often contributes very little beyond the information obtained from the

first item (Drolet and Morrison 2001). Nor do multiple-item measurements,

compared with single-item measures, necessarily have better predictive validity

(Bergkvist and Rossiter 2007). When several similar criteria are combined, efforts

and costs increase for data collection, yet less relevant information might emerge,

due to the measurement of redundant items and criteria. Therefore, nonprofit

effectiveness could be measured with a single-item indicator if (1) it is used for

measuring an overall personal perception, (2) a consistent methodological approach

is applied, and (3) the interpretation of results sufficiently takes individual

respondents’ characteristics into account (Pandey et al. 2007; Stazyk and Goerdely

2010). Furthermore, a perception-based reflective effectiveness item can be useful

for explaining individual behavior, because it relates to multiple, more objective

organizational effectiveness criteria, such as giving donations, being committed to

the organization or volunteer for the organization (Forbes 1998; Padanyi and Gainer

2003; Yoo and Brooks 2005; Daellenbach et al. 2006; Sarstedt and Schloderer 2010;

Mews and Boenigk 2012). For example, such an item might ask, ‘‘On an overall

basis, rank the effectiveness of your agency in accomplishing its core mission

(0 = not effective at all; 10 = extremely effective)’’ (Pandey et al. 2007, p. 406).

Considering the three key elements, and potentially in contrast with a dominant

rationale in previous literature, we propose that a unidimensional or even single-

criterion perspective on effectiveness deserves more attention, especially if the aim

is to (1) improve generalizability across contexts, (2) enhance analytical robustness,

(3) deal with naturally constrained data availability, and/or (4) reduce costs or

efforts because they are not worth the minimal extra information provided by an

additional item, criterion, or dimension.

Formative Versus Reflective Measurements

Another trade-off pertains to whether to apply a formative or reflective approach.

Each approach inherently encompasses very different assumptions, and when

1656 Voluntas (2014) 25:1648–1670

123

wrongly applied, it could have severe consequences in terms of Type I errors

(Diamantopoulos 1999). These errors occur when the results suggest that a

relationship exists between two concepts, whereas in reality no such relationship

exists, or for example, when findings recommend that practitioners should invest in

particular practices, even though it produces no returns in reality.

A reflective approach assumes that the latent variable (i.e., the concept of

interest, which would be nonprofit effectiveness in our context) causes the different

indicators being measured (DeVellis 2003; Sarstedt and Schloderer 2010; Boenigk

et al. 2011). For example, if a donor is satisfied with an organization (which would

mean that the latent variable is ‘‘donor satisfaction’’), the reflective approach

includes the assumption that the donor will evaluate the overall service delivered by

the organization positively and believes that his or her personal expectations have

been fulfilled. As a result, reflective indicators of donor satisfaction might be: ‘‘I am

happy with the service delivered by this organization’’ or ‘‘My expectations are met

by this organization.’’

A formative approach instead takes the opposite assumption: All the indicators

together define or conceptually cause the latent construct (Bollen and Lennox 1991).

An example in the nonprofit context comes from Willems et al. (2012a). For the

latent variable ‘‘nonprofit governance quality,’’ whether organizational governance

quality is considered high depends on whether several distinct criteria are high. For

example, to consider an organization well governed, stakeholders would need to be

sufficiently involved in the organization’s decision processes, its internal structures

and procedures must be well developed, recurrent evaluations should review prior

achievements and outputs, etc. (together all formative criteria for ‘‘governance

quality’’ (Willems et al. 2012a).

For most constructs, the choice between a formative and a reflective approach

should be obvious (Petter et al. 2007; Sarstedt and Schloderer 2010). Yet the

different perspectives applied in the literature to nonprofit effectiveness makes the

choice less obvious. Errors might result when methodological specifications are

inconsistent with theoretical claims and interpretations, so this second trade-off

deserves substantially more attention in the nonprofit literature. We address two

perspectives that appear in contemporary nonprofit literature and that each support

either a formative or a reflective approach. Depending of the context of their

particular study, researchers and practitioners can chose for a formative, reflective,

or a combined approach. These perspectives are the social construction and the

normative nature of nonprofit effectiveness.

If effectiveness is seen as a social construction for a particular group of people—

that is, they hold a shared, common perception of the concept nonprofit

effectiveness (Herman and Renz 1997; Forbes 1998; Liao-Troth and Dunn

1999)—a reflective approach is appropriate. If people consider an organization

‘‘effective,’’ they might designate it as a good example for other organizations or

cite the organization as a best practice example. As a result, the reflective items in a

scale measuring overall effectiveness might include, ‘‘This organization is a good

example for other organizations’’ or ‘‘I would talk about this organization to

illustrate good practices.’’ From a reliability perspective, several such items,

preferably with high common variance, could be combined, such that the items

Voluntas (2014) 25:1648–1670 1657

123

together measure a single latent variable (DeVellis 2003). Therefore, a reflective

approach is appropriate when the measurements focus on perceptions of effective-

ness or on reputational effectiveness (Smith and Shen 1996; Forbes 1998; Liao et al.

2001).

In contrast, from a normative perspective, nonprofit performance, or effective-

ness requires a formative measurement approach (Petter et al. 2007; Willems et al.

2012a). This means that the theoretical framework requires a combined consider-

ation of multiple items or dimensions to provide a comprehensive assessment of the

organization’s effectiveness (e.g. Kaplan 2001). The assessment of the various

criteria thus leads to the conclusion about whether the nonprofit organization is

effective (i.e., causal relationship from items to latent variable). In this case, the

common variance of the items is less important, while their unique variances

become critical. A formative construct is useful if little common variance exists

among its indicators, as each indicator then has a relevant contribution to the overall

concept (which is usually assessed by the level of multicollinearity) (Diamanto-

poulos and Siguaw 2006). Specialized literature offers a more extensive overview of

the steps necessary to develop and test the appropriateness of formative constructs

(Joreskog and Goldberger 1975; Bollen and Lennox 1991; Diamantopoulos and

Winklhofer 2001; Jarvis et al. 2003; Petter et al. 2007; Sarstedt and Schloderer

2010).

Arguments in existing literature that suggest the need to measure nonprofit

effectiveness with multiple dimensions also tend to take a normative perspective

(Herman and Renz 1999, 2008; Rojas 2000; DiMaggio 2001; Sowa et al. 2004;

Shilbury and Moore 2006). This rationale states that researchers and practitioners

should investigate multiple, distinct dimensions, because a single dimension cannot

sufficiently represent what effectiveness is. This argument inherently endorses the

idea that each dimension contains a piece of relevant information that cannot be

captured by other dimensions. Thus, a formative measurement model approach is

appropriate for a normative perspective. Unfortunately, several (frequently cited)

examples in nonprofit literature assert the need for multiple dimensions and propose

multiple criteria or dimensions, yet the results they report suggest that the

dimensions proposed are not multiple or distinct effectiveness criteria. For example,

principal component analysis (PCA), validation reporting (e.g., Cronbach’s alpha),

and interpretations all build on high common variances among items or dimensions

(e.g., Gill et al. 2005; Sowa et al. 2004; Shilbury and Moore 2006). Because such

approaches thus contradict the theoretical claims, they create the potential for

Type I errors (Diamantopoulos and Siguaw 2006). Therefore, we suggest that the

items proposed in such seemingly multidimensional scales could be used in further

research to quantify a single, unidimensional, ‘‘social constructionist’’ latent

variable that quantifies overall effectiveness (rather than measuring distinct

dimensions), or that new criteria and items are proposed that truly measure a

unique element of overall nonprofit effectiveness. We also suggest avoiding

substantive explanations of the superficial relatedness of arbitrarily chosen

dimensions (with one another or other concepts). Such efforts grant distinct

interpretations to relations between concepts that are not inherently different, which

again creates Type I errors (Diamantopoulos and Siguaw 2006). Furthermore, we

1658 Voluntas (2014) 25:1648–1670

123

suggest re-validations, using formative approaches, of existing scales that initially

were developed to measure distinct dimensions but that led to high internal

correlations. These new evaluations also could clarify the extent to which the

proposed dimensions actually measure distinct aspects. If no such proof is

forthcoming, new scale development should be undertaken, to uncover multiple,

unrelated dimensions that accord with the issued theoretical claims.

Individual Versus Group Measurements

Although virtually no empirical tests address the extent to which we can rely on a

single rater’s opinion to assess an organization’s effectiveness (Lynn et al. 2000),

opinions among individuals can differ substantially (Green and Griesinger 1996;

Herman and Renz 2008; Willems et al. 2012b). First, raters engage in various types

of relationships with the organization, so they judge the usefulness of the

organizational outcomes differently depending on their personal needs (Krashinsky

1997; Balser and McClusky 2005; Mistry 2007; Babiak 2009). Second, a

comparative view is inherent to assessments of concepts that are difficult to

quantify objectively (Herman and Renz 1999, 2008; DiMaggio 2001), so the

personal reference frameworks of raters play important roles. That is, people rely on

their personal experiences with other organizations to judge the effectiveness of an

organization (Herman and Renz 1999, 2008). Third, the social desirability of the

reported output might result in (unsystematically) biased assessments that differ

across stakeholder types (Green and Griesinger 1996). For example, a fundraiser

might present organizational effectiveness higher than it actually is, because she or

he hopes to create a positive image of the organization and thereby obtain more

funding.

Considering these three sources of potential measurement biases in individual

assessments of organizational characteristics, it might be useful to distinguish

between the (1) unit of interest, (2) unit of data collection, and (3) unit of analysis

(though all three often are referred to with the common term ‘‘unit of analysis’’).

The unit of interest pertains to the level on which the theoretical considerations are

focused. In nonprofit effectiveness literature, the unit of interest is often the

organization (e.g., ‘‘Do particular practices in an organization result in better effects

for its stakeholders?’’). The unit of data collection is the level at which data are

collected. For example, raters might be asked to judge the effectiveness of an

organization, or researchers might review various annual reports by the organiza-

tion. In these cases, the respective units of data collection would be persons in the

organizations and years in which the organization was active. These data can be

analyzed in their original form, aggregated, or related across levels on the basis of

existing assumptions. These choices in turn determine the unit of analysis. If the unit

of interest (e.g., organization) differs from the unit of data collection or unit of

analysis (e.g., individual opinions; change in reported donors), important method-

ological considerations are necessary with respect to robustness, reliability, the

research design, and the contribution.

From a reliability perspective, different opinions or measurements can be

gathered and aggregated for a single organization, which requires the use of

Voluntas (2014) 25:1648–1670 1659

123

inter-rater reliability measures (Boyer and Verma 2000). However, this method

requires an extensive data collection, which might result in high costs and reduce

the number of organizations in the sample, leading to less statistical power for

analyzing the unit of interest (Groves and Heeringa 2006; Herman and Renz 2008).

Thoughtful decisions also are necessary with respect to how to aggregate the data

(especially when the numbers of respondents differ across organizations). It should

be possible to find comparable groups of respondents across organizations (e.g.,

board members are likely more comparable across organizations than the

beneficiaries of different types of organizations). In contrast, researchers could

use the opinions of single respondents with comparable positions across organi-

zations to gather comparable answers from a relatively large sample of organiza-

tions, such as the chair of the board or the general manager (Brown 2005; Gill et al.

2005; LeRoux and Wright 2010). In this way, larger samples at the level of the unit

of interest can be composed, but biases can occur due to differing backgrounds,

social desirability, and subjective measurements can induce misinterpretations of

the results—particularly for studies that rely on individual perceptions (please see

‘‘Leading versus lagging measurements’’ and ‘‘Distinct versus overlapping effec-

tiveness measurements’’ sections).Therefore, as long as the analysis also includes

individual control variables or provides a more atomistic (i.e., individual-level)

interpretation of the findings, a single person’s assessment can be a valid substitute

to approximate an organization’s characteristics.

Rather than controlling for potential biases, it is also possible to consider more

complex research questions, in which both individuals and organizations are units of

interest. We know little about the factors and effects of the unique perceptions of

individual raters versus the shared perceptions among raters in the same

organization; substantial improvement of our understanding of nonprofit effective-

ness might result from a closer examination of such multilevel data structures (Lynn

et al. 2000; Yoo and Brooks 2005; Hitt et al. 2007). Thus, the challenges regarding

existing biases become research questions, with both high academic and practical

relevance (Lynn et al. 2000; Sowa et al. 2004; Willems et al. 2012a).

Internal Versus External Measurements

Closely related to the previous trade-off, some additional advantages and

disadvantages refer to internal assessments, being the opinions of people inside

the organization (e.g., CEO, board members, staff, volunteers) versus external

assessments by people outside the organization (e.g., customer, funders, beneficia-

ries) (Van Puyvelde et al. 2012). As Green and Griesinger (1996) note, relying too

much on inside judgments can result in biased assessments due to social desirability

or ivory tower judgments, which arise when people in leadership positions lack

insight into actual operational performance. Yet the availability of information to

insiders means that substantially more and detailed information can be gathered

from these internal respondents (Brown 2005).

Using assessments by donors, beneficiaries, other organizations in the field, or

other external stakeholders offers the advantage of greater relevance, in terms of

real effects (Smith and Shen 1996; Forbes 1998; Liao et al. 2001). The perceptions

1660 Voluntas (2014) 25:1648–1670

123

held by these external stakeholders determine (implying a reflective operational-

ization; see ‘‘Formative versus reflective measurements’’ section), for example,

donation levels, subsidies, volunteering intentions, and whether their interests are

sufficiently met by the organizational outcomes (Forbes 1998; Padanyi and Gainer

2003; Yoo and Brooks 2005; Daellenbach et al. 2006; Sarstedt and Schloderer 2010;

Mews and Boenigk 2012). Yet their perceptions also might reflect biases, including

those mentioned in the previous trade-off (i.e., different stakeholder needs or

different personal reference frameworks). The decisions about whom to rely on

(‘‘Internal versus external measurements’’ section) and how many sources to include

(‘‘Individual versus group measurements’’ section) thus might be combined, to deal

with the overall biases that can result from differences between the unit of interest

and the unit of data collection and/or analysis.

Leading Versus Lagging Measurements

The distinction between leading and lagging measurements, as noted in practice-

oriented business literature (Brewer and Speh 2000; Epstein and Wisner 2001;

Bremser and Barsky 2004), refers to the causal sequence of organizational actions

and direct outputs on one hand, and the effects for various stakeholders on the other.

From a practical perspective, lagging indicators focus on the overall, mission-

related effects of an organization for a certain period of time, following from the

investments made and actions performed by an organization (e.g., number of

healthy children after a vaccination program, change in literacy rate due to a

deployed reading program). Leading indicators instead forecast and quantify the

organization’s actions or direct outcomes (e.g., number of vaccinations given,

number of people educated in the reading programs).

Leading indicators tend to be more objectively observable. For example, whether

particular board practices are in place is easier to observe than the extent to which

public opinion has changed after a campaign. A focus on leading indicators might

help avoid biases due to social desirability, personal background, or strong

individual reference frameworks. Furthermore, leading indicators often can be

generalized across heterogeneous samples. For example, regardless of the mission

or strategy of an organization, specific board, or management practices likely need

to be in place at various types of organizations, and can therefore more easily be

compared across organizations. Assessments of their achieved mission or strategy

instead are much more context dependent (Sawhill and Williamson 2001; Moxham

2009).

In contrast, lagging indicators have greater relevance, in the sense that they

reflect changes in society, achievement of mission, or impact on stakeholders. Yet

they also tend to be more subjective and dependent on personal reference

frameworks. To deal with these issues, we refer to prior suggestions regarding the

need to consider multiple external assessments, and control and test for contextual

and individual characteristics.

In this context, we note an interesting contribution by Packard (2010), who

describes a complex, conceptual sequence of relevant indicators: input indicators,

throughput indicators, management capacity, program capacity, and outputs. These

Voluntas (2014) 25:1648–1670 1661

123

combined dimensions reveal the organization’s overall performance as an ‘‘over-

riding concept, including throughputs in terms of program and management

operations; and results including outputs, quality, efficiency, and effectiveness’’

(Packard 2010, p. 976). With a survey of 52 staff members in 14 nonprofit

programs, Packard finds that the respondents consider goal accomplishment of great

importance. These lagging indicators also are of stronger interest from a theoretical

perspective and they invoke generally high ratings when people are asked to judge

their own organization. Perhaps social desirability influences their answers,

especially those for indicators that are less objectively measurable. However,

respondents seem to rate leading indicators, such as customer and employee

satisfaction, as better indicators of nonprofit effectiveness, because such indicators

can more actively be managed, as well as compared across organizations.

Distinct Versus Overlapping Effectiveness Measurements

While the previous trade-off deals with the causal sequence of distinct concepts like

organizational actions, outcomes, and effects, here we address potentially overlap-

ping effectiveness-related concepts. The literature shows different logical sequences

regarding overall effectiveness, in which for example (1) particular board practices

lead to board performance (or governance quality), (2) board performance leads to

organizational performance, (3) organizational performance leads to reputational

effectiveness, and (4) reputational effectiveness leads to higher donations (stake-

holder performance) (Bradshaw et al. 1992; Green and Griesinger 1996; Kushner

and Poole 1996; Siciliano 1997; Smith and Shen 1996; Herman et al. 1997; Herman

and Renz 2000; Forbes 1998; Padanyi and Gainer 2003; Radbourne 2003; Brown

2005; Gill et al. 2005; Yoo and Brooks 2005; Daellenbach et al. 2006; Carman and

Fredericks 2010; Sarstedt and Schloderer 2010; Mews and Boenigk 2012; Helmig

et al. 2013).

When multiple of these effectiveness-related concepts appear in a single study,

regardless of whether they are formative or reflective, or leading or lagging, they

might be completely distinct, partially overlapping, or even completely overlapping

(Tacq 1984). Consider for example the unique impact of multiple management

practices and their quality on organizational effectiveness, tested in a regression

analysis. An inherent assumption is that they are distinct concepts in a causal

sequence. However, when the assessment of all these concepts (management

practices and effectiveness) relies on perceptions, the indicators could be related or

partially overlapping, as they contain very similar elements. The overlap even could

be so extensive that the assumed cause equals the hypothesized effect (i.e., two

conceptual names are used, but both refer to a single and general concept, or causa

aequat effectum, Tacq 1984, p. 146). If the researcher posits that the concepts are

distinct when they are not (or only partially), substantive causal interpretations

could emerge for non-existing relations (i.e., Type I error). In trade-off 2, we

described similar errors, which resulted from assigning different interpretations to

seemingly distinct dimensions that were not different in reality. Here, the errors

result from assuming causal relations among concepts that are, at least from a

measurement perspective, not different.

1662 Voluntas (2014) 25:1648–1670

123

Thus, when multiple effectiveness-related concepts combine in a single study,

with causal relations hypothesized and tested across these concepts, we suggest an

initial, thorough analysis of whether the measured concepts actually may be

considered distinct. The discriminant and convergent validity of the concepts could

be tested (Fornell and Larcker 1981; Cohen 1988; Wilson et al. 2007; Farrell and

Rudd 2009; Farrell 2010; Shiu et al. 2011). For example, Boenigk and Helmig

(2013) assess whether donor–nonprofit identification and donor identity salience are

distinct constructs, with separate and direct effects on donation behavior. This

assessment revealed that the two identification constructs were distinct, even though

the indicators used for the two measurements seemed similar. To avoid Type I

errors, more careful interpretations are needed regarding whether different concepts

truly are being measured and if they causally relate, especially if single informants

provide the judgments for several effectiveness-related concepts, such as measures

of professionalism, aspects of leadership, or governance effectiveness (e.g., 260

CEOs judging their own organizations, LeRoux and Wright 2010), as well as

perceptions of chairs (e.g., Harrison et al. 2012 surveyed different people close to

board chairs). When extremely high standardized coefficients or correlations (up to

0.86) are reported as causal relationships between perception-based concepts, we

might suspect that the measurements combine leading and lagging indicators,

quantifying a single overall latent construct that the respondents perceive similarly,

rather than distinct concepts that are causally related.

In contrast, studying overlapping concepts can help in better understanding how

mental models are shaped by (groups of) practitioners. Having insight in how

individuals separately, or groups through mutual interaction, construct a mental

framework on how their actions and decisions potentially result in outcomes, can

improve our insights in the individual and collective sense making processes. As

these processes are at the basis of actual managerial and governance decisions, they

can clarify what information is mainly taken into account and for what reasons

(Mitchell 2013).

Additive Versus Multiplicative Measurements

Researchers and practitioners might contemplate assumptions they make with

respect to aggregating data, which can reflect different criteria, dimensions, and

items (see ‘‘Uni-versus multidimensional measurements’’ and ‘‘Formative versus

reflective measurements’’ sections), or different sources (see ‘‘Individual versus

group measurements’’ section). In most cases, if multiple items refer to a single

dimension, an average score is used. In combination with the high common variance

of reflective constructs, such a quantification should give a reliable indication of an

existing latent concept. The combination of multiple sources, together with high

inter-rater reliability (Boyer and Verma 2000), also offers a reliable assessment. For

formative items, weights (comparable to regression coefficients) might be used to

extract the unique effects of the indicators. Similar considerations apply to

combinations of dimensions or of assessments by different stakeholder groups. In

these cases, the assumption is that the measurement model is additive, such that the

(weighted) sum of multiple assessments constitutes the score for the overall

Voluntas (2014) 25:1648–1670 1663

123

concepts. From a practitioner perspective, a multiplicative approach may be more

useful for assessing and benchmarking organizations. This approach stresses the

conditionality of each dimension or data source. For example, if an observer reviews

several dimensions to make a holistic assessment of an organization’s effectiveness,

separate scores are available for each dimension. An organization might score high

on all except one dimension and extremely low on this single dimension. An

additive approach would produce a fairly positive outcome, because the negative

outcome for the one dimension would be compensated for by the other dimensions.

A multiplicative approach instead would lead to a lower value for the overall

concept, because in this view, an organization is effective only if it performs well on

every dimension. The latter approach thus could be valuable for selecting highly

effective organizations for further qualitative exploration (Balser and McClusky

2005; Herman and Renz 2000). Finally, a multiplicative approach can reveal the

causes and effects of a particular and obvious shortcoming, rather than of averaged

achievements across multiple dimensions or sources (Willems et al. 2012a).

Conclusions and Avenues for Further Research

Rather that striving for a ‘‘best’’ way that is applicable to any kind of nonprofit

organization, we have sought to present both the advantages and the disadvantages

of a variety of choices that can be made, taking into account the contextual elements

of an effectiveness measurement project. As our first contribution, we offer an

overview of these advantages and disadvantages, as summarized in Table 1. This

overview offers decision guidelines for researchers and practitioners who engage in

nonprofit effectiveness measurement projects.

As a second main contribution, we identify from the seven trade-offs five

research avenues that could extend and enhance the ongoing nonprofit effectiveness

discussion.

First, our approach started with existing theoretical insights, examples, and

literature overviews from the broad nonprofit research domain. Starting from the

seven trade-offs identified, a next step for research could be to pursue a more

standardized, abstracted approach for identifying general decision rules for

nonprofit effectiveness measurement projects. Our literature review enabled us to

identify the separate trade-offs, but it was outside the scope of this project to

investigate actual, concrete decisions that have been considered (or ignored) by the

authors of the studies that we cite. This limitation actually is inherent to the

academic publication process, which focuses on reporting detailed research findings

rather than on the trade-offs that the researchers or practitioners considered reaching

these findings. It would be even harder to gain insights into all the contextual factors

that determined their choices. Therefore, we invite researchers and practitioners to

discuss and evaluate their own project-related decisions, using the trade-offs that we

identified (e.g., in the methodology section of their articles). In addition to clarifying

the purpose and decision processes underlying research projects, this could also

result in a more critical evaluation and elaboration of the trade-offs postulated

herein. Ultimately, this could also provide more standardized insights into how

1664 Voluntas (2014) 25:1648–1670

123

contextual settings can determine optimal decisions for nonprofit effectiveness

measures.

Second, considering the misinterpretations that can result from misspecifications

or over-investigation of seemingly distinct but overlapping concepts (see ‘‘Forma-

tive versus reflective measurements’’ and ‘‘Distinct versus overlapping effectiveness

measurements’’ sections), we hope further research digs into the question of where

to draw boundaries between dimensions and/or related effectiveness concepts. In

contrast with the divergent evolution in the previous literature, typified by the

introduction of new classifications, dimensions, conditions, factors, and effects, we

consider a converging phase in nonprofit effectiveness research more relevant.

When new concepts are introduced, a proper clarification and testing of how they

are similar or different from those that already exit previously, is important for true

theory advancement. In particular, when concepts are quantified for further analysis,

we strongly suggest reusing existing measurements. As Baruch and Ramalho (2006)

recommend, new studies on nonprofit effectiveness might reuse basic, generalizable

concepts across contexts and for various research questions. Depending on the

particular context, additional measurement criteria might be introduced. Such

(partial) reuse of dimensions, criteria, or scales offers two major advantages for an

overall understanding of nonprofit effectiveness. It provides a continuous verifica-

tion of existing concepts, dimensions, and measurements, both theoretically and

methodologically, and it enables a more critical evaluation of the generalizability of

prior research findings across various contexts.

Third, we used the distinction between units of interest, data collection, and

analysis mainly to discuss potential measurement biases. However, the differences

among these units also imply strong opportunities for developing new theoretical

insights into nonprofit effectiveness. Several contributions already assert that

nonprofit effectiveness is a social construction (Herman and Renz 1997; Forbes

1998; Liao-Troth and Dunn 1999), focusing mainly on the fact that people

individually develop an understanding of what nonprofit effectiveness means to

them. However, through their social interactions they develop a certain degree of

common understanding. The social constructionist perspective thus far has served

mainly to argue how people differ in their assessments of organizations. An opposite

perspective, regarding the extent to which they agree, might deserve more attention.

That is, we know little about how and why groups of people come to agree about the

meaning of nonprofit effectiveness. The perceived effectiveness of an organization

has significant impacts on the behavior of individuals and stakeholder groups

toward the organization though (Yoo and Brooks 2005; Daellenbach et al. 2006;

Sarstedt and Schloderer 2010; Mews and Boenigk 2012), so research should address

in-depth the social interaction factors and processes that shape shared versus unique

perceptions of an organization’s effectiveness. This analysis might provide

important new insights into how social processes among individuals can be

managed, and to steer collective behavior toward an organization.

Fourth, for conciseness, we focused only on nonprofit-related examples and

literature reviews. However, challenges to measure effectiveness-related concepts

go far beyond nonprofit realms. In addition to the supplementary suggestions that

we have made already, we recommend that continued contributions contemplate

Voluntas (2014) 25:1648–1670 1665

123

and investigate similarities and dissimilarities with other contexts, such as public

and profit sectors (Rojas 2000; Micheli and Kennerley 2005). Such a comparison

could enhance our understanding of the extent to which these seven trade-offs are

unique to the nonprofit sector or generalizable to other areas, or whether these trade-

offs are dealt with differently, or result in other decisions for other research

domains.

Fifth, we note the challenges that remain for bridging the gap between theoretical

insights and practitioner recommendations for measuring nonprofit effectiveness.

We have postulated seven trade-offs, largely based on academic literature. Yet the

basic decisions in each trade-off are highly relevant for practitioners too. For

example, practitioners seek high-quality, reliable measures of their actions, outputs,

and effects, and they hope to obtain these measures at reasonable costs.

Accordingly, they face very similar decisions in terms of which criteria to measure,

what sources to use, how many sources to gather, how to aggregate data, how to

quantify various measurement types, and so on. Thus, research projects should focus

not only on the type of measurements used by nonprofit practitioners or their

usefulness but also on why they have been chosen and in which contexts. Such

studies could provide some verification for whether the seven trade-offs we propose

are truly relevant beyond the academic context. More important, they could reveal

new insights into the practical usefulness of various types of effectiveness

measurements, depending on the real-world context.

As nonprofit organizations are increasingly confronted with competition to attract

resources and with continuous internal and external accountability obligations

(Ashley and Faulk 2010; Faulk et al. 2012), the trade-offs described in this article

could help as guidelines in professionalizing their measurement-based management.

As pointed out by Mitchell (2013), nonprofit practitioners often rely on personal and

intuitive effectiveness assessments to make managerial decisions. In addition, the

content of these assessments seem unfortunately not fully captured by the

effectiveness focus taken by academic scholars. Therefore, nonprofit practitioners

and consultants could move forward in setting up own measurement systems that

support their particular needs, but that are sufficiently providing useful information

at a reasonable cost. In this context, organizations could continuously experiment

with different types of measurements and effectiveness reporting, in order to find

out how sufficient managerial benefits can be obtained [e.g. more efficient practices,

more (social) return on investment, more impact on stakeholders, etc.]. When

selecting these performance indicators, special attention could be devoted to find an

optimal balance regarding the many advantages and disadvantages discussed in the

trade-offs. For example, practitioners could define for their own organization a set

of broad en general performance indicators that allow comparison with other

organizations. In doing so, organizations can benchmark themselves and learn from

each other when aiming to improve their practices and performance indicators. In

addition, specific indicators for the particular context of an organization and its

stakeholders can complement the core set of general indicators, aiming at better

understanding how particular actions cause improvements in serving stakeholder

interests and in achieving the organizational mission. In doing so, they can verify or

adjust their own mental models on how their actions potentially result in

1666 Voluntas (2014) 25:1648–1670

123

performance, and they can incrementally improve their understanding of the true

impact of their decisions and actions.

References

Ashley, S., & Faulk, L. (2010). Nonprofit competition in the grants marketplace: Exploring the

relationship between nonprofit financial ratios and grant amount. Nonprofit Management and

Leadership, 21(1), 43–57.

Babiak, K. M. (2009). Criteria of effectiveness in multiple cross-sectoral interorganizational relation-

ships. Evaluation & Program Planning, 32(1), 1–12.

Balser, D., & McClusky, J. (2005). Managing stakeholder relationships and nonprofit organization

effectiveness. Nonprofit Management & Leadership, 15(3), 295–315.

Baruch, Y., & Ramalho, N. (2006). Communalities and distinctions in the measurement of organizational

performance and effectiveness across for-profit and nonprofit sectors. Nonprofit and Voluntary

Sector Quarterly, 35(1), 39–65.

Bergkvist, L., & Rossiter, J. R. (2007). The predictive validity of multiple-item versus single-item

measures of the same constructs. Journal of Marketing Research, 44(2), 175–184.

Boenigk, S., & Helmig, B. (2013). Why do donors donate? Examining the effects of organizational

identification and identity salience on the relationships among satisfaction, loyalty, and donation

behavior. Journal of Service Research, 16, 533–548.

Boenigk, S., Leipnitz, S., & Scherhag, C. (2011). Altruistic values, satisfaction and loyalty among first-

time blood donors. Nonprofit and Voluntary Sector Marketing, 16(4), 356–370.

Bollen, K., & Lennox, R. (1991). Conventional wisdom on measurement: A structural equation

perspective. Psychological Bulletin, 110(2), 305–314.

Boyer, K. K., & Verma, R. (2000). Multiple raters in survey-based operations management research:

A review and tutorial. Production and Operations Management, 9(2), 128–140.

Bradshaw, P., Murray, V., & Wolpin, J. (1992). Do nonprofit boards make a difference? An exploration of

the relationships among board structure, process and effectiveness. Nonprofit and Voluntary Sector

Quarterly, 21(13), 227–249.

Bremser, W. G., & Barsky, N. (2004). Utilizing the balanced scorecard for R&D performance

measurement. R&D Management, 34(3), 229–238.

Brewer, P. C., & Speh, T. W. (2000). Using the balanced scorecard to measure supply chain performance.

Journal of Business Logistics, 21(1), 75–93.

Brown, W. A. (2005). Exploring the association between board and organizational performance in

nonprofit organizations. Nonprofit Management & Leadership, 15(3), 317–339.

Carman, J. G., & Fredericks, K. A. (2010). Evaluation capacity and nonprofit organizations: Is the glass

half-empty or half-full? American Journal of Evaluation, 31(1), 84–104.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence

Erlbaum.

Daellenbach, K., Davies, J., & Ashill, N. J. (2006). Understanding sponsorship and sponsorship

relationships—Multiple frames and multiple perspectives. International Journal of Nonprofit and

Voluntary Sector Marketing, 11(1), 73–87.

Dart, R. (2010). A grounded qualitative study of the meanings of effectiveness in Canadian ‘results-

focused’ environmental organizations. VOLUNTAS: International Journal of Voluntary and

Nonprofit Organizations, 21, 202–219.

DeVellis, R. F. (2003). Scale development: Theory and applications (2nd ed., Vol. 26)., Applied Social

Research Methods Series London: Sage Publications.

Diamantopoulos, A. (1999). Viewpoint–export performance measurement: Reflective versus formative

indicators. International Marketing Review, 16(6), 444–457.

Diamantopoulos, A., & Siguaw, J. A. (2006). Formative versus reflective indicators in organizational

measure development: A comparison and empirical illustration. British Journal of Management,

17(4), 263–282.

Diamantopoulos, A., & Winklhofer, H. M. (2001). Index construction with formative indicators:

An alternative to scale development. Journal of Marketing Research, 38(2), 269–277.

Voluntas (2014) 25:1648–1670 1667

123

DiMaggio, P. (2001). Measuring the impact of the nonprofit sector on society is probably impossible but

possibly useful: A sociological perspective. In P. Flynn & V. Hodgkinson (Eds.), Measuring the

impact of the nonprofit sector (pp. 249–272). New York: Kluwer Academic/Plenum Publishers.

Drolet, A. L., & Morrison, D. G. (2001). Do we really need multiple-item measures in service research?

Journal of Service Research, 3(3), 196–204.

Eckerd, A., & Moulton, S. (2011). Heterogeneous roles and heterogeneous practices: Understanding the

adoption and uses of nonprofit performance evaluations. American Journal of Evaluation, 32(1),

98–117.

Epstein, M. J., & Wisner, P. S. (2001). Using a balanced scorecard to implement sustainability.

Environmental Quality Management, 11(2), 1–10.

Farrell, A. M. (2010). Insufficient discriminant validity: A comment on Bove, Prevan, Beatty and Shiu

(2009). Journal of Business Research, 63(5), 324–327.

Farrell, A. M., & Rudd, J. M. 2009. Factor analysis and discriminant validity: A brief review of some

practical issues. Australia-New Zealand Marketing Academy Conference (ANZMAC), December,

Melbourne, Australia.

Faulk, L., Lecy, J. D., & McGinnis, J. 2012. Nonprofit competitive advantage in grant markets:

Implications of network embeddedness. Andrew Young School of Policy Studies Research Paper

Series No. 13-07.

Forbes, D. P. (1998). Measuring the unmeasurable: Empirical studies of nonprofit organizations

effectiveness from 1977 to 1997. Nonprofit and Voluntary Sector Quarterly, 27(2), 183–202.

Fornell, C., & Larcker, D. F. (1981). Evaluating structural equation models with unobservable variables

and measurement error. Journal of Marketing Research, 18(1), 39–50.

Frumkin, P. (2002). On being nonprofit: A conceptual policy primer. Cambridge, MA: Harvard

University Press.

Gill, M., Flynn, R. J., & Raising, E. (2005). The governance self-assessment checklist: An instrument for

assessing board effectiveness. Nonprofit Management & Leadership, 15(3), 271–294.

Gordon, T. P., Khumawala, S. B., Kraut, M., & Neely, D. G. (2010). Five dimensions of effectiveness for

nonprofit annual reports. Nonprofit Management & Leadership, 21(2), 209–228.

Green, J. C., & Griesinger, D. W. (1996). Board performance and organizational effectiveness in

nonprofit social services organizations. Nonprofit Management & Leadership, 6(4), 381–402.

Groves, R. M., & Heeringa, S. G. (2006). Responsive design for household surveys: Tools for actively

controlling survey errors and costs. Journal of the Royal Statistical Society Series A, 169(3),

439–457.

Hansmann, H. (1987). Economic theories of nonprofit organizations. In W. W. Powell (Ed.), The

nonprofit sector: A research handbook (pp. 27–42). New Haven, CT: Yale University Press.

Harrison, Y., Murray, V., & Cornforth, C. (2013). Perceptions of board chair leadership effectiveness in

nonprofit and voluntary sector organizations. VOLUNTAS: International Journal of Voluntary and

Nonprofit Organizations, 24(3), 688–712. doi:10.1007/s11266-012-9274-0.

Helmig, B., Ingerfurth, S., & Pinz, A. 2013. Success and Failure of Nonprofit Organizations: Theoretical

Foundations, Empirical Evidence, and Future Research, Voluntas. doi:10.1007/s11266-013-9402-5.

Herman, R. D., & Renz, D. O. (1997). Multiple constituencies and the social construction of nonprofit

organization effectiveness. Nonprofit and Voluntary Sector Quarterly, 26(2), 185–206.

Herman, R. D., & Renz, D. O. (1999). Theses on nonprofit organizational effectiveness. Nonprofit and

Voluntary Sector Quarterly, 28(2), 107–126.

Herman, R. D., & Renz, D. O. (2000). Board practices of especially effective and less effective local

nonprofit organizations. American Review of Public Administration, 30(2), 146–160.

Herman, R. D., & Renz, D. O. (2008). Advancing nonprofit organizational effectiveness research and

theory: Nine theses. Nonprofit Management & Leadership, 18(4), 399–415.

Herman, R. D., Renz, D. O., & Heimovics, R. D. (1997). Board practices and board effectiveness in local


Hitt, M. A., Beamish, P. W., Jackson, S. E., & Mathieu, J. E. (2007). Building theoretical and empirical

bridges across levels: Multilevel research in management. Academy of Management Journal, 50(6),

1385–1399.

Jackson, D. K., & Holland, T. P. (1998). Measuring effectiveness of nonprofit boards. Nonprofit and


Jarvis, C. B., MacKenzie, S. B., & Podsakoff, P. M. (2003). A critical review of construct indicators and

measurement model misspecification in marketing and consumer research. Journal of Consumer

Research, 30(2), 199–218.

1668 Voluntas (2014) 25:1648–1670

123

http://dx.doi.org/10.1007/s11266-012-9274-0

Joreskog, K. G., & Goldberger, A. S. (1975). Estimation of a model with multiple indicators and multiple

causes of a single latent variable. Journal of the American Statistical Association, 70(351), 631–639.

Jun, K.-N., & Shiau, E. (2012). How are we doing? A multiple constituency approach to civic association

effectiveness. Nonprofit and Voluntary Sector Quarterly, 41(4), 632–655.

Kaplan, R. S. (2001). Strategic performance measurement and management in nonprofit organizations.

Nonprofit Management & Leadership, 11(3), 353–370.

Krashinsky, M. (1997). Stakeholder theories of the non-profit sector: One cut at the economic literature.

VOLUNTAS: International Journal of Voluntary and Nonprofit Organizations, 8(2), 149–161.

Kushner, R. J., & Poole, P. P. (1996). Exploring structure–effectiveness relationships in nonprofit arts

organizations. Nonprofit Management & Leadership, 7(2), 119–136.

Lecy, J. D., Schmitz, H. P., & Swedlund, H. (2012). Non-governmental and not-for-profit organizational

effectiveness: A modern synthesis. VOLUNTAS: International Journal of Voluntary and Nonprofit

Organizations, 23(2), 434–457.

LeRoux, K., & Wright, N. S. (2010). Does performance measurement improve strategic decision making?

Findings from a national survey of nonprofit social service agencies. Nonprofit and Voluntary Sector

Quarterly, 39(4), 571–587.

Liao, M., Foreman, S., & Sargeant, A. (2001). Market versus societal orientation in the nonprofit context.

International Journal of Nonprofit and Voluntary Sector Marketing, 6(3), 254–268.

Liao-Troth, M., & Dunn, C. P. (1999). Social constructs and human service: Managerial sensemaking of

volunteer motivation. Voluntas, 10(4), 345–361.

Lynn, L. E, Jr, Heinrich, C. J., & Hill, C. J. (2000). Studying governance and public management:

Challenges and prospects. Journal of Public Administration Research and Theory, 10(2), 233–261.

Maas, C. J. M., & Hox, J. (2005). Sufficient sample sizes for multilevel modeling. Methodology, 1(3),

86–92.

Mews, M., & Boenigk, S. (2012). Does organizational reputation influence the willingness to donate

blood? International Review on Public and Nonprofit Marketing,. doi:10.1007/s12208-012-0090-4.

Micheli, P., & Kennerley, M. (2005). Performance measurement frameworks in public and non-profit

sectors. Production Planning & Control, 16(2), 125–134.

Mistry, S. 2007. How does one voluntary organization engage with multiple stakeholder views of

effectiveness? Voluntary Sector Working Paper: Number 7, London School of Economics and

Political Science.

Mitchell, G. E. (2013). The construct of organizational effectiveness: Perspectives from leaders of

international nonprofits in the United States. Nonprofit and Voluntary Sector Quarterly, 42(2),

324–345.

Moxham, C. (2009). Performance measurement: Examining the applicability of the existing body of

knowledge to nonprofit organisations. International Journal of Operations & Production

Management, 29(7), 740–763.

Osborne, S. P., & Tricker, M. (1995). Researching non-profit organisational effectiveness: A comment on

Herman and Heimovics. VOLUNTAS: International Journal of Voluntary and Nonprofit Organi-

zations, 6(1), 85–92.

Packard, T. (2010). Staff perceptions of variables affecting performance in Human service organizations.

Nonprofit and Voluntary Sector Quarterly, 39(6), 971–990.

Padanyi, P., & Gainer, B. (2003). Peer reputation in the nonprofit sector: Its role in nonprofit sector

management. Corporate Reputation Review, 6(3), 252–265.

Pandey, S. K., Coursey, D. H., & Moynihan, D. P. (2007). Organizational effectiveness and bureaucratic

red tape: A multimethod study. Public Performance & Management Review, 30(3), 398–425.

Perrow, C. (1961). The analysis of goals in complex organizations. American Sociological Review, 26(6),

854–866.

Petter, S., Straub, D., & Rai, A. (2007). Specifying formative constructs in information systems research.

Management Information Systems Quarterly, 31(4), 623–656.

Plantz, M. C., Greenway, M. T., & Hendricks, M. (1997). Outcome measurement: Showing results in the

nonprofit sector. New Directions Evaluation, 75, 15–30.

Radbourne, J. (2003). Performing on board: The link between governance and corporate reputation in

nonprofit arts boards. Corporate Reputation Review, 6(3), 212–222.

Richard, P. J., Devinney, T. M., Yip, G. S., & Johnson, G. (2009). Measuring organizational performance:

Towards methodological best practice. Journal of Management, 35(3), 718–804.

Rojas, R. R. (2000). A review of models for measuring organizational effectiveness among for-profit and


Voluntas (2014) 25:1648–1670 1669

123

http://dx.doi.org/10.1007/s12208-012-0090-4

Sarstedt, M., & Schloderer, M. P. (2010). Developing a measurement approach for reputation of non-

profit organizations. International Journal of Nonprofit and Voluntary Sector Marketing, 15(3),

276–299.

Sawhill, J. C., & Williamson, D. (2001). Mission impossible? Measuring success in nonprofit


Shilbury, D., & Moore, K. A. (2006). A study of organizational effectiveness for national Olympic

sporting organizations. Nonprofit and Voluntary Sector Quarterly, 35(1), 5–38.

Shiu, E., Pervan, S. J., Bove, L. L., & Beatty, S. E. (2011). Reflections on discriminant validity:

Reexamining the Bove et al. (2009) findings. Journal of Business Research, 64(3), 497–500.

Siciliano, J. I. (1997). The relationship between formal planning and performance in nonprofit


Smith, D. H., & Shen, C. (1996). Factors characterizing the most effective nonprofits managed by

volunteers. Nonprofit Management & Leadership, 6(3), 271–289.

Sowa, J. E., Selden, S. C., & Sandfort, J. R. (2004). No longer unmeasurable? A multidimensional

integrated model of nonprofit organizational effectiveness. Nonprofit and Voluntary Sector

Quarterly, 33(4), 711–728.

Stazyk, E. C., & Goerdely, H. T. (2010). The benefits of bureaucracy: Public managers’ perceptions of

political support, goal ambiguity, and organizational effectiveness. Journal of Public Administration

Research and Theory, 21(4), 645–672.

Tacq, J. J. A. (1984). Causaliteit in sociologisch onderzoek. Een beoordeling van causale analysetech-

nieken in het licht van wijsgerige opvattingen over causaliteit. Deventer: Van Loghum Slaterus.

Van Puyvelde, S., Caers, R., Du Bois, C., & Jegers, M. (2012). The governance of nonprofit

organizations: Integrating agency theory with stakeholder and stewardship theories. Nonprofit and


Willems, J., Huybrechts, G., Jegers, M., Weijters, B., Vantilborgh, T., Bidee, J., et al. (2012a). Nonprofit

governance quality: Concept and measurement. Journal of Social Service Research, 38(4), 561–578.

Willems, J., Van den Bergh, J., & Deschoolmeester, D. (2012b). Analyzing employee agreement on

maturity assessment tools for organizations. Knowledge and Process Management, 19(3), 142–147.

Wilson, B. C., W. Ringle, C. M., & Henseler, J. (2007). Exploring causal path directionality for a

marketing model using Cohen’s path method. In H. Martens, T. Naes, & M. Martens (Eds.),

Causalities Explored by Indirect Observation: Proceedings of the 5th International Symposium on

PLS and Related Methods (PLS’07) MATFORSK (pp. 57–61).

Yoo, J., & Brooks, D. (2005). The role of organizational variables in predicting service effectiveness:

An analysis of a multilevel model. Research on Social Work Practice, 15(4), 267–277.

1670 Voluntas (2014) 25:1648–1670

123

Seven trade-offs in measuring nonprofit performance and effectiveness

Documents

Transcript of Seven trade-offs in measuring nonprofit performance and effectiveness