Post on 20-Jan-2023
ORI GIN AL PA PER
Seven Trade-offs in Measuring Nonprofit Performanceand Effectiveness
Jurgen Willems • Silke Boenigk • Marc Jegers
Published online: 6 March 2014
� International Society for Third-Sector Research and The Johns Hopkins University 2014
Abstract To complement contemporary nonprofit literature, which mainly offers
theory-driven recommendations for measuring nonprofit effectiveness, perfor-
mance, or related concepts; this article presents seven trade-offs for researchers and
practitioners to consider before engaging in a nonprofit effectiveness measurement
project. For each trade-off, we offer examples and suggestions to clarify the
advantages and disadvantages of methodological choices that take various contex-
tual elements into account. In particular, we address the differences between for-
mative and reflective approaches, as well as the differences between unit of interest,
unit of data collection, and unit of analysis. These topics require more in-depth
attention in the nonprofit effectiveness literature to avoid misinterpretations and
measurement biases. Finally, this article concludes with five avenues for further
research to help address key challenges that remain in this research area.
Resume Afin de completer la litterature contemporaine portant sur les organismes
a but non lucratif, qui propose principalement des recommandations basees sur la
theorie visant a evaluer leur efficacite, leurs performances ou des concepts con-
nexes, cet article presente sept compromis que les chercheurs et les professionnels
pourront prendre en consideration avant de s’engager dans un projet d’evaluation de
l’efficacite d’un organisme a but non lucratif. Pour chaque compromis, nous don-
nons des exemples et des suggestions mettant en lumiere les avantages et les in-
convenients de choix methodologiques qui tiennent compte de divers elements
contextuels. En particulier, nous traitons des differences qui existent entre les
J. Willems (&) � S. Boenigk
Department of Nonprofit & Public Management, University of Hamburg, Von-Melle-Park 5,
20146 Hamburg, Germany
e-mail: jurgen.willems@wiso.uni-hamburg.de
J. Willems � M. Jegers
Department of Applied Economics, Vrije Universiteit Brussel, Pleinlaan 2, 1050 Brussel, Belgium
123
Voluntas (2014) 25:1648–1670
DOI 10.1007/s11266-014-9446-1
approches formative et reflexive, ainsi qu’entre unite d’interet, unite de collecte de
donnees et unite d’analyse. Ces sujets exigent d’etre approfondis dans la litterature
portant sur l’efficacite des organismes a but non lucratif pour eviter les interpre-
tations erronees et les biais d’evaluation. Enfin, cet article conclut avec cinq pistes a
explorer dans l’objectif de relever les defis importants qui demeurent dans ce secteur
de recherche.
Zusammenfassung Die Realisierung von empirischer Nonprofit Forschung hat in
den letzten Jahren, sowohl in der akademischen Forschung als auch in der Nonprofit
Praxis, stark zugenommen. Wahrend jedoch in anderen Disziplinen und Fac-
hzeitschriften die methodische Diskussion und Fragen der Messung sehr intensiv und
teilweise kritisch diskutiert werden, hat dieser Diskurs in der Nonprofit Forschung
noch nicht begonnen. Dieser Beitrag verfolgt daher das allgemeine Ziel, den me-
thodischen Diskus unter den Nonprofit Forschenden und Praktikern anzuregen, um
mittelfristig einen Beitrag zur Erhohung der Messqualitat von empirischen Nonprofit
Studien zu leisten. Zu diesem Zweck werden sieben Entscheidungsbereiche vorges-
tellt und vertiefend diskutiert, die bei der Realisierung von empirischen Studien im
Nonprofit Management, insbesondere bei der Messung von Nonprofit Erfolg und
Effektivitat, vermehrt beachtet werden sollten. Im Einzelnen sind dies folgende
Entscheidungstatbestande: (1) Uni- versus Multidimensionalitat, (2) Formative versus
reflective Messung, (3) Individual versus Gruppenmeinung, (4) Interne versus externe
Messung, (5) Leading versus Lagging, (6) Distinkt versus Uberlappende Messung
sowie letztlich (7) Adaptiver versus Multiplikativer Messansatz. Jeder Ents-
cheidungsbereich wird mit seinen Vor- und Nachteilen erlautert und exemplarisch
diskutiert, um Handlungsempfehlungen fur die Nonprofit Community abzuleiten, in
welchen Studiensituationen welcher Messansatz zu bevorzugen ist.
Resumen Para complementar el material publicado contemporaneo sobre orga-
nizaciones sin animo de lucro, que ofrece principalmente recomendaciones im-
pulsadas por la teorıa para medir la efectividad, el rendimiento o conceptos
relacionados de las organizaciones sin animo de lucro, este artıculo presenta siete
compromisos o terminos medios a considerar por los investigadores y profesionales
antes de implicarse en un proyecto de medicion de la efectividad de las organiz-
aciones sin animo de lucro. Para cada compromiso ofrecemos ejemplos y suge-
rencias para clarificar las ventajas y desventajas de las elecciones metodologicas que
toman en cuenta diversos elementos contextuales. En particular, abordamos las
diferencias entre enfoques formativos y reflexivos, ası como tambien la diferencia
entre unidad de interes, unidad de recopilacion de datos y unidad de analisis. Estos
temas requieren mas atencion en profundidad en el material publicado sobre la
efectividad de las organizaciones sin animo de lucro para evitar malas interpre-
taciones y sesgos de medicion. Finalmente, el presente artıculo concluye con cinco
vıas de investigacion adicional para ayudar a abordar los desafıos claves que siguen
existiendo en esta area de investigacion.
Keywords Nonprofit performance � Nonprofit effectiveness � Measurement �Formative or reflective specifications � Measurement biases
Voluntas (2014) 25:1648–1670 1649
123
Introduction
Measuring nonprofit performance and effectiveness has remained a topic of
substantial debate for more than half a century. Nonprofit organizations are defined
solely by their lack of profit distributions (Hansmann 1987), so they can have various
goals (Perrow 1961; DiMaggio 2001). Therefore, a one-size-fits-all solution to
measure the achievement of these goals is unlikely. As Lecy et al. (2012) and Jun and
Shiau (2012) note in their extensive historical overviews of nonprofit effectiveness
literature, multiple studies acknowledge the complexity and the plethora of theoretical
elements that constitute the very concept of effectiveness (Osborne and Tricker 1995;
Plantz et al. 1997; Rojas 2000; DiMaggio 2001; Sawhill and Williamson 2001; Sowa
et al. 2004; Herman and Renz 2008; Carman and Fredericks 2010).
Although previous research has used the terms nonprofit performance and
nonprofit effectiveness synonymously, we clarify their usage for this study. In line
with nonprofit research and organizational performance science (e.g., Richard et al.
2009), nonprofit performance encompasses four concrete areas of interest:
(a) financial performance (e.g., donations raised in a year, state funding),
(b) stakeholder performance (e.g., volunteer satisfaction, donor loyalty, stakeholder
identification), (c) market performance (e.g., nonprofit image, nonprofit brand
reputation, service quality), and (d) mission performance (achieving the mission of
the organization). Nonprofit effectiveness is closely related but broader than
nonprofit performance, in that it focuses more on the balanced input and output
achieved through the combination of processes, projects, and programs imple-
mented by the nonprofit organization to reach its predefined goals. These nonprofit
goals are framed by the organization’s mission, which generally focuses on creating
particular effects for various stakeholder groups (Frumkin 2002).
Herman and Renz (1997, 1999, 2000, 2008) offer an intensive discussion of
nonprofit effectiveness, from which they derive nine general theses. Some of these
theses regard important consequences of how to measure nonprofit effectiveness.
However, as pointed out by Lecy et al. (2012) and Jun and Shiau (2012), substantial
challenges remain with respect to the empirical verification of nonprofit perfor-
mance and effectiveness insights. In particular, they note the paradoxical
observation that proposed guidelines for measuring nonprofit effectiveness are
seldom all met in a single study.
Therefore, this study takes a more empirical-driven perspective on nonprofit
effectiveness measurements, and we add an important but missing methodological
part to the mainly theory-driven recommendations available in contemporary
literature (as summarized by Jun and Shiau 2012; Lecy et al. 2012). We focus on
how the contextual factors of a measurement project, from both research and
practitioner perspectives, can be addressed more clearly as a means to reach
appropriate nonprofit effectiveness measurements. Specifically, we aim to provide
researchers and practitioners with a detailed overview of criteria to be taken into
account when measuring nonprofit effectiveness. Furthermore, we formulate
concrete avenues for research that can enhance both the methodological robustness
and the theoretical depth of the ongoing discussion about nonprofit effectiveness.
1650 Voluntas (2014) 25:1648–1670
123
As we strongly want to avoid adding another list of theory-driven requirements to
the broad literature that already exists, we purposefully adopt a distinct approach.
We identify from our literature review seven trade-offs that researchers and
practitioners should consider when they initiate a project to measure nonprofit
effectiveness. For each trade-off, we note the advantages and disadvantages of each
choice, devoting special attention to the context in which the measurements would
be used (e.g., target group, unit of analysis, type of research question, etc.). From
this perspective, we define a trade-off in the context of this study as an evaluation of
several context-dependent advantages and disadvantages that results in a choice
between two methodological options, or a well-considered combination of these
options. Discussing trade-offs, rather than presenting a supposed one-size-fits-all list
of recommendations, enables us to focus on the importance of the contextual
elements on which decisions could be based in order to come to valuable and
adjusted measurement designs. We build on empirical studies in the domain of
nonprofit effectiveness, which provide examples of advantages and disadvantages of
the various choices. In addition, we use methodological contributions from outside
the traditional nonprofit research domain, such as organizational research,
marketing, psychology, and/or sociology, which offer more substantial experience
in terms of quantifying complex concepts.
Trade-off Decisions in Measuring Effectiveness
Before discussing each trade-off, we stress two points. First, we structure our
discussion in accordance with these seven trade-offs, because the classification
allows us to address stand-alone elements. That is, each element has to be
considered by a researcher or practitioner before engaging in a nonprofit
effectiveness measurement project. Furthermore, this classification enables us to
single out elements from the overall discussion of nonprofit effectiveness, on which
we base our recommendations for further research. We consider each of these trade-
offs equally important with respect to the need for being considered for a
measurement project. Though, our explanations of them differ in length, largely
because some aspects have not been discussed previously to the extent they require.
Second, because we rely on broader literature to frame our methodological
considerations (i.e., organizational research, marketing, psychology, and sociology),
our focal topics may be relevant in other research areas too (e.g., corporate social
responsibility in a for-profit or public setting). Nevertheless, to maintain a clear
focus and provide a targeted contribution, beyond the theoretical overviews already
offered by Jun and Shiau (2012) and Lecy et al. (2012), we use mainly nonprofit-
related examples and theoretical insights from the nonprofit research domain. The
seven trade-offs are summarized in Table 1, and explained in detail in the following
sections.
Uni- Versus Multidimensional Measurements
The first trade-off is between a unidimensional or a multidimensional measurement
approach. In their extensive chronological literature review, Jun and Shiau (2012)
Voluntas (2014) 25:1648–1670 1651
123
Ta
ble
1S
even
trad
e-o
ffs
wh
enm
easu
rin
gn
on
pro
fit
effe
ctiv
enes
san
dp
erfo
rman
ce(a
dv
anta
ges
and
dis
adv
anta
ges
)
Tra
de-
off
s
1.
Uni-
ver
sus
mult
idim
ensi
onal
mea
sure
men
t
Unid
imen
sional
mea
sure
men
t(o
ne
item
,cr
iter
ion,
or
dim
ensi
on
tom
easu
re
effe
ctiv
enes
s)
?C
anbe
appli
edin
het
erogen
eous
sam
ple
s
?T
ote
stre
lati
onsh
ips
inth
eco
nte
xt
of
apar
ticu
lar
theo
ry,
but
appli
cable
acro
ssm
any
dif
fere
nt
sett
ings
(hig
her
rele
van
cefo
rsp
ecifi
cre
sear
chpro
ject
s).
?E
asie
rto
find
larg
ersa
mple
sfo
rw
hic
ha
unid
imen
sional
mea
sure
men
tis
rele
van
tfo
r
all
sam
ple
enti
ties
(e.g
.,dif
fere
nt
org
aniz
atio
ns,
stak
ehold
ers)
?L
ess
cost
/eff
ort
nee
ded
toobta
inm
easu
rem
ents
–F
ocu
son
asi
ngle
aspec
tof
effe
ctiv
enes
s
–C
oncl
usi
ons
mig
ht
rem
ain
gen
eral
–L
ow
erpote
nti
alco
nte
nt
val
idit
yan
d/o
rre
liab
ilit
yof
mea
sure
men
ts
Mult
idim
ensi
onal
mea
sure
men
t(m
ult
iple
crit
eria
and/o
r
dim
ensi
ons)
?G
ives
more
det
aile
din
sights
inth
etr
ue
com
ple
xit
yof
nonpro
fit
effe
ctiv
enes
s
?C
anbe
use
dto
inves
tigat
eco
mple
xre
lati
onsh
ips
of
elem
ents
wit
hoth
erco
nce
pts
(det
aile
dan
alyse
s)
?H
igher
pra
ctic
alre
levan
ce
?U
seof
mult
idim
ensi
onal
mea
sure
men
tsco
nfo
rms
wit
h
conte
mpora
ryex
pec
tati
ons
innonpro
fit
effe
ctiv
enes
sli
tera
ture
–S
uit
able
for
par
ticu
lar
conte
xts
(more
hom
ogen
eous
sam
ple
s)
–P
ote
nti
aldat
asa
mple
sm
ight
be
nat
ura
lly
const
rain
ed
(sta
tist
ical
pow
erto
test
rela
tionsh
ips
of
inte
rest
mig
ht
be
low
)
–M
ore
com
ple
xan
alysi
sm
ethods
mig
ht
be
nec
essa
ry.
2.
Form
ativ
ever
sus
refl
ecti
ve
mea
sure
men
ts
Form
ati
vem
easu
rem
ent
(cri
teri
adefi
ne
nonpro
fit
effe
ctiv
enes
s)
?C
onfo
rms
wit
ha
norm
ativ
eap
pro
ach
toef
fect
iven
ess
mea
sure
men
t(d
om
inan
t
rati
onal
ein
conte
mpora
ryli
tera
ture
)
?C
ansu
pport
theo
reti
cal
clai
ms
that
dif
fere
nt
dim
ensi
ons
of
effe
ctiv
enes
shav
euniq
ue
impac
ts
?C
anin
ves
tigat
eco
mple
xre
lati
onsh
ips
of
elem
ents
of
effe
ctiv
enes
sw
ith
oth
er
conce
pts
(det
aile
dan
alyse
s)
?A
ppro
pri
ate
for
acti
on-o
rien
ted
asse
ssm
ents
of
effe
ctiv
enes
s
–W
hen
wro
ngly
spec
ified
,te
sted
,or
val
idat
ed,
mig
ht
resu
ltin
sever
e
mis
inte
rpre
tati
ons
Refl
ecti
vem
easu
rem
ent
(cri
teri
asu
mm
ari
zenonpro
fit
effe
ctiv
enes
s)
?C
onfo
rms
wit
ha
soci
alco
nst
ruct
ionis
tap
pro
ach
to
effe
ctiv
enes
sm
easu
rem
ent
?C
anbe
use
dto
impro
ve
reli
abil
ity
of
mea
sure
men
ts
?A
ppro
pri
ate
for
per
cepti
on-b
ased
eval
uat
ions
of
effe
ctiv
enes
s
?V
alid
atio
nm
ethods
are
bet
ter
know
nan
dm
ore
use
din
conte
mpora
ryli
tera
ture
–W
hen
wro
ngly
spec
ified
,te
sted
,or
val
idat
ed,
mig
ht
resu
ltin
sever
em
isin
terp
reta
tions
1652 Voluntas (2014) 25:1648–1670
123
Ta
ble
1co
nti
nu
ed
Tra
de-
off
s
3.
Indiv
idual
ver
sus
gro
up
mea
sure
men
ts
Indiv
idual
mea
sure
men
t(s
ingle
sourc
eis
consu
lted
tom
easu
reef
fect
iven
ess)
?L
arger
sam
ple
sca
nbe
com
pose
dat
the
level
of
the
unit
of
inte
rest
?A
bro
adra
nge
of
crit
eria
can
be
pro
bed
from
the
sourc
ew
ith
the
most
acce
ssto
info
rmat
ion
–W
hen
unit
of
inte
rest
dif
fers
from
unit
of
dat
aco
llec
tion
or
anal
ysi
s,su
bst
anti
al
mea
sure
men
tbia
ses
mig
ht
exis
tdue
to(a
)so
cial
des
irab
ilit
y,(b
)dif
fere
nt
bac
kgro
und
of
rate
rs,
or
(c)
dif
fere
nt
per
sonal
refe
rence
fram
ework
s
–In
terp
reta
tion
of
resu
lts
should
be
atth
eat
om
isti
cle
vel
(i.e
.,le
vel
of
dat
aco
llec
tion,
rath
erth
anle
vel
of
inte
rest
).
Gro
up
mea
sure
men
t(m
ult
iple
sourc
esare
consu
lted
tom
easu
re
effe
ctiv
enes
s)
?M
ore
reli
able
mea
sure
men
tsm
ight
be
obta
ined
(counte
ring
mea
sure
men
tbia
ses)
?A
more
holi
stic
vie
wfr
om
dif
fere
nt
per
spec
tives
can
be
acquir
ed
–D
ata
coll
ecti
on
mig
ht
be
cost
lyan
dco
mple
x
–A
ssum
pti
ons
nec
essa
ryto
aggre
gat
edat
a,w
hic
hm
ight
resu
lt
infe
wdat
apoin
tsat
the
unit
of
inte
rest
–C
ontr
ol
var
iable
atin
div
idual
(sourc
e)le
vel
mig
ht
be
nec
essa
ry
4.
Inte
rnal
ver
sus
exte
rnal
mea
sure
men
ts
Inte
rnal
mea
sure
men
t(i
nte
rnal
sourc
es)
?M
ore
det
aile
din
form
atio
nis
avai
lable
–M
easu
rem
ent
bia
ses
mig
ht
exis
tdue
toso
cial
des
irab
ilit
yan
div
ory
tow
erju
dgm
ents
Ext
ernal
mea
sure
men
t(e
xter
nal
sourc
es,
such
as
cust
om
ers,
oth
erorg
aniz
ati
ons,
donors
)
?In
form
atio
nof
hig
her
rele
van
cem
ight
be
obta
ined
(rea
l
worl
d)
–C
om
par
abil
ity
of
sam
ple
sac
ross
org
aniz
atio
ns
should
exis
t
–D
iffe
rence
sac
ross
sourc
esin
each
org
aniz
atio
nsh
ould
be
acco
unte
dfo
r(e
.g.,
contr
ol
var
iable
s,w
eights
),to
avoid
bia
ses
due
toth
e(a
)dif
fere
nt
bac
kgro
und
of
rate
rsor
(b)
dif
fere
nt
per
sonal
refe
rence
fram
ework
s
5.
Lea
din
gver
sus
laggin
g
mea
sure
men
ts
Lea
din
gm
easu
rem
ent
(focu
sed
on
act
ions
and
thei
rdir
ect
outc
om
es)
?M
ore
obje
ctiv
ely
obse
rvab
le
?G
ener
aliz
able
acro
ssm
ore
het
erogen
eous
sam
ple
sof
org
aniz
atio
ns
–A
ssum
pti
ons
should
be
real
isti
cre
gar
din
gth
eef
fect
sas
soci
ated
wit
hth
ese
mea
sure
men
ts
Laggin
gm
easu
rem
ent
(focu
sed
on
the
effe
cts
of
org
aniz
ati
onal
act
ions
and
outc
om
es)
?F
ocu
son
the
elem
ents
that
are
of
real
inte
rest
(eff
ects
:hig
her
pra
ctic
alre
levan
ce)
–M
ore
subje
ctiv
ean
ddep
enden
ton
per
sonal
bac
kgro
unds
and
refe
rence
fram
ework
s
–L
ess
gen
eral
izab
leac
ross
org
aniz
atio
ns
wit
hvar
ious
mis
sions
–L
ess
gen
eral
izab
leac
ross
stak
ehold
ergro
ups
wit
hvar
ious
inte
rest
san
dnee
ds
Voluntas (2014) 25:1648–1670 1653
123
Ta
ble
1co
nti
nu
ed
Tra
de-
off
s
6.
Dis
tinct
ver
sus
over
lappin
g
mea
sure
men
ts
Dis
tinct
conce
pts
mea
sure
d(t
oin
vest
igate
cause
sand
effe
ctof
nonpro
fit
effe
ctiv
enes
s)
?T
oin
ves
tigat
efa
ctors
and
condit
ions
that
det
erm
ine
dif
fere
nce
sin
nonpro
fit
effe
ctiv
enes
s
?T
oid
enti
fym
anag
emen
tac
tions
that
should
be
inpla
ceor
impro
ved
toobta
inhig
h
effe
ctiv
enes
s
?E
nab
les
stro
ng
theo
riza
tion
–E
xte
nsi
ve
pre
test
ing
nec
essa
ryto
det
erm
ine
dis
tinct
nes
sof
conce
pts
–M
ore
com
ple
xan
dco
stly
dat
aco
llec
tion
pro
cess
es
Ove
rlappin
gco
nce
pts
mea
sure
d(t
oin
vest
igate
men
tal
model
s
on
nonpro
fit
effe
ctiv
enes
s)
?T
oin
ves
tigat
ehow
conce
pts
are
per
ceptu
ally
rela
ted,
and/o
r
how
men
tal
model
sar
eco
nst
itute
dam
ong
man
ager
sor
stak
ehold
ers
(e.g
.,re
sear
chon
man
ager
ial
sense
mak
ing)
–R
isk
of
over
-inves
tigat
ion
and
inte
rpre
tati
on
of
nonex
iste
nt
rela
tions
(Type
Ier
rors
)
7.
Addit
ive
ver
sus
mult
ipli
cati
ve
mea
sure
men
ts
Addit
ive
mea
sure
men
t(v
ari
ous
crit
eria
aggre
gate
din
addit
ive
way,
addin
gor
ave
ragin
g)
?T
ore
ach
am
ore
reli
able
mea
sure
men
t(r
eflec
tive
crit
eria
)or
when
com
bin
ing
dif
fere
nt
sourc
es(e
.g.,
inte
r-ra
ter
agre
emen
t)
?T
oobta
inev
aluat
ive
mea
sure
men
tsw
hen
crit
eria
can
com
pen
sate
for
one
anoth
er
–U
niq
ue
contr
ibuti
ons
of
separ
ate
crit
eria
toover
all
effe
ctiv
enes
sm
easu
rem
ent
not
clea
rly
obse
rvab
le
Mult
ipli
cati
vem
easu
rem
ent
(vari
ous
crit
eria
aggre
gate
din
a
mult
ipli
cati
vew
ay)
?T
om
ake
condit
ional
ity
of
par
ticu
lar
crit
eria
more
pro
min
ent
?T
oid
enti
fycr
iter
ia,
dim
ensi
ons,
or
stak
ehold
ergro
ups
that
requir
em
ost
urg
ent
man
agem
ent
acti
ons.
?U
sefu
lfo
rth
ese
lect
ion
of
hig
hly
effe
ctiv
eorg
aniz
atio
ns
(e.g
.,
qual
itat
ive
rese
arch
anal
ysi
s)
–L
ess
consi
sten
tw
ith
conte
mpora
ryuse
sof
var
iable
san
d
quan
tifi
cati
on,
soas
sum
pti
ons
about
dis
trib
uti
on,
var
iance
s,
and
range
mig
ht
be
vio
late
d(i
.e.,
not
suit
able
for
com
monly
use
dsc
ienti
fic
rese
arch
met
hods)
1654 Voluntas (2014) 25:1648–1670
123
distinguish between unidimensional and framework-based approaches. Unidimen-
sional approaches, typical of the first generation of nonprofit effectiveness studies,
focus on a single dimension of nonprofit performance and are commonly applied
according to a particular theory. Framework-based approaches, or the second
generation of studies, apply an additional classification between multidimensional
and multi-constituency approaches. Multidimensional approaches take more than
one dimension of effectiveness into account; for example, Sowa et al. (2004)
differentiate management effectiveness from program effectiveness. A multi-
constituency approach instead addresses the different interests and perspectives of
separate stakeholder groups.
We consider the advantages and disadvantages of unidimensional versus
multidimensional measurements in this section. The reasons listed in contemporary
literature for using a multidimensional effectiveness assessment reflect mainly
theoretical considerations (Lecy et al. 2012). Most nonprofit organizations (1) have
multiple goals, (2) share goals with other organizations, (3) pursue subjective
outcomes, and (4) involve various stakeholders with different interests, so a single
criterion likely could not quantify an organization’s effectiveness sufficiently
(Perrow 1961; DiMaggio 2001; Kaplan 2001; Herman and Renz 2008; Moxham
2009). Therefore, considering various criteria would help researchers gain a more
holistic view of effectiveness, such that they could make better inferences about the
complexity of the ‘‘real world’’ (Micheli and Kennerley 2005; Gordon et al. 2010).
As another important benefit, a multidimensional scale can investigate the
differential impacts of various factors on different dimensions (Jackson and
Holland 1998; Brown 2005; Dart 2010).
However, at least three considerations involving multidimensional measurements
highlight their potential disadvantages. The first relates to the generalizability of the
findings. Because of the many differences that exist across nonprofit organizations,
it is hard, or even impossible, to find a set of performance dimensions that would be
broadly applicable to many different types of organizations and their varied
stakeholders (DiMaggio 2001; Micheli and Kennerley 2005; Baruch and Ramalho
2006; Eckerd and Moulton 2011). To ensure that the criteria adopted in a
multidimensional approach are relevant for all organizations and stakeholders
addressed, the data collection would need to be constrained to a particular context of
similar organizations and/or stakeholders (i.e., homogeneous sample). When
researchers seek generalizable findings or if practitioners want to make an overall
assessment across stakeholder groups, they might prefer measurements with a single
dimension or criterion, relevant for all the different entities. Thus they could
investigate more heterogeneous samples, in which the organizations or stakeholders
differ widely. A useful suggestion in this context comes from Baruch and Ramalho
(2006), who propose including both context-specific and generalist measurements
(items and/or dimensions) of effectiveness in every study. Such an approach could
offer a detailed analysis of a particular research question (context-specific
measures), while also framing and comparing results within a broader investigation
of effectiveness (general measurements).
A second consideration is the challenges that multidimensional measures create
for data analysis. Despite the growing availability of various statistical applications
Voluntas (2014) 25:1648–1670 1655
123
(Sowa et al. 2004), to date, few contributions apply them to the study of
multidimensional measurements of effectiveness (Lynn et al. 2000; Lecy et al.
2012). These methods require substantial sets of observations (Maas and Hox 2005;
Herman and Renz 2008), but in most cases, either the populations are naturally
constrained (i.e., there are no unlimited contexts in which all measurement
dimensions make sense), or the groups of target respondents are too heterogeneous
(i.e., requiring many control variables, reducing the statistical power for testing the
real relationships of interest). The use of multidimensional measurements thus
seems more appropriate for testing the potentially complex relatedness of a few
variables in a controlled environment. In contrast, a unidimensional measure could
test a more generalizable relationship, across more heterogeneous contexts, based
on a general theory (Jun and Shiau 2012).
Finally, from a scientific perspective, some nonprofit researchers pursue high
measurement quality and internal consistency by using items and criteria that are
very similar in nature. Despite their high internal reliability (e.g., high Cronbach’s
alpha values) and good model fit, the second or third item in such multi-item
constructs often contributes very little beyond the information obtained from the
first item (Drolet and Morrison 2001). Nor do multiple-item measurements,
compared with single-item measures, necessarily have better predictive validity
(Bergkvist and Rossiter 2007). When several similar criteria are combined, efforts
and costs increase for data collection, yet less relevant information might emerge,
due to the measurement of redundant items and criteria. Therefore, nonprofit
effectiveness could be measured with a single-item indicator if (1) it is used for
measuring an overall personal perception, (2) a consistent methodological approach
is applied, and (3) the interpretation of results sufficiently takes individual
respondents’ characteristics into account (Pandey et al. 2007; Stazyk and Goerdely
2010). Furthermore, a perception-based reflective effectiveness item can be useful
for explaining individual behavior, because it relates to multiple, more objective
organizational effectiveness criteria, such as giving donations, being committed to
the organization or volunteer for the organization (Forbes 1998; Padanyi and Gainer
2003; Yoo and Brooks 2005; Daellenbach et al. 2006; Sarstedt and Schloderer 2010;
Mews and Boenigk 2012). For example, such an item might ask, ‘‘On an overall
basis, rank the effectiveness of your agency in accomplishing its core mission
(0 = not effective at all; 10 = extremely effective)’’ (Pandey et al. 2007, p. 406).
Considering the three key elements, and potentially in contrast with a dominant
rationale in previous literature, we propose that a unidimensional or even single-
criterion perspective on effectiveness deserves more attention, especially if the aim
is to (1) improve generalizability across contexts, (2) enhance analytical robustness,
(3) deal with naturally constrained data availability, and/or (4) reduce costs or
efforts because they are not worth the minimal extra information provided by an
additional item, criterion, or dimension.
Formative Versus Reflective Measurements
Another trade-off pertains to whether to apply a formative or reflective approach.
Each approach inherently encompasses very different assumptions, and when
1656 Voluntas (2014) 25:1648–1670
123
wrongly applied, it could have severe consequences in terms of Type I errors
(Diamantopoulos 1999). These errors occur when the results suggest that a
relationship exists between two concepts, whereas in reality no such relationship
exists, or for example, when findings recommend that practitioners should invest in
particular practices, even though it produces no returns in reality.
A reflective approach assumes that the latent variable (i.e., the concept of
interest, which would be nonprofit effectiveness in our context) causes the different
indicators being measured (DeVellis 2003; Sarstedt and Schloderer 2010; Boenigk
et al. 2011). For example, if a donor is satisfied with an organization (which would
mean that the latent variable is ‘‘donor satisfaction’’), the reflective approach
includes the assumption that the donor will evaluate the overall service delivered by
the organization positively and believes that his or her personal expectations have
been fulfilled. As a result, reflective indicators of donor satisfaction might be: ‘‘I am
happy with the service delivered by this organization’’ or ‘‘My expectations are met
by this organization.’’
A formative approach instead takes the opposite assumption: All the indicators
together define or conceptually cause the latent construct (Bollen and Lennox 1991).
An example in the nonprofit context comes from Willems et al. (2012a). For the
latent variable ‘‘nonprofit governance quality,’’ whether organizational governance
quality is considered high depends on whether several distinct criteria are high. For
example, to consider an organization well governed, stakeholders would need to be
sufficiently involved in the organization’s decision processes, its internal structures
and procedures must be well developed, recurrent evaluations should review prior
achievements and outputs, etc. (together all formative criteria for ‘‘governance
quality’’ (Willems et al. 2012a).
For most constructs, the choice between a formative and a reflective approach
should be obvious (Petter et al. 2007; Sarstedt and Schloderer 2010). Yet the
different perspectives applied in the literature to nonprofit effectiveness makes the
choice less obvious. Errors might result when methodological specifications are
inconsistent with theoretical claims and interpretations, so this second trade-off
deserves substantially more attention in the nonprofit literature. We address two
perspectives that appear in contemporary nonprofit literature and that each support
either a formative or a reflective approach. Depending of the context of their
particular study, researchers and practitioners can chose for a formative, reflective,
or a combined approach. These perspectives are the social construction and the
normative nature of nonprofit effectiveness.
If effectiveness is seen as a social construction for a particular group of people—
that is, they hold a shared, common perception of the concept nonprofit
effectiveness (Herman and Renz 1997; Forbes 1998; Liao-Troth and Dunn
1999)—a reflective approach is appropriate. If people consider an organization
‘‘effective,’’ they might designate it as a good example for other organizations or
cite the organization as a best practice example. As a result, the reflective items in a
scale measuring overall effectiveness might include, ‘‘This organization is a good
example for other organizations’’ or ‘‘I would talk about this organization to
illustrate good practices.’’ From a reliability perspective, several such items,
preferably with high common variance, could be combined, such that the items
Voluntas (2014) 25:1648–1670 1657
123
together measure a single latent variable (DeVellis 2003). Therefore, a reflective
approach is appropriate when the measurements focus on perceptions of effective-
ness or on reputational effectiveness (Smith and Shen 1996; Forbes 1998; Liao et al.
2001).
In contrast, from a normative perspective, nonprofit performance, or effective-
ness requires a formative measurement approach (Petter et al. 2007; Willems et al.
2012a). This means that the theoretical framework requires a combined consider-
ation of multiple items or dimensions to provide a comprehensive assessment of the
organization’s effectiveness (e.g. Kaplan 2001). The assessment of the various
criteria thus leads to the conclusion about whether the nonprofit organization is
effective (i.e., causal relationship from items to latent variable). In this case, the
common variance of the items is less important, while their unique variances
become critical. A formative construct is useful if little common variance exists
among its indicators, as each indicator then has a relevant contribution to the overall
concept (which is usually assessed by the level of multicollinearity) (Diamanto-
poulos and Siguaw 2006). Specialized literature offers a more extensive overview of
the steps necessary to develop and test the appropriateness of formative constructs
(Joreskog and Goldberger 1975; Bollen and Lennox 1991; Diamantopoulos and
Winklhofer 2001; Jarvis et al. 2003; Petter et al. 2007; Sarstedt and Schloderer
2010).
Arguments in existing literature that suggest the need to measure nonprofit
effectiveness with multiple dimensions also tend to take a normative perspective
(Herman and Renz 1999, 2008; Rojas 2000; DiMaggio 2001; Sowa et al. 2004;
Shilbury and Moore 2006). This rationale states that researchers and practitioners
should investigate multiple, distinct dimensions, because a single dimension cannot
sufficiently represent what effectiveness is. This argument inherently endorses the
idea that each dimension contains a piece of relevant information that cannot be
captured by other dimensions. Thus, a formative measurement model approach is
appropriate for a normative perspective. Unfortunately, several (frequently cited)
examples in nonprofit literature assert the need for multiple dimensions and propose
multiple criteria or dimensions, yet the results they report suggest that the
dimensions proposed are not multiple or distinct effectiveness criteria. For example,
principal component analysis (PCA), validation reporting (e.g., Cronbach’s alpha),
and interpretations all build on high common variances among items or dimensions
(e.g., Gill et al. 2005; Sowa et al. 2004; Shilbury and Moore 2006). Because such
approaches thus contradict the theoretical claims, they create the potential for
Type I errors (Diamantopoulos and Siguaw 2006). Therefore, we suggest that the
items proposed in such seemingly multidimensional scales could be used in further
research to quantify a single, unidimensional, ‘‘social constructionist’’ latent
variable that quantifies overall effectiveness (rather than measuring distinct
dimensions), or that new criteria and items are proposed that truly measure a
unique element of overall nonprofit effectiveness. We also suggest avoiding
substantive explanations of the superficial relatedness of arbitrarily chosen
dimensions (with one another or other concepts). Such efforts grant distinct
interpretations to relations between concepts that are not inherently different, which
again creates Type I errors (Diamantopoulos and Siguaw 2006). Furthermore, we
1658 Voluntas (2014) 25:1648–1670
123
suggest re-validations, using formative approaches, of existing scales that initially
were developed to measure distinct dimensions but that led to high internal
correlations. These new evaluations also could clarify the extent to which the
proposed dimensions actually measure distinct aspects. If no such proof is
forthcoming, new scale development should be undertaken, to uncover multiple,
unrelated dimensions that accord with the issued theoretical claims.
Individual Versus Group Measurements
Although virtually no empirical tests address the extent to which we can rely on a
single rater’s opinion to assess an organization’s effectiveness (Lynn et al. 2000),
opinions among individuals can differ substantially (Green and Griesinger 1996;
Herman and Renz 2008; Willems et al. 2012b). First, raters engage in various types
of relationships with the organization, so they judge the usefulness of the
organizational outcomes differently depending on their personal needs (Krashinsky
1997; Balser and McClusky 2005; Mistry 2007; Babiak 2009). Second, a
comparative view is inherent to assessments of concepts that are difficult to
quantify objectively (Herman and Renz 1999, 2008; DiMaggio 2001), so the
personal reference frameworks of raters play important roles. That is, people rely on
their personal experiences with other organizations to judge the effectiveness of an
organization (Herman and Renz 1999, 2008). Third, the social desirability of the
reported output might result in (unsystematically) biased assessments that differ
across stakeholder types (Green and Griesinger 1996). For example, a fundraiser
might present organizational effectiveness higher than it actually is, because she or
he hopes to create a positive image of the organization and thereby obtain more
funding.
Considering these three sources of potential measurement biases in individual
assessments of organizational characteristics, it might be useful to distinguish
between the (1) unit of interest, (2) unit of data collection, and (3) unit of analysis
(though all three often are referred to with the common term ‘‘unit of analysis’’).
The unit of interest pertains to the level on which the theoretical considerations are
focused. In nonprofit effectiveness literature, the unit of interest is often the
organization (e.g., ‘‘Do particular practices in an organization result in better effects
for its stakeholders?’’). The unit of data collection is the level at which data are
collected. For example, raters might be asked to judge the effectiveness of an
organization, or researchers might review various annual reports by the organiza-
tion. In these cases, the respective units of data collection would be persons in the
organizations and years in which the organization was active. These data can be
analyzed in their original form, aggregated, or related across levels on the basis of
existing assumptions. These choices in turn determine the unit of analysis. If the unit
of interest (e.g., organization) differs from the unit of data collection or unit of
analysis (e.g., individual opinions; change in reported donors), important method-
ological considerations are necessary with respect to robustness, reliability, the
research design, and the contribution.
From a reliability perspective, different opinions or measurements can be
gathered and aggregated for a single organization, which requires the use of
Voluntas (2014) 25:1648–1670 1659
123
inter-rater reliability measures (Boyer and Verma 2000). However, this method
requires an extensive data collection, which might result in high costs and reduce
the number of organizations in the sample, leading to less statistical power for
analyzing the unit of interest (Groves and Heeringa 2006; Herman and Renz 2008).
Thoughtful decisions also are necessary with respect to how to aggregate the data
(especially when the numbers of respondents differ across organizations). It should
be possible to find comparable groups of respondents across organizations (e.g.,
board members are likely more comparable across organizations than the
beneficiaries of different types of organizations). In contrast, researchers could
use the opinions of single respondents with comparable positions across organi-
zations to gather comparable answers from a relatively large sample of organiza-
tions, such as the chair of the board or the general manager (Brown 2005; Gill et al.
2005; LeRoux and Wright 2010). In this way, larger samples at the level of the unit
of interest can be composed, but biases can occur due to differing backgrounds,
social desirability, and subjective measurements can induce misinterpretations of
the results—particularly for studies that rely on individual perceptions (please see
‘‘Leading versus lagging measurements’’ and ‘‘Distinct versus overlapping effec-
tiveness measurements’’ sections).Therefore, as long as the analysis also includes
individual control variables or provides a more atomistic (i.e., individual-level)
interpretation of the findings, a single person’s assessment can be a valid substitute
to approximate an organization’s characteristics.
Rather than controlling for potential biases, it is also possible to consider more
complex research questions, in which both individuals and organizations are units of
interest. We know little about the factors and effects of the unique perceptions of
individual raters versus the shared perceptions among raters in the same
organization; substantial improvement of our understanding of nonprofit effective-
ness might result from a closer examination of such multilevel data structures (Lynn
et al. 2000; Yoo and Brooks 2005; Hitt et al. 2007). Thus, the challenges regarding
existing biases become research questions, with both high academic and practical
relevance (Lynn et al. 2000; Sowa et al. 2004; Willems et al. 2012a).
Internal Versus External Measurements
Closely related to the previous trade-off, some additional advantages and
disadvantages refer to internal assessments, being the opinions of people inside
the organization (e.g., CEO, board members, staff, volunteers) versus external
assessments by people outside the organization (e.g., customer, funders, beneficia-
ries) (Van Puyvelde et al. 2012). As Green and Griesinger (1996) note, relying too
much on inside judgments can result in biased assessments due to social desirability
or ivory tower judgments, which arise when people in leadership positions lack
insight into actual operational performance. Yet the availability of information to
insiders means that substantially more and detailed information can be gathered
from these internal respondents (Brown 2005).
Using assessments by donors, beneficiaries, other organizations in the field, or
other external stakeholders offers the advantage of greater relevance, in terms of
real effects (Smith and Shen 1996; Forbes 1998; Liao et al. 2001). The perceptions
1660 Voluntas (2014) 25:1648–1670
123
held by these external stakeholders determine (implying a reflective operational-
ization; see ‘‘Formative versus reflective measurements’’ section), for example,
donation levels, subsidies, volunteering intentions, and whether their interests are
sufficiently met by the organizational outcomes (Forbes 1998; Padanyi and Gainer
2003; Yoo and Brooks 2005; Daellenbach et al. 2006; Sarstedt and Schloderer 2010;
Mews and Boenigk 2012). Yet their perceptions also might reflect biases, including
those mentioned in the previous trade-off (i.e., different stakeholder needs or
different personal reference frameworks). The decisions about whom to rely on
(‘‘Internal versus external measurements’’ section) and how many sources to include
(‘‘Individual versus group measurements’’ section) thus might be combined, to deal
with the overall biases that can result from differences between the unit of interest
and the unit of data collection and/or analysis.
Leading Versus Lagging Measurements
The distinction between leading and lagging measurements, as noted in practice-
oriented business literature (Brewer and Speh 2000; Epstein and Wisner 2001;
Bremser and Barsky 2004), refers to the causal sequence of organizational actions
and direct outputs on one hand, and the effects for various stakeholders on the other.
From a practical perspective, lagging indicators focus on the overall, mission-
related effects of an organization for a certain period of time, following from the
investments made and actions performed by an organization (e.g., number of
healthy children after a vaccination program, change in literacy rate due to a
deployed reading program). Leading indicators instead forecast and quantify the
organization’s actions or direct outcomes (e.g., number of vaccinations given,
number of people educated in the reading programs).
Leading indicators tend to be more objectively observable. For example, whether
particular board practices are in place is easier to observe than the extent to which
public opinion has changed after a campaign. A focus on leading indicators might
help avoid biases due to social desirability, personal background, or strong
individual reference frameworks. Furthermore, leading indicators often can be
generalized across heterogeneous samples. For example, regardless of the mission
or strategy of an organization, specific board, or management practices likely need
to be in place at various types of organizations, and can therefore more easily be
compared across organizations. Assessments of their achieved mission or strategy
instead are much more context dependent (Sawhill and Williamson 2001; Moxham
2009).
In contrast, lagging indicators have greater relevance, in the sense that they
reflect changes in society, achievement of mission, or impact on stakeholders. Yet
they also tend to be more subjective and dependent on personal reference
frameworks. To deal with these issues, we refer to prior suggestions regarding the
need to consider multiple external assessments, and control and test for contextual
and individual characteristics.
In this context, we note an interesting contribution by Packard (2010), who
describes a complex, conceptual sequence of relevant indicators: input indicators,
throughput indicators, management capacity, program capacity, and outputs. These
Voluntas (2014) 25:1648–1670 1661
123
combined dimensions reveal the organization’s overall performance as an ‘‘over-
riding concept, including throughputs in terms of program and management
operations; and results including outputs, quality, efficiency, and effectiveness’’
(Packard 2010, p. 976). With a survey of 52 staff members in 14 nonprofit
programs, Packard finds that the respondents consider goal accomplishment of great
importance. These lagging indicators also are of stronger interest from a theoretical
perspective and they invoke generally high ratings when people are asked to judge
their own organization. Perhaps social desirability influences their answers,
especially those for indicators that are less objectively measurable. However,
respondents seem to rate leading indicators, such as customer and employee
satisfaction, as better indicators of nonprofit effectiveness, because such indicators
can more actively be managed, as well as compared across organizations.
Distinct Versus Overlapping Effectiveness Measurements
While the previous trade-off deals with the causal sequence of distinct concepts like
organizational actions, outcomes, and effects, here we address potentially overlap-
ping effectiveness-related concepts. The literature shows different logical sequences
regarding overall effectiveness, in which for example (1) particular board practices
lead to board performance (or governance quality), (2) board performance leads to
organizational performance, (3) organizational performance leads to reputational
effectiveness, and (4) reputational effectiveness leads to higher donations (stake-
holder performance) (Bradshaw et al. 1992; Green and Griesinger 1996; Kushner
and Poole 1996; Siciliano 1997; Smith and Shen 1996; Herman et al. 1997; Herman
and Renz 2000; Forbes 1998; Padanyi and Gainer 2003; Radbourne 2003; Brown
2005; Gill et al. 2005; Yoo and Brooks 2005; Daellenbach et al. 2006; Carman and
Fredericks 2010; Sarstedt and Schloderer 2010; Mews and Boenigk 2012; Helmig
et al. 2013).
When multiple of these effectiveness-related concepts appear in a single study,
regardless of whether they are formative or reflective, or leading or lagging, they
might be completely distinct, partially overlapping, or even completely overlapping
(Tacq 1984). Consider for example the unique impact of multiple management
practices and their quality on organizational effectiveness, tested in a regression
analysis. An inherent assumption is that they are distinct concepts in a causal
sequence. However, when the assessment of all these concepts (management
practices and effectiveness) relies on perceptions, the indicators could be related or
partially overlapping, as they contain very similar elements. The overlap even could
be so extensive that the assumed cause equals the hypothesized effect (i.e., two
conceptual names are used, but both refer to a single and general concept, or causa
aequat effectum, Tacq 1984, p. 146). If the researcher posits that the concepts are
distinct when they are not (or only partially), substantive causal interpretations
could emerge for non-existing relations (i.e., Type I error). In trade-off 2, we
described similar errors, which resulted from assigning different interpretations to
seemingly distinct dimensions that were not different in reality. Here, the errors
result from assuming causal relations among concepts that are, at least from a
measurement perspective, not different.
1662 Voluntas (2014) 25:1648–1670
123
Thus, when multiple effectiveness-related concepts combine in a single study,
with causal relations hypothesized and tested across these concepts, we suggest an
initial, thorough analysis of whether the measured concepts actually may be
considered distinct. The discriminant and convergent validity of the concepts could
be tested (Fornell and Larcker 1981; Cohen 1988; Wilson et al. 2007; Farrell and
Rudd 2009; Farrell 2010; Shiu et al. 2011). For example, Boenigk and Helmig
(2013) assess whether donor–nonprofit identification and donor identity salience are
distinct constructs, with separate and direct effects on donation behavior. This
assessment revealed that the two identification constructs were distinct, even though
the indicators used for the two measurements seemed similar. To avoid Type I
errors, more careful interpretations are needed regarding whether different concepts
truly are being measured and if they causally relate, especially if single informants
provide the judgments for several effectiveness-related concepts, such as measures
of professionalism, aspects of leadership, or governance effectiveness (e.g., 260
CEOs judging their own organizations, LeRoux and Wright 2010), as well as
perceptions of chairs (e.g., Harrison et al. 2012 surveyed different people close to
board chairs). When extremely high standardized coefficients or correlations (up to
0.86) are reported as causal relationships between perception-based concepts, we
might suspect that the measurements combine leading and lagging indicators,
quantifying a single overall latent construct that the respondents perceive similarly,
rather than distinct concepts that are causally related.
In contrast, studying overlapping concepts can help in better understanding how
mental models are shaped by (groups of) practitioners. Having insight in how
individuals separately, or groups through mutual interaction, construct a mental
framework on how their actions and decisions potentially result in outcomes, can
improve our insights in the individual and collective sense making processes. As
these processes are at the basis of actual managerial and governance decisions, they
can clarify what information is mainly taken into account and for what reasons
(Mitchell 2013).
Additive Versus Multiplicative Measurements
Researchers and practitioners might contemplate assumptions they make with
respect to aggregating data, which can reflect different criteria, dimensions, and
items (see ‘‘Uni-versus multidimensional measurements’’ and ‘‘Formative versus
reflective measurements’’ sections), or different sources (see ‘‘Individual versus
group measurements’’ section). In most cases, if multiple items refer to a single
dimension, an average score is used. In combination with the high common variance
of reflective constructs, such a quantification should give a reliable indication of an
existing latent concept. The combination of multiple sources, together with high
inter-rater reliability (Boyer and Verma 2000), also offers a reliable assessment. For
formative items, weights (comparable to regression coefficients) might be used to
extract the unique effects of the indicators. Similar considerations apply to
combinations of dimensions or of assessments by different stakeholder groups. In
these cases, the assumption is that the measurement model is additive, such that the
(weighted) sum of multiple assessments constitutes the score for the overall
Voluntas (2014) 25:1648–1670 1663
123
concepts. From a practitioner perspective, a multiplicative approach may be more
useful for assessing and benchmarking organizations. This approach stresses the
conditionality of each dimension or data source. For example, if an observer reviews
several dimensions to make a holistic assessment of an organization’s effectiveness,
separate scores are available for each dimension. An organization might score high
on all except one dimension and extremely low on this single dimension. An
additive approach would produce a fairly positive outcome, because the negative
outcome for the one dimension would be compensated for by the other dimensions.
A multiplicative approach instead would lead to a lower value for the overall
concept, because in this view, an organization is effective only if it performs well on
every dimension. The latter approach thus could be valuable for selecting highly
effective organizations for further qualitative exploration (Balser and McClusky
2005; Herman and Renz 2000). Finally, a multiplicative approach can reveal the
causes and effects of a particular and obvious shortcoming, rather than of averaged
achievements across multiple dimensions or sources (Willems et al. 2012a).
Conclusions and Avenues for Further Research
Rather that striving for a ‘‘best’’ way that is applicable to any kind of nonprofit
organization, we have sought to present both the advantages and the disadvantages
of a variety of choices that can be made, taking into account the contextual elements
of an effectiveness measurement project. As our first contribution, we offer an
overview of these advantages and disadvantages, as summarized in Table 1. This
overview offers decision guidelines for researchers and practitioners who engage in
nonprofit effectiveness measurement projects.
As a second main contribution, we identify from the seven trade-offs five
research avenues that could extend and enhance the ongoing nonprofit effectiveness
discussion.
First, our approach started with existing theoretical insights, examples, and
literature overviews from the broad nonprofit research domain. Starting from the
seven trade-offs identified, a next step for research could be to pursue a more
standardized, abstracted approach for identifying general decision rules for
nonprofit effectiveness measurement projects. Our literature review enabled us to
identify the separate trade-offs, but it was outside the scope of this project to
investigate actual, concrete decisions that have been considered (or ignored) by the
authors of the studies that we cite. This limitation actually is inherent to the
academic publication process, which focuses on reporting detailed research findings
rather than on the trade-offs that the researchers or practitioners considered reaching
these findings. It would be even harder to gain insights into all the contextual factors
that determined their choices. Therefore, we invite researchers and practitioners to
discuss and evaluate their own project-related decisions, using the trade-offs that we
identified (e.g., in the methodology section of their articles). In addition to clarifying
the purpose and decision processes underlying research projects, this could also
result in a more critical evaluation and elaboration of the trade-offs postulated
herein. Ultimately, this could also provide more standardized insights into how
1664 Voluntas (2014) 25:1648–1670
123
contextual settings can determine optimal decisions for nonprofit effectiveness
measures.
Second, considering the misinterpretations that can result from misspecifications
or over-investigation of seemingly distinct but overlapping concepts (see ‘‘Forma-
tive versus reflective measurements’’ and ‘‘Distinct versus overlapping effectiveness
measurements’’ sections), we hope further research digs into the question of where
to draw boundaries between dimensions and/or related effectiveness concepts. In
contrast with the divergent evolution in the previous literature, typified by the
introduction of new classifications, dimensions, conditions, factors, and effects, we
consider a converging phase in nonprofit effectiveness research more relevant.
When new concepts are introduced, a proper clarification and testing of how they
are similar or different from those that already exit previously, is important for true
theory advancement. In particular, when concepts are quantified for further analysis,
we strongly suggest reusing existing measurements. As Baruch and Ramalho (2006)
recommend, new studies on nonprofit effectiveness might reuse basic, generalizable
concepts across contexts and for various research questions. Depending on the
particular context, additional measurement criteria might be introduced. Such
(partial) reuse of dimensions, criteria, or scales offers two major advantages for an
overall understanding of nonprofit effectiveness. It provides a continuous verifica-
tion of existing concepts, dimensions, and measurements, both theoretically and
methodologically, and it enables a more critical evaluation of the generalizability of
prior research findings across various contexts.
Third, we used the distinction between units of interest, data collection, and
analysis mainly to discuss potential measurement biases. However, the differences
among these units also imply strong opportunities for developing new theoretical
insights into nonprofit effectiveness. Several contributions already assert that
nonprofit effectiveness is a social construction (Herman and Renz 1997; Forbes
1998; Liao-Troth and Dunn 1999), focusing mainly on the fact that people
individually develop an understanding of what nonprofit effectiveness means to
them. However, through their social interactions they develop a certain degree of
common understanding. The social constructionist perspective thus far has served
mainly to argue how people differ in their assessments of organizations. An opposite
perspective, regarding the extent to which they agree, might deserve more attention.
That is, we know little about how and why groups of people come to agree about the
meaning of nonprofit effectiveness. The perceived effectiveness of an organization
has significant impacts on the behavior of individuals and stakeholder groups
toward the organization though (Yoo and Brooks 2005; Daellenbach et al. 2006;
Sarstedt and Schloderer 2010; Mews and Boenigk 2012), so research should address
in-depth the social interaction factors and processes that shape shared versus unique
perceptions of an organization’s effectiveness. This analysis might provide
important new insights into how social processes among individuals can be
managed, and to steer collective behavior toward an organization.
Fourth, for conciseness, we focused only on nonprofit-related examples and
literature reviews. However, challenges to measure effectiveness-related concepts
go far beyond nonprofit realms. In addition to the supplementary suggestions that
we have made already, we recommend that continued contributions contemplate
Voluntas (2014) 25:1648–1670 1665
123
and investigate similarities and dissimilarities with other contexts, such as public
and profit sectors (Rojas 2000; Micheli and Kennerley 2005). Such a comparison
could enhance our understanding of the extent to which these seven trade-offs are
unique to the nonprofit sector or generalizable to other areas, or whether these trade-
offs are dealt with differently, or result in other decisions for other research
domains.
Fifth, we note the challenges that remain for bridging the gap between theoretical
insights and practitioner recommendations for measuring nonprofit effectiveness.
We have postulated seven trade-offs, largely based on academic literature. Yet the
basic decisions in each trade-off are highly relevant for practitioners too. For
example, practitioners seek high-quality, reliable measures of their actions, outputs,
and effects, and they hope to obtain these measures at reasonable costs.
Accordingly, they face very similar decisions in terms of which criteria to measure,
what sources to use, how many sources to gather, how to aggregate data, how to
quantify various measurement types, and so on. Thus, research projects should focus
not only on the type of measurements used by nonprofit practitioners or their
usefulness but also on why they have been chosen and in which contexts. Such
studies could provide some verification for whether the seven trade-offs we propose
are truly relevant beyond the academic context. More important, they could reveal
new insights into the practical usefulness of various types of effectiveness
measurements, depending on the real-world context.
As nonprofit organizations are increasingly confronted with competition to attract
resources and with continuous internal and external accountability obligations
(Ashley and Faulk 2010; Faulk et al. 2012), the trade-offs described in this article
could help as guidelines in professionalizing their measurement-based management.
As pointed out by Mitchell (2013), nonprofit practitioners often rely on personal and
intuitive effectiveness assessments to make managerial decisions. In addition, the
content of these assessments seem unfortunately not fully captured by the
effectiveness focus taken by academic scholars. Therefore, nonprofit practitioners
and consultants could move forward in setting up own measurement systems that
support their particular needs, but that are sufficiently providing useful information
at a reasonable cost. In this context, organizations could continuously experiment
with different types of measurements and effectiveness reporting, in order to find
out how sufficient managerial benefits can be obtained [e.g. more efficient practices,
more (social) return on investment, more impact on stakeholders, etc.]. When
selecting these performance indicators, special attention could be devoted to find an
optimal balance regarding the many advantages and disadvantages discussed in the
trade-offs. For example, practitioners could define for their own organization a set
of broad en general performance indicators that allow comparison with other
organizations. In doing so, organizations can benchmark themselves and learn from
each other when aiming to improve their practices and performance indicators. In
addition, specific indicators for the particular context of an organization and its
stakeholders can complement the core set of general indicators, aiming at better
understanding how particular actions cause improvements in serving stakeholder
interests and in achieving the organizational mission. In doing so, they can verify or
adjust their own mental models on how their actions potentially result in
1666 Voluntas (2014) 25:1648–1670
123
performance, and they can incrementally improve their understanding of the true
impact of their decisions and actions.
References
Ashley, S., & Faulk, L. (2010). Nonprofit competition in the grants marketplace: Exploring the
relationship between nonprofit financial ratios and grant amount. Nonprofit Management and
Leadership, 21(1), 43–57.
Babiak, K. M. (2009). Criteria of effectiveness in multiple cross-sectoral interorganizational relation-
ships. Evaluation & Program Planning, 32(1), 1–12.
Balser, D., & McClusky, J. (2005). Managing stakeholder relationships and nonprofit organization
effectiveness. Nonprofit Management & Leadership, 15(3), 295–315.
Baruch, Y., & Ramalho, N. (2006). Communalities and distinctions in the measurement of organizational
performance and effectiveness across for-profit and nonprofit sectors. Nonprofit and Voluntary
Sector Quarterly, 35(1), 39–65.
Bergkvist, L., & Rossiter, J. R. (2007). The predictive validity of multiple-item versus single-item
measures of the same constructs. Journal of Marketing Research, 44(2), 175–184.
Boenigk, S., & Helmig, B. (2013). Why do donors donate? Examining the effects of organizational
identification and identity salience on the relationships among satisfaction, loyalty, and donation
behavior. Journal of Service Research, 16, 533–548.
Boenigk, S., Leipnitz, S., & Scherhag, C. (2011). Altruistic values, satisfaction and loyalty among first-
time blood donors. Nonprofit and Voluntary Sector Marketing, 16(4), 356–370.
Bollen, K., & Lennox, R. (1991). Conventional wisdom on measurement: A structural equation
perspective. Psychological Bulletin, 110(2), 305–314.
Boyer, K. K., & Verma, R. (2000). Multiple raters in survey-based operations management research:
A review and tutorial. Production and Operations Management, 9(2), 128–140.
Bradshaw, P., Murray, V., & Wolpin, J. (1992). Do nonprofit boards make a difference? An exploration of
the relationships among board structure, process and effectiveness. Nonprofit and Voluntary Sector
Quarterly, 21(13), 227–249.
Bremser, W. G., & Barsky, N. (2004). Utilizing the balanced scorecard for R&D performance
measurement. R&D Management, 34(3), 229–238.
Brewer, P. C., & Speh, T. W. (2000). Using the balanced scorecard to measure supply chain performance.
Journal of Business Logistics, 21(1), 75–93.
Brown, W. A. (2005). Exploring the association between board and organizational performance in
nonprofit organizations. Nonprofit Management & Leadership, 15(3), 317–339.
Carman, J. G., & Fredericks, K. A. (2010). Evaluation capacity and nonprofit organizations: Is the glass
half-empty or half-full? American Journal of Evaluation, 31(1), 84–104.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence
Erlbaum.
Daellenbach, K., Davies, J., & Ashill, N. J. (2006). Understanding sponsorship and sponsorship
relationships—Multiple frames and multiple perspectives. International Journal of Nonprofit and
Voluntary Sector Marketing, 11(1), 73–87.
Dart, R. (2010). A grounded qualitative study of the meanings of effectiveness in Canadian ‘results-
focused’ environmental organizations. VOLUNTAS: International Journal of Voluntary and
Nonprofit Organizations, 21, 202–219.
DeVellis, R. F. (2003). Scale development: Theory and applications (2nd ed., Vol. 26)., Applied Social
Research Methods Series London: Sage Publications.
Diamantopoulos, A. (1999). Viewpoint–export performance measurement: Reflective versus formative
indicators. International Marketing Review, 16(6), 444–457.
Diamantopoulos, A., & Siguaw, J. A. (2006). Formative versus reflective indicators in organizational
measure development: A comparison and empirical illustration. British Journal of Management,
17(4), 263–282.
Diamantopoulos, A., & Winklhofer, H. M. (2001). Index construction with formative indicators:
An alternative to scale development. Journal of Marketing Research, 38(2), 269–277.
Voluntas (2014) 25:1648–1670 1667
123
DiMaggio, P. (2001). Measuring the impact of the nonprofit sector on society is probably impossible but
possibly useful: A sociological perspective. In P. Flynn & V. Hodgkinson (Eds.), Measuring the
impact of the nonprofit sector (pp. 249–272). New York: Kluwer Academic/Plenum Publishers.
Drolet, A. L., & Morrison, D. G. (2001). Do we really need multiple-item measures in service research?
Journal of Service Research, 3(3), 196–204.
Eckerd, A., & Moulton, S. (2011). Heterogeneous roles and heterogeneous practices: Understanding the
adoption and uses of nonprofit performance evaluations. American Journal of Evaluation, 32(1),
98–117.
Epstein, M. J., & Wisner, P. S. (2001). Using a balanced scorecard to implement sustainability.
Environmental Quality Management, 11(2), 1–10.
Farrell, A. M. (2010). Insufficient discriminant validity: A comment on Bove, Prevan, Beatty and Shiu
(2009). Journal of Business Research, 63(5), 324–327.
Farrell, A. M., & Rudd, J. M. 2009. Factor analysis and discriminant validity: A brief review of some
practical issues. Australia-New Zealand Marketing Academy Conference (ANZMAC), December,
Melbourne, Australia.
Faulk, L., Lecy, J. D., & McGinnis, J. 2012. Nonprofit competitive advantage in grant markets:
Implications of network embeddedness. Andrew Young School of Policy Studies Research Paper
Series No. 13-07.
Forbes, D. P. (1998). Measuring the unmeasurable: Empirical studies of nonprofit organizations
effectiveness from 1977 to 1997. Nonprofit and Voluntary Sector Quarterly, 27(2), 183–202.
Fornell, C., & Larcker, D. F. (1981). Evaluating structural equation models with unobservable variables
and measurement error. Journal of Marketing Research, 18(1), 39–50.
Frumkin, P. (2002). On being nonprofit: A conceptual policy primer. Cambridge, MA: Harvard
University Press.
Gill, M., Flynn, R. J., & Raising, E. (2005). The governance self-assessment checklist: An instrument for
assessing board effectiveness. Nonprofit Management & Leadership, 15(3), 271–294.
Gordon, T. P., Khumawala, S. B., Kraut, M., & Neely, D. G. (2010). Five dimensions of effectiveness for
nonprofit annual reports. Nonprofit Management & Leadership, 21(2), 209–228.
Green, J. C., & Griesinger, D. W. (1996). Board performance and organizational effectiveness in
nonprofit social services organizations. Nonprofit Management & Leadership, 6(4), 381–402.
Groves, R. M., & Heeringa, S. G. (2006). Responsive design for household surveys: Tools for actively
controlling survey errors and costs. Journal of the Royal Statistical Society Series A, 169(3),
439–457.
Hansmann, H. (1987). Economic theories of nonprofit organizations. In W. W. Powell (Ed.), The
nonprofit sector: A research handbook (pp. 27–42). New Haven, CT: Yale University Press.
Harrison, Y., Murray, V., & Cornforth, C. (2013). Perceptions of board chair leadership effectiveness in
nonprofit and voluntary sector organizations. VOLUNTAS: International Journal of Voluntary and
Nonprofit Organizations, 24(3), 688–712. doi:10.1007/s11266-012-9274-0.
Helmig, B., Ingerfurth, S., & Pinz, A. 2013. Success and Failure of Nonprofit Organizations: Theoretical
Foundations, Empirical Evidence, and Future Research, Voluntas. doi:10.1007/s11266-013-9402-5.
Herman, R. D., & Renz, D. O. (1997). Multiple constituencies and the social construction of nonprofit
organization effectiveness. Nonprofit and Voluntary Sector Quarterly, 26(2), 185–206.
Herman, R. D., & Renz, D. O. (1999). Theses on nonprofit organizational effectiveness. Nonprofit and
Voluntary Sector Quarterly, 28(2), 107–126.
Herman, R. D., & Renz, D. O. (2000). Board practices of especially effective and less effective local
nonprofit organizations. American Review of Public Administration, 30(2), 146–160.
Herman, R. D., & Renz, D. O. (2008). Advancing nonprofit organizational effectiveness research and
theory: Nine theses. Nonprofit Management & Leadership, 18(4), 399–415.
Herman, R. D., Renz, D. O., & Heimovics, R. D. (1997). Board practices and board effectiveness in local
nonprofit organizations. Nonprofit Management & Leadership, 7(4), 373–385.
Hitt, M. A., Beamish, P. W., Jackson, S. E., & Mathieu, J. E. (2007). Building theoretical and empirical
bridges across levels: Multilevel research in management. Academy of Management Journal, 50(6),
1385–1399.
Jackson, D. K., & Holland, T. P. (1998). Measuring effectiveness of nonprofit boards. Nonprofit and
Voluntary Sector Quarterly, 27(2), 159–182.
Jarvis, C. B., MacKenzie, S. B., & Podsakoff, P. M. (2003). A critical review of construct indicators and
measurement model misspecification in marketing and consumer research. Journal of Consumer
Research, 30(2), 199–218.
1668 Voluntas (2014) 25:1648–1670
123
Joreskog, K. G., & Goldberger, A. S. (1975). Estimation of a model with multiple indicators and multiple
causes of a single latent variable. Journal of the American Statistical Association, 70(351), 631–639.
Jun, K.-N., & Shiau, E. (2012). How are we doing? A multiple constituency approach to civic association
effectiveness. Nonprofit and Voluntary Sector Quarterly, 41(4), 632–655.
Kaplan, R. S. (2001). Strategic performance measurement and management in nonprofit organizations.
Nonprofit Management & Leadership, 11(3), 353–370.
Krashinsky, M. (1997). Stakeholder theories of the non-profit sector: One cut at the economic literature.
VOLUNTAS: International Journal of Voluntary and Nonprofit Organizations, 8(2), 149–161.
Kushner, R. J., & Poole, P. P. (1996). Exploring structure–effectiveness relationships in nonprofit arts
organizations. Nonprofit Management & Leadership, 7(2), 119–136.
Lecy, J. D., Schmitz, H. P., & Swedlund, H. (2012). Non-governmental and not-for-profit organizational
effectiveness: A modern synthesis. VOLUNTAS: International Journal of Voluntary and Nonprofit
Organizations, 23(2), 434–457.
LeRoux, K., & Wright, N. S. (2010). Does performance measurement improve strategic decision making?
Findings from a national survey of nonprofit social service agencies. Nonprofit and Voluntary Sector
Quarterly, 39(4), 571–587.
Liao, M., Foreman, S., & Sargeant, A. (2001). Market versus societal orientation in the nonprofit context.
International Journal of Nonprofit and Voluntary Sector Marketing, 6(3), 254–268.
Liao-Troth, M., & Dunn, C. P. (1999). Social constructs and human service: Managerial sensemaking of
volunteer motivation. Voluntas, 10(4), 345–361.
Lynn, L. E, Jr, Heinrich, C. J., & Hill, C. J. (2000). Studying governance and public management:
Challenges and prospects. Journal of Public Administration Research and Theory, 10(2), 233–261.
Maas, C. J. M., & Hox, J. (2005). Sufficient sample sizes for multilevel modeling. Methodology, 1(3),
86–92.
Mews, M., & Boenigk, S. (2012). Does organizational reputation influence the willingness to donate
blood? International Review on Public and Nonprofit Marketing,. doi:10.1007/s12208-012-0090-4.
Micheli, P., & Kennerley, M. (2005). Performance measurement frameworks in public and non-profit
sectors. Production Planning & Control, 16(2), 125–134.
Mistry, S. 2007. How does one voluntary organization engage with multiple stakeholder views of
effectiveness? Voluntary Sector Working Paper: Number 7, London School of Economics and
Political Science.
Mitchell, G. E. (2013). The construct of organizational effectiveness: Perspectives from leaders of
international nonprofits in the United States. Nonprofit and Voluntary Sector Quarterly, 42(2),
324–345.
Moxham, C. (2009). Performance measurement: Examining the applicability of the existing body of
knowledge to nonprofit organisations. International Journal of Operations & Production
Management, 29(7), 740–763.
Osborne, S. P., & Tricker, M. (1995). Researching non-profit organisational effectiveness: A comment on
Herman and Heimovics. VOLUNTAS: International Journal of Voluntary and Nonprofit Organi-
zations, 6(1), 85–92.
Packard, T. (2010). Staff perceptions of variables affecting performance in Human service organizations.
Nonprofit and Voluntary Sector Quarterly, 39(6), 971–990.
Padanyi, P., & Gainer, B. (2003). Peer reputation in the nonprofit sector: Its role in nonprofit sector
management. Corporate Reputation Review, 6(3), 252–265.
Pandey, S. K., Coursey, D. H., & Moynihan, D. P. (2007). Organizational effectiveness and bureaucratic
red tape: A multimethod study. Public Performance & Management Review, 30(3), 398–425.
Perrow, C. (1961). The analysis of goals in complex organizations. American Sociological Review, 26(6),
854–866.
Petter, S., Straub, D., & Rai, A. (2007). Specifying formative constructs in information systems research.
Management Information Systems Quarterly, 31(4), 623–656.
Plantz, M. C., Greenway, M. T., & Hendricks, M. (1997). Outcome measurement: Showing results in the
nonprofit sector. New Directions Evaluation, 75, 15–30.
Radbourne, J. (2003). Performing on board: The link between governance and corporate reputation in
nonprofit arts boards. Corporate Reputation Review, 6(3), 212–222.
Richard, P. J., Devinney, T. M., Yip, G. S., & Johnson, G. (2009). Measuring organizational performance:
Towards methodological best practice. Journal of Management, 35(3), 718–804.
Rojas, R. R. (2000). A review of models for measuring organizational effectiveness among for-profit and
nonprofit organizations. Nonprofit Management & Leadership, 11(1), 97–104.
Voluntas (2014) 25:1648–1670 1669
123
Sarstedt, M., & Schloderer, M. P. (2010). Developing a measurement approach for reputation of non-
profit organizations. International Journal of Nonprofit and Voluntary Sector Marketing, 15(3),
276–299.
Sawhill, J. C., & Williamson, D. (2001). Mission impossible? Measuring success in nonprofit
organizations. Nonprofit Management & Leadership, 11(3), 371–386.
Shilbury, D., & Moore, K. A. (2006). A study of organizational effectiveness for national Olympic
sporting organizations. Nonprofit and Voluntary Sector Quarterly, 35(1), 5–38.
Shiu, E., Pervan, S. J., Bove, L. L., & Beatty, S. E. (2011). Reflections on discriminant validity:
Reexamining the Bove et al. (2009) findings. Journal of Business Research, 64(3), 497–500.
Siciliano, J. I. (1997). The relationship between formal planning and performance in nonprofit
organizations. Nonprofit Management & Leadership, 7(4), 387–403.
Smith, D. H., & Shen, C. (1996). Factors characterizing the most effective nonprofits managed by
volunteers. Nonprofit Management & Leadership, 6(3), 271–289.
Sowa, J. E., Selden, S. C., & Sandfort, J. R. (2004). No longer unmeasurable? A multidimensional
integrated model of nonprofit organizational effectiveness. Nonprofit and Voluntary Sector
Quarterly, 33(4), 711–728.
Stazyk, E. C., & Goerdely, H. T. (2010). The benefits of bureaucracy: Public managers’ perceptions of
political support, goal ambiguity, and organizational effectiveness. Journal of Public Administration
Research and Theory, 21(4), 645–672.
Tacq, J. J. A. (1984). Causaliteit in sociologisch onderzoek. Een beoordeling van causale analysetech-
nieken in het licht van wijsgerige opvattingen over causaliteit. Deventer: Van Loghum Slaterus.
Van Puyvelde, S., Caers, R., Du Bois, C., & Jegers, M. (2012). The governance of nonprofit
organizations: Integrating agency theory with stakeholder and stewardship theories. Nonprofit and
Voluntary Sector Quarterly, 41(3), 431–451.
Willems, J., Huybrechts, G., Jegers, M., Weijters, B., Vantilborgh, T., Bidee, J., et al. (2012a). Nonprofit
governance quality: Concept and measurement. Journal of Social Service Research, 38(4), 561–578.
Willems, J., Van den Bergh, J., & Deschoolmeester, D. (2012b). Analyzing employee agreement on
maturity assessment tools for organizations. Knowledge and Process Management, 19(3), 142–147.
Wilson, B. C., W. Ringle, C. M., & Henseler, J. (2007). Exploring causal path directionality for a
marketing model using Cohen’s path method. In H. Martens, T. Naes, & M. Martens (Eds.),
Causalities Explored by Indirect Observation: Proceedings of the 5th International Symposium on
PLS and Related Methods (PLS’07) MATFORSK (pp. 57–61).
Yoo, J., & Brooks, D. (2005). The role of organizational variables in predicting service effectiveness:
An analysis of a multilevel model. Research on Social Work Practice, 15(4), 267–277.
1670 Voluntas (2014) 25:1648–1670
123