Sterman, John D., "Appropriate Summary Statistics for Evaluating the Historical Fit of System Dynamics Models", 1983

Online content

Fullscreen
DH3993

APPROPRIATE SUMMARY STATISTICS FOR
EVALUATING THE HISTORICAL FIT
oF

SYSTEM DYNAMICS MODELS

John D. Sterman
Assistant Professor
Sloan School of Management
Massachusetts Institute of Technology
Cambridge, Massachusetts 02139

February 1983

203

D-3393

ABSTRACT

System Dynamics modelers are often faulted for
their reluctance to employ formal measures of
goodness~of-fit when assessing the historical
behavior of models. As a result, the validity of
system dynamics models is often questioned even
when the model's correspondence to historical
behavior is quite good. This paper argues that
the failure to present formal analysis of
historical behavior creates an impression of
sloppiness and unprofessionalism. After reviewing
the concept of validity in simulation modeling,
the paper proposes a simple set of summary
statistics appropriate for system dynamics models
(the root-mean-square error and Theil inequality
statistics). The statistics allow the error due
to individual behavior modes to be analyzed, do
not require the use of formal parameter estimation
procedures, and can be conveniently computed. A
large model of the U.S. economy is used to
illustrate the use of the statistics.

204
D-3393

Introduction

System Dynamics modelers are often faulted for their
reluctance to employ formal measures of goodness-of-fit when
assessing the historical behavior of models. As a result, the
validity of system dynamics models is often questioned even when
their correspondence to historical behavior is quite good. This
paper argues that the failure to present formal analysis of
historical behavior creates an impression of sloppiness and
unprofessionalism, After reviewing the theory of validity in
system dynamics, the paper proposes a simple set of summary
statistics appropriate for system dynamics models. The
statistics allow the error due to individual behavior modes to be
analyzed, do not require the use of formal parameter estimation

procedures, and can be conveniently computed.

The “Validation” of System Dynamics Models

Debate over the concept of "validity" in system dynamics has
as long a history as the field itself.” Discussions of
validation in system dynamics have stressed three basic points:

1, There can be no absolute test of validity,
2. There can be no objective tests of validity,
3. There can be no single test of validity.

From the first, system dynamicists have rejected the notion
that the validity of models can be established absolutely.

Rather, as Forrester emphatically states,

205

D-3393

The validity (or significance) of a model

should be judged by its suitability for a

particular’ purpose. A model is sound and

defendable if it accomplishes what is expected

of it. ... validity, as an abstract concept

divorced from purpose, has no useful meaning

(3).
A model intended for short-term prediction must be evaluated by
different procedures than models designed for long-term policy
analysis, exploration of possible future behavior modes, or
theory testing, a view widely shared by other modelers and social

seientists.4

Rejected also is the notion that, even given a clear
purpose, there can be objective criteria for validity. Forrester
correctly states that

Any "objective" model-validation procedure

rests eventually at some lower level on a

judgment or faith that either the procedure or

its goals are acceptable without objective

Proof (5).
For example, one of the most common tests of the significance of
parameter estimates in regressions is the t-statistic.® The
t-statistic is used to test, within some level of significance
(typically 5%) the hypothesis that the estimated parameter is
equal to zero (the analyst usually hopes to be able to reject the
hypothesis, establishing a significant nonzero value for the
estimated parameter). Econometrics texts teach that “one may

interpret a significant t-statistic ... as evidence tending to

206
D-3393

validate the model ... {and} an insignificant statistic would
lead toward invalidation of the mode1."7 However, the t-test in
the context of the standard single-equation least-squares
repression model rests on several assumptions ("maintained
hypotheses") that are not verifiable or go unquestioned in
practice, including

+s. perfect specification of the model being

estimated (including zero-mean, normally-

distributed noise inputs in each equation) and

perfect measurement of all variables (8).
Because the maintained hypotheses are not verified, the t-test is
not a test of validity: an insignificant value may indicate one
of the maintained hypotheses is violated rather than an insignif—
icant relationship, a result that has been demonstrated through
synthetic data experiments.? To treat the test as an indicator
of validity, then, is necessarily to make a subjective judgment
that the maintained hypotheses are in fact true. It may be
objected that the t-test and standard linear model are simplistic
and unrepresentative of actual practice. Econometricians have
developed many powerful procedures that allow maintained hypo-
theses to be tested, including tests of model specification.!°
However, useful though they may be, such tests necessarily invoke
other maintained hypotheses, shifting the locus of the inevitable

@ priori but never eliminating it.

207

D-3393

Validation is an inherently social process. It depends on
the cultural context and background of the model builders and
model users. It depends on whether one is an "observer" (e.g.4
an academic researcher) or an “operator," (e.g., a decisionmaker
who must act without waiting for more data or further analy-
sis). churchman goes so far as to argue that the process is
entirely social:

.++ a point of view, or a model, is realistic

to the extent that it can be adequately

interpreted, understood, ‘and accepted by other

points of view. (12)
Recognizing the ultimately subjective nature of all “objective”
tests means one can never validate a model in the sense of
establishing its truth. Rather, the notion of objective validity
has been replaced by the confidence the model builders and users
place in the model and its conclusions:

No model has ever been or ever will be

thoroughly validated. ... "Useful,"

"illuminating," “convincing,” or "inspiring

confidence" are more apt descriptors applying

to models than "valid" (13).

Emphasizing the process of building confidence in a model
means there can be no single test or measure of validity. No
responsible model builder or user would ever be satisfied with a
single test. Confidence must be developed through a process of
testing and evaluation along many dimensions, a point emphasized

by many.+4

208
D-3393

A wide variety of tests have been developed to aid the
diagnosis of errors and to assist the confidence-building process
in system dynamics, The tests, summarized in Table 1, include
tests of the structure, parameters, behavior, and policy

recommendations of the model.

The Role of torical Data in the Confidence-building

Process

A corollary of the three principles outlined above is that
the single most common measure of validity in the social
sciences, the historical fit of a model, is a weak test that
contributes little if anything to confidence. Analysis of the
historical fit of a model is a part of the Behavior Reproduction
test (Table 1). But the Behavior Reproduction test is more than
comparing the correspondence of simulated and actual data on a
point-by-point basis. The test usually focuses on the character
of the simulated data: does it exhibit the same modes, phase
relationships, relative amplitudes, and variability as the real
datazl>
Point-by~point or event-oriented comparisons to historical
data have been minimized in system dynamics for several reasons.
The behavior of any real system is the result of both the
systematic forces relevant to a particular model and purpose and
the peculiarities of historical circumstance: the randomness or

noise, that is, the aspects of behavior that are not relevant for

209

D-3393

the purpose of the study. The historical behavior of a social
system, then, can be viewed as analagous to a particular
simulation of a model with stochastic elements. The randomness
represents those aspects of decisionmaking that are weakly
coupled to the system of interest and have not been modeled.
Forrester has shown that point-prediction of social systems
beyond at most one-quarter of their natural period is impossible
in principle even when one has a perfectly specified and
estimated model, knows the nature of the noise or error terms,
and lacks only the precise values of the noise, assumptions which
can never be met in real life and only poorly approximated.® at
the same time, one can always fit any set of data to any degree
of precision required. Phelps Brown puts it even more strongly:

The case for validating assumptions by testing

their implications really rests on the possi-

bility of controlled experiment, but that

possibility is generally denied the economist

++..Where, as so often, the fluctuations of

different time series respond in common to the

pulse of the economy, it is fatally easy to

get a good fit, and get it for quite a number

of different equations....running regressions

between time series is only likely to deceive.
(17)

Because historical fit is a weak test, system dynamicists
have tended to ignore or minimize the comparison of the behavior
of their models to historical data, preferring to focus the
confidence-building process on the stronger tests outlined in
Table 1. When historical fit is considered, it is usually

presented in a highly informal manner. ‘Typically, the modeler

210
D-3393

presents a graph of the historical behavior alongside the
simulated version of the same data and asks the reader to judge
whether the degree of fit is "close enough" (the so-called

"Mistaken Identity test") .+

The failure of system dynamicists to treat historical fit
more rigorously is unfortunate. Although reproducing historical
behavior is only one of a large number of tests and activities
required to build confidence in a model, it is nonetheless an
extremely important one. Failure to satisfy a client or reviewer
that a model's historical fit is satisfactory is often sufficient
grounds to dismiss the model and its conclusions. Passing the
historical behavior test, while far from sufficient, is a
necessary step in the confidence-building process. Arnold
Zellner's response to Forrester's description of the use of
information in system dynamics modeling is perhaps typical of the
attitude of econometricians and other quantitative social
scientists:

one difference between Forrester's approach
and those of others, however, is that
Forrester apparently does not make explicit
use of formal statistical inference tech-
niques....I do believe that it would be
worthwhile for Forrester to consider incor-

porating...appropriate and relevant statis-
tical techniques in his approach (19).

More often than not, the historical fit of a system dynamics

model is sufficient for its purpose. The problem arises from the

ant

D-3393

informal way in which goodness-of-fit is demonstrated. The
Mistaken Identity test is considered naive and unprofessional,
and visual comparison alone is viewed as "sloppy" by economists
and other social scientists reared in more quantitative and
statistical methods. Like it or not, system dynamics models are
reviewed and evaluated by persons who expect a formal measure of
goodness-of-fit, and who are reluctant to place confidence in a
model unless its historical performance is appraised with some
summary statistics with which they are familiar. system
dynamicists, who emphasize the social nature of the confidence-
building process, should be the first to employ formal measures
of goodness-of-fit when the purpose of their models is to
communicate results to social scientists with quantitative

biases.

However, the use and interpretation of formal measures of
goodness-of-fit must remain true to the purpose of system
dynamics models and the confidence-building process. Historical
fit is a necessary but far from sufficient test. Matching
historical data must never become an end in itself, nor can the
availability of numerical data be allowed to dictate the
structure of a model. A good system dynamics model is expected
to generate the historical behavior of the system endogenously,
and without the extensive use of exogenous or dummy variables.

Historical data should not be used to estimate the parameters of

212
D-3393

a model directly; rather, parameters should be estimated from
data "below the level of aggregation" of the model--that is, from
interviews, engineering data, surveys, or.other disaggregate
studies that draw on descriptive knowledge of the system's

structure rather than its aggregate behavior.7?

Further, system dynamics models do not usually employ formal
estimation procedures that guarantee a minimum sum-of-squared-
errors over the range of available data, as in a regression.“

As a result, the error between simulated and actual data may be
larger than typically found in regression models. ‘There may also
be systematic bias between simulated and actual data, Yet
precisely because exogenous and dummy variables are not used and
the historical data are not used to derive the parameters that
minimize some measure of error, larger errors than are typical in
regression models do not necessarily compromise the validity of
system dynamics models or imply lack of confidence in their
results. In addition, system dynamics models are designed for a
specific purpose and may deliberately exclude some of the modes
of behavior present in the historical data, For example, a model
of long-term economic growth may exclude the business cycle. The
simulated GNP in such a model may not match the historical GNP,
which fluctuates with the business cycle, on a point-by-point
basis. The total error may be large even if the model matches

the relevant growth mode extremely well.

213

D-3393

For these reasons, the summary statistic most commonly used
to evaluate goodness-of-fit in regression models, the coefficient
of determination or R® (which measures the fraction of the total
variation explained by the model), is inappropriate for system

dynamics models.

Appropriate Summary Statistics for System Dynamics

To develop appropriate summary statistics to evaluate the
historical fit of system dynamics models, it is useful to review
the role of historical data in regression models such as
econometric models based on time-series data. Often only the
first portion of the available data is used to estimate the
parameters of a model. Within the period of fit, the R*,
t-statistics, and other usual measures of goodness-of-fit and
significance are applicable. The model is then simulated beyond
the period of fit, to generate an ex post forecast. Simulating

the model beyond the available data produces an ex ante
2

forecast.”

The purpose of an ex post forecast is precisely the same as
the purpose of analyzing the historical behavior of system
dynamics models: to build confidence in the model. An ex post
forecast provides a test of the model's ability to replicate the
behavior of the real system that is independent of the process by

which the structure and parameters of the model were chosen.

214
D-3393

(Using the entire set of available data to estimate a model is a
much weaker test, if passed, even if the resulting behavior is a
closer match because the data in that case are directly used to

find the structure and parameters that best match the data.)

Because system dynamics models typically do not employ the
aggregate historical data in developing the structure or
estimating the parameters, the behavior of the model over the
entire range of available data may be analyzed as an ex post
forecast, and summary statistics designed to measure forecast

error are thus the appropriate measures of £it.2?

The-measurement and interpretation of forecast error has
been studied extensively by statisticians and econometricians.
One of the most common measures of forecast error is the

Mean-square-error (MSE), defined as

a

z 2

2£ is,-a,)
tal

where
n = Number of observations (t = 1, ... n)
= Simulated value at time t

= Actual value at time t.

215

D-3393

The MSE error has the advantage that large errors are weighted
more heavily than small ones, and that errors of opposite sign do
not cancel each other out. Often the square root of the
mean-square error is taken, yielding the root-mean-square (RMS)
error. The RMS error provides a measure of error with the same

units as the variable under consideration.

It is often more convenient to compute a normalized measure
of error. A common and easily interpreted dimensionless measure

is the root-mean-square percent error (RMSPE),

Other normalizations are possible; the choice of an appropriate
measure depends on the purpose of the error analysis and the

nature of the data.74

Error Decomposition
In addition to the size of the total error, it is important

to know the sources of error. Failure to fit the data may be
caused by a poor model or by a large degree of randomness in the
historical data. The total error may be large if a mode of be-
havior in the real system is deliberately excluded as irrelevant
to the purpose of the model. While there is ultimately no sub-

stitute for plotting the simulated and actual data side-by-side,

216
D-3393
several statistical methods are available to decompose the total

error into systematic and random portions.

One elegant decomposition of the mean-square-error is
provided by the Theil inequality statistics, The Theil
statistics are derived from the following decomposition of the

mse:25
n
2 Dis,-ayy? = GH? + (sys)? + 20-4) 8,8,

where 5 and A are the means of S and A
LY, ane AE
Sg and s, equal the standard deviations of S and A

fl B2 fi
Vide, 5)? and vi

and finally r equals the correlation coefficient between

» respectively;

Le, x)?, respectively;

simulated and actual data

13% S) (A,-K,

The term (S-A)? measures the bias between simulated and actual

series. The term (sg75,)” is the component of the MSE due to a
difference in the variances of the simulated and actual series,
and measures the degree of unequal variation between the two

series. Finally, the term 2(1-r)s,s, is the component of the

217

D-3393 =

error due to incomplete covariation between the two series, and
measures the degree to which the changes in the simulated series
fail to match the changes in the actual series on a point-by-

point basis.

By dividing each of the components of the error by the total

mean-square-error, the “inequality proportions" are derived:

uM (5-5)?
FL Ay)”

vs =

uo =

of course, u" + uS + uc = 1, so uM, us, and ue reflect the
fraction of the mean-square-error due to bias, unequal variance,

and unequal covariance, respectively.

Interpretation of the Inequality Statistics

To see how the inequality statistics apply, consider each
term in turn. Bias, indicated by a large UY and small uS and u°,
can be thought of as a translation of one series by a constant

amount at a1] points in time (Figure 1a). A large bias (indi-~

218
D-3393

cated by both a large MSE and a large U“) reveals a systematic
difference between the model and reality. Errors due to bias are
potentially serious, possibly indicating specification or para~
meter errors, Alternatively, bias may be due to acceptable

simplifying assumptions which do not compromise the model.

Error due to unequal variance may also be systematic.
Consider two cases: suppose unequal variation (US) dominates the
error, with u" and u° small. ‘Then the two series match on
average and are highly correlated, but the magnitude of the
variation in the two around their common mean differs. One
variable is a “stretched out" version of the other. In Figure
1b, US is large because the magnitude of the trend in the two
variables is different. Such a case reveals a systematic
difference between simulated and actual series and directs
attention to the assumptions of the model, much as bias does.
systematic error is also the verdict in Figure lc, in which the
magnitude of a cyclical mode in one variable is underestimated by
the other, though the phasing is correct. Such a case would
direct attention to the factors controlling the amplitude and

damping of the cyclical mode. 7°

Alternatively, if US is large, but both series have the same

mean (U"=0) and if at.least one variable is nearly constant, ue

will be small because the standard deviation s, or s, will be

219

D-3393 +

small. In such a case (Figure 1d) the error would reflect random
noise or a cyclic mode in one of the series not present in the
other, The interpretation of such a situation depends on the
purpose of the model. If the model is designed to investigate
the cyclic mode, the failure of the model to generate the cycle
would clearly be a systematic error. But if the purpose of the
model is analysis of long-run behavior that abstracts from the
short-term cycle, failure to represent the cycle is unimportant.
The cycle becomes unsystematic noise relative to the model

purpose.

If the majority of the error is concentrated in unequal
covariation Uc, wnile u" and uS are small, it indicates that the
point-by-point values of the simulated and actual series do not
match even though the model captures the average value and
dominant trends in the actual data well. Such a case might
indicate a fairly constant phase shift or translation in time of
a cyclical mode otherwise reproduced well (Figure le). More
likely, a large U° indicates one of the variables has a large
random component or contains cyclical modes not present in the
other series. in particular, a large U° may be due to noise or
cyclical modes in the historical data not captured by the model.
A large Uc indicates the majority of the error is unsystematic
with respect to the purpose of the model, and the model should
not be faulted for failing to match the random component of the

data.?7

220
D-3393

Unsystematic error may also show up in US. suppose the
actual series has a trend as well as cyclic modes or noise
(Pigure 1£). I€ there is no bias, "so, The MSE will be divided

between U® and uc

: by virtue of the cycles or noise, the two
series will have slightly different variances and will be
imperfectly correlated even if the model matches the trend in the
data. The distribution of the MSE between u® and u° will depend
on the magnitude of the noise relative to that of the trend.

Even though U°>0 here, the error is unsystematic and does not

compromise the model.

In terms of building confidence in the ability of a model to
endogenously generate the behavior of the system, the error
should be small and unsystematic, that is, concentrated in u° or
u®, Large total errors need not compromise the model's utility
if they are due to excluded modes or noise in the historical
data. Conversely, large biases or unequal trend errors should
lead to questions about the assumptions of the model. As in all
Statistical tests, the choice of significance or tolerance levels
depends on the purpose of the model and the characteristics of

the data.

An Illustration

The mean-square-error and inequality statistics have been

used to evaluate the historical fit of a large system dynamics

221

D-3393

model of energy-economy interactions.?® he purpose of the model
is to investigate the effects of resource depletion and rising
energy prices on economic growth over the long term (the simu-
lations run from 1950 to 2050). The model focuses on long-run
growth and explicitly excludes the business cycle. The model is
a dynamic general disequilibrium representation of the U.S.
economy and energy sector, including OPEC. Table 2 summarizes
the major endogenous and exogenous variables. ‘Typical of system
dynamics models, the model boundary is quite wide. All the major
economic and energy aggregates are generated endogenously. In
contrast, there are but three exogenous variables. O£ these,
population and the index of technological progress are specified
at ten-year intervais, and linear interpolation is used in inter-
vening years. Only the historical OPEC price is represented
annually (and it is generated endogenously after 1982). There-
fore the behavior of the model and its ability to replicate
historical data, to capture trends and turning points, is predon-
inantly the result of the interaction of the endogenous
variables. Starting the model in 1950 provides roughly thirty
years of simulated data to compare to the actual behavior of the

economy.
Table 3 summarizes the error analysis for eleven variables.

The RMS percent error provides a normalized measure of the

magnitude of the error. The MSE error and inequality statistics

222
D-3393

provide a measure of the total error and how it breaks down into

bias, unequal variation, and unequal covariation components.7?

The RMS percent errors are below ten percent with the
exception of real private investment, the fraction of energy
imported, and real energy prices. Five variables including real
GNP, consumption, consumption as a fraction of GNP, and total

energy consumption have RMS errors under 5 percent

While the small total errors in most variables show the
model adequately tracks the major variables, the several large
errors might raise questions about the internal consistency of

the model or the structure controlling those variables.

The error decomposition helps resolve such doubts. Consider
real private investment. The RMS percent error is 11.7, But
only 2% of the mean-square-error is due to bias, and unequal var-
ation accounts for only 10% of the total. The vast majority of
the error (nearly 90%) is due to unequal covariation, indicating
that simulated investment tracks the underlying trend in actual
investment almost perfectly, but diverges point-by-point. Plot-
sting the two series and their residuals (Figure 2) shows that
actual investment is the culprit as it fluctuates with the
business cycle around. the simulated values. Since the business

cycle is explicitly excluded from the purpose of the model, the

223

D-3393

large RMS percent error is of little concern and does not

compromise the conclusions of the study.

The energy import fraction reveals the same pattern. Only
12% of the MSE is due to bias, and virtually none to unequal
variation. Imports, as the most costly source of energy, are the
most volatile component of energy consumption, Like investment,
the actual import fraction fluctuates with the business cycle
around the simulated value, again causing nearly 90% of the error
to show up as unequal covariation. As shown in Figure 3, the
model captures the rapid rise in imports that began around 1970

quite well, even though the point-by-point match is poor.

The largest RMS percent error, 14%, shows up in the real
energy price. Error decomposition shows the majority of the MSE
to be due to bias (58%) with the rest due to unequal covariation.
Plotting the two series (Figure 4) reveals the cause: The
average real energy price fell over 30% between 1950 and 1970
before rising 130% in the next seven years. The model does not
capture the full decline in real price, only some of which can be
explained by the pressure of inexpensive imports before 1973.
Several theories have been offered to explain the drop in real
energy prices up to 1970: economies of scale associated with
ever larger electric generation plants, higher than average

technical progress in the energy sector, and discovery of less

224
D-3393

costly resources. Economies of scale and technical progress
could be represented by assuming technology in the energy sector
improved faster than the average, but is excluded for simplicity
and since such an assumption would be exogenous. Similarly,
depletion is assumed to be strictly monotonic: in keeping with
traditional resource theory, it is assumed the least expensive
resources are exploited first. It would be an easy matter to
"tune" the model to reproduce the decline in real energy price.
If the purpose of the model were point~prediction or short-term
forecasting, such tuning would be appropriate and would help
build confidence in the utility of the model. But since the
purpose is assessment of long-term trends and policy analysis,
such tuning, relying as it would on exogenous variables and
ad-hoc adjustments to parameters, would contribute nothing to the
confidence-building process and might actually decrease
confidence by obscuring the model's ability to endogenously

capture the behavior of interest.

As a final illustration of error decomposition, consider the
9.7 RMS percent error in net energy consumption. Only 16% of the
MSE is due to bias, but nearly two-thirds is due to unequal
variance. Again, a simplifying assumption is responsible. Net
energy consumption (gross primary consumption less conversion and
distribution losses) is underestimated by the model during the

1950s and overestimated during the 1970s. Actual efficiency

225

D-3393

dropped from 88% in 1950 to 70% in 1977, primarily due to the
large conversion losses associated with electricity generation,
For simplicity, the model assumes a constant average efficiency
of 80%. Thus, as shown in Figure 5, simulated net consumption
grows more rapidly than the actual value, resulting in the large
error in variance. Since the model is not intended for fore-
casting but rather for policy analysis, the error in net energy
consumption is of little concern as it will not affect the

relative efficacy of policies.

Reviewing the other variables with RMS percent errors
greater than 5% shows they all have small bias and unequal
variance components. The bias fraction in the real wage
(RMSPE=5.4) is just .10, while the bias in primary energy
production (RMSPE=7.6) is just .14; the unequal variation terms

are .23 and .26, respectively.

The error analysis shows the model reproduces historical
behavior well. ‘The small number of large errors are readily
explained, with the help of error decomposition, as the result of
modes of behavior outside the purpose of the model, noise in the
historical data, or simplifying assumptions, In interpreting the
statistical results, it must be stressed that the historical data
were not used by a formal estimation procedure that guarantees

the minimum sum-of-squared errors. Parameters were chosen on the

226
D-3393

basis of disaggregate data, econometric estimation reported by
others, and other managerial and engineering data. More
important, no dummy variables and only three exogenous variables
are used. The ability of the model to capture the trends and
turning points in the historical data, over a three-decade span,
is due to the interaction of the endogenous variables. To the
author's knowledge, no other energy-economy model can make that

claim.

Conclusion: Rigor Means Never Having to Say You're Sorry

Though historical £it is but one of many tests required to
build confidence in a system dynamics model, and a weak one at
that, it-is nevertheless a necessary one. The process of
building confidence in system dynamics models has been hampered
by the reluctance of model builders to employ formal measures of
goodness-of-fit, even though the models often fit the historical
data quite well, generate the behavior endogenously, and pass a
variety of structure and behavior adequacy tests. The statistics
developed here provide a straightforward, easily interpreted
method to lend rigor to the analysis of historical behavior. The
root-mean-square percent error provides a simple way to gauge the
magnitude of the total error between simulated and actual
variables, The Theil inequality statistics are particularly well
suited for system dynamics models because they allow the analyst

to separate the fraction of the error due to excluded modes or

227

D-3393

noise from the error due to systematic differences between the

model and reality.

Other summary statistics may be more appropriate for some
purposes and systems. For example, a model focusing on an
oscillatory mode such as the business cycle will typically not be
expected to reproduce the point-by-point behavior of the system
because of the strong influence of noise on its exact trajectory.
In such a case, establishing confidence in the model rests on the
correspondence of the average period, amplitude, and phase rela~
tionships among the variables. appropriate summary statistics
might compare the means, variances, or spectral densities of the

variables.

The statistics proposed here are not tests of validity, but
are summary statistics: convenient, compact, ways to express the
correspondence between a model's behavior and numerical data.

The use of summary statistics (when numerical data exist) can
help to establish confidence in system dynamics models without
placing unwarranted emphasis onthe point-by-point correspondence
with historical data. But historical fit in itself must not be
viewed as a test of validity. Building confidence in the
structure of the model demands the analyst expose it to other,
more severe tests. Such tests may include statistical tests

where the important maintained hypotheses can be established, but

228
D-3393

will rest primarily on the structure and behavior adequacy tests
described in Table 1, The true test of a model is its ability to
reproduce historical behavior endogenously, with structure and
parameters that are consistent with descriptive knowledge of the
system. These are strong requirements that few models in
economics and social science can meet. Satisfying them is the
process of building confidence in a model, and a well-built and
carefully tested system dynamics model owes no apology to those

who would judge validity by statistics alone.

229

D-3393

APPENDIX

Computing the Summary Statistics with DYNAMO

The summary statistics presented above can be computed
easily with DYNAMO using the following macros. In general, a
simulation may start before and end after the period in which the
historical comparison is to be made. Further, the model may
compute the simulated values more frequently than the data are
available. It is necessary to compute the summary statistics

using sampled versions of simulated and actual data.

The Pick Function

MACRO PICK(ST,ET, PER) 1
PICK - PICK FUNCTION (DIMENSIONLESS) <2>
ST - START TIME FOR PICK FUNCTION (TIME UNIT)
BT - END TIME FOR PICK FUNCTION (TIME UNIT)
PER = - PERIOD OF DATA FOR PICK FUNCTION (TIME UNITS)
PICK.K=PULSE(1,ST, PER) * (1-STEP(1,ET+DT) ) A,2
PICK - PICK FUNCTION (DIMENSIONLESS) <2>
PULSE - PULSE FUNCTION
ST - START TIME FOR PICK FUNCTION (TIME UNIT)

PER - PERIOD OF DATA FOR PICK FUNCTION
(TIME UNITS)

STEP = STEP FUNCTION

eT - END TIME FOR PICK FUNCTION (TIME UNIT)

Dr - TIME STEP FOR SIMULATION (TIME UNITS)

MEND 3
The PICK function is used to sample a variable at a
specified period PER over a specified interval (from ST to ET):

1 TIME=ST,ST+PER,ST+2PER,
0 otherwise.

ET

PICK =

230
D-3393

PICK has a value of zero before ST and after ET, and takes a

value of 1 at intervals of PER between (and including) ST and ET.

Root-Mean-Square Percent Error Macro

MACRO RMSPE (HV,SV,ST,ET, PER, PE) 4
RMSPE - ROOT-MEAN-SQUARE PERCENT ERROR (%) <6>
HV - HISTORICAL VARIABLE (UNITS)
sv - SIMULATED VARIABLE (UNITS)
ST - START TIME FOR PICK FUNCTION (TIME UNIT)
ET - END TIME FOR PICK FUNCTION (TIME UNIT)
PER - PERIOD OF DATA FOR PICK FUNCTION (TIME UNITS)
PE - PERCENT ERROR BETWEEN SIMULATED AND ACTUAL

VARIABLES (%) <5>
The RMSPE macro computes the root-mean-square percent error
between simulated and historical variables and also the percent
error at each moment. The RMSPE is computed every PER time units
between ST and ET, inclusive.
PE. K=100# (SV, K-HV. K) /HV.K
PE - PERCENT ERROR BETWEEN SIMULATED AND ACTUAL
VARIABLES (%) <5>
sv - SIMULATED VARIABLE (UNITS)
BV - HISTORICAL VARIABLE (UNITS)
The error between simulated and historical variables is computed

as a percent of the historical value,

RMSPE.K=SORT ($SSPE.K/$N.K) A,6
RMSPE - ROOT-MEAN-SQUARE PERCENT ERROR (%) <6>
SQRT - SQUARE ROOT
$8SPE - SUM OF SQUARED PERCENT ERRORS (8 SQUARED)
<9>
$n ~ NUMBER OF OBSERVATIONS (DIMENSIONLESS)
<7,18,30>

The root-mean-square percent error is defined as the square
root of the mean of the squared percent errors.

§N.K=$N.J+(DT/DT) *$IN.J

$N=1E-20

be?
E-2 N,7.1

D-3393

$n - NUMBER OF OBSERVATIONS (DIMENSIONLESS)
<7, 18, 30>
DT - TIME STEP FOR SIMULATION (TIME UNITS)
SIN - INCREMENT IN NUMBER OF OBSERVATIONS
(DIMENSIONLESS) <8,19,31>
SIN. K=PICK (ST, ET, PER) ALB
SIN - INCREMENT IN NUMBER OF OBSERVATIONS

(DIMENSIONLESS) <8,19,31>

PICK - PICK FUNCTION (DIMENSIONLESS) <2>

ST - START TIME FOR PICK FUNCTION (TIME UNIT)
ET - END TIME FOR PICK FUNCTION (TIME UNIT)
PER ~ PERIOD OF DATA FOR PICK FUNCTION

(TIME UNITS)
The number of observations is incremented by one every PER time
units between ST and ET. (NB: The term 1£-20 prevents division
by zero in Eq. 6 before TIME=ST+PER, The term (DT/DT) in Eq. 7
is necessary only to prevent an “unusual format in level

equation" error from DYNAMO.)

$SSPE.K=$SSPE.J+(DT/DT) *$SPE.J Lo
$SSPE=0 N,9.1
$SSPE - SUM OF SQUARED PERCENT ERRORS (% SQUARED)
<9>
DT - TIME STEP FOR SIMULATION (TIME UNITS)
$SPE - SQUARED PERCENT ERROR (% SQUARED) <10>
$SPE.K=PE.K*PE,K*PICK (ST, ET, PER) A,10
$SPE - SQUARED PERCENT ERROR (% SQUARED) <10>
PE - PERCENT ERROR BETWEEN SIMULATED AND ACTUAL
VARIABLES (%) <5>
PICK - PICK FUNCTION (DIMENSIONLESS) <2>
ST - START TIME FOR PICK FUNCTION (TIME UNIT)
ET ~ END TIME FOR PICK FUNCTION (TIME UNIT)
PER ~ PERIOD OF DATA FOR PICK FUNCTION

(TIME UNITS)
MEND ql
The squared percent errors, sampled by the PICK function,

accumulate to yield the sum of squared percent errors,
D-3393

Inequality Statistics Macro
MACRO MSE (HV, SV,ST, ET, PER, UM,US, UC) 12

MSE = _ MEAN-SQUARE-ERROR (UNITS SQUARED) <13>

HV - HISTORICAL VARIABLE (UNITS)

sv - SIMULATED VARIABLE (UNITS)

ST - START TIME FOR PICK FUNCTION (TIME UNIT)

ET - END TIME FOR PICK FUNCTION (TIME UNIT)

PER = - PERIOD OF DATA FOR PICK FUNCTION
(TIME UNITS)

om - FRACTION OF MSE DUE TO BIAS (FRACTION)
<14>

us - FRACTION OF MSE DUE TO UNEQUAL VARIATION
(FRACTION) <15>

uc - FRACTION OF MSE DUE TO UNEQUAL COVARIATION

(FRACTION) <16>
The MSE macro computes the root-mean-square error between
simulated and historical variables and the Theil inequality

proportions u", uS, ana u°.

MSE.K=$SSE.K/$N.K A,13
MSE ~ MEAN-SQUARE-ERROR (UNITS SQUARED) <13>
$SSE - SUM OF SQUARED ERRORS (UNITS SQUARED) <19>
SN - NUMBER OF OBSERVATIONS (DIMENSIONLESS)
<7,17,29>
UM. x (SMSV.K-SMHV.K) (SMSV.K-SMHV.K) /(1E-204MSE.K) A,14
- FRACTION OF MSE DUE TO BIAS (FRACTION)
<14>
$MSV  - MEAN OF SIMULATED VARIABLE (UNITS) <21>
$MHV  - MEAN OF HISTORICAL VARIABLE (UNITS) <22>
MSE - MEAN-SQUARE-ERROR (UNITS SQUARED) <13>

US .K= ($SDSV.K-$SDHV.K) (SSDSV.K-S$SDHV.K)/(1E-204MSE.K) A, 15
us ~ FRACTION OF MSE DUE TO UNEQUAL, VARIATION
(FRACTION) <15>
$SDSV - STANDARD DEVIATION OF SIMULATED VARIABLE

(UNITS,
$SDHV - STANDARD DEVIATION OF HISTORICAL VARIABLE
(UNITS)
MSE = MEAN-SQUARE-ERROR (UNITS SQUARED) <13>
UC. Re (2) (2-SCORR-R) (S508V-K) ($SDAV.K)/(1B-204M8E-K) A,16
- FRACTION OF MSE DUE TO UNEQUAL

COVARIATION (FRACTION) <16>

233

D-3393

$CORR - CORRELATION COEFFICIENT BETWEEN SIMULATED
AND HISTORICAL VARIABLES (DIMENSIONLESS)

<23>

$SDSV - STANDARD DEVIATION OF SIMULATED VARIABLE
(UNITS)

$SDHV - STANDARD DEVIATION OF HISTORICAL VARIABLE
(UNITS)

MSE - MEAN-SQUARE-ERROR (UNITS SQUARED) <13>

A running measure of the mean-square-error (MSE) is computed
as the model moves through time. The fraction of the MSE due to
bias is given by the squared difference in the means of simulated
and historical series relative to the MSE. The fraction of the
MSE due to unequal variation is given by the squared difference
in the standard deviation of simulated and historical series
relative to the MSE. The fraction of the MSE due to unequal
covariation is given by the product (2) (l-r) (Sg) (s,) relative to
the MSE. (A small number is added to the denominator of the

inequality proportions to prevent division by zero.)

§N.K=$N.J+ (DT/DT) *$IN.J L,17
SN=1E-20 NAAT
$n - NUMBER OF OBSERVATIONS (DIMENSIONLESS)
<7,17,29>
DT - TIME STEP FOR SIMULATION (TIME UNITS)
SIN - INCREMENT IN NUMBER OF OBSERVATIONS
(DIMENSIONLESS) <8,18,30>
$IN.K=PICK (ST,ET, PER) A,18
$IN = INCREMENT IN NUMBER OF OBSERVATIONS

(DIMENSIONLESS) <8,18,30>
PICK FUNCTION (DIMENSIONLESS) <2>

PICK -

ST ~ START TIME FOR PICK FUNCTION (TIME UNIT)
ET - END TIME FOR PICK FUNCTION (TIME UNIT)
PER ~ PERIOD OF DATA FOR PICK FUNCTION

(TIME UNITS)

234
D~3393

The running total of the number of observations is calculated

exactly as in the RMSPE macro.

$SSE.K=$SSE.J+ (DT/DT) *$SE.J

$SSE=0
$SSE - SUM OF SQUARED ERRORS (UNITS SQUARED)
<19>
DT ~ TIME STEP FOR SIMULATION (TIME UNITS
$SE  - SQUARED ERROR (UNITS SQUARED) <20>

SSE. Kz (SV. K-HV.K) (SV-K-HV. K) *PICK (ST, ET, PER)
§SE - SQUARED ERROR (UNITS SQUARED) <20>

sv - SIMULATED VARIABLE (UNITS)

HV ~ HISTORICAL VARIABLE (UNITS)

PICK - PICK FUNCTION (DIMENSIONLESS) <2>

st - START TIME FOR PICK FUNCTION (TIME UNIT)
ET - END TIME FOR PICK FUNCTION (TIME UNIT)
PER =~ PERIOD OF DATA FOR PICK FUNCTION

(TIME UNITS)

The squared errors, sampled by the PICK function, accumulate in

the sumof squared errors.

§MSV.K=MEAN (SV.K, ST, BT, PER, $SDSV.K)

§MSV - MEAN OF SIMULATED VARIABLE (UNITS) <21>
MEAN  ~ MEAN OF INPUT SERIES (UNITS) <28>

sv - SIMULATED VARIABLE (UNITS)

ST - START TIME FOR PICK FUNCTION (TIME UNIT)
ET - END TIME FOR PICK FUNCTION (TIME UNIT)
PER - PERIOD OF DATA FOR PICK FUNCTION

(TIME UNITS)
$SDSV - STANDARD DEVIATION OF SIMULATED VARIABLE

(UNITS)
SMHV. K=MEAN (HV.K, ST, ET, PER, SSDHV. K)
$MHV - MEAN OF HISTORICAL VARIABLE (UNITS) <22>
MEAN - MEAN OF INPUT SERIES (UNITS) <28>
HV - HISTORICAL VARIABLE (UNITS)
ST - START TIME FOR PICK FUNCTION (TIME UNIT)
ET - END TIME FOR PICK FUNCTION (TIME UNIT)
PER - PERIOD OF DATA FOR PICK FUNCTION
(TIME UNITS)
$SDHV - STANDARD DEVIATION OF HISTORICAL VARIABLE

(UNITS)

235

A,21

Ay22

D-3393

The means and standard deviations of the simulated and historical
variables are calculated over the relevant range of data by the
MEAN macro (below).

SCORR. K= ( ($SPSH.K/$N.K) -$MSV.K*$MHV. K) /(1E-20+
$SDSV.K*$SDHV.K) A,23
$CORR - CORRELATION COEFFICIENT BETWEEN SIMULATED
AND HISTORICAL VARIABLES (DIMENSIONLESS)
<23>

$SPSH - SUM OF PRODUCTS OF SIMULATED AND HISTORICAL
VARIABLES (UNITS SQUARED) <24>

$n - NUMBER OF OBSERVATIONS (DIMENSIONLESS)
£7,17,29>
SMSV - MEAN OF SIMULATED VARIABLE (UNITS) <21>
SMHV  - MEAN OF HISTORICAL VARIABLE (UNITS) <22>
$SDSV - STANDARD DEVIATION OF SIMULATED VARIABLE
(UNITS)
$SDHV - STANDARD DEVIATION OF HISTORICAL VARIABLE
(UNITS)
S$SPSH.K=$SPSH.J+(DT/DT) *$PSH.J L,24
SSPSH=! N,24.2
$SPSH - SUM OF PRODUCTS OF SIMULATED AND
HISTORICAL VARIABLES (UNITS SQUARED)
<24>
DT ~ TIME STEP FOR SIMULATION (TIME UNITS)
$PSH - PRODUCT OF SIMULATED AND HISTORICAL
VARIABLES (UNITS SQUARED) <25>
$PSH.K=SV.K*HV.K*PICK (ST, ET, PER) A,25
$PSH - PRODUCT OF SIMULATED AND HISTORICAL
VARIABLES (UNITS SQUARED) <25>
sv - SIMULATED VARIABLE (UNITS)
HV - HISTORICAL VARIABLE (UNITS)
PICK - PICK FUNCTION (DIMENSIONLESS) <2>
sT - START TIME FOR PICK FUNCTION (TIME UNIT)
ET - END TIME FOR PICK FUNCTION (TIME UNIT)
PER PERIOD OF DATA FOR PICK PUNCTION
(TIME UNITS)
MEND 26

The "hand computation" formula is used to calculate the correla-
tion coefficient between simulated and historical series as the
model moves through time. The hand computation formula is based

on the definition of the correlation coefficient

236
D-3393
= COV(S,A)
85°

and the following formula for covariance

covis,a) = 23% (s,-5) a,-B1

1
= 2} say - 5%

Mean and Standard Deviation Macro

MACRO MEAN(IS,ST, ET, PER, SD) 27
MEAN . - MEAN OF INPUT SERIES (UNITS) <28>
is - INPUT SERIES (UNITS)
ST - START TIME FOR PICK FUNCTION (TIME UNIT)
ET - END TIME FOR PICK FUNCTION (TIME UNIT)
PER - PERIOD OF DATA FOR PICK FUNCTION
(TIME UNITS)
sp - STANDARD DEVIATION OF INPUT SERIES (UNITS)
<33>

The MEAN macro computes running means and standard deviations

over a specified range and periodicity of data.

MEAN. K=$S1S.K/SN.K A,28
MEAN  - MEAN OF INPUT SERIES (UNITS) <28>
$SIS - SUM OF INPUT SERIES (UNITS) <31>
SN - NUMBER OF OBSERVATIONS (DIMENSIONLESS)
<7,17,29>

The mean is defined as the sum of the sampled input series over

the number of observations.

§N.K=$N.J+(DT/DT) *$1N. 3 L,29
§N=1E-20 N,29.1
SN - NUMBER OF OBSERVATIONS (DIMENSIONLESS)
<7,17,29>
DT - TIME STEP FOR SIMULATION (TIME UNITS)
$IN - INCREMENT IN NUMBER OF OBSERVATIONS

(DIMENSIONLESS) <8,18,30>

237

D-3393

SIN. K=PICK (ST, ET, PER) A,30
SIN - INCREMENT IN NUMBER OF OBSERVATIONS
(DIMENSIONLESS) <8,18,30>
PICK - PICK FUNCTION (DIMENSIONLESS) <2>
ST - START TIME FOR PICK FUNCTION (TIME UNIT)
ET - END TIME FOR PICK FUNCTION (TIME UNIT)
PER - PERIOD OF DATA FOR PICK FUNCTION

(TIME UNITS)

The number of observations is computed exactly as in the RMSPE

and RMSE macros.

$SIS.K=$SIS.J+(DT/DT) *$18.3 1,32
$SIS=0 N,31.1
$SIS  - SUM OF INPUT SERIES (UNITS) <31>
DT - TIME STEP FOR SIMULATION (TIME UNITS)
$IS | - SAMPLED INPUT SERIES (UNITS) <32>
$IS.K=IS.K*PICK (ST, ET, PER) A, 32
$IS | - SAMPLED INPUT SERIES (UNITS) <32>
Is - INPUT SERIES (UNITS)
PICK - PICK FUNCTION (DIMENSIONLESS) <2>
ST - START TIME FOR PICK FUNCTION (TIME UNIT)
ET - END TIME FOR PICK FUNCTION (TIME UNIT)
PER -- PERIOD OF DATA FOR PICK FUNCTION
(TIME UNITS)
The sampled input series is summed for computation of the mean.
SD.K=SQRT (MAX (0, ($SISS.K/$N.K)-MEAN. K*MEAN.K) ) A,33
sD - STANDARD DEVIATION OF INPUT SERIES
(UNITS) <33>
SQRT - SQUARE ROOT
MAX = = MAXIMUM FUNCTION
$SISS - SUM OP INPUT SERIES SQUARED
(UNITS SQUARED) <34>
Su - NUMBER.OF OBSERVATIONS (DIMENSIONLESS)
<7,17,29>
MEAN - MEAN OF INPUT SERIES (UNITS) <28>
$SISS.K=$SISS.J+(DT/DT) *$1SS.J L,34
$sISs=0 N,34.1

$SISS - SUM OF INPUT SERIES SQUARED
(UNITS SQUARED) <34>

DT --TIME STEP FOR SIMULATION (TIME UNITS)
$ISS - INPUT SERIES SQUARED (UNITS SQUARED)
<35>
238
D-3393
$ISS.K=IS.K*IS.K*PICK (ST,ET, PER) A,35
$Iss - INPUT SERIES SQUARED (UNITS SQUARED) <35>
Is - INPUT SERIES (UNITS)
PICK - PICK FUNCTION (DIMENSIONLESS) <2> l.
st - START TIME FOR PICK FUNCTION (TIME UNIT)
ET - END TIME FOR PICK FUNCTION (TIME UNIT)
PER - PERIOD OF DATA FOR PICK FUNCTION
(TIME. UNITS) 2.
MEND 36
The "hand computation" formula for the standard deviation is 3.
used to calculate the standard deviation without prior knowledge
of the mean. The hand computation formula follows from the
definition of variance: Be
6.
i x2
varia) = 25) (x-)?)
1 2. 1 2 7.
: =2Z 0) - bBo
8.
= 2c) ~ van (xy 1?
o
However, the hand computation formula (and the hand computation
formula for the correlation coefficient above) are subject to
more round-off error than the computations based on the defini-
tions of variance and covariance. (The error is larger because
the hand computation involves small differences of large num-
bers.) To guard against the possibility that the difference
$SISS.K/$N.K-MEAN.K*MEAN.K is negative, a MAX function is
inserted in Eq. 33. Round-off error has not been a problem in 10.
actual applications to date. Sterman 1981 describes an alterna- iy
tive approach to computing the statistics described here that
12.

involves less round-off error but is more cumbersome to use.

239

D-3393

NOTES

I am indebted to Ernst R. Berndt, Jack B. Homer, and George
P. Richardson for many useful comments and criticisms; of
course, all errors are mine.

E.g., Forrester 1961, Ch. 13; Ansoff and Slevin 1968;
Forrester 1968; Nordhaus 1973; Forrester 1973; Forrester et
al. 1974; Forrester and Senge 1980; Richardson and Pugh 1981.
115.

Forrester 1961, p.

Naylor and Finger 1967, p. B-97, McKenney 1967, p. B-102,
Hermann 1967, pp. 217£f, Lilien 1975, Pindyck and Rubinfeld
1976, p. 315. Greenberger et al. 1976, pp. 62-63 and -70-74.

Forrester 1961, p. 123.

The t-statistic and other common tests of significance are
discussed in any introductory econometrics text, e.g.,
Pindyck and Rubinfeld 1976.

Pindyck and Rubinfeld 1976, p. 37.

Mass and Senge 1978, p. 451.

Mass and Senge 1978 demonstrate that a moderate amount of
measurement error causes insignificant t-statistics in OLS
estimation of a model with the same specification as the
model used to generate the data in the first place. One way
to recover from such errors (if they are detected) is to
employ a more sophisticated estimation procedure, such as
full-information maximum likelihood estimation using Kalman
filtering (Peterson 1980). A simpler and often more
illuminating approach is to subject the model to the behavior
anomaly test (Table 1): does anomalous behavior arise if the
assumption is deleted? Senge 1978 uses such behavior tests
to establish the necessity of various hypotheses in an
investment function whose statistical performance was only
slightly better than that of the neoclassical function.

Hausman 1978 presents a test of model specification.
Forrester 1973, pp. 24-31. The operator/observer distinction
is an important one that accounts for much of the
disagreement on validation

Churchman 1973, p. 12.

240
D-3393

13.

14.

15.

16.
is
18.

19.
20.

21.

22.

Greenberger et al., pp. 70-71.

Naylor and Finger 1967 propose “multistage verification" as a
confidence-building procedure (see also Naylor 1971).

Hermann 1967 proposes a five-stage confidence-building
approach; Emshoff and Sisson 1970 also emphasize an itetative
and variegated approach to confidence building; Schrank and
Holt 1967 stress "the criterion of usefulness," (p. B-105).

Naylor and Finger 1967 (p. B-97) endorse Cyert's version of
the Behavior Reproduction test, which lists eight attributes
of similarity, the least important of which is "exact
matching of values of variables.”

Forrester 1961, app. K.
Phelps Brown 1972, pp. 5-6.

The Mistaken Identity Test is described in Forrester 1973,
pp. 53-54, For examples, see Naill 1977, app. A and Runge
1976, Ch. 5. The Mistaken Identity Test is similar to
McKenney's 1967 proposal to employ a Turing Test as “an
adequate method of validation." The Turing Test or imitation
game (Turing 1950) was originally proposed as a sufficient
test-for artificial intelligence: in Turing's view, if a
panel of human interrogators cannot distinguish the
performance of a machine (model) from that of a human (real
system), then the machine (model) is an artificial
intelligence (valid model).

Zellner 1980, p. 567.

The philosophy of parameter estimation in system dyanmics is
described in Forrester 1961, e.g., pp. 171-172, Forrester
1980, pp. 559-560, and Richardson and Pugh 1981, pp. 230-240.
Estimation of parameters "below the level of aggregation" of
a model is discussed in Graham 1980. The fallacy in using
aggregate data to evaluate structural relationships is
illustrated by Nordhaus 1973 and exposed by Forrester et al.

The maintained hypotheses of most single-equation techniques
are violated by the multiloop, nonlinear nature of complex
feedback systems with measurement error. However, optimal
filtering (Peterson 1980) offers a promising approach to
formal estimation of system dynamics models.

See, e.g., Pindyck and Rubinfeld 1976, pp. 157££.

D-3393

23. The summary statistics described here may be useful even if
the data are used to estimate the parameters. Relative to an
ex post forecast, such a case, like an in-sample simulation
Of an econometric model, is a weaker test of a model if it
passes and a stronger test if it fials.

24, The RMSPE and other error measures are discussed by Pindyck
and Rubinfeld 1976, pp. 314-320. Other normalizations
include the root-mean-square error as a percent of the mean,

Theil 1966, pp. 27-28, divides the MSE by the mean of the
squared actual values to define what he calls U, the
"fnequality coefficient";

L 2

FDC

ao

These and other normalizations have individual strengths and
weaknesses in particular situations. As usual one cannot
apply statistics blindly without considering the purpose of
the analysis or the nature of the data.

25. The description of the Theil statistics is reproduced from
Theil 1966, Ch, 2.3-2.5.

26. It would also make one suspect the model had been “fine-
tuned" with exogenous variables to match the phasing. Most
models of cyclical phenomena must include some randomness to
excite the latent oscillatory modes, but inclusion of noise
will certainly cause the turning points to differ from
historical behavior in the same way that two samples drawn
from the same distribution will differ point by point. In
practice, the error shown in Figure 3c is unlikely to arise.

27. Note that when the purpose of a model abstracts from a cycle,
the cycle becomes noise: that part of the decisionmaking
process that is not modeled. In such a case there will be
high serial correlation in the residuals, but the presence of
such autocorrelation does not. compromise the model.

242
D-3393

28. Sterman 1981, Sterman 1982.

29. Note that RNSPE? # MSE, so one cannot multiply the RMSPE? by
u", uS, or u° to yield the "RNSPE due to bias," etc.

D-3393

REFERENCES

Ansof£, Igor and Slevin, Dennis, "An Appreciation of Industrial
Dynamics," Management Science 14(7), March 1968, pp. 383-397.

Churchman, C. W., “Reliability of Models in the Social Sciences,"
Interfaces, 4(1) November 1973, pp. 1-12.

Emshoff, James and Sisson, Roger, Design and Use of Computer
imulation Models. New York, naewitian, 197

Forrester, Jay W., Industrial Dynamics. Cambridge: The MIT
Press, 1961.

Forrester, Jay W., Industrial Dynamics--A Response to Ansoff and
Slevin," Management Science, 14(9), May 1968, pp. 601-618.

Forrester, Jay W., "Confidence in Models of Social Behavior--With
Emphasis on System Dynamics Models," Working paper D-1967,
System Dynamics Group, MIT, December 1973.

Forrester, Jay W., "Information Sources for Modeling the National
Economy," Journal of the American Statistical Association,
75 (371), September 1980, pp. 555-574.

Forrester, Jay W. et al., "The Debate on World Dynamics--A
Response to Nordhaus,” Policy Sciences 5 (1974), Pp- 169-190.

Forrester, Jay W. and Senge, Peter M., “Tests for Building
Confidence in System Dynamics Models," TIMS Studies in the
gums Studies in the

Management Sciences 14 (1980), pp. 201-228.
Graham, Alan K., "Parameter Estimation in System Dynamics

Modeling," Elements of the System Dynamics Method (J¢érgen
Randers, ed.). Cambridge: The MIT Press, 1960, pp. 143-161

Greenberger, Martin et al., Models in the Policy Process. New
York: Russell Sage Foundation, 1976.

Hausman, J. A. “Specification Tests in Econometrics,"
Econometrica, 46 (1978), pp. 1251-1272.

Hermann, C. F., "Validation Problems in Games and Simulations,"
Behavioral Science, 12 (1967), pp. 216-231.

Lilien, Gary L., "Model Relativism: A Situational Approach to
Model Building," Interfaces, 5(3) (May 1975), pp. 11-18.
D-3393

Mass, Nathaniel J. and Senge, Peter M., “Alternative Tests for
the Selection of Model Variables," IEEE Transactions on

Systems, Man, and Cybernetics, SMC-8, no. 6, June 1978, pp.
450-460.

McKenney, James L., "Critique of ‘Verification of Computer
Simulation Modeis,'" Management Science, 14(2) (October
1967), pp. B-102 ~ B-103.

Naill, Roger F., Managing the Ener. Transition. Cambridge:
Billingst, (S77

Naylor, T. H. and Finger, J. M., "Verification of Computer
Simulation Models," Management Science, 14(2) (October 1967),
pp. B-92 - B-101.

Naylor, T. H., Computer Simulation Experiments with Models of
Economic Systems. New York: Wiley, 1971.

Nordhaus, William D., "World Dynamics: Measurement Without
Data," The Economic Journal 83, pp. 1156-1183.

Peterson, D. W., "Statistical Tools for System Dynamics,"
Elements of the System Dynamics Method (Jérgen Randers, ed.).
Cambridge: The NIT Press, 1980, pp. 224-245.

Phelps Brown, E. H., “The Underdevelopment of Economics,"
Economic Journal, 82(325) (March 1972), pp. 1-10.

The

Pindyck, Robert S. and Rubinfeld, Daniel L., Econometric Models
and Economic Forecasts. New York: McGraw Hill, 1976.

Richardson, George P. and Pugh, Alexander L., Introduction to
System Dynamics Modeling with DYNAMO. Cambridge: The MIT
Press, 1981.

Schrank, W. E., and Holt, C. C., "Critique of ‘Verification of
Computer Simulation Models,'" Management Science, 14(2)
(October 1967), pp. Brl04 = B-T0B

Senge, Peter M. The System Dynamics National Model Investment

Punction: A Comparison to the Neoclassical Investment
Functioi Ph.D. Dissertation, M.1.T.: 1978.

Sterman, John D., The Energy Transition and the Economy: A
System Dynamics Approach. Ph.D, Dissertation, M.I.T.: 1981.

Sterman, John D., “Economic Vulnerability and the Energy
Transition," Energy Systems and Policy. Forthcoming 1983.

D-3393

Theil, Henri, Applied Economic Forecasting. Amsterdam: North
Holland Publishing Company, 196:

Turing, Alan M., "Computing Machinery and Intelligence," Mind
5(1950), pp. 433-460.

Zellner, Arnold, "Comment" on Forrester's 'Information Sources
for Modeling the National Economy,' Journal of the American

Statistical Association 75(371), September 1980, pp. 567-569.

246
D-3393

Table 1. Tests for Building Confidence
In System Dynamics Models*

Tests of Model Structure

Question Addressed by the Test

Structure Verification

Parameter Verification

Extreme Conditions
Boundary Adequacy

(Structure)

Dimensional Consistency

Tests of Model Behavior

Behavior Reproduction

Behavior Anomaly

Family Member

Is the model structure consistent
with relevant descriptive knowledge
of the system?

Are the parameters consistent with
relevant descriptive (and numerical,
when available) knowledge of the
system?

Does each equation make sense even
when its inputs take on extreme
values?

Are the important concepts for
addressing the problem endogenous to
the model?

Is each equation dimensionally
consistent without the use of
parameters having no real-world
counterpart?

Does the model endogenously generate
the symptoms of the problem, be-
havior modes, phasing, frequencies,
and other characteristics of the
behavior of the real system?

Does anomalous behavior arise if an
assumption of the model is deleted?

Can the model reproduce the behavior
of other examples of systems in the
same class as the model (e.g., can
an urban model generate the behavior
of New York, Dallas, Carson City,
and Calcutta when parameterized for
each)?

247

D-3393

Surprise Behavior
Extreme Policy

Boundary Adequacy
(Behavior)

Behavior Sensitivity

Statistical Character

Tests of Policy Implications

System Improvement
Behavior Prediction
Boundary Adequacy

(Policy)

Policy Sensitivity

Does the model point to the exis-
tence of a previously unrecognized
mode of behavior in the real system?

Does the model behave properly when
subjected to extreme policies or
test inputs?

Is the behavior of the model sensi-
tive to the addition or alteration
of structure to represent plausible
alternative theories?

Is the behavior of the model sensi-
tive to plausible variations in
parameters?

Does the output of the model have
the same statistical character as
the "output" of the real system?

Is the performance of the real
system improved through use of the
model?

Does the model correctly describe
the results of a new policy?

Are the policy recommendations
sensitive to the addition or
alteration of structure to represent
plausible alternative theories?

Are the policy recommendations
sensitive to plausible variations in
parameters?

* Adapted from Forrester and Senge 1980, esp. p. 227, and from
Richardson and Pugh 1981, esp. pp. 313-319.

248
D-3393

D-3393
Table 2: Summary of Energy-Economy Model Boundary Table 3: Error Analysis of Energy-Economy Model
Root Mean Mean Square Inequality Statistics®
Square Error « 6 ¢
. Percent Error MSE U v u'

SREERGUS EKOGENOUS Variable RMSPE (%) (units?) (£raction of MSE)
GNP Population
Consumption Technological change 2
Investment Historical OPEC Price Telitior ier By 3.2 9.7x10 10 +00 +90
Savings tna (endogenous after 1982) [Bitiion year)
Mages: (Weal aaa Neatnaay Real Consumption 4.7 19x10? +5400 629017

Billion 1972 $/year)
Inflation Rate ¢
Lab o
Employment. paeteneeesen Consumption Fraction? 3.6 8.7x1074 +46 01.53
Unemployment (Zract ion)
Interest Rates 2
Money Supply Real Private 11.7 3.5x10' +02 410.88
Debt =
Energy Production (Billion 1972 $/year)
Energy Demand -2
Energy Imports eee wee 5.4 9.0x10 «10 +23 .67

1972: $/person/year)

Workforce Partici- 2.5 2.2x1074 +75 «16 +09
pation Fraction

(fraction)

Primary Energy 4.0 4.2x10° 204 605.92
Consumption
(Quads/year)

Primary Energy 71.6 9.1x10° +140 2659.
Production
(Quads/year)

4 +12 601.87

Energy Import 13.9 1.8x107
Fraction

(Eraction)

Real Energy Price 14.0 4.9x1073
(1972 $/mMBTU)

Net Energy 9.7 2.2x10) +16 0-62.22
Consumption
(Quads/year)

Totals may not add due to rounding.
Real Consumption/Real GNP
Computed from 1960 to 1977

a
b

249 250
1S2

2g2

Figure 1. Interpretation of the Theil Inequality Statistics

vt uS uv characterization Interpretation
a) 1 0 0 S,=A,+BIAS ‘Systematic error.
y (S is equal to
A translated
A by a constant BIAS.)
») 0 1 0 Sy-B=K(A,-A) Systematic error:
A where S=A. 8 and A have
(S, is a stretched different trends.
3 version of A, about
their common mean.)
ec) oO 1 © Same as (b). Systematic error:
S$ and A have the same
phasing but different
magnitude fluctu-
i
ations.
vos Characterization Interpretation

a) O 1 0 Sysks Ayskee(t) ‘The error is
where F(E)=0 unsystematic
(A has cycles unless purpose

s or noise not ‘otimadeliaasto
present in's.) study the cycles
A in A.
e) 0 0 1 AysKeKsin(wt)s Same means and
a variances but
8,-S+ksin(wtrp) ones
- phasing differs:
where A=S. peanenay
(S is a translation waayateaatl® aez0e,
in time of A by a
phase margin.)

f) ° a ira Sy=f(t); S and A have the
Ayrt(t)+e, same mean and trends
ansee eso but vary point by
(seqiata.e point: Unsystematic
sae TER error unless purpose

different values
of the ‘error'

tera.)

is to study the
cycles in A.
D-3393

D-3393

(wotevinsno Auyyulid JO NOWDVUs)
NOLOvus 280d AUN.

(uvaa/e Test worm@)
ANBWasdanl a1vAvad WEY

feaanattesstte

erage

Energy Import Fraction

Figure 3a:

Real Private Investment

Figure 2a:

(ama wnisy 30 %)
NOMLvu4 LUO ADUANT NI YOUU

(amvA wWALrv 30 *f)
ALNawlasannt aLyAVd WA AI YT

Real Private Investment: Residuals

Figure 2b:

Residuals

Energy Import Fraction:

Figure 3b

254

253
D-3393

D-3393

NOMaWnSNOD AUN 13N

(nnenw/ $ 2u61)
BoWud A9W3NB WAY JOvUaAN

roan

nawesat

Real Energy Price

Figure 4a:

Net Energy Consumption

Figure Sa:

(amu wnigy 40%)
NOLLAWNSNOD ADIN 13N Nt HORM

3omid AOUIN] WSU JOVUBAY NI HOUNS

Residuals

Pigure 4b:

Residuals

Net Energy Consumption:

Figure Sb:

Real Energy Price:

255

256
D-3393

D-3393

(tolnawnsnoo auntie 40 NOUV)
NOLDWS IHOAAI AS'NI

. *
(vans TAH nor)
ANBWASAANI BLUE WEB

rosea

Energy Import Fraction

Figure 3a:

Real Private Investment

Figure 2a:

(ann waioy 40%)
NOHOWus LNOgWI AON NI YOUU

(ANA WNL 30 “e)
ANSWLIFARN BLYALSS WB NI YO

ena

Residuals

Real Private Investment:

eat

Figure 2b:

Residuals

Energy Import Fraction

Figure 3b:

258

257
D-3393

D-3393

(awa, /Sauno}
NNouawnsNoD .9H3N3 13N

(nusW /$ 2261)
‘30}ud A0uaNa TW3Y 3OvUaAY

Net Energy Consumption

Figure 5a:

Real Energy Price

Figure 4a:

(ana Tn.2¥ 40%)
NOLLAWNSNOD ADNINS13N MI HOMER

(ana wwhioy 20 %)
‘3ojtd AOU3NA WSU 39MEBAY MI NOUNS

Residuals

Net Energy Consumption:

Figure 5b:

Real Energy Price: Residuals

Figure 4b:

260

259

Metadata

Resource Type:
Document
Description:
System Dynamics models are often faulted for their reluctance to employ formal measures of goodness-of-fit when assessing the historical behavior of models. As a result, the validity of system dynamics models is often questioned even when the model’s correspondence to historical behavior is quite good. This paper argues that the failure to present formal analysis of historical behavior creates an impression of sloppiness and unprofessionalism. After reviewing the concept of validity in simulation modeling, the paper proposes a simple set of summary statistics appropriate for system dynamics models (the root-mean-square error and Theil inequality statistics). The statistics allow the error due to individual behavior modes to be analyzed, do not require the use of formal parameter estimation procedures, and can be conveniently computed. A large model of the U.S. economy is used to illustrate the use of statistics.
Rights:
Date Uploaded:
December 5, 2019

Using these materials

Access:
The archives are open to the public and anyone is welcome to visit and view the collections.
Collection restrictions:
Access to this collection is unrestricted unless otherwide denoted.
Collection terms of access:
https://creativecommons.org/licenses/by/4.0/

Access options

Ask an Archivist

Ask a question or schedule an individualized meeting to discuss archival materials and potential research needs.

Schedule a Visit

Archival materials can be viewed in-person in our reading room. We recommend making an appointment to ensure materials are available when you arrive.