Model Evaluation Principles
Guidelines for how to evaluate models.
Four principles
It is important to establish a framework for assessing whether a modeling system (i.e., the emissions, meteorological, and dispersion models and their supporting data sets) performs with sufficient reliability to justify its use in analysis and planning activities. A framework for assessing the model’s reliability consists of the following principles:
The model should be viewed as a system
When we refer to evaluating a “model”, we mean model in the broad sense. Model includes not only the model, but its various components: companion preprocessor models (i.e., the emissions and meteorological models), the supporting geophysical, aerometric, and emissions databases, and any other related analytical and numerical procedures used to produce modeling results. A principal emphasis in the model testing process is to identify and correct flawed model components.
Model acceptance is a continuing process of non-rejection
Over-reliance on explicit or implied model “acceptance” criteria should be avoided. Models should be accepted gradually as a consequence of successive non-rejections. Over time, confidence in a model builds as it is exercised in a number of different applications without encountering major or fatal flaws that cause the model to be rejected.
Previous experience should be used as a guide
Previous modeling experience serves as a primary guide for judging model acceptability. Interpretation of the modeling results for each episode, against the backdrop of previous modeling experience, aides in identifying potential performance problems and suggests whether the model should be tested further or rejected.
Criteria for judging model performance should remain flexible
The criteria for judging the acceptability of model performance should remain flexible. Model performance is not quantifiable outside of a model use case. Flexibility is required because model use is both varied and changing.
Four performance testing types
Operational Evaluation
An assessment of the ability of the model to correctly estimate values without regard to whether the individual process descriptions in the model are accurate. That is, operational evaluation is an examination of how well the model reproduces the observed values in time and space consistent with the input requirements of the model. The operational evaluation gives little, if any, information about whether the results are correct from a scientific perspective or whether they are simply the fortuitous product of compensating errors. Therefore, a “successful” operational evaluation is necessary but insufficient in terms of establishing reliable model performance.
Scientific Evaluation
Determines whether the model’s behavior, in the aggregate and in its component modules, is consistent with prevailing theory, knowledge of physical processes, and observations. The main objective is to reveal the presence of bias and internal (compensating) errors in the model that, unless discovered and rectified, or at least quantified, may lead to erroneous or fundamentally incorrect policy decisions based on model usage. The scientific evaluation ideally consists of a series of diagnostic and mechanistic tests aimed at (1) examining the existence of compensatory errors, (2) determining the causes of failure of a flawed model, (3) stressing a model to ensure failure if indeed the model is flawed, and (4) providing additional insight into model performance beyond that supplied through routine, operational evaluation procedures.
Mechanistic Evaluation
Explores the behavior of individual process modules within the overall modeling system with the intent of identifying possible flaws and/or systematic biases that may not be apparent when examining the model as a whole. The mechanistic evaluation is often severely constrained by the available data. This lack of available mechanistic tests is mitigated somewhat by the fact that the various modules comprising models have undergone some testing prior to their incorporation into an overall modeling framework. While these modules have not necessarily been tested extensively with observation data (since these data are invariably lacking), significant numerical experimentation has, nevertheless, been applied to each one.
Dynamic Evaluation
This type of evaluation has been only recently been used in air quality modeling, and has not been applied to some other kinds of modeling covered by SEMIP. Since the real impact of potential emission control scenarios on air quality cannot be directly evaluated, the usual approach taken in regulatory modeling is to establish a model’s credibility based on the model’s ability to reproduce observed concentrations for the ‘‘base case’’. This base case reflects a simulation under recent historical meteorological conditions with the best emission inventory available. Evaluating the model’s ability to reproduce past air quality conditions is very different from evaluating the model’s ability to predict changes in air quality given changes in emissions. The latter issue referred to here as a dynamic evaluation is only possible if a retrospective case exists where (1) substantial emission reductions have resulted in discernible changes in air quality over time and (2) the change in emissions can be quantified accurately. An additional challenge is that the air quality changes over time are also driven by meteorological variability.
SEMIP uses all of these types in different ways. By focusing on outputs, SEMIP is doing operational evaluations primarily, but by breaking the modeling chain into its intermediate output levels, it is also mechanistic and scientific. SEMIP's focus on various conditions in its test cases can be considered part of dynamic evaluation.

