Personal tools
You are here: Home Analysis Protocols Metrics and Statistics

Metric and Statistic Definitions and Discussion

last modified Jun 06, 2009 10:56 AM

Objectives of this document

  1. To list available techniques used to evaluate model output.
  2. Discuss preferred model performance metrics, statistics, and plots.
  3. Narrow the list of metrics, statistics, and plots to those that can initially be used to evaluate simulated data with observed data.
  4. Give examples and brief descriptions and instructions of plots used to evaluate simulated data.

Background

Appropriate model performance metrics will vary depending on the simulated output type.  Potential output types include fuel information, consumption rate of fuels, total consumption, emission rates, time profile of emissions, plume rise, and surface concentrations.  For many case studies the available observations will be limited to surface concentrations (usually PM2.5) as this is the most readily available data.  Additionally, considerable fuel loading, consumption, and aerosol optical depth data are available.

Selection of the suite of model evaluation tools should be based on the type evaluation the model or framework-path is undergoing.  Chang and Hanna (2004) describe three types of statistical evaluation: comparing modeled and observed time-averaged values; evaluating modeled and observed values at a specified time, i.e. peak hour, or location; and examining the arrival and departure times of the observed and modeled values, i.e. concentration plume arrival.  Model performance metrics may vary for the same model, depending on the type of output undergoing evaluation.

Model evaluation is difficult.  A point measurement may not accurately represent the simulated value, if the simulated value is based on a 3-dimensional volume and in some cases the observational value may only represent a small area surrounding he measurement location or the observation value may not represent the absolute in-situ value (Boylan and Russell, 2006).  Using performance metrics that consider the observations as absolute truth should be done only when known measurement error is small and when the point measurements represents an area similar in size to the grid cell used in the simulation (US EPA, 2007).  Incorporating knowledge of measurement error into the analysis will assist with intelligent model evaluation.

Key Concepts

Non-normalcy and Robustness

Some statistics (e.g., mean and standard deviation) require the data or data subsets to be distributed across an normal function.  Before using these type of statistics a test for normality on the dataset population or sample subset should take place or check to see if the central limit theorom can be applied to the data.  A dataset is normal when it displays a distribution shape with a skewness of zero and a kurtosis of three (D’Agostino et al., 1990).  D’Agostino et al (1990) offer an explanation of normality with equations for testing data skewness and kurtosis and a hypothesized test of normality. 

Robust statistics are minimally affected by by outliers, examples of robust stastics include the median and the quartiles.

Unpaired and Paired analyses

Unpaired and paired (in space and time) statistics and metrics should be used to evaluate the simulated data.

Unpaired analyses are used to evaluate overall model trends and performance.  Overall as in throughout the domain or selected subset of the domain and throughout the timing of the event or selected subset, such as days, hours, etc.  Unpaired analyses examines the data distribution and shape and answers:

Did the model simulate the range and frequency of concentrations observed during the events?

Does simulated data from model A have the same data distribution as simulated data from model B?

Paired analyses are used to compare the simulated data to observation or other simulated data in mutual space and time.  These analyses demonstrate how well the model performs at specific locations and time.  The results may point out phsysical micro-influences that the model does not handle well or handles well, such as uncharacterization of fuel loadings that may have increased the time-rate of consumption to a faster rate then modeled, or the accurate characterization of terrain affects on air flow.  Paired analyses attempts to answer:

How well do the simulated data represent the observations in space (at the observation station) and time (hour by hour)?

What is the range of simulated data from different model sources at a space, time location?

Interpretive statistical “translation”

Models produce data that is meant to simulate the real world, but often require translation to be most useful.  For example, a model might be perfectly accurate if it’s known bias is removed.  This is an application of a statistical “translation” of the model output to make it more in line with actual observations.  We utilize several simple statistical translations including linear models and categorical threshold models, as discussed below.

Error Prorogation

Within each smoke modeling step their are inherent errors and these errors can propagate into next smoke modeling step and throughout the smoke modeling pathway.  The propagation of the error may act to magnify or cancel each other out.  To characterize the error and its magnitude, in-depth analyses at each modeling step is required.

Types of evaluations

Description of Statistics Commonly Used to Evaluate Model Performance

Several performance metrics and statistics exist to determine whether the simulated dataset is within an acceptable range of the observed dataset.  A model performance metric is a measurement of how the model performs relative to observation values.  A statistic is a description of a dataset population or sample.  Evaluation tools include quantitative metrics (Table 1) and statistics (Table 2) paired in time and space; paired in space; and graphical plots displaying paired and unpaired data (Table 3).  Each of the following sections gives a brief description and lists a few model evaluation tools.  It is best to choose a combination consisting of metrics, statistics, and graphical plots to make up the model performance evaluation suite.

Table1Model Performance Metrics

Several metrics (Table 1) should be used to describe total model performance.  Using only one metric can give an unbalanced account.  Terms like the mean bias (MB) and  mean error (ME) return values in absolute units and the equations consider the observations as absolute truth.  Knowledge of observational error, magnitude, and variability may help determine if the MB and ME results are substantial.  Consideration should be given to the area a point observation represents; the radius of representation may vary with distance and height (Boylan and Russell, 2006).  If the measurement’s radius of representation is similar in size to the grid cell used in the simulation than the point observation may represent the volumetric simulated value.  If the radius or representation is smaller then the grid cell then the measurement may not be the absolute truth compared to the 3-D simulated volume and other or additional metrics should be used in the evaluation.

The bias type metrics report overall over- or under-estimation of the simulated values.  The error terms convey the general error between model and observed values.  Normalization with both the observed and modeled values [mean fractional bias (MFB) and mean fractional error (MFE)] allows for evaluation of the simulated results with consideration for an observed value with a small radius of representation and measurement error.  Obviously, the MFB and MFE become less valuable if the sum of modeled and observed approach zero.  MFB and MFE are useful because they give equal weight to over- and under-predictions, both have an ideal value of 0, and both do not require a minimum threshold in order to be appropriate for analyses (Seigneur et al., 2000).  The normalized mean bias factor (BNMBF), an unbounded metric, and normalized mean absolute error factor (ENMAEF), bounded at 0, are relatively new metrics proposed by Yu et al. (2006).  The BNMBF estimates both the sign and the magnitude of the model error, the model overestimates by [1+BNMBF for mean simulated values > mean observed values] and underestimates by [1-BNMBF for mean simulated values < mean observed values], and the ENMAEF is used to determine the magnitude of the absolute error, which equals [ENMAEF x mean observed value] (Yu et al., 2006).  These metrics are also symmetric, giving equal weight to over- and under-estimations made by the model.  The MFB and MFE do not require minimum observation thresholds, BNMBF and ENMAEF should be used carefully when the observed and simulated values approach zero.

The fraction of the time the model returns simulated values within a factor of 2 (Fa2) of the observed values numerically displays overall model performance; Fa2 = 1; if all values of   are between 0.5 and 2.  This metric is useful for evaluation of the entire population or sub-populations of the dataset.

Table2Model Evaluation Statistics

The standard statistics (Table 2) peak, mean, standard deviation, median, and first and third quartiles, reveal data distributions and facilitate in dataset shape comparisons.  These values can easily be displayed both numerically and through graphical plots.  In some cases using a running mean on both the observed and simulated values will assist in model evaluation.  This is especially useful when the observations change rapidly but the model use is for general trends (Smyth et al., 2006). 

Evaluation of peak timing is important, trends in the model, such as model tendencies to hit or miss peak timing by Δt, can be found rapidly through this type of assessment.  See Appendix C for peak misses and hits type statistics.

Some statistics, such as the standard deviation require the data to be near normality.  Test for normality before using these type of statistics.

 

Table3Graphical Plots

Several graphical plots (Table 3) can be used to display modeled values with respect to the measurements.  Selection of these plots should demonstrate model strengths and weaknesses.  This is narrowed list of ways to plot the data, below is a brief description of each plot, for examples and detailed instructions on how to create the plots, please refer to Appendix A.

Time Series and Vertical Profile

The time series and vertical profiles display data with respect to change in time or change in height.  By replacing time with horizontal distance, the rate of change with distance can be displayed.  These plots can be used for evaluation of peaks and trends.

Box Plot

The box plot displays the median and first and third quartiles of the simulated and observed data.  The quartiles form the bottom and top of the box and the median is shown as a line through the middle of the box.  The box plot can be plotted as a time series, vertical profile, or the values can be demonstrated against a dependent variable (i.e. number of fires).  The box plot is good at providing an aggregate overview of the data in comparison to other distributions.

Ratio of Modeled-to-Observed

Plotting the ratio of model-to-observed against the observed values readily displays where the model performs well.  Adding horizontal lines at y = 0.5 and y =2 helps visualize where the ratios fall, ideally, the ratios should fall along the 1.0 line.  Model performance standards will vary from model to model and may even vary with different output from the same model.  Some modeled data are considered to perform well when the majority of the ratios are within a factor of 2.  Other modeled output may have more or less rigorous requirements (e.g., within a factor of 10) depending on the current state of technology.  This type of plot illustrates the strengths and weaknesses of the model, illustrating where the model performs well: for all values, for values near the peak, for values near the average, or for values lower then average.  Model tendencies to over- or under-simulate the observed values can be seen in this plot.

Scatter Plot

The scatter plot conveys the correlation between the two datasets, measured and simulated.  Usually, the observed values are plotted on the x axis and the simulated values are plotted on the y axis.  Making the axes the same size and plotting the 45º line facilitates quick assessment of the data.  Displaying the linear regression equation and correlation coefficient (R2) is also useful for dataset comparisons.

Shape Comparison and Contour-Point Plot

Plotting the simulated and observed shapes (i.e. simulated and observed smoke plumes) or plotting simulated contours and point observations demonstrates spatial model performance.  Comparing the simulated and observed shape is a qualitative measure of model performance.  This type of plot reveals the differences between observed and simulated spatially distributed shapes.  A similar analysis can be made with the contour plot, which reveals model performance over the entire domain or specified area.  Contours, color coded to represent various ranges of value display the simulated values.  Adding the observed values as color coded circles, using the same scale as the contours, allows for dataset comparison and examination of simulated contour gradients in the proximity of the measurement site.

Bugle Plot

The bugle plot is used to graphically display MFB and MFE versus observed values along with their corresponding model performance goal and criterion.  The ideal value for both MFB and MFE is zero, the closer the MFB and MFE values fall onto zero, the better the model performance.  Model performance goal and criterion are defined as ‘the best the model can achieve’ and ‘acceptable model performance’, respectively, and both are set based on current knowledge and technological capabilities to simulate the value of interest (Boylan and Russell, 2006).  For the major components of post-analysis PM2.5 model performance goals have been defined to range from ±30% for MFB and +50% for MFE and model performance criteria have been established to range from ±60% for MFB and +75% for MFE (Boylan and Russell, 2006; US EPA, 2007).  Goals and criteria may differ for different datasets or they may not yet be defined, as is the case with most predictive data.

The goals and criteria can be plotted as a horizontal line at a fixed value or as an exponential curve.  For PM2.5 the equation [presented by Boylan and Russell (2006), pp 4952 and the equation coefficients (Table 1), pp 4955] can be used to smooth the goals and criteria lines asymptotically from high to low values with an exponential curve.  This approach widens the goal and criterion envelope near zero and is useful if many data are near zero.  Data that fall within the goal envelope exhibit excellent model performance, data falling between the goal and criterion demonstrate satisfactory model performance with improvement needed, data falling outside the criterion show poor model performance and may require further investigation.

M and O ≥ Vsig Plot

Models applied to simulate regulated quantities, for example known hazardous fuel loadings or the six primary pollutants regulated by the US EPA through the National Ambient Air Quality Standards (NAAQS), can be evaluated by plotting the modeled and observed values that are greater then a chosen value of significance (Vsig).  The model can be considered to perform well if it accurately simulates the observed values that go above Vsig.  This type of model evaluation is biased towards not missing an event that puts the measured values above Vsig.

Quantile-Quantile (QQ) plot

The Quantile-Quantile (QQ) plot is used to demonstrate similarity between the distribution of dataset shapes (Wilks, 2006).  The data in this plot have been unpaired in time and space.  Both datasets are sorted from lowest value to highest value and the new pairs are plotted.  If the datasets have a similar distribution the plotted values will fall along a 1:1 line.  Over the 1:1 line indicates general model over-simulation and under the 1:1 line indicates general model under-simulation (Venkatram et al., 2001).  The overall model tendencies are displayed in the QQ plots and the model’s general capability to simulate low, average, or high values becomes apparent.  To assist in evaluation, 2:1 and 0.5:1 lines can also be plotted, data that fall within these lines display where the model produces simulated values within a factor of 2 of the observed values.

Evaluation the Surrounding Nodes

Evaluating simulated gridded data with respect to observations made at a single location can be difficult.  Ideally, grid size should be selected at a resolution that is compatible to the observation radius of representation.  Often this is computationally unrealistic, particularly for predictive systems.  To compare point observations to simulated gridded data, the nearest node to the observation location is usually selected.  Sometimes it is worth while to investigate the eight surrounding nodes, this may reveal a spatial shift in the simulated data or discrepancies between the location of simulated and observed peak values.

The nine simulated datasets can be plotted with the measured dataset in time series format (Appendix A).  The simulated dataset from the node closest to the measured value (the primary dataset, a.k.a. center node) and the observed dataset are represented by their own lines and the simulated values from the eight surrounding nodes are plotted in subdued colors, or as brackets surrounding the dataset.  Using this method to examine model output demonstrates the simulated radius of representation for a single node and aids in model evaluation.

Evaluating Simulated PM2.5 Data

Most of the tools described above can be used to evaluate PM2.5 surface concentrations, however some of these metrics, statistics, and plots can immediately demonstrate model performance without further investigation.  Observed PM2.5 surface concentrations can approach zero, the radius of representation of the PM2.5 measurement is usually small, and instrument error can be large depending on the chemical makeup of the particulate matter.  Taking this into consideration, metrics chosen to evaluate simulated PM2.5 concentrations should not require a minimum threshold, should not consider observations the absolute truth, but should give over- and under- prediction equal weight.  The MFB and MFE have all of these qualities, including a high tolerance for near zero values.  The statistics peak, median, first and third quartiles numerically describe the dataset distributions, while the QQ plot graphically displays whether or not the two datasets have the same shape.  The time series and scatter plots are graphical tools that quickly and easily illustrate model performance of peak and overall timing.

To summarize, two model performance metrics, four statistics, and three graphical plots can be used to quickly determine model performance for simulated PM2.5 surface concentrations.

List of tools

Performance Metrics:         MFB, MFE

Statistics:                          Peak, Median, First and Third Quartiles

Graphical Plots:                 QQ, Time Series, Scatter

Using other metrics and graphical plots can illustrate different aspects of model performance and these should not be disregarded.

Conclusion

Model performance evaluation is necessary in order to build model confidence (Chang and Hanna, 2004).  Evaluation statistics, metrics, and graphical plots should be used to demonstrate complete performance including model strengths and weaknesses and usually, multiple model performance tools are required to reveal total model capability.  Many more performance metrics exist (Yu et al., 2006; Boylan and Russell, 2006), selection of the proper metrics should be done carefully, keeping in mind simulated and measurement errors, range, and measurement radius of representation.

References

Boylan, J. W. and Russell, A. G., 2006.  PM and light extinction model performance metrics, goals, and criteria for three-dimensional air quality models.  Atmos. Environ. 40: 4946-4959.

Chang, J. C. and Hanna, S. R., 2004.  Air quality model performance evaluation.  Meteorol. Atmos. Phys. 87: 167-196.

D’Agostino, R. B., Belanger, A., and D’Agostino, R. B. JR., 1990.  A suggestion for using powerful and informative tests of normality.  Amer. Statistician 44: 316-321.

Eder, B., Kang D., Mathur, R., Yu S., and Schere K., 2005. An operational evaluation of the Eta–CMAQ air quality forecast model.  Atmos. Environ. 40: 4894-4905.

Larkin, N. K., O’Neill, S. M., Solomon, R., Krull, C., Raffuse, S., Rorig, M., Peterson, J., and Ferguson, S. A., 2007.  The BlueSky smoke modeling framework:  design, application, and performance.  (Submitted)

O’Neill, S. M., Hoadley, J., Ferguson, S. A., Solomon, R., Peterson, J., Larkin, N., Peterson, R., Wilson, R., and Matheny, D., 2005.  Applications of the BlueSkyRAINS smoke prediction system.  EM., J. Air Waste Ma.  September, 2005.  pp. 20-23

Seigneur, C., Pun, B., Prasad, P., Louis, J.-F., Solomon, P., Emery, C., Morris, R., Zahniser, M., Worsnop, D., Koutrakis, P., White, W., and Tombach, I., 2000.  Guidance for the performance evaluation of three-dimensional air quality modeling systems for particulate matter and visibility.  J. Air Waste Ma. 50: 588-599.

Smyth, S. C., Jiang, W., Yin, D., Roth, H., and Giroux, É., 2006.  Evaluation of CMAQ O3 and PM2.5 performance using Pacific 2001 measurement data.  Atmos. Environ. 40: 2735-2749.

US EPA., 2007.  Guidance on the use of models and other analyses for demonstrating attainment of air quality goals for ozone, PM2.5, and regional haze.  Publication No. EPA-454/B-07-002, April 2007. pp. 253.

Venkatram, A., Brode, R., Cimorelli, A., Lee, R., Paine, R., Perry, S., Peters, W., Weil, J., and Wilson, R., 2001.  A complex terrain dispersion model for regulatory applications.  Atmos. Environ. 35: 4211-4221.

Wilks, D. S., 2006.  Statistical methods in the atmospheric sciences. Academic Press, Burlington, MA, USA, 627 pp

Yu, S., Eder, B., Dennis, R., Chu, S.-H., and Schwartz, S. E., 2006.  New unbiased symmetric metrics for evaluation of air quality models.  Atmos. Sci. Let. 7: 26-34.

 

Appendix A: Graphical Plot Examples

Appendix B: Forecast Verification

Appendix C:  Further Statistic Definitions

For all appendices

Please see the linked PDF document (large):

 

Document Actions